on June 21, 2011 by Colin Gillespie in Transcriptomics, Tutorials, Comments (9)

Volcano plots of microarray data

To run the R commands in this post, you should first work through Analysing microarry data in BioConductor.

A volcano plot is a scatter plot that is often used when analysing micro-array data sets to give an overview of interesting genes. The log fold change is plotted on the x-axis and the negative log10 p-value is plotted on the y-axis.

Using the GSE20986 data set, the x-axis shows the log fold change between the HUVEC control cell line and one of the three treatment endothelial cell lines. After running the preliminarily analysis, we construct a table containing the fold change and p-values for all changes:

##Complete list of genes with p-values and fold change
##Coef=1, so we are just looking at huvec_choroid
gene_list <- topTable(huvec_ebFit, coef=1, number=1000000, sort.by="logFC")

The relevant columns can be obtained via the “$” operator, viz:

##The head command prints the first few values of a vector
head(gene_list$logFC)
head(gene_list$P.Value)

The volcano plot can then be generated using the standard plot command:

##The par command sets "nice" graphical defaults
par(mar=c(3,3,2,1), mgp=c(2,.7,0), tck=-.01)
plot(gene_list$logFC, -log10(gene_list$P.Value),
     xlim=c(-10, 10), ylim=c(0, 15), #Set limits
     xlab="log2 fold change", ylab="-log10 p-value")#Set axis labels

which gives:

We can use R to determine how many genes are above certain cut-offs. For example, the number of genes that have an absolute fold change greater than 2 and a p-value less than 0.001 can be found using the command:

##Gives 717
sum(abs(gene_list$logFC) > 2 & gene_list$P.Value < 0.001) 

A more principled way of choosing a p-value cut-off is to use a multiple testing rule. Common rules are Bonferroni and FDR. When we carry out a large number of statistical tests, the Bonferroni cut-off is approximately 0.05/#tests. So

##no_of_genes=27,306
no_of_genes = dim(probeset.list)[1]
##Gives 203 genes
sum(abs(gene_list$logFC) > 2 & gene_list$P.Value < 0.05/no_of_genes)

If we wanted to use the false discovery rate as a cut-off (FDR), then we would use the adjusted p-value:

##Gives 933
sum(abs(gene_list$logFC) > 2 & gene_list$adj.P.Val < 0.05)

In many microarray studies, the FDR cut-off is used. However, when we have a large number of statistically significant genes, as in this example, a more conservative rule can be useful.

Using ggplot2 for volcano plots

The problem with the above volcano plot is that since we are plotting around 27,000 points – one point per gene – then the points overlap and the colour becomes dense. An alternative graphical framework to the base graphics is the ggplot2 package. This package can be installed directly from CRAN, in the usual manner:

install.packages("ggplot2")

Then to construct the volcano plot, we use the following commands:

require(ggplot2)
##Highlight genes that have an absolute fold change > 2 and a p-value < Bonferroni cut-off
gene_list$threshold = as.factor(abs(gene_list$logFC) > 2 & gene_list$P.Value < 0.05/no_of_genes)

##Construct the plot object
g = ggplot(data=gene_list, aes(x=logFC, y=-log10(P.Value), colour=threshold)) +
  geom_point(alpha=0.4, size=1.75) +
  opts(legend.position = "none") +
  xlim(c(-10, 10)) + ylim(c(0, 15)) +
  xlab("log2 fold change") + ylab("-log10 p-value")
g

which yields

By using the “alpha” and “size” options in ggplot2, we can control the transparency and size of the points.

To add text to the plot, we use geom_text. For example,

##Graph not shown
g + geom_text(aes(x=gene1$logFC, y=-log10(gene1$P.Value),
                     label=gene1$ID, size=1.2), colour="black")

You can also pass vectors of text labels.

9 Comments

  1. Review of “Volcano plots of microarray data” » The Bioinformatics Knowledgeblog

    June 21, 2011 @ 2:37 pm

    […] is a review for “Volcano plots of microarray data” by Colin […]

  2. Analysing microarray data in BioConductor » The Bioinformatics Knowledgeblog

    June 21, 2011 @ 2:41 pm

    […] from Simon Cockell on analyising the biological significance of results and another article by Colin Gillespie on visualising data from expression analysis as volcano plots. Post […]

  3. ishmael

    June 23, 2011 @ 10:51 am

    It looks like missing a “+” in
    opts(legend.position = “none”)

    It should be:
    opts(legend.position = “none”) +

  4. Colin Gillespie

    June 23, 2011 @ 10:53 am

    Thanks (and fixed).

  5. Holger

    July 14, 2011 @ 8:36 pm

    Great tutorial! Many thanks for the work!

    It seems as in the ggplot version of the plot is a little typo:
    aes(x=logFC, y=-log(P.Value), colour=threshold))
    should be
    aes(x=logFC, y=-log10(P.Value), colour=threshold))
    otherwise the natural logarithms are calculated.

  6. Colin Gillespie

    July 18, 2011 @ 8:56 am

    Good spot. Thanks (and fixed).

  7. JLA

    February 14, 2012 @ 1:59 pm

    Thanks for a helpful tutorial!

  8. KP

    February 14, 2012 @ 5:36 pm

    Hello,

    Thanks for the useful script and examples.

    When you use geom_text to plot the gene names, how do you do it just for the genes from what you defined in the column threshold?

    Thank you!

  9. Colin Gillespie

    February 14, 2012 @ 5:44 pm

    If I understand you correctly, you should create a new data frame containing just the genes of interest. So something like:

    dd_text = gene_list[(abs(gene_list$logFC) > 2) & (gene_list$P.Value < 0.05/no_of_genes),]
    g + geom_text(data=dd_text, aes(x=logFC, y=-log10(P.Value), labels=ID), colour="black")

    I'll try and find time in the next few days to double check this.

Leave a comment

Login