Volcano plots of microarray data
To run the R commands in this post, you should first work through Analysing microarry data in BioConductor.
A volcano plot is a scatter plot that is often used when analysing micro-array data sets to give an overview of interesting genes. The log fold change is plotted on the x-axis and the negative log10 p-value is plotted on the y-axis.
Using the GSE20986 data set, the x-axis shows the log fold change between the HUVEC control cell line and one of the three treatment endothelial cell lines. After running the preliminarily analysis, we construct a table containing the fold change and p-values for all changes:
##Complete list of genes with p-values and fold change ##Coef=1, so we are just looking at huvec_choroid gene_list <- topTable(huvec_ebFit, coef=1, number=1000000, sort.by="logFC")
The relevant columns can be obtained via the “$” operator, viz:
##The head command prints the first few values of a vector head(gene_list$logFC) head(gene_list$P.Value)
The volcano plot can then be generated using the standard plot command:
##The par command sets "nice" graphical defaults par(mar=c(3,3,2,1), mgp=c(2,.7,0), tck=-.01) plot(gene_list$logFC, -log10(gene_list$P.Value), xlim=c(-10, 10), ylim=c(0, 15), #Set limits xlab="log2 fold change", ylab="-log10 p-value")#Set axis labels
We can use R to determine how many genes are above certain cut-offs. For example, the number of genes that have an absolute fold change greater than 2 and a p-value less than 0.001 can be found using the command:
##Gives 717 sum(abs(gene_list$logFC) > 2 & gene_list$P.Value < 0.001)
A more principled way of choosing a p-value cut-off is to use a multiple testing rule. Common rules are Bonferroni and FDR. When we carry out a large number of statistical tests, the Bonferroni cut-off is approximately 0.05/#tests. So
##no_of_genes=27,306 no_of_genes = dim(probeset.list) ##Gives 203 genes sum(abs(gene_list$logFC) > 2 & gene_list$P.Value < 0.05/no_of_genes)
If we wanted to use the false discovery rate as a cut-off (FDR), then we would use the adjusted p-value:
##Gives 933 sum(abs(gene_list$logFC) > 2 & gene_list$adj.P.Val < 0.05)
In many microarray studies, the FDR cut-off is used. However, when we have a large number of statistically significant genes, as in this example, a more conservative rule can be useful.
Using ggplot2 for volcano plots
The problem with the above volcano plot is that since we are plotting around 27,000 points – one point per gene – then the points overlap and the colour becomes dense. An alternative graphical framework to the base graphics is the ggplot2 package. This package can be installed directly from CRAN, in the usual manner:
Then to construct the volcano plot, we use the following commands:
require(ggplot2) ##Highlight genes that have an absolute fold change > 2 and a p-value < Bonferroni cut-off gene_list$threshold = as.factor(abs(gene_list$logFC) > 2 & gene_list$P.Value < 0.05/no_of_genes) ##Construct the plot object g = ggplot(data=gene_list, aes(x=logFC, y=-log10(P.Value), colour=threshold)) + geom_point(alpha=0.4, size=1.75) + opts(legend.position = "none") + xlim(c(-10, 10)) + ylim(c(0, 15)) + xlab("log2 fold change") + ylab("-log10 p-value") g
By using the “alpha” and “size” options in ggplot2, we can control the transparency and size of the points.
To add text to the plot, we use geom_text. For example,
##Graph not shown g + geom_text(aes(x=gene1$logFC, y=-log10(gene1$P.Value), label=gene1$ID, size=1.2), colour="black")
You can also pass vectors of text labels.