Analysing microarray data in BioConductor
Microarray analysis is a complex procedure, and using R and the associated BioConductor packages can be a daunting task. In this tutorial we are going to show you how to install R and BioConductor, download a dataset from GEO, normalise it, do some quality control checks and finally analyse it to find differentially expressed genes. The following instructions assume that you are working with a fresh install of an Ubuntu based Linux distribution, and includes the installation of all the pre-requisites for the analysis. Some familiarity with Linux is ideal and the instructions were developed on Ubuntu 11.04, R 2.12.1. For a full code listing for this tutorial and figures resulting from it see the second part of the article.
Open up a terminal (Applications->Accessories->Terminal from the the toolbar)
Install R and BioConductor dependencies, and then start R:
$ sudo apt-get install r-base-core libxml2-dev libcurl4-openssl-dev curl $ R
Now at the R prompt:
> # download the BioC installation routines > source("http://bioconductor.org/biocLite.R") > # install the core packages > biocLite() > # install the GEO libraries > biocLite("GEOquery")
If you are running as an unprivileged user you will be prompted to install the packages into a personal library. We’re also install the GEOquery package at this point. Installing the core BioConductor packages will take some time. GEOquery is an interface to the Gene Expression Omnibus at the NCBI, which holds transcriptomics data in a standardised format.
Getting the data
For this tutorial we are going to be working with dataset GSE20986, generated by Dr Andrew Browning. HUVECs are cells often used in research, derived from human umbilical vein, used for investigations into the pathology and physiology of endothelial cells. In this experiment iris, retina and choroidal microvascular endothelial cells were isolated from donor eyes, and compared to HUVECs in order to ascertain if the HUVEC line is a suitable surrogate for studying ocular disorders. The GEO page also includes information on the design of the experiment. We can see that there are triplicate measurements for each set of samples. This splits the samples into four groups “iris” “retina” “HUVEC” and “choroidal”.
The GEO Platform is GPL570 – the GEO short code for the Affymetrix Human Genome U133 Plus 2.0 Array, a commonly used microarray chip for human transcriptome studies. If we investigate the GSM links such as GSM524662 we get more detailed information about the experimental conditions used on each chip. They should be the same for each chip, as they are derived from the same experiment. What is captured is information about how the cell lines were extracted, grown, processed for RNA extraction etc. They also include details on how the RNA was processed and hybridised to the microarray chip, and what machine was used to scan the chips. This data is all captured to ensure compliance to MIAME standards.
For each chip, the data table shows the probeset alongside a normalised expression value for the probeset. The table header description let you know how the data was normalised. In this case you can see the GC-RMA algorithm was used. For more information about GC-RMA you can read this paper.
First of all we need to acquire the raw data directly from GEO. We can do this by loading the GEOquery library, and pointing it to the experiment we want to analyse. This download is approximately 53MB.
> library(GEOquery) > getGEOSuppFiles("GSE20986")
If you open a file browser you will see this creates a GSE20986 directory under the directory you launched R in:
$ ls GSE20986/ filelist.txt GSE20986_RAW.tar
The GSE20986_RAW.tar file is a compressed archive of the CEL files (the Affymetrix native file format). Before we can work on them we need to uncompress them. We can achieve this at the R prompt. This uncompresses the tar file (which contains compressed CEL files). The CEL files are subsequently decompressed themselves to the directory in which we are working with R.
> untar("GSE20986/GSE20986_RAW.tar", exdir="data") > cels <- list.files("data/", pattern = "[gz]") > sapply(paste("data", cels, sep="/"), gunzip) > cels
Describing the experiment
We are now ready to start analysis, however in order to do this we need to capture the experimental information. This is just a text file which describes the chip names, and the source of the biological samples hybridised to them.
To load the data into R using simpleaffy we need to create a file to represent this information. Open a new terminal window and type:
$ ls data/*.CEL > data/phenodata.txt
Open this file in a text editor. Currently this is a single column listing the files just a list of the CEL files. The final file needs 3 tab-delimited columns named ‘Name’ ‘FileName’ and ‘Target’. The FileName and Name columns will be identical in this case, although the Name column can be used to provide a more human readable sample name if desired.
The Target column is information about the samples taken from the GEO page. Label the samples appropriately either ‘iris’ ‘retina’ ‘choroid’ or ‘huvec’. This defines our replicate groups for further analysis when we compare the cell line with the tissue samples.
The final result should look like this (ensure that there are tabs between the columns and not spaces!). You can produce this file in any spreadsheet program providing you save it as ‘tab-delimited text’.
Name FileName Target GSM524662.CEL GSM524662.CEL iris GSM524663.CEL GSM524663.CEL retina GSM524664.CEL GSM524664.CEL retina GSM524665.CEL GSM524665.CEL iris GSM524666.CEL GSM524666.CEL retina GSM524667.CEL GSM524667.CEL iris GSM524668.CEL GSM524668.CEL choroid GSM524669.CEL GSM524669.CEL choroid GSM524670.CEL GSM524670.CEL choroid GSM524671.CEL GSM524671.CEL huvec GSM524672.CEL GSM524672.CEL huvec GSM524673.CEL GSM524673.CEL huvec
Loading and normalising the data
The simpleaffy package provides routines for handling CEL files including normalisation and loading data with sample information. We are going to load the data into an R object called ‘celfiles’
> library(simpleaffy) > celfiles <- read.affy(covdesc="phenodata.txt", path="data")
You can confirm this has worked, and download the chip annotations at the same time (if this is your first run) by typing:
> celfiles AffyBatch object size of arrays=1164x1164 features (12 kb) cdf=HG-U133_Plus_2 (54675 affyids) number of samples=12 number of genes=54675 annotation=hgu133plus2 notes=
Now we can normalise the data. As with the data deposited in GEO we are going to use the GC-RMA algorithm. If this is the first time you have run it, it will download additional files during the process.
> celfiles.gcrma <- gcrma(celfiles) Adjusting for optical effect............Done. Computing affinitiesLoading required package: AnnotationDbi .Done. Adjusting for non-specific binding............Done. Normalizing Calculating Expression
If you want to see the data associated with the normalised object you can – you will notice this is no longer an AffyBatch object, but an ExpressionSet object, which holds normalised values.
> celfiles.gcrma ExpressionSet (storageMode: lockedEnvironment) assayData: 54675 features, 12 samples element names: exprs protocolData sampleNames: GSM524662.CEL GSM524663.CEL ... GSM524673.CEL (12 total) varLabels: ScanDate varMetadata: labelDescription phenoData sampleNames: GSM524662.CEL GSM524663.CEL ... GSM524673.CEL (12 total) varLabels: sample FileName Target varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation: hgu133plus2
Quality control checks
Before we analyse the data it is imperative that we do some quality control checks to make sure there are no issues with the dataset. The first thing we can do is check the effects of the GC-RMA normalisation, by plotting a boxplot of probe intensities before and after normalisation.
> # load colour libraries > library(RColorBrewer) > # set colour palette > cols <- brewer.pal(8, "Set1") > # plot a boxplot of unnormalised intensity values > boxplot(celfiles, col=cols) > # plot a boxplot of normalised intensity values, affyPLM is required to interrogate celfiles.gcrma > library(affyPLM) > boxplot(celfiles.gcrma, col=cols) > # the boxplots are somewhat skewed by the normalisation algorithm > # and it is often more informative to look at density plots > # Plot a density vs log intensity histogram for the unnormalised data > hist(celfiles, col=cols) > # Plot a density vs log intensity histogram for the normalised data > hist(celfiles.gcrma, col=cols)
From these plots we can conclude that there are no major deviations amongst the 12 chips, and normalisation has brought the intensities from all of the chips into distributions with similar characteristics. To take a closer look at the situation on a per-chip level we can use affyPLM. affyPLM allows us to visualise statistical characteristics of the CEL files.
> # Perform probe-level metric calculations on the CEL files: > celfiles.qc <- fitPLM(celfiles) > # Create an image of GSM24662.CEL: > image(celfiles.qc, which=1, add.legend=TRUE) > # Create an image of GSM524665.CEL > # There is a spatial artifact present > image(celfiles.qc, which=4, add.legend=TRUE) > # affyPLM also provides more informative boxplots > # RLE (Relative Log Expression) plots should have > # values close to zero. GSM524665.CEL is an outlier > RLE(celfiles.qc, main="RLE") > # We can also use NUSE (Normalised Unscaled Standard Errors). > # The median standard error should be 1 for most genes. > # GSM524665.CEL appears to be an outlier on this plot too > NUSE(celfiles.qc, main="NUSE")
We can also look at the relationships between the samples using heirarchical clustering:
> eset <- exprs(celfiles.gcrma) > distance <- dist(t(eset),method="maximum") > clusters <- hclust(distance) > plot(clusters)
This suggests our HUVEC samples are very much a separate group compared to the eye tissues, which show some evidence of clustering by tissue type. GSM524665.CEL does not appear to be a particular outlier in this view.
Now we have looked at the data, we can go on to analyse it. The first stage of analysis is to filter out uninformative data such as control probesets and other internal controls as well as removing genes with low variance, that would be unlikely to pass statistical tests for differential expression, or are expressed uniformly close to background detection levels. The modifiers to nsFilter below tell nsFilter not to remove probesets without Entrez Gene identifiers, or have duplicated Entrez Gene identifiers.
> celfiles.filtered <- nsFilter(celfiles.gcrma, require.entrez=FALSE, remove.dupEntrez=FALSE) > # What got removed and why? > celfiles.filtered$filter.log $numLowVar  27307 $feature.exclude  62
From this we conclude 27,307 probesets have been removed for reasons of low variance, and 62 control probesets have been removed as well.
Finding differentially expressed probesets
Now we have a filtered dataset, we can send the information to limma for differential gene expression analysis. First of all we need to extract information about the samples:
> samples <- celfiles.gcrma$Target > # check the results of this > samples  "iris" "retina" "retina" "iris" "retina" "iris" "choroid"  "choroid" "choroid" "huvec" "huvec" "huvec" > # convert into factors > samples <- as.factor(samples) > # check factors have been assigned > samples  iris retina retina iris retina iris choroid choroid choroid  huvec huvec huvec Levels: choroid huvec iris retina > # set up the experimental design > design <- model.matrix(~0 + samples) > colnames(design) <- c("choroid", "huvec", "iris", "retina") > # inspect the experiment design > design choroid huvec iris retina 1 0 0 1 0 2 0 0 0 1 3 0 0 0 1 4 0 0 1 0 5 0 0 0 1 6 0 0 1 0 7 1 0 0 0 8 1 0 0 0 9 1 0 0 0 10 0 1 0 0 11 0 1 0 0 12 0 1 0 0 attr(,"assign")  1 1 1 1 attr(,"contrasts") attr(,"contrasts")$samples  "contr.treatment"
At this point we have normalised filtered data, and a description of the data and the samples and experimental design. This can be fed into limma for analysis.
> library(limma) > # fit the linear model to the filtered expression set > fit <- lmFit(exprs(celfiles.filtered$eset), design) > # set up a contrast matrix to compare tissues v cell line > contrast.matrix <- makeContrasts(huvec_choroid = huvec - choroid, huvec_retina = huvec - retina, huvec_iris <- huvec - iris, levels=design) > # check the contrast matrix > contrast.matrix Contrasts Levels huvec_choroid huvec_retina huvec_iris choroid -1 0 0 huvec 1 1 1 iris 0 0 -1 retina 0 -1 0 > # Now the contrast matrix is combined with the per-probeset linear model fit. > huvec_fits <- contrasts.fit(fit, contrast.matrix) > huvec_ebFit <- eBayes(huvec_fits) > # return the top 10 results for any given contrast > # coef=1 is huvec_choroid, coef=2 is huvec_retina > topTable(huvec_ebFit, number=10, coef=1) ID logFC AveExpr t P.Value adj.P.Val 6147 204779_s_at 7.367947 4.171874 72.77016 3.290669e-15 8.985500e-11 7292 207016_s_at 6.934796 4.027229 57.37259 3.712060e-14 5.068076e-10 8741 209631_s_at 5.193313 4.003541 51.22423 1.177337e-13 1.071612e-09 26309 242809_at 6.433514 4.167462 48.52518 2.042404e-13 1.394247e-09 6828 205893_at 4.480463 3.544350 40.59376 1.253534e-12 6.845801e-09 20232 227377_at 3.670688 3.209217 36.03427 4.200854e-12 1.911809e-08 6222 204882_at -5.351976 6.512018 -34.70239 6.154318e-12 2.400711e-08 27051 38149_at -5.051906 6.482418 -31.44098 1.672263e-11 5.282592e-08 6663 205576_at 6.586372 4.139236 31.31584 1.741131e-11 5.282592e-08 6589 205453_at 3.623706 3.210306 30.72793 2.109110e-11 5.759136e-08 B 6147 20.25563 7292 19.44681 8741 18.96282 26309 18.70767 6828 17.75331 20232 17.01960 6222 16.77196 27051 16.08854 6663 16.05990 6589 15.92277
To impose a fold change cut off, and see how many genes are returned you can use the lfc modifier for topTable, here we show the results for fold changes of 5,4,3 and 2 in terms of the number of probesets.
> nrow(topTable(huvec_ebFit, coef=1, number=10000, lfc=5))  88 > nrow(topTable(huvec_ebFit, coef=1, number=10000, lfc=4))  194 > nrow(topTable(huvec_ebFit, coef=1, number=10000, lfc=3))  386 > nrow(topTable(huvec_ebFit, coef=1, number=10000, lfc=2))  1016 > # Get a list for probesets with a four fold change or more > probeset.list <- topTable(huvec_ebFit, coef=1, number=10000, lfc=4)
Annotating the results with associated gene symbols
In order to annotate the probesets into gene symbols we need to install and load the associated database package and the annotate package, then we can extract the probeset ID’s from the topTable results, and match the symbols
> biocLite("hgu133plus2.db") > library(hgu133plus2.db) > library(annotate) > gene.symbols <- getSYMBOL(probeset.list$ID, "hgu133plus2") > results <- cbind(probeset.list, gene.symbols) > head(results) ID logFC AveExpr t P.Value adj.P.Val 6147 204779_s_at 7.367947 4.171874 72.77016 3.290669e-15 8.985500e-11 7292 207016_s_at 6.934796 4.027229 57.37259 3.712060e-14 5.068076e-10 8741 209631_s_at 5.193313 4.003541 51.22423 1.177337e-13 1.071612e-09 26309 242809_at 6.433514 4.167462 48.52518 2.042404e-13 1.394247e-09 6828 205893_at 4.480463 3.544350 40.59376 1.253534e-12 6.845801e-09 6222 204882_at -5.351976 6.512018 -34.70239 6.154318e-12 2.400711e-08 B gene.symbols 6147 20.25563 HOXB7 7292 19.44681 ALDH1A2 8741 18.96282 GPR37 26309 18.70767 IL1RL1 6828 17.75331 NLGN1 6222 16.77196 ARHGAP25 > write.table(results, "results.txt", sep="\t", quote=FALSE)
There is a follow-on article from Simon Cockell on analyising the biological significance of results and another article by Colin Gillespie on visualising data from expression analysis as volcano plots.