on June 20, 2011 by Daniel Swan in Transcriptomics, Tutorials, Comments (29)

# Analysing microarray data in BioConductor

## Introduction

Microarray analysis is a complex procedure, and using R and the associated BioConductor packages can be a daunting task. In this tutorial we are going to show you how to install R and BioConductor, download a dataset from GEO, normalise it, do some quality control checks and finally analyse it to find differentially expressed genes. The following instructions assume that you are working with a fresh install of an Ubuntu based Linux distribution, and includes the installation of all the pre-requisites for the analysis. Some familiarity with Linux is ideal and the instructions were developed on Ubuntu 11.04, R 2.12.1. For a full code listing for this tutorial and figures resulting from it see the second part of the article.

## Installation

Open up a terminal (Applications->Accessories->Terminal from the the toolbar)

Install R and BioConductor dependencies, and then start R:

$sudo apt-get install r-base-core libxml2-dev libcurl4-openssl-dev curl$ R

Now at the R prompt:

> source("http://bioconductor.org/biocLite.R")
> # install the core packages
> biocLite()
> # install the GEO libraries
> biocLite("GEOquery")

If you are running as an unprivileged user you will be prompted to install the packages into a personal library. We’re also install the GEOquery package at this point. Installing the core BioConductor packages will take some time. GEOquery is an interface to the Gene Expression Omnibus at the NCBI, which holds transcriptomics data in a standardised format.

## Getting the data

For this tutorial we are going to be working with dataset GSE20986, generated by Dr Andrew Browning. HUVECs are cells often used in research, derived from human umbilical vein, used for investigations into the pathology and physiology of endothelial cells. In this experiment iris, retina and choroidal microvascular endothelial cells were isolated from donor eyes, and compared to HUVECs in order to ascertain if the HUVEC line is a suitable surrogate for studying ocular disorders. The GEO page also includes information on the design of the experiment. We can see that there are triplicate measurements for each set of samples. This splits the samples into four groups “iris” “retina” “HUVEC” and “choroidal”.

The GEO Platform is GPL570 – the GEO short code for the Affymetrix Human Genome U133 Plus 2.0 Array, a commonly used microarray chip for human transcriptome studies. If we investigate the GSM links such as GSM524662 we get more detailed information about the experimental conditions used on each chip. They should be the same for each chip, as they are derived from the same experiment. What is captured is information about how the cell lines were extracted, grown, processed for RNA extraction etc. They also include details on how the RNA was processed and hybridised to the microarray chip, and what machine was used to scan the chips. This data is all captured to ensure compliance to MIAME standards.

For each chip, the data table shows the probeset alongside a normalised expression value for the probeset. The table header description let you know how the data was normalised. In this case you can see the GC-RMA algorithm was used. For more information about GC-RMA you can read this paper.

First of all we need to acquire the raw data directly from GEO. We can do this by loading the GEOquery library, and pointing it to the experiment we want to analyse. This download is approximately 53MB.

> library(GEOquery)
> getGEOSuppFiles("GSE20986")

If you open a file browser you will see this creates a GSE20986 directory under the directory you launched R in:

$ls GSE20986/ filelist.txt GSE20986_RAW.tar The GSE20986_RAW.tar file is a compressed archive of the CEL files (the Affymetrix native file format). Before we can work on them we need to uncompress them. We can achieve this at the R prompt. This uncompresses the tar file (which contains compressed CEL files). The CEL files are subsequently decompressed themselves to the directory in which we are working with R. > untar("GSE20986/GSE20986_RAW.tar", exdir="data") > cels <- list.files("data/", pattern = "[gz]") > sapply(paste("data", cels, sep="/"), gunzip) > cels ## Describing the experiment We are now ready to start analysis, however in order to do this we need to capture the experimental information. This is just a text file which describes the chip names, and the source of the biological samples hybridised to them. To load the data into R using simpleaffy we need to create a file to represent this information. Open a new terminal window and type:$ ls data/*.CEL > data/phenodata.txt

Open this file in a text editor. Currently this is a single column listing the files just a list of the CEL files. The final file needs 3 tab-delimited columns named ‘Name’ ‘FileName’ and ‘Target’. The FileName and Name columns will be identical in this case, although the Name column can be used to provide a more human readable sample name if desired.

The Target column is information about the samples taken from the GEO page. Label the samples appropriately either ‘iris’ ‘retina’ ‘choroid’ or ‘huvec’. This defines our replicate groups for further analysis when we compare the cell line with the tissue samples.

The final result should look like this (ensure that there are tabs between the columns and not spaces!). You can produce this file in any spreadsheet program providing you save it as ‘tab-delimited text’.

Name FileName Target
GSM524662.CEL GSM524662.CEL iris
GSM524663.CEL GSM524663.CEL retina
GSM524664.CEL GSM524664.CEL retina
GSM524665.CEL GSM524665.CEL iris
GSM524666.CEL GSM524666.CEL retina
GSM524667.CEL GSM524667.CEL iris
GSM524668.CEL GSM524668.CEL choroid
GSM524669.CEL GSM524669.CEL choroid
GSM524670.CEL GSM524670.CEL choroid
GSM524671.CEL GSM524671.CEL huvec
GSM524672.CEL GSM524672.CEL huvec
GSM524673.CEL GSM524673.CEL huvec

The simpleaffy package provides routines for handling CEL files including normalisation and loading data with sample information. We are going to load the data into an R object called ‘celfiles’

> library(simpleaffy)

You can confirm this has worked, and download the chip annotations at the same time (if this is your first run) by typing:

> celfiles
AffyBatch object
size of arrays=1164x1164 features (12 kb)
cdf=HG-U133_Plus_2 (54675 affyids)
number of samples=12
number of genes=54675
annotation=hgu133plus2
notes=

Now we can normalise the data. As with the data deposited in GEO we are going to use the GC-RMA algorithm. If this is the first time you have run it, it will download additional files during the process.

> celfiles.gcrma <- gcrma(celfiles)
.Done.
Normalizing
Calculating Expression

If you want to see the data associated with the normalised object you can – you will notice this is no longer an AffyBatch object, but an ExpressionSet object, which holds normalised values.

> celfiles.gcrma
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54675 features, 12 samples
element names: exprs
protocolData
sampleNames: GSM524662.CEL GSM524663.CEL ... GSM524673.CEL (12 total)
varLabels: ScanDate
phenoData
sampleNames: GSM524662.CEL GSM524663.CEL ... GSM524673.CEL (12 total)
varLabels: sample FileName Target
featureData: none
experimentData: use 'experimentData(object)'
Annotation: hgu133plus2

## Quality control checks

Before we analyse the data it is imperative that we do some quality control checks to make sure there are no issues with the dataset. The first thing we can do is check the effects of the GC-RMA normalisation, by plotting a boxplot of probe intensities before and after normalisation.

> library(RColorBrewer)
> # set colour palette
> cols <- brewer.pal(8, "Set1")
> # plot a boxplot of unnormalised intensity values
> boxplot(celfiles, col=cols)
> # plot a boxplot of normalised intensity values, affyPLM is required to interrogate celfiles.gcrma
> library(affyPLM)
> boxplot(celfiles.gcrma, col=cols)
> # the boxplots are somewhat skewed by the normalisation algorithm
> # and it is often more informative to look at density plots
> # Plot a density vs log intensity histogram for the unnormalised data
> hist(celfiles, col=cols)
> # Plot a density vs log intensity histogram for the normalised data
> hist(celfiles.gcrma, col=cols)

From these plots we can conclude that there are no major deviations amongst the 12 chips, and normalisation has brought the intensities from all of the chips into distributions with similar characteristics. To take a closer look at the situation on a per-chip level we can use affyPLM. affyPLM allows us to visualise statistical characteristics of the CEL files.

> # Perform probe-level metric calculations on the CEL files:
> celfiles.qc <- fitPLM(celfiles)
> # Create an image of GSM24662.CEL:
> # Create an image of GSM524665.CEL
> # There is a spatial artifact present
> # RLE (Relative Log Expression) plots should have
> # values close to zero. GSM524665.CEL is an outlier
> RLE(celfiles.qc, main="RLE")
> # We can also use NUSE (Normalised Unscaled Standard Errors).
> # The median standard error should be 1 for most genes.
> # GSM524665.CEL appears to be an outlier on this plot too
> NUSE(celfiles.qc, main="NUSE")

We can also look at the relationships between the samples using heirarchical clustering:

> eset <- exprs(celfiles.gcrma)
> distance <- dist(t(eset),method="maximum")
> clusters <- hclust(distance)
> plot(clusters)

This suggests our HUVEC samples are very much a separate group compared to the eye tissues, which show some evidence of clustering by tissue type. GSM524665.CEL does not appear to be a particular outlier in this view.

## Filtering data

Now we have looked at the data, we can go on to analyse it. The first stage of analysis is to filter out uninformative data such as control probesets and other internal controls as well as removing genes with low variance, that would be unlikely to pass statistical tests for differential expression, or are expressed uniformly close to background detection levels. The modifiers to nsFilter below tell nsFilter not to remove probesets without Entrez Gene identifiers, or have duplicated Entrez Gene identifiers.

> celfiles.filtered <- nsFilter(celfiles.gcrma, require.entrez=FALSE, remove.dupEntrez=FALSE)
> # What got removed and why?
> celfiles.filtered$filter.log$numLowVar
[1] 27307

$feature.exclude [1] 62 From this we conclude 27,307 probesets have been removed for reasons of low variance, and 62 control probesets have been removed as well. ## Finding differentially expressed probesets Now we have a filtered dataset, we can send the information to limma for differential gene expression analysis. First of all we need to extract information about the samples: > samples <- celfiles.gcrma$Target
> # check the results of this
> samples
[1] "iris" "retina" "retina" "iris" "retina" "iris" "choroid"
[8] "choroid" "choroid" "huvec" "huvec" "huvec"
> # convert into factors
> samples <- as.factor(samples)
> # check factors have been assigned
> samples
[1] iris retina retina iris retina iris choroid choroid choroid
[10] huvec huvec huvec
Levels: choroid huvec iris retina
> # set up the experimental design
> design <- model.matrix(~0 + samples)
> colnames(design) <- c("choroid", "huvec", "iris", "retina")
> # inspect the experiment design
> design
choroid huvec iris retina
1 0 0 1 0
2 0 0 0 1
3 0 0 0 1
4 0 0 1 0
5 0 0 0 1
6 0 0 1 0
7 1 0 0 0
8 1 0 0 0
9 1 0 0 0
10 0 1 0 0
11 0 1 0 0
12 0 1 0 0
attr(,"assign")
[1] 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$samples [1] "contr.treatment" At this point we have normalised filtered data, and a description of the data and the samples and experimental design. This can be fed into limma for analysis. > library(limma) > # fit the linear model to the filtered expression set > fit <- lmFit(exprs(celfiles.filtered$eset), design)
> # set up a contrast matrix to compare tissues v cell line
> contrast.matrix <- makeContrasts(huvec_choroid = huvec - choroid, huvec_retina = huvec - retina, huvec_iris <- huvec - iris, levels=design)
> # check the contrast matrix
> contrast.matrix
Contrasts
Levels huvec_choroid huvec_retina huvec_iris
choroid -1 0 0
huvec 1 1 1
iris 0 0 -1
retina 0 -1 0
> # Now the contrast matrix is combined with the per-probeset linear model fit.
> huvec_fits <- contrasts.fit(fit, contrast.matrix)
> huvec_ebFit <- eBayes(huvec_fits)
> # return the top 10 results for any given contrast
> # coef=1 is huvec_choroid, coef=2 is huvec_retina
> topTable(huvec_ebFit, number=10, coef=1)
ID logFC AveExpr t P.Value adj.P.Val
6147 204779_s_at 7.367947 4.171874 72.77016 3.290669e-15 8.985500e-11
7292 207016_s_at 6.934796 4.027229 57.37259 3.712060e-14 5.068076e-10
8741 209631_s_at 5.193313 4.003541 51.22423 1.177337e-13 1.071612e-09
26309 242809_at 6.433514 4.167462 48.52518 2.042404e-13 1.394247e-09
6828 205893_at 4.480463 3.544350 40.59376 1.253534e-12 6.845801e-09
20232 227377_at 3.670688 3.209217 36.03427 4.200854e-12 1.911809e-08
6222 204882_at -5.351976 6.512018 -34.70239 6.154318e-12 2.400711e-08
27051 38149_at -5.051906 6.482418 -31.44098 1.672263e-11 5.282592e-08
6663 205576_at 6.586372 4.139236 31.31584 1.741131e-11 5.282592e-08
6589 205453_at 3.623706 3.210306 30.72793 2.109110e-11 5.759136e-08
B
6147 20.25563
7292 19.44681
8741 18.96282
26309 18.70767
6828 17.75331
20232 17.01960
6222 16.77196
27051 16.08854
6663 16.05990
6589 15.92277

To impose a fold change cut off, and see how many genes are returned you can use the lfc modifier for topTable, here we show the results for fold changes of 5,4,3 and 2 in terms of the number of probesets.

> nrow(topTable(huvec_ebFit, coef=1, number=10000, lfc=5))
[1] 88
> nrow(topTable(huvec_ebFit, coef=1, number=10000, lfc=4))
[1] 194
> nrow(topTable(huvec_ebFit, coef=1, number=10000, lfc=3))
[1] 386
> nrow(topTable(huvec_ebFit, coef=1, number=10000, lfc=2))
[1] 1016
> # Get a list for probesets with a four fold change or more
> probeset.list <- topTable(huvec_ebFit, coef=1, number=10000, lfc=4)

### Annotating the results with associated gene symbols

In order to annotate the probesets into gene symbols we need to install and load the associated database package and the annotate package, then we can extract the probeset ID’s from the topTable results, and match the symbols

> biocLite("hgu133plus2.db")
> library(hgu133plus2.db)
> library(annotate)
> gene.symbols <- getSYMBOL(probeset.list$ID, "hgu133plus2") > results <- cbind(probeset.list, gene.symbols) > head(results) ID logFC AveExpr t P.Value adj.P.Val 6147 204779_s_at 7.367947 4.171874 72.77016 3.290669e-15 8.985500e-11 7292 207016_s_at 6.934796 4.027229 57.37259 3.712060e-14 5.068076e-10 8741 209631_s_at 5.193313 4.003541 51.22423 1.177337e-13 1.071612e-09 26309 242809_at 6.433514 4.167462 48.52518 2.042404e-13 1.394247e-09 6828 205893_at 4.480463 3.544350 40.59376 1.253534e-12 6.845801e-09 6222 204882_at -5.351976 6.512018 -34.70239 6.154318e-12 2.400711e-08 B gene.symbols 6147 20.25563 HOXB7 7292 19.44681 ALDH1A2 8741 18.96282 GPR37 26309 18.70767 IL1RL1 6828 17.75331 NLGN1 6222 16.77196 ARHGAP25 > write.table(results, "results.txt", sep="\t", quote=FALSE) There is a follow-on article from Simon Cockell on analyising the biological significance of results and another article by Colin Gillespie on visualising data from expression analysis as volcano plots. Tags: , , , ## 29 Comments 1. Analysing microarray data in BioConductor: figures and code » The Bioinformatics Knowledgeblog June 20, 2011 @ 9:58 am […] Analysing microarray data in BioConductor […] 2. Downstream functional analysis of an ‘omics experiment using freely available tools » The Bioinformatics Knowledgeblog June 20, 2011 @ 4:45 pm […] will be introduced. We will focus on the downstream analysis of the gene list produced by the Analysing microarray data in R and BioConductor tutorial. Hypergeometric tests for over-representation of functional […] 3. Volcano plots of microarray data » The Bioinformatics Knowledgeblog June 21, 2011 @ 2:14 pm […] Volcano plots of microarray data Transcriptomics, Tutorials Add comments Jun 212011 To run the R commands in this post, you should first work through Analysing microarry data in BioConductor. […] 4. Andreas June 22, 2011 @ 5:56 am HI Daniel, Thank you for your wonderful Tutorial. I try to analyze some data, but even when I try to test your example I will get the following warnings and no data was downloaded. > getGEOSuppFiles(“GSE20986”) [1] “ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE20986/” Warnmeldungen: 1: In download.file(file.path(url, i), destfile = file.path(storedir, : download had nonzero exit status 2: In download.file(file.path(url, i), destfile = file.path(storedir, : download had nonzero exit status I am using Win Vista (german language) with R 2.13.0. Even with google I couldn’t find any solution on this problem. I would be very thankful, if you could help me. Thanks in advance Andreas 5. Review of “Analysing microarray data in BioConductor” » The Bioinformatics Knowledgeblog June 23, 2011 @ 10:58 am […] article describes a statistical analysis of the GSE20986 data set. This data set consists of four […] 6. guqi August 22, 2011 @ 12:40 am HI Daniel, Thank you for your Tutorial.But I am facing a problem. boxplot(celfiles.gcrma,col=cols) Error in x[!xna] : object of type ‘S4’ is not subsettable In addition: Warning messages: 1: In is.na(x) : is.na() applied to non-(list or vector) of type ‘S4’ 2: In is.na(x) : is.na() applied to non-(list or vector) of type ‘S4’ 7. wkwok September 14, 2011 @ 9:11 pm Hi, I’m also having the same problem as guqi. I was wondering if anyone found a solution or a work around yet? 8. Kumar March 13, 2012 @ 5:28 am Hi I am also facing the same problem as wkwok and guqi. I hope author should be able to sort it out. Regards 9. Sylvie April 9, 2012 @ 8:05 pm Hello Daniel… I am running R 2.10 under Linux and I have the same error message as guqi… Have you found a solution? 10. Ulrik April 13, 2012 @ 9:40 am I had the same problem: >hist(celfiles.gcrma, col=cols) Error in hist.default(celfiles.gcrma, col = cols) : ‘x’ must be numeric > boxplot(celfiles.gcrma, col=cols) Error in x[!xna] : object of type ‘S4’ is not subsettable In addition: Warning messages: 1: In is.na(x) : is.na() applied to non-(list or vector) of type ‘S4’ 2: In is.na(x) : is.na() applied to non-(list or vector) of type ‘S4’ I loaded affyPLM and problems were solved 11. sarita April 21, 2012 @ 1:28 pm when i run to programm to on my sample data.i got a geo file then i run a commpressed file into uncompressed file ,i run on this command convert tar into untar the error appeared untar(“GSE16070/GSE16070_RAW.tar”, exdir=”data”) Error in getOct(block, 100, 8) : invalid octal digit pls,what is solution to remove this error.pls sir, help me. 12. Jane May 24, 2012 @ 4:06 am Hi Daniel, I’m trying to follow your example before working with a data set that I’m interested as I’m new to working with analysing data in bioconductor. I’ve created the phenodata.txt file through the shell: ls unzip/*.CEL > unzip/phenodata.txt and then when working in R and using the command: celfiles<-read.affy(covdesc="phenodata.txt", path="unzip") it comes back with this : Error: the following are not valid files: unzip/unzip/GSM524663.CEL unzip/unzip/GSM524664.CEL unzip/unzip/GSM524665.CEL unzip/unzip/GSM524666.CEL unzip/unzip/GSM524667.CEL unzip/unzip/GSM524668.CEL unzip/unzip/GSM524669.CEL unzip/unzip/GSM524670.CEL unzip/unzip/GSM524671.CEL unzip/unzip/GSM524672.CEL unzip/unzip/GSM524673.CEL Any advice in this roadblock would be very much appreciated! 13. Jane May 24, 2012 @ 4:48 am One more question: I’m using the sample data set still as a run through and am returned with “argument is not a matrix” when I try to run the below assignment. distance <- dist(t(celfiles.gcrma$eset),method="maximum")
Error in t.default(celfiles.gcrma$eset) : argument is not a matrix 14. Sanchari May 26, 2012 @ 7:21 pm Hi Daniel, I am a newbie with R. There was a problem in uncompressing the GSE tar file of the above dataset. So I downloaded the GSMs individually in the working directory. I am confused regarding which command I should execute next ? Thanks ! 15. Sanchari May 28, 2012 @ 6:33 pm Hi Daniel, Sorry to bother again. I am getting the following error when I am typing: > distance <- dist(t(celfiles.gcrma$eset),method="maximum")"
Error in t.default(celfiles.gcrma\$eset) : argument is not a matrix"

Do you have any idea why I am getting this ?

Thanks!

16. Sanchari

May 31, 2012 @ 7:15 am

Thanks Daniel. I was able to complete the tutorial.

17. Oskari

August 28, 2012 @ 7:44 am

Hello,
Very nice and well explained tutorial.

I am mostly a newbie at this sort of thing, so here’s a sort of a question conserning the use of nsFilter after normalization. nsFilter documents references R. Bourgon et al. (www.pnas.org/cgi/doi/10.1073/pnas.0914005107) in which article the use of filtering is not recommended when using limma with small sample sizes. As this example has three samples per type of different sample, could you elaborate a bit on the usage of filtering (of course I am not exactly sure if the article refers to total sample size or sample size per different type of sample).

18. Omax

September 12, 2012 @ 7:33 pm

Hi,Mr.Daniel, I was quite impressed by the way of presenting the tutorial.Thanks a lot.

19. Feseha

January 6, 2013 @ 12:20 pm

Hi Daniel;
Thanks for an excellent tutorial.
It run without a problem on Windows7.
Since I have GnuWin32 package on my PATH,
even the linux commands worked fine!
Thank you and Keep up the good work.
Feseha A.

20. Feseha A.

January 6, 2013 @ 12:31 pm

Hi Daniel;
a package prior to starting the tutorial and so the
link to biocLite was open. As I progressed in the
install packages it needed with one exception and
even that I was able to install without interrupting
the tutorial by manually invoking on the ptompt like:
>biocLite(“limma”)
… Just in case it hepls future “students”
Thanks again
Feseha

21. Leandro Mattos

January 28, 2013 @ 8:44 pm

Hy Daniel, your tutorial is really very good. Packages in this tutorial are run in several papers published and the step for step of tutorial is very explained. Congratulations!
The same time, I would like of ask you if you have some tutorial about analysis of transcriptional regulatory networks ou Protein-protein interaction using igraph package in R. Because, I’ didn’t found any tutorial step-step explained, please send for my E-mail: mattoslmp@yahoo.com.br, or if it is possible to publish your bloog.