Gene Set Enrichment Analysis
The enrichment analysis explored in the previous article can be powerful provided the mechanism used for identifying candidate genes finds all of the genes that are interesting in a given system (which itself relies upon the assumption that all interesting genes will alter their expression levels between given conditions). This type of analysis also ignores what we already know about the biology of a system, and identifies enriched terms with no regard for how they are related to one another. Gene Set Enrichment Analysis operates over the whole of the data from a microarray experiment, so no data points are lost prior to functional analysis. It also makes uses of prior knowledge to group sets of genes by related function, meaning that over-representation of a group of functions is considered, as opposed to isolated individual functions. It is suggested this could lead to more biologically meaningful results.
The most commonly used tool for Gene Set Enrichment Analysis is GSEA, developed by a team at the Broad Institute (10.1073/pnas.0506580102).
GSEA requires the full data from a microarray experiment in a specific format. There are instructions for generating this format on the GSEA Wiki. We will analyse the microarray data used in this Microarray analysis article, which compares gene expression in unpassaged, proliferating HUVEC and human iris, retinal and choroidal microvascular endothelial cells. We are going to get the data from R, by first following the instructions in the aforementioned article for importing the microarray data, until the point where you have the `celfiles.exprs` object containing all the expression values. Then print this object out to a tab-delimited file.
write.table(celfiles.exprs, "celfiles.txt", sep="\t", quote=FALSE)
Open celfiles.txt in Excel, to add the additional information required by GSEA.
- Shift the column names 1 cell to the right (so they are over the correct columns).
- Add a column to the spreadsheet between columns A and B.
- Add 2 rows to the top of the spreadsheet.
- Enter ‘#1.2’ in cell A1.
- Enter the number of probesets in cell B1 (54,675), and the number of samples in cell B2 (12).
- Type ‘NAME’; in cell A3 and ‘Description’ in cell B3.
- Enter ‘NA’ or similar in every other cell in column B.
Your Excel sheet should now look something like Figure 1.
Make sure you save the spreadsheet as tab-delimited text. This is the GCT format required for data input to GSEA.
The next file needed is the CLS file describing the phenotype of the samples. Create a new text file, and enter the following information:
12 4 1 #IRIS RETINA CHOROID HUVEC IRIS RETINA RETINA IRIS RETINA IRIS CHOROID CHOROID CHOROID HUVEC HUVEC HUVEC
Save the file as phenotype.cls or similar.
Finally, GSEA requires the gene set database, but it provides a number of these for immediate use.
- Open GSEA by running it from the link provided on the GSEA website.
- Click the ‘Load Data’ icon and load in the GCT and CLS files we created above. Now click ‘Run GSEA’.
- Select the Biological Process gene sets from the list that appears when you click the […] icon to the right of the ‘Gene Sets Database’ field in the form that appears:
- Pick HUVEC_VS_CHOROID from the list of comparisons that appears when you click the […] icon next to ‘Phenotype labels’
- Select ‘gseaftpbroadinstitute.org://pub/gsea/annotations/HG_U133_Plus_2.chip’ from the list of ‘Chip platform(s)’.
- Now click ‘Run’.
When the analysis has finished running, you will get a ‘Success’ message in the left hand panel of GSEA. Click this to open a summary of the results in your browser.
The Enrichment Score (ES) is the means by which a given set of terms is assessed in GSEA – the ES shows by how much a gene set is over-represented in a ranked list of genes. GSEA walks down the ranked list of genes, increasing the enrichment score if the gene is present in a gene set, and decreasing the score if it is not. The magnitude of the increase or decrease is determined by how well gene expression correlates with phenotype.
The results of the GSEA analysis can be browsed online.
Parametric Analysis of Gene Set Enrichment (PAGE) (10.1186/1471-2105-6-144). PAGE uses the normal distribution for statistical inference (as opposed to the non-parametric Kolmogorov-Smirnov used by GSEA), and so is considerably lighter on computational resources. PAGE is reported to improve analysis of minimally changed gene expression profiles.
Irizarry et al (10.1177/0962280209351908) report a similarly parametric approach, using the one-sample -test and test to perform gene set enrichment analysis. However, there is currently no software publicly available to implement this analysis.
Gene Set Enrichment Analysis is potentially more biologically meaningful, and more robust than the gene list functional over-representation analysis presented in the previous article. The consideration of the entire set of data, and of sets of functionally related gene annotations, rather than isolated terms, make it more powerful than the more simple analysis. However, it is not a broadly applicable approach to downstream ‘omics data analysis, since it requires microarray data as input, and so is incompatible with proteomics data. It is also conceptually much more complicated than hypergeometric analysis, and more difficult to perform.