on June 20, 2011 by Simon Cockell in Transcriptomics, Comments (4)

Gene Set Enrichment Analysis

Introduction

The enrichment analysis explored in the previous article can be powerful provided the mechanism used for identifying candidate genes finds all of the genes that are interesting in a given system (which itself relies upon the assumption that all interesting genes will alter their expression levels between given conditions). This type of analysis also ignores what we already know about the biology of a system, and identifies enriched terms with no regard for how they are related to one another. Gene Set Enrichment Analysis operates over the whole of the data from a microarray experiment, so no data points are lost prior to functional analysis. It also makes uses of prior knowledge to group sets of genes by related function, meaning that over-representation of a group of functions is considered, as opposed to isolated individual functions. It is suggested this could lead to more biologically meaningful results.

Tools

GSEA

The most commonly used tool for Gene Set Enrichment Analysis is GSEA, developed by a team at the Broad Institute (10.1073/pnas.0506580102).

GSEA requires the full data from a microarray experiment in a specific format. There are instructions for generating this format on the GSEA Wiki. We will analyse the microarray data used in this Microarray analysis article, which compares gene expression in unpassaged, proliferating HUVEC and human iris, retinal and choroidal microvascular endothelial cells. We are going to get the data from R, by first following the instructions in the aforementioned article for importing the microarray data, until the point where you have the `celfiles.exprs` object containing all the expression values. Then print this object out to a tab-delimited file.

[sourcecode language=”r”]
write.table(celfiles.exprs, "celfiles.txt", sep="\t", quote=FALSE)
[/sourcecode]

Open celfiles.txt in Excel, to add the additional information required by GSEA.

  • Shift the column names 1 cell to the right (so they are over the correct columns).
  • Add a column to the spreadsheet between columns A and B.
  • Add 2 rows to the top of the spreadsheet.
  • Enter ‘#1.2’ in cell A1.
  • Enter the number of probesets in cell B1 (54,675), and the number of samples in cell B2 (12).
  • Type ‘NAME’; in cell A3 and ‘Description’ in cell B3.
  • Enter ‘NA’ or similar in every other cell in column B.

Your Excel sheet should now look something like Figure 1.

Excel with formatted GCT file displaying

Figure 1 - Formatted GCT file for use in GSEA (Click for full size).

Make sure you save the spreadsheet as tab-delimited text. This is the GCT format required for data input to GSEA.

The next file needed is the CLS file describing the phenotype of the samples. Create a new text file, and enter the following information:

12 4 1
#IRIS RETINA CHOROID HUVEC
IRIS RETINA RETINA IRIS RETINA IRIS CHOROID CHOROID CHOROID HUVEC HUVEC HUVEC

Save the file as phenotype.cls or similar.

Finally, GSEA requires the gene set database, but it provides a number of these for immediate use.

  • Open GSEA by running it from the link provided on the GSEA website.
  • Click the ‘Load Data’ icon and load in the GCT and CLS files we created above. Now click ‘Run GSEA’.
  • Select the Biological Process gene sets from the list that appears when you click the […] icon to the right of the ‘Gene Sets Database’ field in the form that appears: gseaftp.broadinstitute.org://pub/gsea/gene_sets/c5.bp.v2.5.symbols.gmt.
  • Pick HUVEC_VS_CHOROID from the list of comparisons that appears when you click the […] icon next to ‘Phenotype labels’
  • Select ‘gseaftpbroadinstitute.org://pub/gsea/annotations/HG_U133_Plus_2.chip’ from the list of ‘Chip platform(s)’.
  • Now click ‘Run’.

When the analysis has finished running, you will get a ‘Success’ message in the left hand panel of GSEA. Click this to open a summary of the results in your browser.

The Enrichment Score (ES) is the means by which a given set of terms is assessed in GSEA – the ES shows by how much a gene set is over-represented in a ranked list of genes. GSEA walks down the ranked list of genes, increasing the enrichment score if the gene is present in a gene set, and decreasing the score if it is not. The magnitude of the increase or decrease is determined by how well gene expression correlates with phenotype.

The results of the GSEA analysis can be browsed online.

Other Tools

Parametric Analysis of Gene Set Enrichment (PAGE) (10.1186/1471-2105-6-144). PAGE uses the normal distribution for statistical inference (as opposed to the non-parametric Kolmogorov-Smirnov used by GSEA), and so is considerably lighter on computational resources. PAGE is reported to improve analysis of minimally changed gene expression profiles.

Irizarry et al (10.1177/0962280209351908) report a similarly parametric approach, using the one-sample -test and test to perform gene set enrichment analysis. However, there is currently no software publicly available to implement this analysis.

Discussion

Gene Set Enrichment Analysis is potentially more biologically meaningful, and more robust than the gene list functional over-representation analysis presented in the previous article. The consideration of the entire set of data, and of sets of functionally related gene annotations, rather than isolated terms, make it more powerful than the more simple analysis. However, it is not a broadly applicable approach to downstream ‘omics data analysis, since it requires microarray data as input, and so is incompatible with proteomics data. It is also conceptually much more complicated than hypergeometric analysis, and more difficult to perform.

Bibliography

Tags: , , ,

4 Comments

  1. Jan Aerts

    June 21, 2011 @ 2:46 pm

    Hi Simon,

    Even though you refer to the other blog post, it might be good to make this one more stand on its own and actually briefly explain the dataset itself again (what data does this tutorial refer to). For the commands to load that data into R people can still be redirected to the other post. Another suggestion to make it more standalone is to add some indication of the file formats that we’re working with here. For example, it’d be nice if there would be a (part of a) screenshot of the celfiles.txt file in Excel after modification.

    Good post.
    jan.

  2. Simon Cockell

    June 21, 2011 @ 3:43 pm

    Thanks Jan, I have incorporated your suggested improvements.

  3. Hussein

    August 26, 2011 @ 9:30 am

    Dear Dr Cockell,
    Actually, I am performing microarray analysis on a treatment that causes minimal changes when compared to the controls. After analyzing a set of arrays, I found no individual genes that were differentially expressed (statistics used 3-way ANOVA). I submitted the data to GSEA and pathways related to the treatment showed to be deregulated. I am not sure if I can consider the genes showing up as differentially expressed or not? Do you think that further confirmation is needed (Q-RT-PCR for example), but I am concerned about this approach as it might be not sensitive to small changes.
    Greetings and thanks in advance!
    Very Interesting Blog!
    Hussein

  4. Jonathan C

    January 21, 2013 @ 3:37 pm

    Hi Simon,

    I find this and may of the articles here very useful so firstly I would like to thank you.
    My question relates to the formatting of the .gct file. I understand from the previous article that using affy library you can read in the cel file and gather the expression values for each probe on the chip. In my case I have previously fRMA corrected values from insilicoDB and have imported the data from as an ExpressionSet. In this case I get the samples perfectly across the top in each column however the rows instead of being the probe IDs i have the genes. I am pretty sure this is causing the Genepattern GSEA tool to fail as the gene names are not matching the expected Probe IDs for the specific chip platform the data has been run on.

    my question is, How am I able to back track in R to generate the exprs with Probe ID names rather than the gene names that are occurring now?
    Or will my best option be to go back and take the raw .CEL files and start from scratch?

    Thanks in advance for your help,
    Jonathan

Leave a comment

Login