GSEA (v17)

Gene Set Enrichment Analysis

Author: Aravind Subramanian, Pablo Tamayo, Broad Institute

Contact:

gp-help@broadinstitute.org

Algorithm Version:

Description

Evaluates a genomewide expression profile and determines whether a priori defined sets of genes show statistically significant, cumulative changes in gene expression that are correlated with a phenotype.  The phenotype may be categorical (e.g., tumor vs. normal) or continuous (e.g., a numerical profile across all samples in the expression dataset).

Summary

Gene Set Enrichment Analysis (GSEA) is a powerful analytical method for interpreting gene expression data.  It evaluates cumulative changes in the expression of groups of multiple genes defined based on prior biological knowledge.  It first ranks all genes in a data set, then calculates an enrichment score for each gene set, which reflects how often members of that gene set occur at the top or bottom of the ranked data set (for example, in expression data, in either the most highly expressed genes or the most underexpressed genes).

Introduction

Microarray experiments profile the expression of tens of thousands of genes over a number of samples that can vary from as few as two to several hundreds. One common approach to analyzing these data is to identify a limited number of the most interesting genes for closer analysis. This usually means identifying genes with the largest changes in their expression values based on a t-test or similar statistic, and then picking a significance cutoff that will trim the list of interesting genes down to a handful of genes for further research.

Gene Set Enrichment Analysis (GSEA) takes an alternative approach to analyzing genomic data: it focuses on cumulative changes in the expression of multiple genes as a group, which shifts the focus from individual genes to groups of genes.  By looking at several genes at once, GSEA can identify pathways whose several genes each change a small amount, but in a coordinated way.  This approach helps reflect many of the complexities of co-regulation and modular expression.

GSEA therefore takes as input two distinct types of data for its analysis:

  • the gene expression data set
  • gene sets, where each set is comprised of a list of genes whose grouping together has some biological meaning; these gene sets can be drawn from the Molecular Signatures Database (MSigDB) or can be from other sources

The GSEA GenePattern module uses either categorical or continuous phenotype data for its analysis.  In the case of a categorical phenotype, a dataset would contain two different classes of samples, such as "tumor" and "normal."  In the case of a continuous phenotype, a dataset would contain a numerical value for each sample.  Examples of numerical profiles include the expression level of a specific gene or a measure of cell viability over the course of a time series experiment. The GSEA desktop application, available on the GSEA website, has additional functionalities.  For instance, the GSEA desktop application can conduct an enrichment analysis against a ranked list of genes, or analyze the leading-edge subsets within each gene set.  Many of these capabilities are also available in separate GP modules (see GSEAPreranked and GSEALeadingEdgeViewer). 

Algorithm

GSEA first ranks the genes based on a measure of each gene's differential expression with respect to the two phenotypes (for example, tumor versus normal) or correlation with a continuous phenotype.  Then the entire ranked list is used to assess how the genes of each gene set are distributed across the ranked list.  To do this, GSEA walks down the ranked list of genes, increasing a running-sum statistic when a gene belongs to the set and decreasing it when the gene does not.  A simplified example is shown in the following figure.

The enrichment score (ES) is the maximum deviation from zero encountered during that walk.  The ES reflects the degree to which the genes in a gene set are overrepresented at the top or bottom of the entire ranked list of genes.  A set that is not enriched will have its genes spread more or less uniformly through the ranked list.  An enriched set, on the other hand, will have a larger portion of its genes at one or the other end of the ranked list. The extent of enrichment is captured mathematically as the ES statistic.

Next, GSEA estimates the statistical significance of the ES by a permutation test.  To do this, GSEA creates a version of the data set with phenotype labels randomly scrambled, produces the corresponding ranked list, and recomputes the ES of the gene set for this permuted data set. GSEA repeats this many times (1000 is the default) and produces an empirical null distribution of ES scores.  Alternatively, permutations may be generated by creating “random” gene sets (genes randomly selected from those in the expression dataset) of equal size to the gene set under analysis.

 The nominal p-value estimates the statistical significance of a single gene set's enrichment score, based on the permutation-generated null distribution.  The nominal p-value is the probability under the null distribution of obtaining an ES value that is as strong or stronger than that observed for your experiment under the permutation-generated null distribution.

Typically, GSEA is run with a large number of gene sets.  For example, the MSigDB collection and subcollections each contain hundreds to thousands of gene sets.  This has implications when comparing enrichment results for the many sets:

The ES must be adjusted to account for differences in the gene set sizes and in correlations between gene sets and the expression data set. The resulting normalized enrichment scores (NES) allow you to compare the analysis results across gene sets.

The nominal p-values need to be corrected to adjust for multiple hypothesis testing. For a large number of sets (rule of thumb: more than 30), we recommend paying attention to the False Discovery Rate (FDR) q-values: consider a set significantly enriched if its NES has an FDR q-value below 0.25.

For more information, see http://www.broadinstitute.org/gsea.

Known Issues

File names

Input expression datasets with the character '-' or spaces in their file names causes GSEA to error.

CLS Files

The GSEA GenePattern module interprets the sample labels in categorical CLS files by their order of appearance, rather than via their numerical value, unlike some other GenePattern modules. For example, in the CLS file below:

13 2 1

# resistant sensitive

1 1 1 1 1 1 1 1 0 0 0 0 0

Most other GenePattern modules would interpret the first 8 samples to be sensitive and the remaining 5 to be resistant. However, GSEA assigns resistant to the first 8 samples and sensitive to the rest. This is because GSEA assigns the first name in the second line to the first symbol found on the third line.

If the sample labels are in numerical order, as below, no difference in behavior will be noted.

13 2 1

# resistant sensitive

0 0 0 0 0 1 1 1 1 1 1 1 1

References

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS. 2005;102(43);15545-15550. (Link)

Mootha VK, Lindgren CM, Eriksson K-F, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesivor JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC.  PGC-1-α responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003;34:267-273. (link)

GSEA User Guide: http://www.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html

Parameters

Name Description
expression dataset * This is a file in either GCT or RES format that contains the expression dataset. 
gene sets database

This drop-down allows you to select gene sets from the Molecular Signatures Database (MSigDB) on the GSEA website.  This provides access to only the most current version of MSigDB. 

If you want to use files from an earlier version of MSigDB, you will need to download that file from the archived releases on the website and specify it in the gene sets database file parameter.

If you do not select an option here, you MUST upload a file in the gene sets database file parameter.
gene sets database file Allows you to upload a gene set file not available in the current version of MSigDB (and thus not listed in the gene sets database parameter drop-down).  This file must be in GMT, GMX, or GRP format. 
number of permutations * Specifies the number of permutations to perform in assessing the statistical significance of the enrichment score. It is best to start with a small number, such as 10, in order to check that your analysis will complete successfully (e.g., ensuring you have gene sets that satisfy the minimum and maximum size requirements and that the collapsing genes to symbols works correctly). After the analysis completes successfully, run it again with a full set of permutations. The recommended number of permutations is 1000. Default: 1000
phenotype labels *

A phenotype label file defines categorical or continuous-valued phenotypes and for each sample in your expression dataset assigns a label or numerical value for the phenotype.  This is a tab-delimited text file in CLS format.

A categorical phenotype CLS file should contain only two labels, such as tumor and normal.

A continuous phenotype CLS file may define one or more continuous-valued phenotypes.  Each phenotype definition includes a profile, assigning a numerical value to each sample in the expression dataset.

GSEA interprets CLS files differently than many GenePattern modules.  See the Known Issue for more details.
target profile Name of the target phenotype for a continuous phenotype CLS. This parameter must be left blank in the case of a categorical CLS file.
collapse dataset *

Select true to have GSEA collapse each probe set in the expression dataset into a single line of data for the gene, which is identified by its HUGO gene symbol. Be sure that your gene sets and array annotations also use gene symbols as the gene identifier format.

Select false to use your expression dataset as is, with its native feature identifiers. When you select this option, the chip annotation file (chip platform parameter) is optional and you must specify a gene set file (gene sets database file parameter) that identify genes using the same feature (gene or probe) identifiers as is used in your expression dataset.

Default: true
permutation type *

Type of permutations to perform in assessing the statistical significance of the enrichment score. Options are:

  • phenotype (default): Random phenotypes are created by shuffling the phenotype labels on the samples. For each random phenotype, GSEA ranks the genes and calculates the enrichment score for all gene sets. These enrichment scores are used to create a distribution from which the significance of the actual enrichment score (for the actual expression data and gene set) is calculated. This is the recommended method when there are at least 7 samples in each phenotype.
  • gene_set: Random gene sets, size matched to the actual gene set, are created and their enrichment scores calculated. These enrichment scores are used to create a null distribution from which the significance of the actual enrichment score (for the actual gene set) is calculated. This method is useful when you have too few samples to do phenotype permutations (that is, when you have fewer than 7 samples in any phenotype).
Phenotype permutation is recommended whenever possible. The phenotype permutation shuffles the phenotype labels on the samples in the dataset; it does not modify gene sets. Therefore, the correlations between the genes in the dataset and the genes in a gene set are preserved across phenotype permutations. The gene_set permutation creates random gene sets; therefore, the correlations between the genes in the dataset and the genes in the gene set are not preserved across gene_set permutations. Preserving the gene-to-gene correlation across permutations provides a more biologically reasonable (more stringent) assessment of significance.
chip platform

This drop-down allows you to specify the chip annotation file, which lists each probe on a chip and its matching HUGO gene symbol, used for the expression array.  This parameter is required if collapse dataset is set to true.  The chip files listed here are from the GSEA website: http://www.broadinstitute.org/gsea/downloads.jsp.  If you used a file not listed here, you will need to provide it (in CHIP format) using 'Upload your own file'

scoring scheme *

The enrichment statistic.  This parameter affects the running-sum statistic used for the enrichment analysis, controlling the value of p used in the enrichment score calculation.  Options are:

  • classic: p=0
  • weighted (default): p=1; a running sum statistic that is incremented by the absolute value of the ranking metric when a gene belongs to the set (see the 2005 PNAS paper for details)
  • weighted_p2: p=2
  • weighted_p1.5: p=1.5
metric for ranking genes *

GSEA ranks the genes in the expression dataset and then analyzes that ranked list of genes. Use this parameter to select the metric used to score and rank the genes. The default metric for ranking genes is the signal-to-noise ratio. To use this metric, your expression dataset must contain at least three (3) samples for each phenotype.

For descriptions of the ranking metrics, see Metrics for Ranking Genes in the GSEA User Guide.
gene list sorting mode * Specifies whether to sort the genes using the real (default) or absolute value of the gene-ranking metric score.
gene list ordering mode * Specifies the direction in which the gene list should be ordered (ascending or descending).
max gene set size * After filtering from the gene sets any gene not in the expression dataset, gene sets larger than this are excluded from the analysis. Default: 500
min gene set size * After filtering from the gene sets any gene not in the expression dataset, gene sets smaller than this are excluded from the analysis. Default: 15
collapsing mode for probe sets with more than one match *

Collapsing mode for sets of multiple probes for a single gene. Used only when the collapse dataset parameter is set to true. Select the expression values to use for the single probe that will represent all probe sets for the gene. Options are:

  • Max_probe (default): For each sample, use the maximum expression value for the probe set.  That is, if there are three probes that map to a single gene, the expression value that will represent the collapsed probe set will be the maximum expression value from those three probes.
  • Median_of_probes: For each sample, use the median expression value for the probe set.
normalization mode *

Method used to normalize the enrichment scores across analyzed gene sets. Options are:

  • meandiv (default): GSEA normalizes the enrichment scores as described in Normalized Enrichment Score (NES) in the GSEA User Guide.
  • None: GSEA does not normalize the enrichment scores.
randomization mode *

Method used to randomly assign phenotype labels to samples for phenotype permutations. ONLY used for phenotype permutations. Options are:

  • no_balance (default): Permutes labels without regard to number of samples per phenotype. For example, if your dataset has 12 samples in phenotype_a and 10 samples in phenotype_b, any permutation of phenotype_a has 12 samples randomly chosen from the dataset.
  • equalize_and_balance: Permutes labels by equalizing the number of samples per phenotype and then balancing the number of samples contributed by each phenotype. For example, if your dataset has 12 samples in phenotype_a and 10 samples in phenotype_b, any permutation of phenotype_a has 10 samples: 5 randomly chosen from phenotype_a and 5 randomly chosen from phenotype_b.
omit features with no symbol match * Used only when collapse dataset is set to true. By default (true), the new dataset excludes probes/genes that have no gene symbols. Set to false to have the new dataset contain all probes/genes that were in the original dataset. 
make detailed gene set report * Create detailed gene set report (heat map, mountain plot, etc.) for each enriched gene set. Default: true
median for class metrics * Specifies whether to use the median of each class, instead of the mean, in the metric for ranking genes. Default: false
number of markers * Number of features (gene or probes) to include in the butterfly plot in the Gene Markers section of the gene set enrichment report. Default: 100
plot graphs for the top sets of each phenotype * Generates summary plots and detailed analysis results for the top x genes in each phenotype, where x is 20 by default. The top genes are those with the largest normalized enrichment scores. Default: 20
random seed * Seed used to generate a random number for phenotype and gene_set permutations. Timestamp is the default. Using a specific integer valued seed generates consistent results, which is useful when testing software. 
save random ranked lists * Specifies whether to save the random ranked lists of genes created by phenotype permutations. When you save random ranked lists, for each permutation, GSEA saves the rank metric score for each gene (the score used to position the gene in the ranked list). Saving random ranked lists is very memory intensive; therefore, this parameter is set to false by default. 
output file name * Name of the output file. The name cannot include spaces. Default: <expression.dataset_basename>.zip

* - required

Input Files

1. expression dataset: GCT or RES file

This file contains the expression dataset.

2. gene sets database file: GMT, GMX, or GRP file (optional, if you do not select a gene set database from the drop-down)

A gene set file not available in the current version of MSigDB (and thus not listed in the gene sets database parameter drop-down).

3. phenotype labels: CLS file

The GSEA module supports two kinds of class (CLS) files: categorical phenotype and continuous phenotype. 

A categorical phenotype CLS file must define a single phenotype having two categorical labels, such as tumor and normal. 

A continuous phenotype CLS may define multiple phenotypes.  Each phenotype definition assigns a numerical value for each sample.  This series of values defines the phenotype profile.  For example,

  • For a continuous phenotype representing the expression levels of a gene of interest, the value for each sample is the expression value of the gene.
  • For a continuous phenotype representing cell viability in a time series experiment, the value for each sample is a measure of cell viability at a distinct time in the experiment.

4. chip platform: an optional CHIP file may be provided if you do not select a chip platform from the drop-down

Output Files

1. output file name: ZIP

ZIP file containing the result files.  For more information on interpreting these results, see Interpreting GSEA Results in the GSEA User Guide.

Platform Dependencies

Task Type:
Gene List Selection

CPU Type:
any

Operating System:
any

Language:
Java

Version Comments

Version Release Date Description
17 2016-02-04 Updated to give access to MSigDB v5.1
16 2015-12-03 Updating the GSEA jar to deal with an issue with FTP access. Fixes an issue for GP@IU.
15 2015-06-16 Add built-in support for MSigDB v5.0, which includes new hallmark gene sets.
14 2013-06-14 Update the gene sets database list and the GSEA Java library, added support for continuous phenotypes..
13 2012-09-20 Updated and sorted the chip platforms list, changed default value of num permutations to 1000, and updated the GSEA java library
12 2011-04-08 Fixed parsing of gene sets database file names which contain @ and # symbols and added gene sets containing entrez ids
11 2010-11-05 Fixed parsing of chip platform file names which contain @ and # symbols
10 2010-10-01 Updated selections for the gene sets database parameter to reflect those available in MSigDB version 3