Using ComparativeMarkerSelection for Differential Expression Analysis  Print-icon


In GenePattern, you use the ComparativeMarkerSelection module to identify the genes (if any) that are differentially expressed between two phenotype classes. Typically, this is a three-step process:

  1. Run the PreprocessDataset module to preprocess the expression data.
    PreprocessDataset removes platform noise and genes that have little variation. It takes an expression data file and generates a new, modified expression data file.
  2. Run the ComparativeMarkerSelection module to compute differential gene expression.
    For each gene, ComparativeMarkerSelection first uses a test statistic to calculate the difference in gene expression between the samples in the first class and the samples in the second class and then estimates the significance (p-value) of the test statistic score. Because testing tens of thousands of genes simultaneously increases the possibility of mistakenly identifying a non-marker gene as a marker gene, ComparativeMarkerSelection corrects for multiple hypothesis testing by computing both the false discovery rate (FDR) and family-wise error rate (FWER). ComparativeMarkerSelection takes an expression data file and generates a result (ODF) file.
  3. Run the ComparativeMarkerSelectionViewer module to view the results.
    For each gene, ComparativeMarkerSelectionViewer displays the test statistic score, its p-value, two FDR statistics, and three FWER statistics.

Basic instructions

The GenePattern Differential Expression Analysis protocol provides example files and step-by-step instructions for running ComparativeMarkerSelection and its companion modules. If you are unfamiliar with differential expression analysis or ComparativeMarkerSelection, start here:

  1. Login to the public GenePattern server at Broad Institute.
    If you do not have a GenePattern account, you can register on the login page.
  2. Notice that the GenePattern protocols are listed in the center of the GenePattern home page.
  3. Click Differential Expression Analysis to display the protocol's step-by-step instructions.

Details and considerations

The information provided in this section supplements the information provided in the Differential Expression Analysis protocol and the ComparativeMarkerSelection documentation. It assumes that you have walked through the Differential Expression Analysis protocol as described in the Basic Instructions above.

Expression data

ComparativeMarkerSelection requires gene expression data in the GCT or RES file format.

Phenotype classes

ComparativeMarkerSelection analyzes two phenotype classes at a time. If the expression data set includes samples from more than two classes, use the phenotype test parameter to analyze each class against all others (one-versus-all) or all class pairs (all pairs).

Confounding phenotype classes

If you are studying two variables and your data set contains a third variable that might distort the association between the variables of interest, you can use a confounding variable class file to correct for the affect of the third variable. For example, the data set in Lu, Getz, et. al. (2005) contains tumor and normal samples from different tissue types. When studying the association between the tumor and normal samples, the authors use a confounding variable class file to correct for the effect of the different tissue types.

The phenotype class file identifies the tumor and normal samples:

75 2 1
# Normal Tumor
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1

The confounding variable class file identifies the tissue type of each sample:

75 6 1
# colon kidney prostate uterus human-lung breast
1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5

Given these two class files, when performing permutations, ComparativeMarkerSelection shuffles the tumor/normal labels only among samples with the same tissue type.


ComparativeMarkerSelection uses a permutation test to estimate the significance (p-value) of the test statistic score. If the data set includes at least 10 samples per class, use the default value of 1000 permutations to ensure sufficiently accurate p-values.

If the data set includes fewer than 10 samples in any class, permuting the samples cannot give an accurate p-value. Specify a value of 0 permutations to use asymptotic p-values instead. In this case, ComparativeMarkerSelection computes p-values assuming the test statistic scores follow Student's t-distribution (rather than using the test statistic to create an empirical distribution of the scores).

ComparativeMarkerSelection also provides two additional options:

Log transformed data

By default, ComparativeMarkerSelection expects non-log-transformed data. Some calculations, such as Fold Change, will produce incorrect results when log transformed data is provided and not indicated.  To indicate that your data are log transformed, be sure to set the _log transformed data _parameter to "yes".

Test direction

By default, ComparativeMarkerSelection performs a two-sided test; that is, the test statistic score is calculated assuming that the differentially expressed gene can be up-regulated in either phenotype class. Optionally, use the test direction parameter to specify a one-sided test, where the differentially expressed gene must be up-regulated for class 0 or for class 1.

Test statistic

ComparativeMarkerSelection provides several methods of calculating differential expression. By default, the module uses the t-test statistic. Optionally, you can choose to use the signal-to-noise ratio (SNR) or paired T-test statistic instead.

T-Test (default)

The T-Test computes the standardized mean difference between the two classes.

ComparativeMarkerSelection also provides variations on the T-Test:

Signal-to-noise ratio (SNR)

Signal-to-noise ratio is computed by dividing the difference of class means by the sum of their standard deviations.

ComparativeMarkerSelection also provides variations on the signal-to-noise ratio:

Paired T-Test

The Paired T-Test can be used to analyze paired samples; for example, samples taken from patients before and after treatment. This test is used when the cross-class differences (e.g. the difference before and after treatment) are expected to be smaller than the within-class differences (e.g., the difference between two patients). For example if you are measuring weight gain in a population of people, the weights may be distributed from 90 lbs. to say 300 lbs. and the weight gain/loss (the paired variable) may be on the order of 0-30 lbs. So the cross-class difference ("before" and "after") is less than the within-class difference (person 1 and person 2).

Where the standard T-Test takes the mean of the difference between classes, the Paired T-Test takes the mean of the differences between pairs (for more information, refer to the Wikipedia article on the paired T-Test.)

For the Paired T-Test, paired samples in the expression data file must be arranged by class, where the first samples in each class are paired, the second samples are paired, and so on. For example, sample pairs A1/B1, A2/B2 and A3/B3 would be ordered in an expression data file as A1, A2, A3, B1, B2, B3. Note that your data must contain the same number of samples in each class in order to use this statistic.

<< Computing SNP Copy Number and Loss of Heterozygosity Up Setting Up a Module Repository >>

Updated on August 31, 2012 15:59