These brief articles and tutorials supplement the GenePattern documentation. They may be written in response to user questions or to describe new GenePattern features.
If you have a topic you would like to see included, please contact the GenePattern team at gp-help(at)broadinstitute.org.
The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated datasets. It supports a wide variety of data types including sequence alignments, microarrays, and genomic annotations. Up until recently, this tool was only available outside of, though it did accept GenePattern file formats. IGV can now be launched from a module available on the GenePattern Server or downloaded from the GenePattern Repository.
With this new development, users can pass their GenePattern result files directly to IGV through GenePattern.
The GenePattern IGV module launches the same application that is available from the IGV website. If you are a user of both the client IGV (either launching from the IGV website or your desktop) and GenePattern, this means you are using the same version of IGV complete with your preferences, home directory, saved genomes, and other such IGV saved presets. For all users, this means you are getting the latest version of IGV each time you run IGV, regardless of whether that is from GenePattern or from the IGV client.
As mentioned above, having IGV in GenePattern now allows you to pass your GenePattern data files directly to IGV in the same way you would use a result file as input file for any other module. For instance, you'll now notice that (on servers where IGV is installed) output from a run of GISTIC will have IGV as a next option in the dropdown for the result file.
IGV supports many of the common GenePattern file formats such as: CBS, CN, GCT*, RES*, GISTIC, SEG, and LOH files. For more information about supported IGV file types click here.
You can also upload any other IGV-supported file type as you would any other input file; i.e., via Upload or URL.
Note: In order to properly view GCT or RES files in IGV, some preprocessing is needed and will be discussed in detail shortly.
When you run IGV from within GenePattern you are provided with a few optional configuration parameters which will instruct IGV how to display your data.
Currently these two parameters are "genome" and "locus".
The "genome" parameter allows you to select the genome which corresponds to your data file. If you choose not to specify a genome, IGV will launch with hg19 if this is the first time you've run IGV. If you've run IGV before, it will launch with the last genome you were viewing.
The "locus" parameter allows you to specify a locus or range of interest for your data. For example, you could specify chr5:90,339,000-90,349,000 and IGV would launch with your data and that region of chromosome 5 displayed. If you instead wanted to look for the gene EGFR, you would simply type "EGFR" into the text box. If you choose not to specify a locus or gene and this is the first time you've run IGV, IGV will launch with chromosome 1 selected. If you've run IGV before, it will launch with the chromosome you last viewed.
In order to properly view a GCT or RES file in IGV, some preprocessing is required.
The default display option for a GCT or RES file is the Heatmap. For the heatmap to make sense, the data must be row-centered, scaled and possibly have a threshold applied. Currently the workflow for this is as follows.
If the data contains negative (non-log transformed) values, run it through PreprocessDataset. The default threshold there is 20.
Currently IGVTools is a stand-alone utility providing a set of tools for pre-processing data files.
For the preprocessing of unscaled GCT and RES files an option called "formatExp" is provided. It takes a non-log expression file and performs the following steps. (Note that these are the steps used for our internal expression data prior to viewing in IGV.)
You can download this version of igvtools.
After unzipping, igvtools can be used on the command line to transform the RES or GCT file as described above. The command line follows. (If you are on a Windows platform use "igvtools.bat" instead of "igvtools".)
To run this on your preprocessed dataset, save the resulting .preprocessed file and provide it as "inputFile" in the command line.
Note: A GenePattern module which will include these preprocessing steps is planned for release in early 2011.
Take the output from IGVTools and provide it as input to IGV.
Once you have launched IGV you may configure, drag, zoom, save, etc., as you would normally use IGV. For more information on how to use IGV, please visit IGV website.
For questions and comments about IGV, please send an email to the IGV team.
For questions and comments about GenePattern, please send an email to the GenePattern team.
After aligning and/or assembling your RNA-seq data, it is important to take a closer look at the content of those result files before continuing with further analysis; in part, because the results of that investigation may, in fact, point you toward how you should best analyze your data.
Specifically in GenePattern, modules are provided to calculate such Quality Control (QC) metrics as: Depth of Coverage, Continuity of Coverage, Duplication Rate, Expression Rates, Strand Specificity, and GC content, among others.
Having these sorts of metrics can help to prevent or better understand common RNA-seq errors stemming from such sources as: read length, quality of data, sample prep, or number of reads in the data.
The following decision diagram illustrates a suggested workflow. This workflow is discussed in further detail in subsequent sections.
The input to the suggested RNA-seq QC workflow in GenePattern is an aligned, coordinate sorted BAM file with Read Group information (such as platform or sample) in the header. (SAM files can be converted to BAM format using the SortSam module.)
If your aligned BAM file does not contain Read Group Information you should run your data through AddOrReplaceReadGroups, as discussed next.
If the aligned BAM file is not coordinate sorted, run the data through SortSam, making sure the sort order is set to "coordinate" as discussed below.
More information about the SAM/BAM format can be found at the SAMtools website.
Note that if your data was aligned with TopHat, you will likely want to run your BAM file through the Picard tool MergeBamAlignment (soon to be available in GenePattern) to handle the fact that TopHat removes unaligned reads. This can throw off the total number of reads and any other metrics using that value in the RNAseqMetrics module.
Input for Picard.AddOrReplaceReadGroups is a BAM file which has been aligned.
The module will either add new (if none previously existed) or replace read groups as defined in the parameters. All reads in the file will be assigned to the specified read group.
Read Group information is required by Picard.MarkDuplicates and the RNAseqMetrics module. Specifically the RNAseqMetrics module requires a Read Group ID in the BAM header.
Full documentation for Picard.AddOrReplaceReadGroups, with parameter descriptions, is available here.
SortSam takes as input a BAM file and outputs a sorted and indexed file. In this step of the workflow, the input BAM should come from Picard.AddOrReplaceReadGroups in step 1.
Note that the module will only generate an index if the output file time is "BAM" and the sort order is "coordinate".
Full documentation for SortSam can be found here.
SAMtools.FastaIndex takes a reference FASTA (.fa) file and creates a .fai index file for it, which will be used by both Picard.ReorderSam, in step 5, and RNAseqMetrics, in step 8, to quickly locate and retrieve information from the reference sequence.
Full documentation for SAMtools.FastaIndex can be found here.
Next Picard.CreateSequenceDictionary takes a reference FASTA (.fa) and creates a SAM file containing a sequence dictionary (.dict extension). Sequence dictionaries contain the sequence name, length and genome assembly identifier and other information about sequences. The .dict file is required for both Picard.ReorderSam, in step 5, and RNAseqMetrics, in step 8.
The output FASTA file (.fa) from SAMtools.FastaIndex (step 3), can be passed as input to this module.
Full documentation for Picard.CreateSequenceDictionary can be found here.
Now that the the .fai and .dict files have been created for the reference FASTA file (steps 3 and 4), Picard.ReorderSam can be run to order the reads in the BAM file according the contigs of a reference FASTA file.
Picard.ReorderSam takes bam/bai pair (for instance, as output by SortSam earlier in this workflow) a.dict from Picard.CreateSequenceDictionary (step 4), and the .fa and .fai files from SAMtools.FastaIndex (step 3) and reorders the BAM file in accordance with the contigs in the reference FASTA file provided. The order is determined by exact name matching of contigs. Reads mapped to contigs absent in the reference file are dropped.
The resulting BAM file can next be sent to Picard.MarkDuplicates.
Full documentation for Picard.ReorderSam can be found here.
Next, Picard.MarkDuplicates takes the coordinate sorted BAM file output by Picard.ReorderSam, in step , (with read group information, added by Picard.AddOrReplaceGroups in step 1).
This is an optional module that will mark duplicate reads in the BAM file and optionally remove them. To see metrics for Duplication rates in the results from RNAseqMetrics, run this module and do not remove the duplicate reads.
Full documentation for Picard.MarkDuplicates can be found here.
The last step before running RNAseqMetrics is to index the BAM file which resulted from the workflow above. To do this, run SortSam, selecting BAM as the output format.
RNAseqMetrics is the last step in the GenePattern RNA-seq QC workflow. The module calculates standard RNA-seq related metrics, such as depth of coverage, ribosomal RNA contamination, continuity of coverage, and GC bias. It takes the following as input:
*Note that in most cases these will all need to be specified separately.
Please read the RNAseqMetrics module documentation for complete information regarding optional input files, parameter settings and the various metrics which will or won't be output based on those settings.
*If using the output from SortSam the BAM and BAI are located in the same folder and only the BAM file need be passed as an input parameter.
The output of the RNAseqMetrics module (and thus this workflow) is a ZIP archive containing an HTML report of metrics stating the total number of reads, depth of coverage at the 3’ and 5’ end, etc. The report also links to a GCT file containing the calculated RPKM values for each transcript in each sample.
Other metrics calculated include:
Full documentation for the module and its output files can be found here: (give public link when ready)
This workflow, and specifically the RNAseqMetrics module, has been optimized for Eukaryotic RNA-seq data. Modules which comprise methods optimized for Prokaryotic data are currently not available.
In cancer genomics, copy number change is one of the hallmarks of the genetic instability common to most human cancers and loss of heterozygosity (LOH) of tumor suppressor genes is a crucial step in the development of sporadic and hereditary cancer (Monti, 2005). Using modules available in GenePattern, you can compute SNP copy number and LOH based on Affymetrix SNP chip data for paired target/normal samples and then view them in the Integrative Genomics Viewer (IGV). The following modules are used for this computation, with IGV at the end for viewing the results:
SNPFileCreator converts the .CEL files from an Affymetrix array into a GenePattern .SNP file. Raw data for the probes in each SNP probe set are converted to a single intensity value per SNP using one of four modeling algorithms: Average Difference, PM/MM Difference Model (dChip, the default), Median Probe, or Trimmed Mean. Note that processing times for this module can average upwards of 30 minutes, depending on the speed of the server, the size of the dataset, and available memory. At least 2GB of memory are needed to run most SNPFileCreator jobs.
For more information about SNPFileCreator please see the SNPFileCreator Documentation
For gender-specific samples, run the XChromosomeCorrect module on the output of SNPFileCreator to correct intensity values for SNPs on the X chromosome. For each sample from a male donor, the module doubles the intensity value for SNPs on the X chromosome.
The sample information file describes the SNP array and must be tab-delimited, include a column labeled Gender that contains a value of M or F for each sample and include target/normal paired samples for copy number and LOH determination. (More information on file formats can be found here)
For more information about XChromosomeCorrect please see the XChromosomeCorrect Documentation
CopyNumberDivideByNormals computes the raw copy number of each target SNP by dividing its intensity value by the mean intensity value of all normal SNPs. This calculation is referred to as copy number normalization or normalization with respect to normals.
For more information about CopyNumberDivideByNormals please see the CopyNumberDivideByNormals Documentation
The LOHPaired module detects loss of heterozygosity (LOH). It takes as input a GenePattern .SNP
file that contains paired normal-target samples with genotype calls. (LOHPaired accepts only nonallele-
specific .SNP files; .SNP files that contain one intensity value per probe.) It returns as output a
GenePattern .LOH file that contains, for each probe, the LOH calls for each array pair.
LOH call values are as follows.
|L||LOH: AB in normal and A or B in tumor|
|R||Retention: AB in both normal and tumor or No Call in normal and AB in tumor|
|C||Conflict: A or B in normal and AB in tumor|
Non-informative call: A or B in normal
No call: No Call in normal or tumor
The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated datasets. It supports a wide variety of data types and provides easy access to genomes and datasets hosted by the Broad Institute.
|name||track label||Track name (ignored when used in the IGV file format)|
|description||center label||Currently ignored|
|visibility||full | dense | hide||Currently ignored|
|color||RRR,GGG,BBB||Color for positive values in all tracks|
|altColor||RRR,GGG,BBB||Color for negative values in all tracks|
|autoScale||on | off||Currently ignored; all tracks autoscale unless an explicit data range is defined (e.g., by including the viewlimits specifier).|
|gridDefault||on | off||Currently ignored|
|maxHeightPixels||max:default:min||Default and min are supported; max is currently ignored|
|graphType||bar | points | heatmap||Scatter plot | heatmap. IGV only: The heatmap value is an IGV addition to the WIG specification.|
|midRange||x:y||Defines the neutral range for a three-color heatmap. Values in this range are rendered with the midColor value, which is white by default. Example: midRange=20:80 IGV only: This specifier is an IGV addition to the WIG specification.|
|midColor||RRR,GGG,BBB||Color to use in the "mid range" of a heatmap. Example: midColor=0.0.150 IGV only: This specifier is an IGV addition to the WIG specification.|
|viewLimits||lower:upper||Defines the data range|
|yLineOnOff||on | off||Currently ignored|
|windowingFunction||maximum | minimum | mean||Function that summarizes the values in a window of data represented by one pixel|
|smoothingWindow||off | [MATKC:2-16]||Currently ignored|
|coords||0 | 1||Indicate whether the file uses 0 or 1 based coordinates.The UCSC specification for WIG files uses 1 based coordinates and for BED files uses 0 based coordinates. If data looks off by one, check for a possible 0 vs 1 based coordinate issue. IGV only: This specifier is an IGV addition to the WIG specification.|
To launch IGV and view your Copy Number and/or LOH data:
For more information on navigating or displaying data in IGV please see the IGV User Guide.
In GenePattern, you use the ComparativeMarkerSelection module to identify the genes (if any) that are differentially expressed between two phenotype classes. Typically, this is a three-step process:
The GenePattern Differential Expression Analysis protocol provides example files and step-by-step instructions for running ComparativeMarkerSelection and its companion modules. If you are unfamiliar with differential expression analysis or ComparativeMarkerSelection, start here:
The information provided in this section supplements the information provided in the Differential Expression Analysis protocol and the ComparativeMarkerSelection documentation. It assumes that you have walked through the Differential Expression Analysis protocol as described in the Basic Instructions above.
ComparativeMarkerSelection requires gene expression data in the GCT or RES file format.
ComparativeMarkerSelection analyzes two phenotype classes at a time. If the expression data set includes samples from more than two classes, use the phenotype test parameter to analyze each class against all others (one-versus-all) or all class pairs (all pairs).
If you are studying two variables and your data set contains a third variable that might distort the association between the variables of interest, you can use a confounding variable class file to correct for the affect of the third variable. For example, the data set in Lu, Getz, et. al. (2005) contains tumor and normal samples from different tissue types. When studying the association between the tumor and normal samples, the authors use a confounding variable class file to correct for the effect of the different tissue types.
The phenotype class file identifies the tumor and normal samples:
The confounding variable class file identifies the tissue type of each sample:
Given these two class files, when performing permutations, ComparativeMarkerSelection shuffles the tumor/normal labels only among samples with the same tissue type.
ComparativeMarkerSelection uses a permutation test to estimate the significance (p-value) of the test statistic score. If the data set includes at least 10 samples per class, use the default value of 1000 permutations to ensure sufficiently accurate p-values.
If the data set includes fewer than 10 samples in any class, permuting the samples cannot give an accurate p-value. Specify a value of 0 permutations to use asymptotic p-values instead. In this case, ComparativeMarkerSelection computes p-values assuming the test statistic scores follow Student's t-distribution (rather than using the test statistic to create an empirical distribution of the scores).
ComparativeMarkerSelection also provides two additional options:
By default, ComparativeMarkerSelection expects non-log-transformed data. Some calculations, such as Fold Change, will produce incorrect results when log transformed data is provided and not indicated. To indicate that your data are log transformed, be sure to set the _log transformed data _parameter to "yes".
By default, ComparativeMarkerSelection performs a two-sided test; that is, the test statistic score is calculated assuming that the differentially expressed gene can be up-regulated in either phenotype class. Optionally, use the test direction parameter to specify a one-sided test, where the differentially expressed gene must be up-regulated for class 0 or for class 1.
ComparativeMarkerSelection provides several methods of calculating differential expression. By default, the module uses the t-test statistic. Optionally, you can choose to use the signal-to-noise ratio (SNR) or paired T-test statistic instead.
The T-Test computes the standardized mean difference between the two classes.
ComparativeMarkerSelection also provides variations on the T-Test:
Signal-to-noise ratio is computed by dividing the difference of class means by the sum of their standard deviations.
ComparativeMarkerSelection also provides variations on the signal-to-noise ratio:
The Paired T-Test can be used to analyze paired samples; for example, samples taken from patients before and after treatment. This test is used when the cross-class differences (e.g. the difference before and after treatment) are expected to be smaller than the within-class differences (e.g., the difference between two patients). For example if you are measuring weight gain in a population of people, the weights may be distributed from 90 lbs. to say 300 lbs. and the weight gain/loss (the paired variable) may be on the order of 0-30 lbs. So the cross-class difference ("before" and "after") is less than the within-class difference (person 1 and person 2).
Where the standard T-Test takes the mean of the difference between classes, the Paired T-Test takes the mean of the differences between pairs (for more information, refer to the Wikipedia article on the paired T-Test.)
For the Paired T-Test, paired samples in the expression data file must be arranged by class, where the first samples in each class are paired, the second samples are paired, and so on. For example, sample pairs A1/B1, A2/B2 and A3/B3 would be ordered in an expression data file as A1, A2, A3, B1, B2, B3. Note that your data must contain the same number of samples in each class in order to use this statistic.
In this article we will go through the manual process of setting up your own module repository to connect to your GenePattern server. This may be desirable if you are developing many GenePattern modules and are looking for a way to control versions in both development and pre-release stages. This mechanism also allows for a centralized distribution of modules, both during development and at release.
The following steps describe the setup process:
The following tutorial shows you how to create a new GenePattern module (in GenePattern 3.4 and up). Only the GenePattern team can create or install modules on the GenePattern public server. Therefore, to create a module, you need to have a local GenePattern server installed (see the download and installation page). You may also be interested in the video tutorial: Create a module in GenePattern.
In this tutorial, you will create a module named log_transform. The module invokes a perl script, log_transform.pl, which log-transforms all positive values in a data set and sets all negative or zero values to zero. Before you begin, download the perl script and its documentation:
In GenePattern, to create the log_transform module:
caArray is an open-source, web and programmatically accessible array data management system. caArray guides the annotation and exchange of array data using a federated model of local installations whose results are shareable across the cancer Biomedical Informatics Grid (caBIG®). caArray furthers translational cancer research through acquisition, dissemination and aggregation of semantically interoperable array data to support subsequent analysis by tools and services on and off the Grid.
To facilitate the importing of data from caArray repositories in GenePattern, a module named caArray2.3.0Importer is provided.
The CaArray2.3.0Importer imports data files from a caArray 2.x repository into GenePattern by connecting to a caArray 2.x repository and then retrieving all files of a given extension for a named experiment. The retrieved files are then collected into a single ZIP file archive which is returned as the module's output.
(Note: 2.x is used here to refer to any 2.3 or higher version repository. The current version is 2.4.0 and is compatible with the caArray2.3.0Importer.)
A typical use case for this module would be to retrieve all .cel files from a given experiment in caArray and then pass the resulting zip file to the ExpressionFileCreator module for processing into a GenePattern GCT or RES file format. More information about possible next steps from the importer will be given below.
The following sections will discuss the details of the caArray2.3.0Importer parameters.
Note: This module is not compatible with earlier, now deprecated, caArray versions (i.e., 1.x).
Here you provide the URL to the caArray 2.x repository from which you wish to import data. By default GenePattern provides the
public caArray instance hosted at the NCI (https://array.nci.nih.gov/caarray/home.action). You may, however, provide an URL to any caArray repository to which you have access.
The "experiment" refers to the title or public identifier of the experiment in caArray from which the data is to be imported.
If you don't already have a title or public id, you should go to your repository of (ie the URL you will have or have provided for the first parameter) and browse to or search for the experiment which contains data you wish to import. (Note that the public repository hosted by NCI works best with Firefox 2.0.0 or higher)
This parameter refers to the type of bioassay data to be retrieved; raw or derived. The default is raw and will likely be correct for most data imported from caArray.
This is an optional parameter which will specify the data file extension to be retrieved. If specified, only data files with this extension, in the experiment, will be imported into GenePattern. If this is not specified, all files of the specified type (raw or derived) will be retrieved.
In this parameter you can specify an output file name to be used for the resulting .zip file. If not specified the experiment name, with any spaces replaced by underscores, will be used as the output file prefix.
These are the username and password for the caArray repository specified by the URL. You only need provide these if the data you are importing is private.
Depending on the data type imported from the caArray repository there are some common paths or workflows you may wish to follow to analyze your data. These will be discussed below.
If you imported raw expression data in the CEL file format from caArray, you can take the .zip output from caArray2.3.0Importer, containing those .cel files and use it as input to the ExpressionFileCreator module. ExpressionFileCreator converts the .cel files into a matrix containing one intensity value per probe set, in the GCTfile format. (Note that the newer CEL file formats require ExpressionFileCreator version 8 or later, which can be found on the GenePattern public server. To install on your local server, email gp-help(at)broadinstitute.org for further instruction.)
A next common step would be to provide the .gct or .res file output from ExpressionFileCreator as input to ComparativeMarkerSelection; a module that computes significance values for features. For a more indepth discussion of using ComparativeMarkerSelection for differential expression analysis please see the article Using ComparativeMarkerSelection for Differential Expression Analysis.
If you imported raw Affymetrix SNP chip CEL files from caArray, you can take the .zip output containing those .cel files and use it as input to for the SNPFileCreator module. SNPFileCreator performs normalization and probe-level summarization to generate a SNP file for the provided set of SNP chip CEL files. (Note that an updated version of SNPFileCreator, version 2, supporting SNP 6.0 will be available in the first quarter of 2012.)
A common workflow or pipeline for .snp data is to compute SNP copy number and loss of heterozygosity (LOH). A discussion of this pipeline can be found in the article Computing SNP Copy Number and Loss of Heterozygosity.
In summary GenePattern provides a quick method for importing caArray experiment data sets into GenePattern, and tools to convert the raw or derived data into common GenePattern input formats. There are some common paths to be followed once your data is in these common formats, however, you should note that these are by no means the only paths to be taken. Once in GenePattern there are many methods available for data analysis and visualization. Those represented here are, as mentioned, common paths which may serve as endpoints in and of themselves or as entry points for using the analysis tools provided by GenePattern in custom workflows as meets with your desired output.