In-Depth Articles

About the In-Depth Articles

These brief articles and tutorials supplement the GenePattern documentation. They may be written in response to user questions or to describe new GenePattern features.

If you have a topic you would like to see included, please contact the GenePattern team at gp-help(at)broadinstitute.org.


Using IGV Through GenePattern

Overview

The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated datasets. It supports a wide variety of data types including sequence alignments, microarrays, and genomic annotations. Up until recently, this tool was only available outside of GenePattern, though it did accept GenePattern file formats. IGV can now be launched from a module available on the GenePattern Server or downloaded from the GenePattern Repository.

With this new development, users can pass their GenePattern result files directly to IGV through GenePattern.

IGV in GenePattern

The GenePattern IGV module launches the same application that is available from the IGV website. If you are a user of both the client IGV (either launching from the IGV website or your desktop) and GenePattern, this means you are using the same version of IGV complete with your preferences, home directory, saved genomes, and other such IGV saved presets. For all users, this means you are getting the latest version of IGV each time you run IGV, regardless of whether that is from GenePattern or from the IGV client.

As mentioned above, having IGV in GenePattern now allows you to pass your GenePattern data files directly to IGV in the same way you would use a result file as input file for any other module. For instance, you'll now notice that (on servers where IGV is installed) output from a run of GISTIC will have IGV as a next option in the dropdown for the result file.

Supported File Formats

IGV supports many of the common GenePattern file formats such as: CBS, CN, GCT*, RES*, GISTIC, SEG, and LOH files. For more information about supported IGV file types click here.

You can also upload any other IGV-supported file type as you would any other input file; i.e., via Upload or URL.

Note: In order to properly view GCT or RES files in IGV, some preprocessing is needed and will be discussed in detail shortly.

Configuring IGV in GenePattern

When you run IGV from within GenePattern you are provided with a few optional configuration parameters which will instruct IGV how to display your data.

Currently these two parameters are "genome" and "locus".

The "genome" parameter allows you to select the genome which corresponds to your data file. If you choose not to specify a genome, IGV will launch with hg19 if this is the first time you've run IGV. If you've run IGV before, it will launch with the last genome you were viewing.

The "locus" parameter allows you to specify a locus or range of interest for your data. For example, you could specify chr5:90,339,000-90,349,000 and IGV would launch with your data and that region of chromosome 5 displayed. If you instead wanted to look for the gene EGFR, you would simply type "EGFR" into the text box. If you choose not to specify a locus or gene and this is the first time you've run IGV, IGV will launch with chromosome 1 selected. If you've run IGV before, it will launch with the chromosome you last viewed.

Viewing GCT and RES files

In order to properly view a GCT or RES file in IGV, some preprocessing is required.

The default display option for a GCT or RES file is the Heatmap. For the heatmap to make sense, the data must be row-centered, scaled and possibly have a threshold applied. Currently the workflow for this is as follows.

1. Run PreprocessDataset in GenePattern

If the data contains negative (non-log transformed) values, run it through PreprocessDataset. The default threshold there is 20.

2. Run data through IGVTools

Currently IGVTools is a stand-alone utility providing a set of tools for pre-processing data files.

For the preprocessing of unscaled GCT and RES files an option called "formatExp" is provided. It takes a non-log expression file and performs the following steps. (Note that these are the steps used for our internal expression data prior to viewing in IGV.)

  1. take log2 of data
  2. compute median and subtact from each log2 probe value (i.e. center on the median)
  3. compute the MAD (mean absolute deviation)
  4. divide each log2 probe value by the MAD

You can download this version of igvtools.

After unzipping, igvtools can be used on the command line to transform the RES or GCT file as described above. The command line follows. (If you are on a Windows platform use "igvtools.bat" instead of "igvtools".)

./igvtools formatExp inputFile outputFile

To run this on your preprocessed dataset, save the resulting .preprocessed file and provide it as "inputFile" in the command line.

Note: A GenePattern module which will include these preprocessing steps is planned for release in early 2011.

3. Run IGV with Scaled and Centered Data

Take the output from IGVTools and provide it as input to IGV.

Using IGV

Once you have launched IGV you may configure, drag, zoom, save, etc., as you would normally use IGV. For more information on how to use IGV, please visit IGV website.

For questions and comments about IGV, please send an email to the IGV team.

For questions and comments about GenePattern, please send an email to the GenePattern team.


RNA-seq QC in GenePattern

Overview

After aligning and/or assembling your RNA-seq data, it is important to take a closer look at the content of those result files before continuing with further analysis; in part, because the results of that investigation may, in fact, point you toward how you should best analyze your data.

Specifically in GenePattern, modules are provided to calculate such Quality Control (QC) metrics as: Depth of Coverage, Continuity of Coverage, Duplication Rate, Expression Rates, Strand Specificity, and GC content, among others.

Having these sorts of metrics can help to prevent or better understand common RNA-seq errors stemming from such sources as: read length, quality of data, sample prep, or number of reads in the data.

Modules in GP

The following decision diagram illustrates a suggested workflow. This workflow is discussed in further detail in subsequent sections.

Input

The input to the suggested RNA-seq QC workflow in GenePattern is an aligned, coordinate sorted BAM file with Read Group information (such as platform or sample) in the header. (SAM files can be converted to BAM format using the SortSam module.)

If your aligned BAM file does not contain Read Group Information you should run your data through AddOrReplaceReadGroups, as discussed next.

If the aligned BAM file is not coordinate sorted, run the data through SortSam, making sure the sort order is set to "coordinate" as discussed below.

More information about the SAM/BAM format can be found at the SAMtools website.

Note that if your data was aligned with TopHat, you will likely want to run your BAM file through the Picard tool MergeBamAlignment (soon to be available in GenePattern) to handle the fact that TopHat removes unaligned reads. This can throw off the total number of reads and any other metrics using that value in the RNAseqMetrics module. 

1) Picard.AddOrReplaceReadGroups

Input for Picard.AddOrReplaceReadGroups is a BAM file which has been aligned.

The module will either add new (if none previously existed) or replace read groups as defined in the parameters. All reads in the file will be assigned to the specified read group.

Read Group information is required by Picard.MarkDuplicates and the RNAseqMetrics module. Specifically the RNAseqMetrics module requires a Read Group ID in the BAM header.

Full documentation for Picard.AddOrReplaceReadGroups, with parameter descriptions, is available here.

2) SortSam

SortSam takes as input a BAM file and outputs a sorted and indexed file. In this step of the workflow, the input BAM should come from Picard.AddOrReplaceReadGroups in step 1.

Note that the module will only generate an index if the output file time is "BAM" and the sort order is "coordinate".

Full documentation for SortSam can be found here.

3) SAMtools.FastaIndex

SAMtools.FastaIndex takes a reference FASTA (.fa) file and creates a .fai index file for it, which will be used by both Picard.ReorderSam, in step 5, and RNAseqMetrics, in step 8, to quickly locate and retrieve information from the reference sequence.

Full documentation for SAMtools.FastaIndex can be found here.

4) Picard.CreateSequenceDictionary

Next Picard.CreateSequenceDictionary takes a reference FASTA (.fa) and creates a SAM file containing a sequence dictionary (.dict extension). Sequence dictionaries contain the sequence name, length and genome assembly identifier and other information about sequences. The .dict file is required for both Picard.ReorderSam, in step 5, and RNAseqMetrics, in step 8.

The output FASTA file (.fa) from SAMtools.FastaIndex (step 3), can be passed as input to this module.

Full documentation for Picard.CreateSequenceDictionary can be found here.

5) Picard.ReorderSam

Now that the the .fai and .dict files have been created for the reference FASTA file (steps 3 and 4), Picard.ReorderSam can be run to order the reads in the BAM file according the contigs of a reference FASTA file.

Picard.ReorderSam takes bam/bai pair (for instance, as output by SortSam earlier in this workflow) a.dict from Picard.CreateSequenceDictionary (step 4), and the .fa and .fai files from SAMtools.FastaIndex (step 3) and reorders the BAM file in accordance with the contigs in the reference FASTA file provided. The order is determined by exact name matching of contigs. Reads mapped to contigs absent in the reference file are dropped.

The resulting BAM file can next be sent to Picard.MarkDuplicates. 

Full documentation for Picard.ReorderSam can be found here.

6) Picard.MarkDuplicates

Next, Picard.MarkDuplicates takes the coordinate sorted BAM file output by Picard.ReorderSam, in step , (with read group information, added by Picard.AddOrReplaceGroups in step 1).

This is an optional module that will mark duplicate reads in the BAM file and optionally remove them. To see metrics for Duplication rates in the results from RNAseqMetrics, run this module and do not remove the duplicate reads.

Full documentation for Picard.MarkDuplicates can be found here.

7)  SortSam

The last step before running RNAseqMetrics is to index the BAM file which resulted from the workflow above. To do this, run SortSam, selecting BAM as the output format.

8)  RNAseqMetrics

RNAseqMetrics is the last step in the GenePattern RNA-seq QC workflow. The module calculates standard RNA-seq related metrics, such as depth of coverage, ribosomal RNA contamination, continuity of coverage, and GC bias. It takes the following as input:

*Note that in most cases these will all need to be specified separately.

Please read the RNAseqMetrics module documentation for complete information regarding optional input files, parameter settings and the various metrics which will or won't be output based on those settings.

*If using the output from SortSam the BAM and BAI are located in the same folder and only the BAM file need be passed as an input parameter.

Output

The output of the RNAseqMetrics module (and thus this workflow) is a ZIP archive containing an HTML report of metrics stating the total number of reads, depth of coverage at the 3’ and 5’ end, etc. The report also links to a GCT file containing the calculated RPKM values for each transcript in each sample.

Other metrics calculated include:

Full documentation for the module and its output files can be found here: (give public link when ready)

Note
This workflow, and specifically the RNAseqMetrics module, has been optimized for Eukaryotic RNA-seq data. Modules which comprise methods optimized for Prokaryotic data are currently not available.

 


Computing SNP Copy Number and Loss of Heterozygosity

Overview

In cancer genomics, copy number change is one of the hallmarks of the genetic instability common to most human cancers and loss of heterozygosity (LOH) of tumor suppressor genes is a crucial step in the development of sporadic and hereditary cancer (Monti, 2005). Using modules available in GenePattern, you can compute SNP copy number and LOH based on Affymetrix SNP chip data for paired target/normal samples and then view them in the Integrative Genomics Viewer (IGV). The following modules are used for this computation, with IGV at the end for viewing the results:

SNPFileCreator

SNPFileCreator converts the .CEL files from an Affymetrix array into a GenePattern .SNP file. Raw data for the probes in each SNP probe set are converted to a single intensity value per SNP using one of four modeling algorithms: Average Difference, PM/MM Difference Model (dChip, the default), Median Probe, or Trimmed Mean. Note that processing times for this module can average upwards of 30 minutes, depending on the speed of the server, the size of the dataset, and available memory. At least 2GB of memory are needed to run most SNPFileCreator jobs.

SNPFileCreator Inputs, Parameters, and Considerations

For more information about SNPFileCreator please see the SNPFileCreator Documentation

XChromosomeCorrect

For gender-specific samples, run the XChromosomeCorrect module on the output of SNPFileCreator to correct intensity values for SNPs on the X chromosome. For each sample from a male donor, the module doubles the intensity value for SNPs on the X chromosome.

XChromosomeCorrect Inputs, Parameters and Considerations

The sample information file describes the SNP array and must be tab-delimited, include a column labeled Gender that contains a value of M or F for each sample and include target/normal paired samples for copy number and LOH determination. (More information on file formats can be found here)

For more information about XChromosomeCorrect please see the XChromosomeCorrect Documentation

CopyNumberDivideByNormals

CopyNumberDivideByNormals computes the raw copy number of each target SNP by dividing its intensity value by the mean intensity value of all normal SNPs. This calculation is referred to as copy number normalization or normalization with respect to normals.

CopyNumberDivideByNormals Inputs, Parameters, and Considerations

For more information about CopyNumberDivideByNormals please see the CopyNumberDivideByNormals Documentation

LOHPaired

The LOHPaired module detects loss of heterozygosity (LOH). It takes as input a GenePattern .SNP
file that contains paired normal-target samples with genotype calls. (LOHPaired accepts only nonallele-
specific .SNP files; .SNP files that contain one intensity value per probe.) It returns as output a
GenePattern .LOH file that contains, for each probe, the LOH calls for each array pair.

LOH call values are as follows.

Call Value
L LOH: AB in normal and A or B in tumor
R Retention: AB in both normal and tumor or No Call in normal and AB in tumor
C Conflict: A or B in normal and AB in tumor
N Non-informative call: A or B in normal
No call: No Call in normal or tumor

LOHPaired Input, Parameters, and Considerations

IGV

The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated datasets. It supports a wide variety of data types and provides easy access to genomes and datasets hosted by the Broad Institute.

Adding a track line to view LOH data

Specifier Value Description
name track label Track name (ignored when used in the IGV file format)
description center label Currently ignored
visibility full | dense | hide Currently ignored
color RRR,GGG,BBB Color for positive values in all tracks
altColor RRR,GGG,BBB Color for negative values in all tracks
priority N Currently ignored
autoScale on | off Currently ignored; all tracks autoscale unless an explicit data range is defined (e.g., by including the viewlimits specifier).
gridDefault on | off Currently ignored
maxHeightPixels max:default:min Default and min are supported; max is currently ignored
graphType bar | points | heatmap Scatter plot | heatmap. IGV only: The heatmap value is an IGV addition to the WIG specification.
midRange x:y Defines the neutral range for a three-color heatmap. Values in this range are rendered with the midColor value, which is white by default. Example: midRange=20:80 IGV only: This specifier is an IGV addition to the WIG specification.
midColor RRR,GGG,BBB Color to use in the "mid range" of a heatmap. Example: midColor=0.0.150 IGV only: This specifier is an IGV addition to the WIG specification.
viewLimits lower:upper Defines the data range
yLineMark real-value Currently ignored
yLineOnOff on | off Currently ignored
windowingFunction maximum | minimum | mean Function that summarizes the values in a window of data represented by one pixel
smoothingWindow off | [MATKC:2-16] Currently ignored
coords 0 | 1 Indicate whether the file uses 0 or 1 based coordinates.The UCSC specification for WIG files uses 1 based coordinates and for BED files uses 0 based coordinates. If data looks off by one, check for a possible 0 vs 1 based coordinate issue. IGV only: This specifier is an IGV addition to the WIG specification.

Launching IGV and Viewing your data

To launch IGV and view your Copy Number and/or LOH data:

For more information on navigating or displaying data in IGV please see the IGV User Guide.


Using ComparativeMarkerSelection for Differential Expression Analysis

Overview

In GenePattern, you use the ComparativeMarkerSelection module to identify the genes (if any) that are differentially expressed between two phenotype classes. Typically, this is a three-step process:

  1. Run the PreprocessDataset module to preprocess the expression data.
    PreprocessDataset removes platform noise and genes that have little variation. It takes an expression data file and generates a new, modified expression data file.
  2. Run the ComparativeMarkerSelection module to compute differential gene expression.
    For each gene, ComparativeMarkerSelection first uses a test statistic to calculate the difference in gene expression between the samples in the first class and the samples in the second class and then estimates the significance (p-value) of the test statistic score. Because testing tens of thousands of genes simultaneously increases the possibility of mistakenly identifying a non-marker gene as a marker gene, ComparativeMarkerSelection corrects for multiple hypothesis testing by computing both the false discovery rate (FDR) and family-wise error rate (FWER). ComparativeMarkerSelection takes an expression data file and generates a result (ODF) file.
  3. Run the ComparativeMarkerSelectionViewer module to view the results.
    For each gene, ComparativeMarkerSelectionViewer displays the test statistic score, its p-value, two FDR statistics, and three FWER statistics.

Basic instructions

The GenePattern Differential Expression Analysis protocol provides example files and step-by-step instructions for running ComparativeMarkerSelection and its companion modules. If you are unfamiliar with differential expression analysis or ComparativeMarkerSelection, start here:

  1. Login to the public GenePattern server at Broad Institute.
    If you do not have a GenePattern account, you can register on the login page.
  2. Notice that the GenePattern protocols are listed in the center of the GenePattern home page.
  3. Click Differential Expression Analysis to display the protocol's step-by-step instructions.

Details and considerations

The information provided in this section supplements the information provided in the Differential Expression Analysis protocol and the ComparativeMarkerSelection documentation. It assumes that you have walked through the Differential Expression Analysis protocol as described in the Basic Instructions above.

Expression data

ComparativeMarkerSelection requires gene expression data in the GCT or RES file format.

Phenotype classes

ComparativeMarkerSelection analyzes two phenotype classes at a time. If the expression data set includes samples from more than two classes, use the phenotype test parameter to analyze each class against all others (one-versus-all) or all class pairs (all pairs).

Confounding phenotype classes

If you are studying two variables and your data set contains a third variable that might distort the association between the variables of interest, you can use a confounding variable class file to correct for the affect of the third variable. For example, the data set in Lu, Getz, et. al. (2005) contains tumor and normal samples from different tissue types. When studying the association between the tumor and normal samples, the authors use a confounding variable class file to correct for the effect of the different tissue types.

The phenotype class file identifies the tumor and normal samples:

75 2 1
# Normal Tumor
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1

The confounding variable class file identifies the tissue type of each sample:

75 6 1
# colon kidney prostate uterus human-lung breast
1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5

Given these two class files, when performing permutations, ComparativeMarkerSelection shuffles the tumor/normal labels only among samples with the same tissue type.

Permutations

ComparativeMarkerSelection uses a permutation test to estimate the significance (p-value) of the test statistic score. If the data set includes at least 10 samples per class, use the default value of 1000 permutations to ensure sufficiently accurate p-values.

If the data set includes fewer than 10 samples in any class, permuting the samples cannot give an accurate p-value. Specify a value of 0 permutations to use asymptotic p-values instead. In this case, ComparativeMarkerSelection computes p-values assuming the test statistic scores follow Student's t-distribution (rather than using the test statistic to create an empirical distribution of the scores).

ComparativeMarkerSelection also provides two additional options:

Log transformed data

By default, ComparativeMarkerSelection expects non-log-transformed data. Some calculations, such as Fold Change, will produce incorrect results when log transformed data is provided and not indicated.  To indicate that your data are log transformed, be sure to set the _log transformed data _parameter to "yes".

Test direction

By default, ComparativeMarkerSelection performs a two-sided test; that is, the test statistic score is calculated assuming that the differentially expressed gene can be up-regulated in either phenotype class. Optionally, use the test direction parameter to specify a one-sided test, where the differentially expressed gene must be up-regulated for class 0 or for class 1.

Test statistic

ComparativeMarkerSelection provides several methods of calculating differential expression. By default, the module uses the t-test statistic. Optionally, you can choose to use the signal-to-noise ratio (SNR) or paired T-test statistic instead.

T-Test (default)

The T-Test computes the standardized mean difference between the two classes.

ComparativeMarkerSelection also provides variations on the T-Test:

Signal-to-noise ratio (SNR)

Signal-to-noise ratio is computed by dividing the difference of class means by the sum of their standard deviations.

ComparativeMarkerSelection also provides variations on the signal-to-noise ratio:

Paired T-Test

The Paired T-Test can be used to analyze paired samples; for example, samples taken from patients before and after treatment. This test is used when the cross-class differences (e.g. the difference before and after treatment) are expected to be smaller than the within-class differences (e.g., the difference between two patients). For example if you are measuring weight gain in a population of people, the weights may be distributed from 90 lbs. to say 300 lbs. and the weight gain/loss (the paired variable) may be on the order of 0-30 lbs. So the cross-class difference ("before" and "after") is less than the within-class difference (person 1 and person 2).

Where the standard T-Test takes the mean of the difference between classes, the Paired T-Test takes the mean of the differences between pairs (for more information, refer to the Wikipedia article on the paired T-Test.)

For the Paired T-Test, paired samples in the expression data file must be arranged by class, where the first samples in each class are paired, the second samples are paired, and so on. For example, sample pairs A1/B1, A2/B2 and A3/B3 would be ordered in an expression data file as A1, A2, A3, B1, B2, B3. Note that your data must contain the same number of samples in each class in order to use this statistic.


Setting Up a Module Repository

In this article we will go through the manual process of setting up your own module repository to connect to your GenePattern server. This may be desirable if you are developing many GenePattern modules and are looking for a way to control versions in both development and pre-release stages. This mechanism also allows for a centralized distribution of modules, both during development and at release.

The following steps describe the setup process:

  1. Download the following files:
  2. Put the properties file somewhere and remember the path.
  3. Take the WAR file and unzip it into a Tomcat/webapps directory (note you need to open the file, deploying it as a WAR will not work).
  4. Edit the WEB-INF/web.xml file for the webapp.
  5. Start Tomcat.
  6. Navigate to <your URL>/gpModuleRepository/uploadForm.jsp.
  7. Upload a module to the 'dev' environment.
  8. Check the directory you defined in the properties to see that the modules subdirectories were created and the unzipped module file is there.
  9. Navigate to <your URL>/gpModuleRepository?env=dev and check that the uploaded module is there.
  10. Open a browser window to your GenePattern server and navigate to Administration/Server Settings/Repositories.
  11. Enter the URL for your new module repository and 'Save'.
  12. Navigate to Modules & Pipelines>Install from Repository and check all of the filtering checkboxes at the top of the page.
  13. Verify that GenePattern can see all the modules you expect to be in your module repository.
  14. Select a module and test an installation.

Creating a GenePattern Module

The following tutorial shows you how to create a new GenePattern module (in GenePattern 3.4 and up). Only the GenePattern team can create or install modules on the GenePattern public server. Therefore, to create a module, you need to have a local GenePattern server installed (see the download and installation page). You may also be interested in the video tutorial: Create a module in GenePattern.

In this tutorial, you will create a module named log_transform. The module invokes a perl script, log_transform.pl, which log-transforms all positive values in a data set and sets all negative or zero values to zero. Before you begin, download the perl script and its documentation:

In GenePattern, to create the log_transform module:

  1. Click Modules & Pipelines>New Module. GenePattern opens the Module Integrator window.
  2. Enter the following information in the Details fields:
  3. Use the Support Files section to upload the perl program (the .pl file you downloaded before starting the tutorial):The Module Integrator Details and Support Files sections should look like this now:

    You can click the blue arrows to the left of Details and Support Files to close those parts of the window and give you more room for working with your parameters.
    1. Click the Add files... button in the Support files field. GenePattern displays the File Upload window.
    2. Select the log_transform.pl file and click Open. This is the script that implements the module.
  4. Enter the following text into the Command Line field, without the quotes: "<perl> <libdir>log_transform.pl -F <input.filename> -o <output.file>" Typically, you enter the command line as a combination of fixed text and variables defined by GenePattern. This allows the command line to be independent of the operating environment and allows different values to be specified at different invocations of the command. This command line uses the following variables:
  5. In the Parameters section, enter "2" and click Add Parameter.  This gives you 2 blank parameter fields to work with.
  6. Describe your two program parameters: input.filename and output.file. The parameter names and descriptions that GenePattern displays when a user runs your module are the parameter names and descriptions that you provide here. In the first parameter, enter the following information for input.filename:
  7. In the second parameter, enter the following information for output.file:The Parameters section should look like this:
  8. Review your command line. The resulting command line should match the command line you originally entered in Step 4.
  9. Click Save. GenePattern displays a message informing you that the module has been saved.
  10. Click Run to confirm that it has been added to the GenePattern server correctly.

Importing Data from caArray to GenePattern

Overview

caArray is an open-source, web and programmatically accessible array data management system. caArray guides the annotation and exchange of array data using a federated model of local installations whose results are shareable across the cancer Biomedical Informatics Grid (caBIG®). caArray furthers translational cancer research through acquisition, dissemination and aggregation of semantically interoperable array data to support subsequent analysis by tools and services on and off the Grid.

To facilitate the importing of data from caArray repositories in GenePattern, a module named caArray2.3.0Importer is provided.

caArray2.3.0Importer

The CaArray2.3.0Importer imports data files from a caArray 2.x repository into GenePattern by connecting to a caArray 2.x repository and then retrieving all files of a given extension for a named experiment. The retrieved files are then collected into a single ZIP file archive which is returned as the module's output.

(Note: 2.x is used here to refer to any 2.3 or higher version repository. The current version is 2.4.0 and is compatible with the caArray2.3.0Importer.)

A typical use case for this module would be to retrieve all .cel files from a given experiment in caArray and then pass the resulting zip file to the ExpressionFileCreator module for processing into a GenePattern GCT or RES file format. More information about possible next steps from the importer will be given below.

The following sections will discuss the details of the caArray2.3.0Importer parameters.

Note: This module is not compatible with earlier, now deprecated, caArray versions (i.e., 1.x).

URL

Here you provide the URL to the caArray 2.x repository from which you wish to import data. By default GenePattern provides the
public caArray instance hosted at the NCI (https://array.nci.nih.gov/caarray/home.action). You may, however, provide an URL to any caArray repository to which you have access.

experiment

The "experiment" refers to the title or public identifier of the experiment in caArray from which the data is to be imported.

If you don't already have a title or public id, you should go to your repository of (ie the URL you will have or have provided for the first parameter) and browse to or search for the experiment which contains data you wish to import. (Note that the public repository hosted by NCI works best with Firefox 2.0.0 or higher)

type

This parameter refers to the type of bioassay data to be retrieved; raw or derived. The default is raw and will likely be correct for most data imported from caArray.

extension

This is an optional parameter which will specify the data file extension to be retrieved. If specified, only data files with this extension, in the experiment, will be imported into GenePattern. If this is not specified, all files of the specified type (raw or derived) will be retrieved.

zipFileName

In this parameter you can specify an output file name to be used for the resulting .zip file. If not specified the experiment name, with any spaces replaced by underscores, will be used as the output file prefix.

username and password

These are the username and password for the caArray repository specified by the URL. You only need provide these if the data you are importing is private.

Next Steps

Depending on the data type imported from the caArray repository there are some common paths or workflows you may wish to follow to analyze your data. These will be discussed below.

Convert to GCT file

If you imported raw expression data in the CEL file format from caArray, you can take the .zip output from caArray2.3.0Importer, containing those .cel files and use it as input to the ExpressionFileCreator module. ExpressionFileCreator converts the .cel files into a matrix containing one intensity value per probe set, in the GCT or RES file format. (Note that the newer CEL file formats require ExpressionFileCreator version 8 or later, which can be found on the GenePattern public server. To install on your local server, email gp-help(at)broadinstitute.org for further instruction.)

A next common step would be to provide the .gct or .res file output from ExpressionFileCreator as input to ComparativeMarkerSelection; a module that computes significance values for features. For a more indepth discussion of using ComparativeMarkerSelection for differential expression analysis please see the article Using ComparativeMarkerSelection for Differential Expression Analysis.

Convert to SNP file

If you imported raw Affymetrix SNP chip CEL files from caArray, you can take the .zip output containing those .cel files and use it as input to for the SNPFileCreator module. SNPFileCreator performs normalization and probe-level summarization to generate a SNP file for the provided set of SNP chip CEL files. (Note that an updated version of SNPFileCreator, version 2, supporting SNP 6.0 will be available in the first quarter of 2012.)

A common workflow or pipeline for .snp data is to compute SNP copy number and loss of heterozygosity (LOH). A discussion of this pipeline can be found in the article Computing SNP Copy Number and Loss of Heterozygosity.

Summary

In summary GenePattern provides a quick method for importing caArray experiment data sets into GenePattern, and tools to convert the raw or derived data into common GenePattern input formats. There are some common paths to be followed once your data is in these common formats, however, you should note that these are by no means the only paths to be taken. Once in GenePattern there are many methods available for data analysis and visualization. Those represented here are, as mentioned, common paths which may serve as endpoints in and of themselves or as entry points for using the analysis tools provided by GenePattern in custom workflows as meets with your desired output.