ExpressionFileCreator (v12) BETA

This module is currently in beta release. The module and/or documentation may be incomplete.

Creates a RES or GCT file from a set of Affymetrix CEL files

Author: Joshua Gould, David Eby Broad Institute

Contact:

gp-help@broadinstitute.org

Algorithm Version:

Summary

The ExpressionFileCreator module creates a gene expression dataset from a ZIP archive containing individual Affymetrix CEL files. The conversion is done using one of  the following algorithms:
The result is a matrix containing one intensity value per probe set, in the GCT or RES file format. Samples can be annotated by specifying a CLM file. A CLM file allows you to change the name of the samples in the expression matrix, reorder the columns, select a subset of the scans in the input ZIP file, and create a class label file in the CLS format. By default, sample names are taken from the CEL file names contained in the ZIP file. A CLM file allows you to specify the sample names explicitly. Additionally, the columns in the expression matrix are reordered so that they are in the same order as the scan names appear in the  CLM file. For example, the input  ZIP file contains the files scan1.cel, scan2.cel, and scan3.cel. The CLM file could contain the following text:
scan3     sample3    tumor
scan1     sample1    tumor
scan2     sample2    normal
The column names in the expression matrix would be: sample3, sample1, sample2. Additionally, only scan names in the CLM file will be used to construct the GCT or RES file; scans not present in the CLM file will be ignored.
 
Note:  A number of newer Affymetrix array types are not current supported by ExpressionFileCreator, including the 1.1, 2.0, 2.1 ST arrays, Exon arrays, and HTA 2.0 arrays.  This is the case even if a CDF file is provided.  The underlying R/Bioconductor technology used within ExpressionFileCreator was not designed for arrays where probes are shared between probesets.  See this Bioconductor mailing list thread for details.
 
We eventually plan to update ExpressionFileCreator or build a new module using a different package.  In the meanwhile, our best suggestion is to use the Affymetrix GeneConsole tool to extract and normalize the data into a tab-delimited file.  From there, it's possible to manually convert the data into GCT/RES format which can then be used for further analysis within GenePattern.  Our File Formats Guide explains the GCT/RES formats and also gives a video tutorial on how to do the conversion.

References

Affymetrix. Affymetrix Microarray Suite User Guide, version 5. Santa Clara, CA:Affymetrix, 2001.

Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249-264.
 
Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA. 2001;98:31-36.3
 
Li C, Wong WH. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biology. 2011;2:research0032-research0032.11.

Parameters

Name Description
input file * A zip file of CEL files
method * The method to use. Note that dchip and MAS5 will not work with ST arrays.
quantile normalization  (GCRMA and RMA only) Whether to normalize data using quantile normalization
background correct  (RMA only) Whether to background correct using RMA background correction
compute present absent calls  Whether to compute Present/Absent calls
normalization method  (MAS5 only) The normalization method to apply after expression values are computed. The column having the median of the means is used as the reference unless the parameter value to scale to is given.
value to scale to  (median/mean scaling only) The value to scale to.
clm file  A tab-delimited text file containing one scan, sample, and class per line
annotate probes * Whether to annotate probes with the gene symbol and description
cdf file  Custom CDF file. Leave blank to use default internally provided CDF file (custom cdf file is not implemented for GCRMA).
output file * The base name of the output file(s)

* - required

Output Files

  1. GCT file (if present/absent calls are NOT computed) or RES file (if present/absent 
    calls ARE computed)
  2. CLS file (if a CLM file is supplied)

Requirements

ExpressionFileCreator requires R 2.15.2 with the following packages:
boot_1.3-7 IRanges_1.16.2 spatial_7.3-5
class_7.3-5 Biobase_2.18.0 BiocGenerics_0.4.0
cluster_1.14.3 AnnotationDbi_1.20.1 affyio_1.26.0
foreign_0.8-51 zlibbioc_1.4.0 preprocessCore_1.20.0
KernSmooth_2.23-8 Matrix_1.0-9 affy_1.36.0
lattice_0.20-10 mgcv_1.7-21 Biostrings_2.26.2
MASS_7.3-22 nlme_3.1-105 gcrma_2.30.0
DBI_0.2-5 nnet_7.3-5 makecdfenv_1.36.0
RSQLite_0.11.2 rpart_3.1-55  
 
Each of these R packages has been bundled into a GenePattern plugin and will be automatically downloaded and installed when the module is installed.  This process will take some time due to the size and number of these packages, so be patient during installation.  R2.15.2 must be installed and configured independently.

Notes

  • The MAS5 and dChip algorithms are based on their Bioconductor implementations. Therefore the results obtained from these algorithms will differ slightly from their official implementations.
  • The GCRMA and RMA algorithms produce values that are in log2 but ExpressionFileCreator removes the log2 transformation before generating the result file.
  • ST 1.1+ and ST exon arrays are not currently supported.
  • The underlying Affymetrix R package used by ExpressionFileCreator v12 fixes a bug in the dChip algorithm implementation.  Unfortunately, this means that dChip expression files created with previous versions are not directly comparable with newly created dChip files.  It is our strong recommendation that you discard older dChip results and re-create the expression files with the new version.

Arrays Supported:

For a list of arrays supported by R2.15 please see http://bioconductor.org/packages/2.10/data/annotation/
Alternatively, you can provide a CDF file with your job to process other array types.

Common Errors

Check the GenePattern FAQ regarding errors you may encounter.

 

Platform Dependencies

Task Type:
Preprocess & Utilities

CPU Type:
any

Operating System:
any

Language:
R 2.15

Version Comments

Version Release Date Description
12 2013-10-31 Updated to R 2.15
11 2013-02-14 Updated to include Affy Annotation CSVs from Feb 2012
10 2012-04-06 updated to use new csv, removed tiger versions, renamed leopard version, removed extraneous R scripts, edited to point to correct packages, updated Affyio package to one that's build for R2.8
9 2012-01-26 Fixed memory corruption bug when reading some CDF files and with annotating probes when some annotations are missing
8 2008-10-29 Read latest Affymetrix CEL file format
6 2008-09-10 Added option to provide custom CDF file
5 2008-02-19 Added option to provide custom CDF file and Updated for R 2.5.0
4 2006-11-13 Fixes scaling bug.
3 2006-07-20 Added gcRMA and dChip algorithms
2 2006-06-19 Added gcRMA and dChip algorithms
1 2005-09-16