File Formats Guide

Creating Input Files

When you run an analysis module, visualization module, or pipeline, GenePattern displays the parameters for the selected modules. Often, one or more of these parameters are input files, which must have a particular format; for example, you might need to supply a gct or res file. For more information about a particular file format, select it from the list at the right.

This section provides general information to help you create properly formatted files for GenePattern.

Several video tutorials are also available:

Creating GCT/RES Files

GenePattern provides several modules for creating GCT and/or RES files from gene expression data from various sources. For information about these modules:

  1. Login to the public GenePattern server: http://genepattern.broadinstitute.org/gp/.
    If you do not have a GenePattern account, you can register on the login page.
  2. Review the GCT and RES Files page of the GenePattern protocols.

Transforming Tab-Delimited Text Files

Although different GenePattern modules require different file formats, all of the files are tab-delimited or space-delimited text files. Most of your gene expression data is already in tab-delimited text files or in spreadsheet and database programs, which have export features that allow you to export the data into tab-delimited text files. Therefore, creating input files for GenePattern is relatively easy:

  1. Start with a tab-delimited text file that contains the required gene expression data.
  2. Open the file in a text editor (or spreadsheet editor).
  3. Make the necessary format changes.
  4. Save the file as a tab-delimited text file with the appropriate file extension.

If Mac OS does not allow direct change of TXT extension to GCT or RES from the file name, right click on the file>Get Info, expand Name & Extension section, uncheck Hide extension option, then change the extension in the box provided.

Converting and Processing Files

The Modules page of the GenePattern web site provides a complete list of the modules and pipelines available from the Broad Institute. Modules in the Data Format Conversion category convert files from one format to another. Modules in the Preprocess & Utilities category provide methods for importing and working with data files.

Converting CDT to GCT Files

One common question from GenePattern users is how to convert a cdt file to a gct file. Following is a brief tutorial that walks you through this process by converting sample.cdt to sample.imputed.gct:

  1. Save the sample.cdt file to your local drive and open it in Microsoft Excel.
  2. Delete the CLID and GWEIGHT columns. The gct file format allows for only two columns of annotations.
  3. Delete the second row, which contains array identifiers (AID). The gct file format allows for only one row of identifiers.
  4. Add two header rows at the top of the file:
  5. Save the modified file as a text (tab delimited) file with the name sample.gct. 
  6. Verify that your new .gct file matches the requirements of a gct file in GenePattern.
  7. Your original cdt file contained cells that were missing data. Most GenePattern modules require that all cells in a gct file contain data. Use the GenePattern analysis module ImputeMissingValues.KNN to add the missing data to your gct file. The module will take sample.gct as the input file, impute the missing data, and generate a sample.imputed.gct file.

ATR

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

ATR File Format

ATR files will be created by the HierarchicalClustering module. This is a format defined at Stanford for their Hierarchical Clustering program.  Note that the HierarchicalClustering module will also generate a CDT file.

The ATR (array tree) file records the order in which the arrays (columns) were joined during clustering.


CDT

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

CDT File Format

CDT files are created by the HierarchicalClustering module. This is a format defined at Stanford for their Hierarchical Clustering program. The CDT (clustered data table) file contains the original data, but reordered, to reflect the clustering. Note that the HierarchicalClustering module will also generate the following files:

These tree files reflect the history of how the cluster was built, and can be used to contruct how the tree(s) should look.


CEL

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

CEL File Format

Affymetrix image analysis software generates CEL files, which store information about the probes on a chip and the intensity values for the probes. GenePattern modules, such as ExpressionFileCreator and SNPFileCreator, take CEL files as input and generate data files that can be read by subsequent GenePattern modules. More information on this format is available here.


CHIP

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

CHIP File Format

The CHIP file format contains annotation about a microarray (used with GSEA module). It lists the features (i.e probe sets) used in the microarray along with their mapping to gene symbols (when available). While this file is not used directly in the GSEA algorithm, it is used to annotate the output results and may also be used to collapse each probe set in the expression dataset to a single gene vector.

Chip annotation files can be specified in a tab-delimited file format (*.chip) or in a comma-separated file format (*.csv). The formats are identical other than the separation character (tab or comma). Typically, you use the tab-delimited (*.chip) file format.

The CHIP file format is organized as follows:

  1. The first line contains column headings that identify the content of each column in the remainder of the file. The file must contain three column headings: These three columns can appear in any order. The file may contain additional columns, which will be ignored.
  2. The rest of the data file contains data for each probe set ID used in the microarray.

Sample CHIP file: HG_U133A_annot.chip


ODF

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

ODF File Format

The Ouput Description Format (ODF) is similar to the RES or GCT file formats for datasets. The main difference is in the header. The body of data still contains the expression level values for each gene in each sample. Thus the main data block (after the header lines) is a matrix of values. The columns are defined by a name and optionally a description. The rows have a name (name of the gene for instance) and a description (description of the gene). The columns contain the expression values for each gene in a sample. If the first gene in the data block is a particular Tyrosine Kinase then each of the samples contained in each of the columns will have expressions values for that particular Tyrosine Kinase in the first row.

Note: This ODF format is specific to GenePattern. It is not an Open Document Format (ODF) for Office Applications as defined by the Organization for the Advancement of Structured Information Standards (OASIS).

ODF Header for Datasets

The following example shows the header lines of an ODF file. The first five lines are required.The line numbers are shown for easy reference, they should not be included in your file.

1.   ODF 1.0
2.   HeaderLines=7
3.   Model= Dataset
4.   DataLines= 3
5.   COLUMN_TYPES: String String float float float *
6.   COLUMN_DESCRIPTIONS: Sample from DFCI Sample from UK Sample from Children's
7.   COLUMN_NAMES: Name Description Sample 1 Biopsy_2 Biopsy_4
8.   RowNamesColumn=0
9.   RowDescriptionsColumn=1

Lines 1 and 2 are required first and second lines. They must both be present in the header and be the first and second lines. They signify that this is an ODF formatted file (of type 1.0) and indicate the number of header lines that follow before the main data block (in this case 7 more). Line 3, required to be somewhere in the header of an ODF file, defines this ODF file as containing Dataset data. Line 4 is required somewhere in the header file. It indicates the number of data rows present in the data block. Line 5 is required somewhere in the header file for any ODF file that has a main data block. It defines the type of data in each column. Line 6 is a tab-delimited list of descriptions for each column. Line 7 is a tab-delimited list of names for the columns. Line 8 defines which column will have the row names, and Line 9 defines which column will contain the row descriptions.

Note: Following are a few notes about the ODF Header:

Main Data Block

The following example shows the first few lines of the main data block:

1000_at    X60188 HSERK1 Human ERK1 mRNA    145.3   240.37823    158.66888
1001_at    X60957 HSTIEMR Human tie mRNA    20.5    31.139397    14.053186
1002_f_at  X65962 HSCP450 H.sapiens mRNA    -9.6    118.06088    -8.287777

The main data block must be consistent with the header. The first COLUMN_NAMES element is "Name". This label is associated with the first column (values: 1000_at, 1001_at, and 1002_f_at). The second column's label is "Description" which is associated with the second column of the main data block. The next three columns are floating point numbers that represent the gene expression values for each of the samples.

Note: The first two columns are just text data, and next three columns only contain floating point values. This is consistent with the "String, String, float, float, float" elements in the COLUMN_TYPES: list.

Sample ODF file: all_aml_train.preprocessed0.odf


CLM

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

CLM File Format

The CLM file format is a tab-delimited file format that describes the samples in a zipped collection of CEL or IDAT files (used with the ExpressionFileCreator and IlluminaExpressionFileCreator modules, respectively).

Each row of the CLM format describes a CEL file in the zip file:

  1. The first column contains the CEL file name (file extension is optional).
  2. The second column contains a sample name for the CEL file data.
  3. The third column contains a phenotype class name for the CEL file data.

Sample CLM file: sample.clm


CLS

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

CLS File Format

The CLS file format defines phenotype (class or template) labels and associates each sample in the expression data with a label. It uses spaces or tabs to separate the fields. The CLS file format differs somewhat depending on whether you are defining categorical or continuous phenotypes:

Note: Most GenePattern modules are intended for use with categorical phenotypes. Therefore, unless the module documentation explicitly states otherwise, a CLS file should define categorical labels.

Categorical labels

Categorical labels define discrete phenotypes (for example, normal vs tumor). For categorical labels, the CLS file format is organized as follows:

  1. The first line of a CLS file contains numbers indicating the number of samples and number of classes. The number of samples should correspond to the number of samples in the associated RES or GCT data file.
  2. The second line in a CLS file contains names for the class numbers. The line should begin with a pound sign (#) followed by a space.
  3. The third line contains a class label for each sample. The class labels are sequential numbers beginning with zero. The first label used (0) is assigned to the first class named on the second line; the second unique label (1) is assigned to the second class named; and so on. (NOTE: While most GenePattern modules adhere to this rule of 0 as the first class, some modules (such as GSEA) do not. Check the documentation for the module you are using if you are unsure.)
    The number of class labels specified on this line should be the same as the number of samples specified in the first line. The number of unique class labels specified on this line should be the same as the number of classes specified in the first line.

CLS file for sample RES file (categorical labels): all_aml_test.cls

Continuous labels

Continuous phenotypes are used for time series experiments or to define the profile of a gene of interest (gene neighbors). A CLS file that defines continuous labels can contain one or more labels. The following example shows a CLS file that defines two continuous labels:

#numeric
#AFFX-BioB-5_st
206.0 31.0 252.0 -20.0 -169.0 -66.0 230.0 -23.0 67.0 173.0 -55.0 -20.0 469.0 -201.0 -117.0 
-162.0 -5.0 -86.0 350.0 74.0 -215.0 193.0 506.0 183.0 350.0 113.0 -17.0 29.0 247.0 -131.0 
358.0 561.0 24.0 524.0 167.0 -56.0 176.0 320.0
#AFFX-BioDn-5
75.0 142.0 32.0 109.0 -38.0 -80.0 62.0 39.0 196.0 -42.0 199.0 49.0 171.0 327.0 115.0 
-71.0 85.0 80.0 270.0 182.0 208.0 -94.0 292.0 233.0 34.0 0.0 59.0 233.0 48.0 466.0 -7.0 
-96.0 297.0 38.0 208.0 -15.0 30.0 357.0
  1. The first line contains the text "#numeric" which indicates that the file defines continuous labels.
  2. The remainder of the file defines the continuous phenotypes. For each phenotype:
    1. The first line defines the name of the phenotype; for example, #AFFX-BIOB-5_st.
    2. The second line contains a value for each sample in the .gct file. Typically, your word processor wraps the second line of the phenotype definition, as shown in the example.

For a continuous phenotype label, the values for the samples define the phenotype profile. The relative change in the values defines the relative distance between points in the phenotype profile. In the example shown above, the phenotype profile is the expression profile for a gene: the sample values for the two phenotype labels are gene expression values. For a time series experiment, you would choose sample values that define the desired expression profile. The example shown below assumes that you have five samples taken at 30 minute intervals. The first phenotype label defines a phenotype profile that shows steadily increasing gene expression; the second defines a profile that shows an initial peak and then gradual decrease:

#numeric
#IncreasingProfle
30 60 90 120 150
#PeakProfle
5 20 15 10 5

POL

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

POL File Format

The POL file format represents a Parameter-Ordered List. This format is a tab-delimited file with 4 columns, consisting of the following:

  1. A ranking (an integer corresponding to the rank of the feature)
  2. Unique feature identifier
  3. Feature description
  4. Distance metric (value upon which the rank position is based)

If you don't have a distance metric, you can use the rank in column 4 as well. Lines from a sample pol file are shown below:

0	X59798_at	CCND1 Cyclin D1	0.0
1	S69272_s_at	Cytoplasmic antiproteinase	465.5319538
2	U37012_at	Cleavage and polyadenylation specificity factor	493.1997567
3	X69910_at	P63 mRNA for transmembrane protein	493.5331802
4	U53347_at	Neutral amino acid transporter B mRNA	515.3552173
5	M80899_at	AHNAK AHNAK nucleoprotein (desmoyokin)	539.9990741

CN

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

CN File Format

This is a tab-delimited file format that contains SNP copy numbers. It contains one row for each SNP and one column for each SNP array: the raw copy number value. It is organized as follows:

  1. The first line contains a list of labels identifying the SNP arrays.
  2. The rest of the SNP file contains one row of data for each SNP.

Note: Sort the SNPs by chromosome and physical position (low to high). Most GenePattern modules, as well as many external tools, require sorted data.

Sample CN file: mynah.sorted.cn


GCT

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

GCT File Format

The GCT file format is a tab delimited file format that describes an expression dataset. The main differences between RES and GCT file formats are the RES file format (1) contains labels for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software and (2) does not allow missing expression values. Although the GCT file format allows missing values, only a few modules (such as CART, GSEA and HierarchicalClustering) can be run against an expression dataset that is missing values. Most modules do not allow missing expression values.

The GCT file is organized as follows:

  1. The first line contains the version string and is always the same for this file format. Therefore, the first line must be as follows:
  2. The second line contains numbers indicating the size of the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns.
  3. The third line contains a list of identifiers for the samples associated with each of the columns in the remainder of the file.
  4. The remainder of the data file contains data for each of the genes. There is one line for each gene and one column for each of the samples. The first two fields in the line contain name and descriptions for the genes (names and descriptions can contain spaces since fields are separated by tabs). The number of lines should agree with the number of data rows specified on line 2.

Occasionally, GCT files are organized in a transposed structure where the columns represent genes and the rows represent samples. The user should take care to check the organization of the file to ensure that the correct preprocessing is performed on the file. See sample *.gct files that come with the distribution for complete examples of the format.

Sample GCT file: allaml.dataset.gct


FPKM_tracking

FPKM_tracking File Format

FPKM_tracking files are output by the Cufflinks modules, and contain RNA-seq estimated expression values in Fragments Per Kilobase of transcript per Million mapped reads (FPKM).  For more information on FPKM, see the Cufflinks FAQ.

FPKM tracking files use a generic format to output estimated expression values. Each FPKM tracking file has the following format:

Column number Column name Example Description
1 tracking_id TCONS_00000001 A unique identifier describing the object (gene, transcript, CDS, primary transcript)
2 class_code = The class_code attribute for the object, or "-" if not a transcript, or if class_code isn't present
3 nearest_ref_id NM_008866.1 The reference transcript to which the class code refers, if any
4 gene_id NM_008866 The gene_id(s) associated with the object
5 gene_short_name Lypla1 The gene_short_name(s) associated with the object
6 tss_id TSS1 The tss_id associated with the object, or "-" if not a transcript/primary transcript, or if tss_id isn't present
7 locus chr1:4797771-4835363 Genomic coordinates for easy browsing to the object
8 length 2447 The number of base pairs in the transcript, or '-' if not a transcript/primary transcript
9 coverage 43.4279 Estimate for the absolute depth of read coverage across the object
10 q0_FPKM 8.01089 FPKM of the object in sample 0
11 q0_FPKM_lo 7.03583 the lower bound of the 95% confidence interval on the FPKM of the object in sample 0
12 q0_FPKM_hi 8.98595 the upper bound of the 95% confidence interval on the FPKM of the object in sample 0
13 q0_status OK Quantification status for the object in sample 0. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution.
14 q1_FPKM 8.55155 FPKM of the object in sample 1
15 q1_FPKM_lo 7.77692 the lower bound of the 95% confidence interval on the FPKM of the object in sample 0
16 q1_FPKM_hi 9.32617 the upper bound of the 95% confidence interval on the FPKM of the object in sample 1
17 q1_status 9.32617 the upper bound of the 95% confidence interval on the FPKM of the object in sample 1. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution.
3N + 12 qN_FPKM 7.34115 FPKM of the object in sample N
3N + 13 qN_FPKM_lo 6.33394 the lower bound of the 95% confidence interval on the FPKM of the object in sample N
3N + 14 qN_FPKM_hi 8.34836 the upper bound of the 95% confidence interval on the FPKM of the object in sample N
3N + 15 qN_status OK Quantification status for the object in sample N. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution.

 


GMT

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

GMT File Format

The GMX and GMT file formats are tab-delimited file formats that describe gene sets (used with the GSEA module). In the GMX format, each column represents a gene set; in the GMT format, each row represents a gene set. The GMX format is convenient for storing a relatively small number of gene sets (<256) and is easier to edit. The GMT format is more convenient for storing larger databases of gene sets. The GMT format contains a row for each gene set:

  1. The first column contains the gene set name. Duplicate names are not allowed.
  2. The second column contains the gene set description. GSEA uses the description field to determine what hyperlink to provide in the report for the gene set description: if the description is na, GSEA provides a link to the named gene set in MSigDB; if the description is a URL, GSEA provides a link to that URL.
  3. The remaining columns list the genes in the gene set.

Sample GMT file: export_gnf.GENE_SYMBOL.gmt


GLAD

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

GLAD File Format

This is a tab-delimited file format that contains the output results of the GLAD module. The GLAD module, a SNP analysis module, runs the R package Gain and Loss analysis of DNA (GLAD), which detects altered regions in the genomic pattern. The GLAD file format is organized as follows:

  1. The first line contains a list of labels identifying the columns.
  2. The rest of the file contains one row of data for each altered chromosomal region.

Sample GLAD file: mynah.glad


GMX

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

GMX File Format

The GMX and GMT file formats are tab-delimited file formats that describe gene sets (used with the GSEA module). In the GMX format, each column represents a gene set; in the GMT format, each row represents a gene set. The GMX format is convenient for storing a relatively small number of gene sets (<256) and is easier to edit. The GMT format is more convenient for storing larger databases of gene sets. The GMX format contains a column for each gene set:

  1. The first line contains the gene set name. Duplicate names are not allowed.
  2. The second line contains the gene set description. GSEA uses the description field to determine what hyperlink to provide in the report for the gene set description: if the description is na, GSEA provides a link to the named gene set in MSigDB; if the description is a URL, GSEA provides a link to that URL.
  3. The remaining lines list the genes in the gene set.

Sample GMX file: export_gnf.GENE_SYMBOL.gmx


GRP

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

GRP File Format

The GRP file format contains a single gene set in a simple newline-delimited text format. Typically, you use the GMT or GMX file formats to create gene sets, rather than using the GRP file format. The GRP format contains a line for each gene, one gene per line. Lines that start with a pound sign (#) are ignored.

Sample GRP file: my_gene_set.grp


GTR

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

GTR File Format

GTR files will be created by the HierarchicalClustering module. This is a format defined at Stanford for their Hierarchical Clustering program.  Note that the HierarchicalClustering module will also generate a CDT file.

The GTR (gene tree) file records the order in which the genes (rows) were joined during clustering.


LOH

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

LOH File Format

This is a tab-delimited file format that contains the output results of the LOH module. The LOH module, a SNP analysis module, detects loss of heterozigosity (LOH). The LOH file format is organized as follows:

  1. The first line contains a list of labels identifying the paired samples.
  2. The rest of the SNP file contains one row of data for each probe. LOH call values are:

Sample LOH file: mynah.loh


MAGE-TAB and MAGE-ML

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

MAGE-TAB and MAGE-ML File Formats

The MAGE-TAB and MAGE-ML file formats are defined by the Functional Genomics Data Society (FGED, formerly MGED) to create standards for the representation of microarray expression data to facilitate the exchange of microarray information between different data systems. More information about and sample files in the FGED-based formats are available here.


PCL

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

PCL File Format

PCL files may be used as input for the GSEA module. This is a format defined at Stanford for their CDNA expression data. More information on this format is available here.


RES

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

RES File Format

The RES file format is a tab delimited file format that describes an expression dataset. The main differences between RES and GCT file formats are the RES file format (1) contains labels for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software and (2) does not allow missing expression values. The file is organized as follows:

  1. The first line contains a list of labels identifying the samples associated with each of the columns in the remainder of the file. Two tabs (\t\t) separate the sample identifier labels because each sample contains two data values (an expression value and a present/marginal/absent call).
  2. The second line contains a list of sample descriptions. Currently, GenePattern ignores these descriptions.
  3. The third line contains a number indicating the number of rows in the data table that is contained in the remainder of the file. Note that the name and description columns are not included in the number of data columns.
  4. The rest of the data file contains data for each of the genes. There is one row for each gene and two columns for each of the samples. The first two fields in the row contain the description and name for each of the genes (names and descriptions can contain spaces since fields are separated by tabs). The description field is optional but the tab following it is not. Each sample has two pieces of data associated with it: an expression value and an associated Absent/Marginal/Present (A/M/P) call. The A/M/P calls are generated by microarray scanning software (such as Affymetrix's GeneChip software) and are an indication of the confidence in the measured expression value. Currently, GenePattern ignores the Absent/Marginal/Present call.

Sample RES file: all_aml_test.res


Sample Information File

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

Sample Information File Format (.txt)

The sample information file is a tab-delimited format that describes a set of SNP arrays. The column labels in the first row define the information provided for each array; each subsequent row describes one SNP array. The sample information file is organized as follows:

  1. The first line contains the column labels. A sample information file can contain any number of columns and the column labels are arbitrary. However, SNP modules may require specific labels, as discussed below.
  2. The remainder of the sample information file contains a line of information for each SNP sample. Where data is unavailable, columns may be empty.

A sample information file can contain any number of columns and the column labels are arbitrary. A SNP analysis module, however, may require a sample information file to include specific column labels. For example, the SNP module CopyNumberDivideByNormals requires a sample information file that includes two columns, Sample and Ploidy(numeric). Following is a list of commonly used column labels:

Note: When a SNP module requires a sample information file to include specific column labels, the module documentation lists the required column labels. Specify required column labels exactly: they are case-sensitive and space-sensitive.

Sample .txt file: 250K_Sampleinfofile.txt

Creating a Sample Information File

The following steps outline how to copy exactly sample identifiers from Excel data and tranpose them from horizonal to vertical.

  1. In Excel, Select entire row containing sample names and Copy. Open a new workbook, Paste Special>Transpose.
  2. If starting from a RES file, to remove blank rows, Select relevant column(s), then click Edit>Go To>Special button>Blanks option and click OK. Blank rows will be selected. Choose Edit>Delete>Entire row option and click OK.

  1. Label row headings exactly as specified for module, fill in cells, and save as tab delimited text (.txt).  For example, ComBat module labels first three cells of Row 1: “Array”, “Sample”, and “Batch”.

 


TXT

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

TXT File Format

The TXT format is a tab delimited file format that describes an expression dataset. It can be used with the GSEA module and is organized as follows:

  1. The first line contains the labels Name and Description followed by the identifiers for each sample in the dataset. The Description is optional.
  2. The remainder of the file contains data for each of the genes. There is one line for each gene. Each line contains the gene name, gene description, and a value for each sample in the dataset. If the first line contains the Description label, include a description for each gene. If the first line does not contain the Description label, do not include descriptions for any gene. Gene names and descriptions can contain spaces since fields are separated by tabs.

 


SNP

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

SNP File Format

This is a tab-delimited file format that contains SNP array data. The SNPFileCreator module can create non-allele-specific SNP files or allele-specific SNP files. A SNP file contains one row for each SNP and two or three columns for each SNP array. Non-allele-specific files contain two columns for each SNP array: the intensity value and the call. Allele-specific files contain three columns for each SNP array: the intensity value for allele A, the intensity value for allele B, and the call. Note: Not all SNP modules accept allele-specific SNP files.

The first line of a SNP file contains a list of labels identifying the SNP arrays.

Non-allele-specific

SNP Chromosome PhysicalPosition array_1_name array_1_name Call array_N_name array_N_name array_N_name Call

Example

SNP Chromosome PhysicalPosition 100H_primary_GBM_101N 100H_primary_GBM_101N Call 100H_primary_GBM_56 100H_primary_GBM_56 Call

Allele-specific

SNP Chromosome PhysicalPosition array_1_name_Allele_A array_1_name_Allele_B array_1_name Call ...

Example

SNP Chromosome PhysicalPosition 100H_primary_GBM_101N_Allele_A 100H_primary_GBM_101N_Allele_B 100H_primary_GBM_101N Call ...

The rest of the SNP file contains one row of data for each probe. Each row contains the probe intensity values and the SNP calls generated by the SNP microarray scanning software (such as Affymetrix's GeneChip software).

Non-allele-specific

probe chromosome position array_1_intensity array_1_call array_N_intensity array_N_call

Example

SNP_A-1718890 1 2103664 1701.022 AB 1879.798 BB 947.904 BB

Allele-specific

probe chromosome position array_1_intensity_Allele_A array_1_intensity_Allele_B array_1 Call ...

Example

SNP_A-1718890 1 2103664 794.811 898.022 AB ...

Sample SNP file: gistic_subset.snp, gistic_subset_allele_specific.snp


XCN

Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).

XCN File Format

This is a tab-delimited file format that is similar to the SNP file format, but contains SNP copy numbers rather than SNP intensity values. It contains one row for each SNP and two columns for each SNP array: the raw copy number value and the call value. It is organized as follows:

  1. The first line contains a list of labels identifying the SNP arrays.
  2. The rest of the SNP file contains one row of data for each probe, including the raw copy number value and the SNP calls generated by SNP microarray scanning software (such as Affymetrix's GeneChip software). Some modules, such as the SNPViewer module, require the data sorted by chromosome and physical position.

Sample SNP file: mynah.sorted.xcn