File Formats
This page describes the data file formats supported by GenePattern. In the
text, file formats likely to be used together are described near one another.
To find a particular file format, use the alphabetical index to the left. For information about creating GenePattern files,
see Creating Input Files.
Note: If you are using Excel to edit GenePattern files, be sure to
save the file as a tab-delimited text file and supply the correct file extension.
You can specify the file name in quotes to prevent Excel from appending .txt to the file name.
Also, note that Excel's auto-formatting can introduce errors in gene names, as described in
Zeeberg, et al (2004).
RES File Format
The RES file format is a tab delimited file format that describes an expression dataset.
The main differences between RES and GCT file formats are the RES file format (1) contains labels
for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software and (2) does not allow missing expression values.
The file is organized as follows:
- The first line contains a list of labels identifying the samples associated with
each of the columns in the remainder of the file. Two tabs (\t\t) separate the
sample identifier labels because each sample contains two data values (an
expression value and a present/marginal/absent call).
- Line format: Description (tab) Accession (tab) (sample 1 name)
(tab) (tab) (sample 2 name) (tab) (tab) ... (sample N name)
- For example: Description Accession DLBC1_1 DLBC2_1 ... DLBC58_0
- The second line contains a list of sample descriptions. Currently,
GenePattern ignores these descriptions.
- Line format: (tab) (sample 1 description) (tab) (tab) (sample 2
description) (tab) (tab) ... (sample N description)
- For example, our RES file creation tool places the sample data file
name and scale factors in this row: MG2000062219AA
MG2000062256AA/scale factor=1.2172 ...
MG2000062211AA/scale factor=1.1214
- The third line contains a number indicating the number of rows in the data table that
is contained in the remainder of the file. Note that the name and
description columns are not included in the number of data columns.
- Line format: (# of data rows)
- For example: 7129
- The rest of the data file contains data for each of the genes. There
is one row for each gene and two columns for each of the samples. The
first two fields in the row contain the description and name for each of the
genes (names and descriptions can contain spaces since fields are
separated by tabs). The description field is optional but the tab following it is not.
Each sample has two pieces of data associated with
it: an expression value and an associated Absent/Marginal/Present (A/M/P) call.
The A/M/P calls are generated by microarray scanning
software (such as Affymetrix's GeneChip software) and are an indication
of the confidence in the measured expression value. Currently,
GenePattern ignores the Absent/Marginal/Present call.
- Line format: (gene description) (tab) (gene name) (tab) (sample 1
data) (tab) (sample 1 A/P call) (tab) (sample 2 data) (tab) (sample 2 A/P call)
(tab) ... (sample N data) (tab) (sample N A/P call)
- For example: AFFX-BioB-5_at (endogenous control) AFFX-BioB-5_at -104
A -152 A ... -44 A
Sample RES file:
all_aml_test.res
GCT File Format
The GCT file format is a tab delimited file format that describes an expression dataset.
The main differences between RES and GCT file formats are the RES file format (1) contains labels
for each gene's absent (A) versus present (P) calls as generated by Affymetrix's GeneChip software and (2) does not allow missing expression values. Although the GCT file format allows missing values, only a few modules (such as CART, GSEA and HierarchicalClustering) can be run against an expression dataset that is missing values. Most modules do not allow missing expression values.
The GCT file is organized as follows:
- The first line contains the version string and is always the same for this file
format. Therefore, the first line must be as follows:
- The second line contains numbers indicating the size of the data table that
is contained in the remainder of the file. Note that the name and
description columns are not included in the number of data columns.
- Line format: (# of data rows) (tab) (# of data columns)
- For example: 7129 58
- The third line contains a list of identifiers for the samples associated with each
of the columns in the remainder of the file.
- Line format: Name (tab) Description (tab) (sample 1 name) (tab)
(sample 2 name) (tab) ... (sample N name)
- For example: Name Description DLBC1_1 DLBC2_1 ... DLBC58_0
- The remainder of the data file contains data for each of the genes. There
is one line for each gene and one column for each of the samples. The
first two fields in the line contain name and descriptions for the genes
(names and descriptions can contain spaces since fields are separated by
tabs). The number of lines should agree with the number of data rows
specified on line 2.
- Line format: (gene name) (tab) (gene description) (tab) (col 1 data)
(tab) (col 2 data) (tab) ... (col N data)
- For example: AFFX-BioB-5_at AFFX-BioB-5_at (endogenous
control) -104 -152 -158 ... -44
Occasionally, GCT files are organized in a transposed structure where the
columns represent genes and the rows represent samples. The user should take
care to check the organization of the file to ensure that the correct preprocessing
is performed on the file. See sample *.gct files that come with the distribution for
complete examples of the format.
Sample GCT file:
allaml.dataset.gct
TXT File Format
The TXT format is a tab delimited file format that describes an expression dataset. It can be
used with the GSEA module and is organized as follows:
- The first line contains the labels Name and Description followed by the identifiers for each sample in the dataset. The Description is optional.
- Line format: Name(tab)Description(tab)(sample 1 name)(tab)(sample 2 name) (tab) ... (sample N name)
- Example: Name Description DLBC1_1 DLBC2_1 ... DLBC58_0
-
The remainder of the file contains data for each of the genes. There is one line for each gene. Each line contains the gene name, gene description, and a value for each sample in the dataset. If the first line contains the Description label, include a description for each gene. If the first line does not contain the Description label, do not include descriptions for any gene.
Gene names and descriptions can contain spaces since fields are separated by tabs.
- Line format: (gene name) (tab) (gene description) (tab) (col 1 data) (tab) (col 2 data) (tab) ... (col N data)
- Example: AFFX-BioB-5_at AFFX-BioB-5_at (endogenous control) -104 -152 -158 ... -44
CLS File Format
The CLS file format defines phenotype (class or template) labels and associates each
sample in the expression data with a label. It uses spaces or tabs to separate the fields.
The CLS file format differs somewhat depending on whether you are defining categorical
or continuous phenotypes:
- Categorical labels define discrete phenotypes; for example, normal vs tumor).
- Continuous phenotypes are used for time series experiments or to define the profile of a gene of interest (gene neighbors).
Note: Most GenePattern modules are intended for use with categorical phenotypes.
Therefore, unless the module documentation explicitly states otherwise, a CLS
file should define categorical labels.
Categorical labels
Categorical labels define discrete phenotypes (for example, normal vs tumor).
For categorical labels, the CLS file format is organized as follows:
- The first line of a CLS file contains numbers indicating the number of
samples and number of classes. The number of samples should
correspond to the number of samples in the associated RES or GCT data
file.
- Line format: (number of samples) (space) (number of classes) (space) 1
- For example: 58 2 1
- The second line in a CLS file contains names for the class numbers. The
line should begin with a pound sign (#) followed by a space.
- Line format: # (space) (class 0 name) (space) (class 1 name)
- For example: # cured fatal/ref
- The third line contains a class label for each sample.
The class labels are sequential numbers beginning with zero.
The first label used (0) is assigned to the first class named on the second line;
the second unique label (1) is assigned to the second class named; and so on.
The number of class labels specified on this line should be the same as the number of
samples specified in the first line. The number of unique class labels specified on this line
should be the same as the number of classes specified in the first line.
- Line format: (sample 1 class) (space) (sample 2 class) (space) ... (sample N class)
- For example: 0 0 ... 1
CLS file for sample RES file (categorical labels):
all_aml_test.cls
Continuous labels
Continuous phenotypes are used for time series experiments or to define the profile of a gene of interest (gene neighbors).
A CLS file that defines continuous labels can contain one or more labels.
The following example shows a CLS file that defines two continuous labels:
#numeric
#AFFX-BioB-5_st
206.0 31.0 252.0 -20.0 -169.0 -66.0 230.0 -23.0 67.0 173.0 -55.0 -20.0 469.0 -201.0 -117.0 -162.0 -5.0 -86.0 350.0 74.0 -215.0 193.0 506.0 183.0 350.0 113.0 -17.0 29.0 247.0 -131.0 358.0 561.0 24.0 524.0 167.0 -56.0 176.0 320.0
#AFFX-BioDn-5
75.0 142.0 32.0 109.0 -38.0 -80.0 62.0 39.0 196.0 -42.0 199.0 49.0 171.0 327.0 115.0 -71.0 85.0 80.0 270.0 182.0 208.0 -94.0 292.0 233.0 34.0 0.0 59.0 233.0 48.0 466.0 -7.0 -96.0 297.0 38.0 208.0 -15.0 30.0 357.0
- The first line contains the text "#numeric" which indicates that the file defines continuous labels.
- The remainder of the file defines the continuous phenotypes. For each phenotype:
- The first line defines the name of the phenotype; for example, #AFFX-BIOB-5_st.
- The second line contains a value for each sample in the .gct file. Typically, your word processor wraps the second line of the phenotype definition, as shown in the example.
For a continuous phenotype label, the values for the samples define the phenotype profile.
The relative change in the values defines the relative distance between points in the phenotype profile.
In the example shown above, the phenotype profile is the expression profile for a gene:
the sample values for the two phenotype labels are gene expression values.
For a time series experiment, you would choose sample values that define the desired expression profile.
The example shown below assumes that you have five samples taken at 30 minute intervals.
The first phenotype label defines a phenotype profile that shows steadily increasing gene expression;
the second defines a profile that shows an initial peak and then gradual decrease:
#numeric
#IncreasingProfle
30 60 90 120 150
#PeakProfle
5 20 15 10 5
ODF File Format
The Ouput Description Format (ODF) is similar to the RES or GCT file formats for datasets. The main difference is in the
header. The body of data still contains the expression level values for each gene in each sample. Thus the main data
block (after the header lines) is a matrix of values. The columns are defined by a name and optionally a description. The rows have a name (name of the gene for instance) and a description (description of the gene).
The columns contain the expression values for each gene in a sample.
If the first gene in the data block is a particular Tyrosine Kinase then each of the samples contained in each of the columns will have
expressions values for that particular Tyrosine Kinase in the first row.
Note: This ODF format is specific to GenePattern. It is not an Open Document Format (ODF) for Office Applications
as defined by
the Organization for the Advancement of Structured Information Standards (OASIS).
ODF Header for Datasets
The following example shows the header lines of an ODF file. The first five lines are required.
The line numbers are shown for easy reference, they should not be included in your file.
1.
ODF 1.0
2.
HeaderLines=7
3.
Model= Dataset
4.
DataLines= 3
5.
COLUMN_TYPES: String String float float float *
6.
COLUMN_DESCRIPTIONS: Sample from DFCI Sample from UK Sample from Children's
7.
COLUMN_NAMES: Name Description Sample 1 Biopsy_2 Biopsy_4
8.
RowNamesColumn=0
9.
RowDescriptionsColumn=1
Lines 1 and 2 are required first and second lines. They must both be present in the header and be the first and second lines.
They signify that this is an ODF formatted file (of type 1.0) and indicate the number of
header lines that follow before the main data block (in this case 7 more). Line 3, required to be somewhere in the header of an ODF file,
defines this ODF file as containing Dataset data. Line 4 is required somewhere in the header file. It indicates the number of data
rows present in the data block.
Line 5 is required somewhere in the header file for any ODF file that
has a main data block. It defines the type of data in each column.
Line 6 is a tab-delimited list of descriptions for each column. Line 7 is a tab-delimited list of names for the columns.
Line 8 defines which column will have the row names, and Line 9 defines which column will contain the row descriptions.
Note: Following are a few notes about the ODF Header:
- The first element of each header line will be a key word. The keyword defines/describes what kind of
meta data will be found on the rest of that line.
- A remark is a human readable comment that is skipped by the parser. This line starts with the "#" character and
can contain any type of text since it is not parsed. Note that remarks are not counted as header lines and the
user can insert them "by hand".
- The number of data lines can be quite large. For example, the U133A chip measures the expression values for about
20,000 human genes. A Dataset created from several samples using the U133A gene chip could have a large value for
the DataLines tag.
- The COLUMN_NAMES:, COLUMN_DESCRIPTIONS:, and COLUMN_TYPES: lists must have the same number of elements. Also the
number of elements must be equal to the number of columns in the main data block.
- The COLUMN_NAMES:, and COLUMN_DESCRIPTIONS: could be empty, that is simply contain the proper number of tabs but no
text.
Main Data Block
The following example shows the first few lines of the main data block:
1000_at X60188
HSERK1 Human ERK1 mRNA 145.3
240.37823 158.66888
1001_at X60957
HSTIEMR Human tie mRNA 20.5
31.139397 14.053186
1002_f_at X65962 HSCP450
H.sapiens mRNA -9.6
118.06088 -8.287777
The main data block must be consistent with the header. The first COLUMN_NAMES element is "Name". This label is
associated with the first column (values: 1000_at, 1001_at, and 1002_f_at). The second column's label is
"Description" which is associated with the second column of the main data block. The next three columns
are floating point numbers that represent the gene expression values for each of the samples.
Note: The first two columns are just text data, and next three columns only contain floating point values. This is
consistent with the "String, String, float, float, float" elements in the COLUMN_TYPES: list.
Sample ODF file:
all_aml_train.preprocessed0.odf
POL File Format
The POL file format represents a Parameter-Ordered List. This format is a tab-delimited file with 4 columns, consisting of the following:
- A ranking (an integer corresponding to the rank of the feature)
- Unique feature identifier
- Feature description
- Distance metric (value upon which the rank position is based)
If you don't have a distance metric, you can use the rank in column 4 as well. Lines from a sample pol file are shown below:
0 X59798_at CCND1 Cyclin D1 0.0
1 S69272_s_at Cytoplasmic antiproteinase 465.5319538
2 U37012_at Cleavage and polyadenylation specificity factor 493.1997567
3 X69910_at P63 mRNA for transmembrane protein 493.5331802
4 U53347_at Neutral amino acid transporter B mRNA 515.3552173
5 M80899_at AHNAK AHNAK nucleoprotein (desmoyokin) 539.9990741
CLM File Format
The CLM file format is a tab-delimited file format that describes the samples in a zipped collection of CEL files
(used with the ExpressionFileCreator module).
Each row of the CLM format describes a CEL file in the zip file:
- Line format: (CEL file name) (tab) (sample name) (tab) (class)
- For example: cat_a.CEL sample_cat_a tumor
- The first column contains the CEL file name (file extension is optional).
- The second column contains a sample name for the CEL file data.
- The third column contains a phenotype class name for the CEL file data.
Sample CLM file:
sample.clm
CEL File Format
Affymetrix image analysis software generates CEL files, which store information about the probes on a chip and the intensity values for the probes. GenePattern modules, such as ExpressionFileCreator and SNPFileCreator, take CEL files as input and generate data files that can be read by subsequent GenePattern modules. More information on this format is available here.
GMX File Format
The GMX and GMT file formats are tab-delimited file formats that describe gene sets (used with the GSEA module).
In the GMX format, each column represents a gene set; in the GMT format, each row represents a gene set. The GMX format is convenient for
storing a relatively small number of gene sets (<256) and is easier to edit. The GMT format is more convenient for
storing larger databases of gene sets. The GMX format contains a column for each gene set:
- Column format: (gene set name) (tab) (description) (tab) (gene 1) (tab) (gene 2) (tab) ... (gene N)
- For example: GNF2_SPTA1 na ALS2CR3 KLF1 SLC6A8 ... CA1
- The first line contains the gene set name. Duplicate names are not allowed.
- The second line contains the gene set description. GSEA uses the description field to determine
what hyperlink to provide in the report for the gene set description: if the description is
na, GSEA provides a
link to the named gene set in MSigDB; if the description is a URL, GSEA provides a link to that URL.
- The remaining lines list the genes in the gene set.
Sample GMX file:
export_gnf.GENE_SYMBOL.gmx
GMT File Format
The GMX and GMT file formats are tab-delimited file formats that describe gene sets (used with the GSEA module).
In the GMX format, each column represents a gene set; in the GMT format, each row represents a gene set. The GMX format is convenient for
storing a relatively small number of gene sets (<256) and is easier to edit. The GMT format is more convenient for
storing larger databases of gene sets.
The GMT format contains a row for each gene set:
- Line format: (gene set name) (tab) (description) (tab) (gene 1) (tab) (gene 2) (tab) ... (gene N)
- For example: GNF2_SPTA1 na ALS2CR3 KLF1 SLC6A8 ... CA1
- The first column contains the gene set name. Duplicate names are not allowed.
- The second column contains the gene set description. GSEA uses the description field to determine
what hyperlink to provide in the report for the gene set description: if the description is
na, GSEA provides a
link to the named gene set in MSigDB; if the description is a URL, GSEA provides a link to that URL.
- The remaining columns list the genes in the gene set.
Sample GMT file:
export_gnf.GENE_SYMBOL.gmt
GRP File Format
The GRP file format contains a single gene set in a simple newline-delimited text format.
Typically, you use the GMT or GMX file formats to create gene sets, rather than using the GRP file format.
The GRP format contains a line for each gene, one gene per line. Lines that start with a pound sign (#) are ignored.
Sample GRP file:
my_gene_set.grp
CHIP File Format
The CHIP file format contains annotation about a microarray (used with GSEA module).
It lists the features (i.e probe sets) used in the microarray along with their mapping to gene symbols (when available).
While this file is not used directly in the GSEA algorithm, it is used to annotate the output results and may also be
used to collapse each probe set in the expression dataset to a single gene vector.
Chip annotation files can be specified in a tab-delimited file format (*.chip) or in a comma-separated file format (*.csv).
The formats are identical other than the separation character (tab or comma). Typically, you use the tab-delimited (*.chip) file format.
The CHIP file format is organized as follows:
- The first line contains column headings that identify the content of
each column in the remainder of the file. The file must contain three column headings:
- Probe Set ID
- Gene_Symbol
- Gene_Title
These three columns can appear in any order. The file may contain additional columns, which will be ignored.
- For example: Probe Set ID Gene_Symbol Gene_Title
- The rest of the data file contains data for each probe set ID used in the microarray.
- Line format: (probe set id) (tab) (gene symbol) (tab) (gene title)
- For example: 205699_at MAP2K6 mitogen-activated protein kinase kinase 6
Sample CHIP file:
HG_U133A_annot.chip
PCL File Format
PCL files may be used as input for the GSEA module. This is a format defined at Stanford for their
CDNA expression data. More information on this format is available here.
CDT File Format
CDT files will be created by the Hierarchical Clustering Algorithm. This is a format defined at Stanford for their
Hierarchical Clustering program. More information on this format is available here.
GTR File Format
GTR files will be created by the Hierarchical Clustering Algorithm. This is a format defined at Stanford for their
Hierarchical Clustering program. More information on this format is available here.
ATR File Format
ATR files will be created by the Hierarchical Clustering Algorithm. This is a format defined at Stanford for their
Hierarchical Clustering program. More information on this format is available here.
SNP File Format
This is a tab-delimited file format that contains SNP array data. The SNPFileCreator module can create non-allele-specific SNP files or allele-specific SNP files. A SNP file contains one row for each SNP and two or three columns for each SNP array. Non-allele-specific files contain two columns for each SNP array: the intensity value and the call. Allele-specific files contain three columns for each SNP array: the intensity value for allele A, the intensity value for allele B, and the call.
Note:
Not all SNP modules accept allele-specific SNP files.
The first line of a SNP file contains a list of labels identifying the SNP arrays.
Non-allele-specific
| SNP | Chromosome | PhysicalPosition | array_1_name | array_1_name Call | array_N_name | array_N_name | array_N_name Call |
Example
| SNP | Chromosome | PhysicalPosition | 100H_primary_GBM_101N | 100H_primary_GBM_101N Call | 100H_primary_GBM_56 | 100H_primary_GBM_56 Call |
Allele-specific
| SNP | Chromosome | PhysicalPosition | array_1_name_Allele_A | array_1_name_Allele_B | array_1_name Call | ... |
Example
| SNP | Chromosome | PhysicalPosition | 100H_primary_GBM_101N_Allele_A | 100H_primary_GBM_101N_Allele_B | 100H_primary_GBM_101N Call | ... |
The rest of the SNP file contains one row of data for each probe. Each row contains the probe intensity values and the SNP calls generated by the SNP microarray scanning software (such as Affymetrix's GeneChip software).
Non-allele-specific
| probe | chromosome | position | array_1_intensity | array_1_call | array_N_intensity | array_N_call |
Example
| SNP_A-1718890 | 1 | 2103664 | 1701.022 | AB | 1879.798 | BB | 947.904 | BB |
Allele-specific
| probe | chromosome | position | array_1_intensity_Allele_A | array_1_intensity_Allele_B | array_1 Call | ... |
Example
| SNP_A-1718890 | 1 | 2103664 | 794.811 | 898.022 | AB | ... |
Sample SNP file:
gistic_subset.snp,
gistic_subset_allele_specific.snp
SNP Sample Information File Format (.txt)
The sample information file is a tab-delimited format that describes a set of SNP arrays.
The column labels in the first row define the information provided for each array; each subsequent row
describes one SNP array.
The sample information format is organized as follows:
- The first line contains the column labels. A sample information file can contain any number of columns
and the column labels are arbitrary. However, SNP modules may require specific labels, as discussed below.
- Line format: Label-1 (tab) Label-2 (tab) ... Label-n
- For example: Array (tab) Sample (tab) Type (tab) Ploidy(numeric) (tab)
Gender (tab) Paired (tab) Platform
- The remainder of the sample information file contains a line of information for each SNP sample.
Where data is unavailable, columns may be empty.
- Line format: Col-1- data (tab) Col-2-data (tab) ... Col-N-data
- For example: S004274N_250S_123005 (tab) S004274N (tab) Normal (tab) 2
(tab) (tab) (tab) 250K_Sty
A sample information file can contain any number of columns and the column labels
are arbitrary. A SNP analysis module, however, may require a sample information file to include
specific column labels. For example, the SNP module CopyNumberDivideByNormals requires a sample
information file that includes two columns, Sample and
Ploidy(numeric). Following is a list of commonly used column labels:
- Array: Identifier for the SNP array.
- Sample: Identifier for the biological sample used to generate the SNP array data.
- Type: Brief description of the biological sample.
- Ploidy(numeric): Integer value, where ploidy=2 indicates a normal sample.
- Gender: Identifier that indicates the gender of the biological sample donor.
For a sample from a male donor, Gender=M; from a female donor, Gender=F.
- Paired: Value that identifies normal-target pairs. For the normal sample, Paired=Yes; for the target
sample, Paired is set to the sample name of the paired normal sample.
- Platform: SNP chip used to generate the array.
Note: When a SNP module requires a sample information file to include specific column labels, the module
documentation lists the required column labels. Specify required column labels exactly: they are
case-sensitive and space-sensitive.
Sample .txt file:
250K_Sampleinfofile.txt
CN File Format
This is a tab-delimited file format that contains SNP copy numbers. It contains one row for each SNP and
one column for each SNP array: the raw copy number value. It is organized as follows:
- The first line contains a list of labels identifying the SNP arrays.
- Line format: SNP (tab) Chromosome (tab) PhysicalPosition (tab) (array_1_name)
(tab) ... (array_N_name)
- For example: SNP (tab) Chromosome (tab) PhysicalPosition (tab)
MYNAH_p_Affy_plate_9_Mapping250K_Sty_A01_49068 (tab) ... MYNAH_p_Affy_plate_9_Mapping250K_Sty_A01_49084
- The rest of the SNP file contains one row of data for each SNP.
- Line format: (snp) (tab) (chromosome) (tab) (position) (tab) (array_1_cn) (tab) ... (array_N_cn)
- For example: SNP_A-4249904 (tab) 17 (tab) 41420045 (tab) 2.265 (tab) ... 1.735
Note: Sort the SNPs by
chromosome and physical position (low to high). Most GenePattern modules, as well as many external tools, require sorted data.
Sample CN file:
mynah.sorted.cn
XCN File Format
This is a tab-delimited file format that is similar to the SNP file format, but contains SNP copy numbers rather than
SNP intensity values. It contains one row for each SNP and
two columns for each SNP array: the raw copy number value and the call value.
It is organized as follows:
- The first line contains a list of labels identifying the SNP arrays.
- Line format: SNP (tab) Chromosome (tab) PhysicalPosition (tab) (array_1_name)
(tab) (array_1_name) Call (tab) ... (array_N_name) (tab) (array_N_name) Call
- For example: SNP (tab) Chromosome (tab) PhysicalPosition (tab)
MYNAH_p_Affy_plate_9_Mapping250K_Sty_A01_49068 (tab) MYNAH_p_Affy_plate_9_Mapping250K_Sty_A01_49068 Call (tab) ...
MYNAH_p_Affy_plate_9_Mapping250K_Sty_A01_49084 (tab) MYNAH_p_Affy_plate_9_Mapping250K_Sty_A01_49084 Call
- The rest of the SNP file contains one row of data for each probe, including the raw copy number value
and the SNP calls generated by
SNP microarray scanning software (such as Affymetrix's GeneChip software).
Some modules, such as the SNPViewer module, require the data sorted by chromosome and physical position.
- Line format: (snp) (tab) (chromosome) (tab) (position) (tab) (array_1_cn)
(tab) (array_1_call) (tab) ... (array_N_cn) (tab) (array_N_call)
- For example: SNP_A-4249904 (tab) 17 (tab) 41420045 (tab) 2.265 (tab) AB (tab) ... 1.735 (tab) AA
Sample SNP file:
mynah.sorted.xcn
GLAD File Format
This is a tab-delimited file format that contains the output results of the GLAD module.
The GLAD module, a SNP analysis module, runs the R package Gain and Loss analysis of DNA (GLAD),
which detects altered regions in the genomic pattern. The GLAD file format is organized as follows:
- The first line contains a list of labels identifying the columns.
- Line format: Sample (tab) Chromosome (tab) Start.bp (tab) End.bp (tab) Num.SNPs (tab) Seg.CN
- The rest of the file contains one row of data for each altered chromosomal region.
- Line format: (sample) (tab) (chromosome) (tab) (startPosition) (tab) (endPosition) (tab) (numberOfSNPs) (tab) (regionCN)
- For example: MYNAH_p_Affy_plate_9_Mapping250K_Sty_A02_49084 (tab) 17 (tab) 41419603 (tab) 36581538 (tab) 6427 (tab) 2.06
Sample GLAD file:
mynah.glad
LOH File Format
This is a tab-delimited file format that contains the output results of the LOH module.
The LOH module, a SNP analysis module, detects loss of heterozigosity (LOH).
The LOH file format is organized as follows:
- The first line contains a list of labels identifying the paired samples.
- Line format: SNP (tab) Chromosome (tab) PhysicalPosition (tab) (pair_1_name) (tab) ... (pair_N_name)
- For example: SNP (tab) Chromosome (tab) PhysicalPosition (tab) SM-12VZ (tab) SM-12W1
- The rest of the SNP file contains one row of data for each probe.
- Line format: (snp) (tab) (chromosome) (tab) (position) (tab) (pair_1_loh)
(tab) ... (pair_N_loh)
- For example: SNP_A-1855068 (tab) 17 (tab) 41089766 (tab) R (tab) R
LOH call values are:
- L (LOH): AB in normal and A or B in tumor
- R (Retention): AB in both normal and tumor or No Call in normal and AB in tumor
- C (Conflict): A or B in normal and AB in tumor
- N (Non-informative call): A or B in normal, or No Call in normal or tumor
Sample LOH file:
mynah.loh
Creating Input Files
When you run an analysis module, visualization module, or pipeline, GenePattern
displays the parameters for the selected modules. Often, one or more of these
parameters are input files, which must have a particular format; for example,
you might need to supply a gct or res file.
Create GCT/RES Files
GenePattern provides several modules for creating GCT and/or RES files from gene expression data from various sources.
For information about these modules:
- Login to the public GenePattern server: http://genepattern.broadinstitute.org/gp/.
If you do not have a GenePattern account, you can register on the login page.
- Review the GCT and RES Files page of the GenePattern protocols.
Transform Tab-Delimited Text Files
Although different GenePattern modules require different file formats, all of the files are tab-delimited or space-delimited text files. Most of your gene expression data is already in tab-delimited text files or in spreadsheet and database programs, which have export features that allow you to export the data into tab-delimited text files. Therefore, creating input files for GenePattern is relatively easy:
- Start with a tab-delimited text file that contains the required gene expression data.
- Open the file in a text editor (or spreadsheet editor).
- Make the necessary format changes.
- Save the file as a tab-delimited text file with the appropriate file extension.
Convert and Process Files
The Modules page of the GenePattern web site provides a complete list of the modules and pipelines available from the Broad Institute. Modules in the Data Format Conversion category convert files from one format to another. Modules in the Preprocess & Utilities category provide methods for importing and working with data files.
Convert CDT to GCT Files
One common question from GenePattern users is how to convert a
cdt file to a gct file. Following is a brief tutorial that walks you through this process
by converting sample.cdt to
sample.imputed.gct:
- Save the sample.cdt file to
your local drive and open it in Microsoft Excel.
- Delete the CLID and GWEIGHT columns. The gct file format allows for only two columns of annotations.
- Delete the second row, which contains array identifiers (AID). The gct file format allows for only one row of identifiers.
- Add two header rows at the top of the file:
- In the first row, first cell, enter: #1.2
- In the second row, first cell, enter the number of data rows: 1553
- In the second row, second cell, enter the number of data columns: 44
- Save the modified file as a text (tab delimited) file with the name sample.gct.
- Your original cdt file contained cells that were missing data. Most GenePattern
modules require that all cells in a gct file contain data.
Use the GenePattern analysis module ImputeMissingValues.KNN to add the missing data
to your gct file. The module will take sample.gct as the input file, impute the missing data, and generate a sample.imputed.gct file.
Documentation Update History
|
Version |
Release Date |
Comments |
|
3.2 |
June, 2009 |
Add CEL file format. Update Creating Input Files. |
|
3.1.1 |
December, 2008 |
Updated .cn file format: SNPs should be sorted. |
|
3.1 |
December, 2007 |
GenePattern 3.1 Release. No file format changes. |
|
3.0 |
May 16, 2007 |
Updated .cls file format |
|
3.0 |
April 2007 |
GenePattern 3.0 Release |