Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).
ODF File Format
The Ouput Description Format (ODF) is similar to the RES or GCT file formats for datasets. The main difference is in the header. The body of data still contains the expression level values for each gene in each sample. Thus the main data block (after the header lines) is a matrix of values. The columns are defined by a name and optionally a description. The rows have a name (name of the gene for instance) and a description (description of the gene). The columns contain the expression values for each gene in a sample. If the first gene in the data block is a particular Tyrosine Kinase then each of the samples contained in each of the columns will have expressions values for that particular Tyrosine Kinase in the first row.
Note: This ODF format is specific to GenePattern. It is not an Open Document Format (ODF) for Office Applications as defined by the Organization for the Advancement of Structured Information Standards (OASIS).
ODF Header for Datasets
The following example shows the header lines of an ODF file. The first five lines are required.The line numbers are shown for easy reference, they should not be included in your file.
1. ODF 1.0
3. Model= Dataset
4. DataLines= 3
5. COLUMN_TYPES: String String float float float *
6. COLUMN_DESCRIPTIONS: Sample from DFCI Sample from UK Sample from Children's
7. COLUMN_NAMES: Name Description Sample 1 Biopsy_2 Biopsy_4
Lines 1 and 2 are required first and second lines. They must both be present in the header and be the first and second lines. They signify that this is an ODF formatted file (of type 1.0) and indicate the number of header lines that follow before the main data block (in this case 7 more). Line 3, required to be somewhere in the header of an ODF file, defines this ODF file as containing Dataset data. Line 4 is required somewhere in the header file. It indicates the number of data rows present in the data block. Line 5 is required somewhere in the header file for any ODF file that has a main data block. It defines the type of data in each column. Line 6 is a tab-delimited list of descriptions for each column. Line 7 is a tab-delimited list of names for the columns. Line 8 defines which column will have the row names, and Line 9 defines which column will contain the row descriptions.
Note: Following are a few notes about the ODF Header:
The first element of each header line will be a key word. The keyword defines/describes what kind of meta data will be found on the rest of that line.
A remark is a human readable comment that is skipped by the parser. This line starts with the "#" character and can contain any type of text since it is not parsed. Note that remarks are not counted as header lines and the user can insert them "by hand".
The number of data lines can be quite large. For example, the U133A chip measures the expression values for about 20,000 human genes. A Dataset created from several samples using the U133A gene chip could have a large value for the DataLines tag.
The COLUMN_NAMES:, COLUMN_DESCRIPTIONS:, and COLUMN_TYPES: lists must have the same number of elements. Also the number of elements must be equal to the number of columns in the main data block.
The COLUMN_NAMES:, and COLUMN_DESCRIPTIONS: could be empty, that is simply contain the proper number of tabs but no text.
Main Data Block
The following example shows the first few lines of the main data block:
1000_at X60188 HSERK1 Human ERK1 mRNA 145.3 240.37823 158.66888
1001_at X60957 HSTIEMR Human tie mRNA 20.5 31.139397 14.053186
1002_f_at X65962 HSCP450 H.sapiens mRNA -9.6 118.06088 -8.287777
The main data block must be consistent with the header. The first COLUMN_NAMES element is "Name". This label is associated with the first column (values: 1000_at, 1001_at, and 1002_f_at). The second column's label is "Description" which is associated with the second column of the main data block. The next three columns are floating point numbers that represent the gene expression values for each of the samples.
Note: The first two columns are just text data, and next three columns only contain floating point values. This is consistent with the "String, String, float, float, float" elements in the COLUMN_TYPES: list.