Note: If you are using Excel to edit GenePattern files, be sure to save the file as a tab-delimited text file and supply the correct file extension. You can specify the file name in quotes to prevent Excel from appending .txt to the file name. Also, note that Excel's auto-formatting can introduce errors in gene names, as described in Zeeberg, et al (2004).
CLS File Format
The CLS file format defines phenotype (class or template) labels and associates each sample in the expression data with a label. It uses spaces or tabs to separate the fields. The CLS file format differs somewhat depending on whether you are defining categorical or continuous phenotypes:
Categorical labels define discrete phenotypes; for example, normal vs tumor).
Continuous phenotypes are used for time series experiments or to define the profile of a gene of interest (gene neighbors).
Note: Most GenePattern modules are intended for use with categorical phenotypes. Therefore, unless the module documentation explicitly states otherwise, a CLS file should define categorical labels.
Categorical labels define discrete phenotypes (for example, normal vs tumor). For categorical labels, the CLS file format is organized as follows:
The first line of a CLS file contains numbers indicating the number of samples and number of classes. The number of samples should correspond to the number of samples in the associated RES or GCT data file.
(number of samples) (space) (number of classes) (space) 1
58 2 1
The second line in a CLS file contains names for the class numbers. The line should begin with a pound sign (#) followed by a space.
The third line contains a class label for each sample. The class labels are sequential numbers beginning with zero. The first label used (0) is assigned to the first class named on the second line; the second unique label (1) is assigned to the second class named; and so on. (NOTE: While most GenePattern modules adhere to this rule of 0 as the first class, some modules (such as GSEA) do not. Check the documentation for the module you are using if you are unsure.)
The number of class labels specified on this line should be the same as the number of samples specified in the first line. The number of unique class labels specified on this line should be the same as the number of classes specified in the first line.
(sample 1 class) (space) (sample 2 class) (space) ... (sample N class)
Continuous phenotypes are used for time series experiments or to define the profile of a gene of interest (gene neighbors). A CLS file that defines continuous labels can contain one or more labels. The following example shows a CLS file that defines two continuous labels:
The first line contains the text "#numeric" which indicates that the file defines continuous labels.
The remainder of the file defines the continuous phenotypes. For each phenotype:
The first line defines the name of the phenotype; for example, #AFFX-BIOB-5_st.
The second line contains a value for each sample in the .gct file. Typically, your word processor wraps the second line of the phenotype definition, as shown in the example.
For a continuous phenotype label, the values for the samples define the phenotype profile. The relative change in the values defines the relative distance between points in the phenotype profile. In the example shown above, the phenotype profile is the expression profile for a gene: the sample values for the two phenotype labels are gene expression values. For a time series experiment, you would choose sample values that define the desired expression profile. The example shown below assumes that you have five samples taken at 30 minute intervals. The first phenotype label defines a phenotype profile that shows steadily increasing gene expression; the second defines a profile that shows an initial peak and then gradual decrease: