Supplemental data for Staunton, et. al.


This supplement supports Staunton, et. al. PNAS 2001 98: 10787-10792; Sep 11 2001, which is available at http://www.pnas.org/cgi/content/full/98/19/10787
The expression data were generated using Affymetrix Hu6800 chips.

An earlier version of the expression data, generated on the lower density Affymetrix 6800 4-chip set, was analyzed by Butte, et. al., PNAS 2000 97: 12182-12186.
The paper is available at http://www.pnas.org/cgi/content/full/97/22/12182
The supplemental data is available at http://www-genome.wi.mit.edu/MPR/data_set_relevance_networks.html


RAW DATA (tab-deliminated text files)

Drug sensitivity data are provided as GI50s (concentrations required to inhibit growth by 50%) (GI50_RAW). The first row of the text file identifies cell lines, separated by tabs. Subsequent rows contain compound number, compound name, and GI50 for each cell line, each separated by a tab.

Expression data are provided as AFFYMETRIX "average difference" units (nci-60.expression.scfrs.txt). The data from each scan were scaled to minimize differences due to global variation in scan brightness. The first row identifies cell lines (same order as drug data file). The second row indicates for each scan the linear scaling factor that was used to transform raw average difference values to the values shown in the file. Subsequent rows contain AFFYMETRIX gene description, AFFYMETRIX accession number, and the (scaled) AFFYMETRIX "average difference" values for each cell line, each separated by a tab.

Cell lines and RNA samples

The NCI-60 cell lines include the following cell lines (tissue of origin is shown in bold):

LUNG: NCI-H23, NCI-H522, A549-ATCC, EKVX, NCI-H226, NCI-H332M, H460, H0P62, HOP92

COLON: HT29, HCC-2998, HCT116, SW620, COLO205, HCT15, KM12

BREAST: MCF7, MCF7ADRr, MDAMB231, HS578T, MDAMB435, MDN, BT549, T47D

OVARIAN: OVCAR3, OVCAR4, OVCAR5, OVCAR8, IGROV1, SKOV3

LEUKEMIA: CCRFCEM, K562, MOLT4, HL60, RPMI8266, SR

RENAL: UO31, SN12C, A498, CAKI1, RXF393, 7860, ACHN, TK10

MELANOMA: LOXIMVI, MALME3M, SKMEL2, SKMEL5, SKMEL28, M14, UACC62, UACC257

PROSTATE: PC3, DU145

CNS: SNB19, SNB75, U251, SF268, SF295, SM539

RNA samples were prepared at the NCI (http://dtp.nci.nih.gov).

Drug sensitivity data

The NCI-60 cell lines were exposed to thousands of compounds at the National Cancer Institute (NCI) as a part of a drug discovery project. Growth inhibitory effects of each compound were measured for each cell line and reported as GI50the concentration that inhibits growth by 50% (for details, http://dtp.nci.nih.gov). We analyzed only compounds that had a relatively broad and balanced range of effects across the 60 cell lines. We took the log of each GI50 value and for each compound, and normalized log10(GI50) values across all cell lines to obtain a mean of zero and standard deviation of one. Cell lines with log10(GI50) values at least 0.8 standard deviations greater than the mean for a compound were defined as resistant to the compound; those 0.8 standard deviations below the mean were defined as sensitive. Log10(GI50) values within a window of 1.6 standard deviations around the mean were considered indeterminate and were eliminated from analysis. Prediction analysis was performed for compounds whose GI50 data met the following criteria: a total of at least 30 sensitive and resistant lines, at least 10 each sensitive and resistant lines, and a 1.6 standard deviation window at least one order of magnitude in raw GI50 values. Of 5084 compounds in the initial data set, 232 met our criteria.

Selection of training samples based on drug sensitivity data

For each of the 232 compounds, sensitive and resistant cell lines were divided into a training set and a test set on the basis of drug sensitivity data alone. Training set cell lines were chosen in a manner designed to normalize for tissue of origin (see cell lines above). For each compound, one sensitive and one resistant cell line from each tissue of origin were chosen for training. The most sensitive and the most resistant samples were chosen from each tissue type; in the case of a tie, the first in the series (as ordered above) was chosen. A tissue type was used in training only if it included at least one sensitive and one resistant cell line for the compound. Training set size varied from 6 to 14 cell lines. For each compound, the test set was comprised of all sensitive and resistant lines not used in training. Test sets varied in size from 15 to 35 cell lines.

Training and test sets for each compound are available for browsing (train.testlines.xls). For each compound, normalized log(GI50) values are shown for each cell line. Sensitive cell lines are blue; resistant are pink. A heavy outline designates that the cell line was used for training for the compound.

Gene expression data

1.5 g of poly-A selected RNA from each cell line was used to prepare biotinylated cRNA target and essentially as previously described, with minor modifications [L. Wodicka, H. Dong, M. Mittmann, M. Ho, D. Lockhart, Nature Biotechnol. 15, 1359 (1997)]. Targets were hybridized to AFFYMETRIX hu6800 chips, washed, stained, and signal amplified; for complete description of procedures, see http://www-genome.wi.mit.edu/MPR/protocol_index.html. Samples were excluded if they yielded less than 15 g of biotinylated cRNA, if the hybridization was weak or if there were visible defects, such as scratches, in the array. Gene expression values were floored to 100 AFFYMETRIX "average difference" units.

Preprocessing of genes and selection of classifier genes

Each classifier is composed of a set of genes and associated weights. For each classifier, the training cell lines were used to filter and rank genes. A variation filter was imposed on genes prior to selection of marker genes. For each compound, we used only genes whose expression values changed by at least 5-fold and 500 units across training cell lines and 2-fold within each pair of training cell lines from a single tissue type. Although 1200 genes pass a 5-fold, 500 unit filter imposed across all cell lines, the number of filtered genes for a given compound was a function of training set composition. For most of the compounds, fewer than 350 genes passed the variation filter. Prediction results were not changed significantly by using a 2-fold filter (2x.5x.comp.ppt).

For each compound, genes were ranked for use as marker genes based on the correlation between their expression values and the profile of sensitivity and resistance in the training cell lines. We used a measure of correlation, P(g,c), described previously (Golub, et. al., Science 1999). Let [1(g),1(g)] and [2(g),2(g)] denote the means and SDs of the log of the expression levels of gene g for the samples in class 1 and class 2, respectively. Let P(g,c) = [1(g) 2(g)]/[1(g) + 2(g)], which reflects the difference between the classes relative to the SD within the classes. Large values of P(g,c) indicate a strong correlation between the gene expression and the class distinction, while the sign of P(g,c) being positive or negative corresponds to g being more highly expressed in class 1 or class 2. Unlike a standard Pearson correlation coefficient, P(g,c) is not confined to the range [-1, +1]. This correlation coefficient was used to weight votes for each gene (see below).

Classification by weighted voting

We used a weighted voting scheme to classify each cell line as sensitive or resistant on the basis of gene expression data (Golub, et. al., Science 1999). Each classifier uses a set of correlated marker genes to vote on the class of each cell line; genes are chosen from the list of ranked genes (described above) and the number of genes used in each classifier is determined by cross-validation (see below). Each gene in the classifier contributes a vote that is the product of the expression level in the cell line that is to be classified and the correlation from the training cell lines. The vote for each gene can be expressed as the weighted difference between the normalized log expression in the cell line to be classified and the average of the sensitive and resistance class mean expression levels, where weighting is determined by the correlation P(g,c) from the training set. The class of the cell line is determined by the sum of votes for all marker genes used in a classifier.

Cross validation on training set

Classifier models with up to 200 marker genes were used for training set cross validation to find the best number of marker genes for each compound. For each model, cross validation was performed with the entire training set: one cell line was removed, the model was trained on the remaining cell lines and tested on the withheld cell line. This was repeated for each cell line in the training set. The model that resulted in the lowest error rate in cross validation was chosen as the optimized classifier for that compound. The number of genes in each classifier varied from 5 to 200. Stability of the cross-validation models varied for different compounds. Results for each classifier model tested in cross-validation are provided (xval.accuracy.xls). The first row shows the number of genes 1 to 200used in cross-validation; subsequent rows show the compound number and name and the percent cross-validation accuracy for the given number of genes. "NA" indicates the specific number of genes was not tested. For most compounds, cross-validation included models with 1-10 genes, and models with up to 200 genes. The number of genes used generally fell in 10 gene increments, but exceptions were made for training sets with fewer than 200 genes that passed the variation filter.

Final evaluation of optimized classifier on test set

The optimized classifier set was evaluated without further modification on the independent test samples. The vote for each classifier gene in the classifier was determined as above, using the expression value from the test cell line. The size and composition of each test set varied. Classifier accuracy was scored in the test set as the average of the accuracy rates for each of the two classes, sensitive and resistant. This was designed to normalize for any bias that would be introduced by uneven test sets. Our test sets are relatively balanced, however, so this adjustment had little effect, and the average accuracy rates are quite similar to the overall accuracy rate for the test set (correct calls/total test cell lines).

Significance of each classifier accuracy

For each compound, we compared classifier accuracy to the probability of seeing the same prediction accuracy by chance if each prediction were the result of a fair coin flip. For example, consider a compound with 26 cell lines in the test set, where our classifier predicted 21 of the 26 cell lines correctly. The probability of doing at least this well by chance is 0.00125 (derivation below), so this result would be significant at the 0.05 (and even the 0.005) level. The accuracy rate and significance of each classifier performance is represented (significance.levels.xls). Of 232 compounds, a surprising 88 classifiers performed accurately with a significance of p<= 0.05 (shaded in pink in table); only 12 such compounds (5% of 232) would be expected by chance.

Of the classifiers that performed poorly, only 13 (5.6%) reached a similar level of significance (shaded in blue).

[derivation of Pr(21 correct predictions):]

Pr(>= 21 heads out of 26 fair coin flips) =

sum_{i=21 to 26} (26 choose i) (1/2)^{26-i} (1/2)^{i} =

1/(2^26) * sum_{i=21 to 26} (26 choose i) =

83,656/(2^26) = 0.00125

 

Histogram of classifier accuracies

Classifier accuracy is described as the average percent accuracy of predicting the sensitive and resistant classes. Accuracies were binned in units of 5 for viewing as a histogram in figure 2. As a control, we ran 1000 replicates of a simulation in which the 232 test sets were classified by a random coin-flip. The control histogram shows the average distribution of results from the simulation. In order to compare the distribution of the observed prediction frequencies with a control distribution derived from random coin flips, we used the Kolmogorov-Smirnov test (Numerical Recipes in C: The Art of Scientific Computing, 2nd Edition. Press, W.H., Teukolsky, S.A., Vetterling, W.T., and Flannery, B.P. Cambridge University Press, Cambridge, 1992.). This test can be used to compare two cumulative distributions, both of which vary from 0 to 1. The Kolmogorov-Smirnov test is based on the maximum absolute difference between the functions, that is the difference between the two functions at the point along the x axis that maximizes this value. After converting the curves to cumulative distributions, the maximum distance was at 34% accuracy. In addition to the difference between the functions, D, the only other parameter is N, the effective number of data points, in this case the number of compounds (232).

Determining the significance of an observation requires calculation of the following sum:

QKS(l) = 2 S (-1)(j-1) e -2j2l2

j=1

The significance level for an observed value of D and N was calculated by determining Q for the following function:

Probability = QKS ([ N +0.12 + 0.11/ N] D)

For the comparison made, the probability that the two curves were drawn from the same distribution was less than

10-24.

 

Marker genes used in optimized classifiers

The genes used for each classifier and the weights associated with them are available as a tab-deliminated text file (genes.weights_used.xls). There are two rows of values for each compound; the first contains gene identifiers, and the second contains the weights associated with each gene.