Starting with a curated list of 380 cancer-associated genes (Brentani, Caballero et al. 2003, Proc. Natl. Acad. Sci. USA 100, 13418-13423), the authors (Subramanian, Tamayo et al. 2005) mined 4 expression compendia datasets for correlated gene sets. Gene neighborhoods with <25 genes at a Pearson correlation threshold of 0.8 were omitted yielding 427 sets.
www.geneontology.org). The gene sets are based on GO terms (gene_ontology_edit.obo, downloaded 1/25/2008) and their associations to human genes (gene2go, downloaded 1/22/2008).
Each GO term belongs to one of the three ontologies: molecular function (MF), cellular component (CC) or biological process (BP). A gene product might be associated with or located in one or more cellular components. It is active in one or more biological processes, during which it performs one or more molecular functions. Each ontology captures a unique aspect of the gene product.
A GO annotation consists of a GO term associated with a specific reference that describes the work or analysis upon which the association between a specific GO term and gene product is based. Each annotation must also include an evidence code to indicate how the annotation to a particular term is supported (www.geneontology.org/GO.evidence.shtml). Only associations with the following evidence codes are included in MSigDB gene sets: IDA IPI, IMP IGI, IEP ISS, TAS.
GO gene sets for very broad categories, such as Biological Process, have been omitted from MSigDB. GO gene sets with fewer than 10 genes have also been omitted. Gene sets with the same members have been resolved based on the GO tree structure: if a parent term has only one child term and their gene sets have the same members, the child gene set is omitted; if the gene sets of sibling terms have the same members, the sibling gene sets are omitted.
We first capture relevant microarray datasets published in the immunology literature that have raw data deposited to Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo). For each published study, the relevant comparisons are identified (e.g. WT vs. KO; pre- vs. post-treatment etc.) and brief, biologically meaningful descriptions are created. All data is processed and normalized the same way to identify the gene sets, which correspond to the top or bottom genes (FDR < 0.25 or maximum of 200 genes) ranked by mutual information for each assigned comparison.
This resource is generated as part of the Human Immunology Project Consortium (HIPC; http://www.immuneprofiling.org).