Data Primer

The general TCGA data primer can be found here and should be considered an all-inclusive reference.


Barcoding Scheme for TCGA Samples

The TCGA sample id follows the pattern as above. The first four letters are either “TCGA” or the four letter code for the cancer type (i.e., “LUSC” for lung squamous cell). The next field following the dash is a unique two character code for the tissue selection site (TSS) in which the sample came from. The next field following that would be a unique four character identification for a specific sample from that TSS. The code as assembled thus far should be uniquely identifiable as for the geographical source (clinic) and suspected type of cancer sample as determined by pathology from a unique patient.

The rest of the assembled identifying fields following the code would be for subsets of the aforementioned sample. For instance, if multiple samples were taken from a single cancer and placed in multiple vials or if subsequent aliquots were derived from these biopsies and eventually processed in unique plates at specific centers.

Data Types

  • Clinical Data: clinical data derived from patient charts from a physician at the TSS or derived from pathology.
  • Copy Number CGH: regions of statically significant copy number change across samples from the CGH platform.
  • Copy Number SNP: regions of statistically significant copy number change across samples from SNP platform.
  • LOH SNP: statistically significant LOH from all samples using SNP platform.
  • SNP: unique combination of SNPs associated with the sample from SNP platform.
  • Methylation: statistically significant methylated genes across samples.
  • Expression Exon: statistically significant exons present across samples.
  • Expression Gene: statistically significant genes expressed across samples.
  • Expression miRNA: statistically significant miRNA expression across samples.
  • Mutation: significant mutations across samples usually from sequencing platform (whole genome or exome).

Data Levels

Specific implications for level of each data type can be found at the TCGA Wiki. Please note that specific permissions must be acquired and granted for access to lower level (i.e., level 1 and 2) data. Level 3 and 4 data are freely available from the publicly accessible links elsewhere on this site and/or the dbGAP and SRA archives. None of these are directly downloadable from the Broad but from a third party centralized storage source.

Data Level Level Type Description
1 Raw Low-level data for single sample
Not normalized
2 Processed Normalized single sample data
Interpreted for presence or absence of specific molecular abnormalities
3 Segmented
Aggregate of processed data from single sample
Grouped by probed loci to form larger contiguous regions (in some cases)
4 Summary
Regions of Interest (ROI)
Quantified association across classes of samples
Associations based on two or more
Molecular abnormalities
Sample characteristics
Clinical variables

Leave a Reply