ABSOLUTE Documentation, v1  Print-icon ▸ Open Module on GenePattern Public Server

Description: Extracts absolute copy numbers per cancer cell from a mixed DNA population. Use this module for the per-sample processing step in the workflow (usually after HAPSEG).

Author: Scott Carter, Matthew Meyerson, Gad Getz

Algorithm Version: ABSOLUTE 1.0.6

Contact:

absolute-help@broadinstitute.org, http://www.broadinstitute.org/cancer/cga/cga_forums, gp-help@broadinstitute.org

Summary

The ABSOLUTE module takes copy number data segmented by haplotype, such as the output of HAPSEG, and determines possible models for absolute copy numbers per cancer cell from a mixed DNA population.  This output should be used as input for the ABSOLUTE.summarize module.

Introduction

The human genome typically consists of a set of chromosome pairs, with one chromosome in each pair, known as a homolog, derived from each parent, and is typically referred to as diploid (whereas the set of chromosomes from a single parent is the haploid genome).  For a given gene on a given chromosome, there is a comparable, if not identical, gene on the other chromosome in the pair, known as an allele.

Cancer cells frequently have large structural alterations in their chromosomes that change the number of copies of affected genes on those chromosomes.  Thus, instead of having a homologous pair of alleles for a given gene, there may be deletions or duplications of those genes.  At a marker where the two alleles are heterozygous, this can lead to unequal contribution of one allele over the other, altering the copy number of a given allele.

Variations in copy number, as reflected in the ratio of cancer cell copy number to normal cell copy number, or relative copy number, can be informative regarding the structure and history of the cancer, and is relatively easy to determine.  When DNA is extracted from an admixed population of cancer and normal cells, the information regarding absolute copy number per cancer cell is lost in the mixing, and these data must be inferred.

Inferring absolute copy number is difficult for three reasons:

Let's briefly discuss the biases introduced by purity and ploidy in cancer samples.

Purity

Tumor tissue usually consists of a mixture of many tumor clones — that is, cell lines originating from different sets of chromosomal rearrangements, or subclones — and normal diploid cells.  Determining the degree of normal cell contamination is important because as the percentage of normal cells in a tumor sample increases, the ability to extract meaningful data from the sample regarding copy number and gene expression decreases.

A common method of determining the degree of contamination is direct microscopic review of the tumor pathology in tissue blocks taken from a tumor specimen.  In many cases, however, pathological review is carried out on tissue blocks from tumor regions physically distant from the tissue block used for DNA extraction.  Because of irregularities in the tumor shape and variability of normal cell contamination, this review is less accurate than could be desired.  As a result, many researchers prefer to assay the DNA sample directly and estimate its purity mathematically.

Ploidy

The variable ploidy of cancer cells affects the preprocessing used for many copy number algorithms.  When DNA is seeded to a microarray, a fixed mass of DNA is used.  This means that for non-tumor samples derived from diploid cells, each microarray well represents a similar amount of DNA, and likewise, a similar number of cells.  The signals derived from each single nucleotide polymorphism (SNP) on the microarray are directly proportional to the allelic copy number for all samples.

When the amount of DNA delivered to each microarray well is controlled by mass for cancer cells, the result can be variable cell numbers.  For example in a tetraploid cancer sample, which has twice the quantity of chromosomes in each cancer cell, a microarray well will represent half the number of cells as for a diploid sample.  If the tetraploid sample has a region that has the identical copy number as the same region in a diploid sample, wells designed to hybridize in this region will produce half the signal in the tetraploid sample as they will in the diploid sample.  The signal is no longer proportional to copy number.

Algorithm

The purpose of ABSOLUTE is to extract the absolute copy number per cancer cell from the mixed DNA population.  It does this in three steps:

  1. Estimating the tumor purity and ploidy from observed relative copy profiles and, optionally, from point mutation data, if available
  2. Using a large and diverse sample collection to help resolve ambiguous cases of purity and ploidy, since this estimation may not be able to be fully determined with a single sample
  3. Attempting to account for copy number alterations and point mutations in tumor subclones

The ABSOLUTE module accepts segmented copy number data as input, together with pre-computed models of recurrent cancer karyotypes and, optionally, allelic fraction values for somatic point mutations.  The output of ABSOLUTE then estimates the absolute cellular copy number of local DNA segments and, for point mutations, the number of mutated alleles.

Workflow

The common workflow is to process SNP data with the HAPSEG GenePattern module and pass the results to ABSOLUTE.  Alternatively, you can supply a tab-delimited segmentation file (e.g., from array CGH or massively parallel sequencing experiments); this file must contain the columns "Chromosome", "Start", "End", "Num_Probes", and "Segment_Mean". Your file may contain other columns besides these, but at a minimum, these columns must be specified. To run with a file other than those produced by HAPSEG, you must also select "total" for the copy number type parameter. See the Example Data link below for sample HAPSEG output. For a quick look at the workflow, see the overview page.

Multiple ABSOLUTE results can be summarized using the ABSOLUTE.summarize module and final solutions chosen – after analyst review – with the ABSOLUTE.review module.

More information about ABSOLUTE, its parameters, and its use are available from the website of the Broad Institute Cancer Genome Analysis (CGA) group.  When you refer to the information on that site, note that it discusses the ABSOLUTE algorithm in terms of the software function, and this GenePattern module executes the per-sample oriented RunAbsolute function.

References

Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, Laird PW, Onofrio RC, Winckler W, Weir BA, Beroukhim R, Pellman D, Levine DA, Lander ES, Meyerson M, Getz G.  Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol. 2012;30(5):413-21. (abstract and link to PDF)

Parameters

Name Description
seg dat file * A HAPSEG output file (<plate.name>_<array.name>.segdat.RData) or other segmented copy number data file.  If you supply a tab-delimited segmentation file, see the Input Files section for file details.
output file name base * If specified, provides a base filename for all output files. The default value is the sample name parameter.
sigma p * Provisional value of excess sample level variance used for mode search.  Default: 0
max sigma h * Maximum value of excess sample level variance.  For more details, see equation 6 in the ABSOLUTE paper.  Default: 0.015
min ploidy * Specifies the minimum ploidy value for the algorithm to consider, and models implying lower ploidy values will be discarded.  Default: 0.95
max ploidy * Specifies maximum ploidy value to consider, and models implying greater ploidy values will be discarded.  Default: 10
primary disease * Primary disease of the sample.  This is used for display and reporting purposes only.
platform *

The chip type used.  Supported chips are:

  • SNP_250K_STY
  • SNP_6.0 (default)
  • Illumina_WES
sample name * The name of the sample.  This is used for display and reporting purposes only.
max as seg count * Maximum number of allelic segments. Samples with a higher segment count will be flagged as 'failed'.  Default: 1500
max neg genome *

Sometimes, due to noise in the data, ABSOLUTE may model the fraction of the genome attributed to tumor subclones to be less than zero.  This parameter specifies the maximum allowable fraction of the genome that can be modeled as being less than zero without discarding a given solution. Default: 0.005

max non clonal * Maximum genome fraction that may be modeled as non-clonal — that is, as being derived from tumor subclones. Solutions implying greater values will be discarded.  Default: 0.05
copy number type *

The copy number type to assess.  Options include:

  • allelic (default)
  • total
If you are supplying a tab-delimited segmentation file (e.g., from array comparative genomic hybridization [CGH] or massively parallel sequencing experiments), you must also select "total" for this parameter.
maf file If available, a minor allele frequency file in mutation annotation format (MAF) (see Input Files for more details). This specifies the data for somatic point mutations to be used by ABSOLUTE.
min mut af If specified, a minimum mutation allelic fraction; that is, the fraction of alleles at a site that show the mutation. Mutations with lower allelic fractions will be filtered out before analysis. Note that if maf file is specified, min mut af must also be specified.

* - required

Input Files

  1. <seg.dat.file>

A HAPSEG output file or tab-delimited segmentation file. If you supply a tab-delimited segmentation file (e.g., from array comparative genomic hybridization [CGH] or massively parallel sequencing experiments) not generated by HAPSEG, this file must contain the columns "Chromosome", "Start", "End", "Num_Probes", and "Segment_Mean". Your file may contain other columns besides these, but at a minimum, these columns must be specified..

  1. <maf.file>

If available, a minor allele frequency file in mutation annotation format (MAF) that specifies the data for somatic point mutations to be used by ABSOLUTE.  Note that the MAF format specification has changed over time, and no particular specification is required, but this file must contain at least the following columns:

  • one of i_t_ref_count, t_ref_count
  • one of i_t_alt_count, t_alt_count
  • dbSNP_Val_Status
  • Start_position
  • Tumor_Sample_Barcode
  • Hugo_Symbol
  • Chromosome

Output Files

  1. <output.file.name.base>_plot.pdf

Plot showing the purity/ploidy values and the solutions

  1. <output.file.name.base>.RData

An R file containing an object ‘seg.dat’ which provides all of the information used to generate the plot.

Example Data

A set of HAPSEG example data from the CGA group is available at:

ftp://ftp.broadinstitute.org/pub/genepattern/example_files/HAPSEG_1.1.1/paper_example.zip

This can be run through HAPSEG and the output supplied to ABSOLUTE.  Note that there is a README file in the ZIP archive that provides the filenames and parameters you will need to run this example data through HAPSEG, ABSOLUTE, ABSOLUTE.summarize, and ABSOLUTE.review.

Requirements

ABSOLUTE can only be used on the GenePattern public server, as it requires a specialized installation process that prevents distribution via the repository.  Please contact the authors listed above if you have an interest in installing ABSOLUTE locally. 

Acceptance of the module license is required for its use.  A copy of the license text is available here: http://www.broadinstitute.org/cancer/cga/sites/default/files/images/ABSOLUTE_HAPSEG_license_2013.pdf

The ABSOLUTE module runs only on GenePattern 3.4.2 or above and requires R2.15.2 with the following packages:
  • numDeriv_2012.9-1
  • getopt_1.17
  • optparse_0.9.5

Each of these R packages will be automatically downloaded and installed when the module is installed.  R2.15.2 must be installed and configured independently.

 

Platform Dependencies

Module Type: SNP Analysis
CPU Type: any
OS: any
Language: R2.15.2

GenePattern Module Version Notes

VersionRelease DateDescription
12013-06-30Initial version.