Overview


Oncotator is a tool to annotate point mutations and indels with functional data relevant to cancer researchers. Annotations include gene names, functional consequence (e.g. Missense), PolyPhen-2 predictions, and cancer-specific annotations from resources such as COSMIC, Tumorscape, and published MutSig results.



GAF Reference Set

To determine mutation consequence, Oncotator utilizes a set of 73,671 reference transcripts derived from transcripts from the UCSC Genome Browser’s UCSC Genes track and microRNAs from miRBase release 153 as provided in the TCGA General Annotation Files (GAF) library. More information about the GAF can be found here.



Latest stable Oncotator version and resources used




A tab-delimited file with five required fields. A header line is optional but if supplied, the five headers must be provided as listed below.


The required fields are:

  1. 'chr' - Chromsome, ‘chr’ prefix is optional (e.g. ‘chr10’ and ’10’ are both valid).
  2. 'start' - 1-based start coordinate of reference allele.
  3. 'end' - 1-based end coordinate of reference allele.
  4. 'reference_allele' - Positive strand reference allele at positions given above.
  5. 'observed_allele' - Observed allele.
Representing different types of alterations

point mutations
chr4 150 150 A T

insertions

Use ‘-’ in the reference_allele field and start/end coordinates must indicate the two adjacent bases in which the insertion occurs between.

chr4 150 151 - T

deletions

Use ‘-’ in the observed_allele field to denote deletion of the given reference allele.

chr4 150 150 A -

VCF-style formatting

Indels can also be represented with VCF-style formatting. For example, the insertion and deletion above can be represent as so:

chr4 150 150 A AT
chr4 150 151 AG A


Oncotator outputs annotated mutations in tab-delimited Mutation Annotation Format (MAF).

  • First line begins with '##' and provides Oncotator and resource version information.
  • If multiple values exist within a field, a pipe “|” character will be used as a delimiter.
  • Column indices 1-32 are dictated by TCGA MAF specification 2.2. Details for column indices 33-76 are provided in the table below.

Column Header Descriptions
Index MAF Column Header Description of Values Example
33 Genome_Change String describing '+' strand genomic coordinates and alleles. g.chr7:55227009T>G
34 Annotation_Transcript UCSC transcript ID of transcript used for annotation. uc003tqk.1
35 Transcript_Strand Strand orientation of the above transcript. +
36 Transcript_Exon Indicates the exon number of reference transcript that the mutation affects. Indicates the exon affected by the mutation. 9
37 Transcript_Position Describes absolute start and end coordinates (separated by a underscore characer) with respect to reference transcript used in the “Annotation_Transcript” column. Note these coordinates will differ from the coding region coordinates used in the “cDNA_Change” and “Codon_Change” columns. Only one number will be provided if the start and end coordinates are the same. 2099_2100
38 cDNA_Change Coding positon and alleles. Coordinates are coding sequence coordinates. c.2573T>G
39 Codon_Change String describing transcript coordinates and alleles in context of codon sequences involved. c.(2572-2574)CTG>CGG
40 Protein_Change Protein postion and alleles involved. p.L858R
41 Other_Transcripts HUGO symbol, UCSC transcript id, variant classifcation and protein change of other transcripts overlapping with mutation. Use the "ALL" transcripts output option to see detailed annotations for each of the transcripts in this field. EGFR_uc010kzg.1_Missense_Mutation_p.L813R
42 Refseq_mRNA_Id RefSeq transcript ID. NM_005228
43 Refseq_prot_Id Refseq protein ID. NP_005219
44 SwissProt_acc_Id UniProt accession ID. P00533
45 SwissProt_entry_Id UniProt entry name ID. EGFR_HUMAN
46 Description If available, description text for transcript. epidermal growth factor receptor isoform a
47 UniProt_AApos UniProt protein position used to derive position-specific annotations. This can differ from the protein position listed in the 'Protein_Change' field if the UCSC and Uniprot protein sequeneces differ. 858
48 UniProt_Region Overlapping UniProt regions of interest (e.g. functional domain or repeat region). Cytoplasmic (Potential).|Protein kinase.
49 UniProt_Site Overlapping UniProt single amino acid sites of interest (e.g. cleavage or inhibitory sites for proteases). ATP (By similarity).
50 UniProt_Natural_Variants Overlappng UniProt natural variants (e.g. disease-associated mutations or RNA editing events). S -> C (in Beare-Stevenson cutis gyrata syndrome).
51 UniProt_Experimental_Info Overlapping UniProt sites with experimental data (e.g. mutagenesis data leading to protein activity inhibition). D->A: Loss of kinase activity.
52 GO_Biological_Process Gene Ontology terms describing pathways and processes UniProt protein is involved in. anoikis|cell cycle arrest|energy reserve metabolic process
53 GO_Cellular_Component Gene Ontology terms describing localization of given UniProt protein. cytosol|nucleus
54 GO_Molecular_Function Gene Ontology terms describing molecular activity of given UniProt protein. ATP binding|magnesium ion binding|protein serine/threonine kinase activity
55 COSMIC_overlapping_mutations Protein changes of overlapping alterations. Number of samples in COSMIC with said mutation is in parentheses. p.V617F(27905)|p.V617_C618>FR(2)|p.V617I(1)
56 COSMIC_fusion_genes Gene symbols of fusion events involving gene in COSMIC. Number of samples in COSMIC with said mutation is in parentheses. PCM1/JAK2(30)|PAX5/JAK2(18)|ETV6/JAK2(11)
57 COSMIC_tissue_types_affected Tissue type summary of tumor samples involving gene in COSMIC. Number of samples in COSMIC is in parentheses. haematopoietic_and_lymphoid_tissue(28274)|lung(5)|breast(4)
58 COSMIC_total_alterations_in_gene Total numbers of records for gene in COSMIC 28285
59 Tumorscape_Amplification_Peaks Overlapping significant GISTIC aplification focal peaks from Tumorscape. (Number of genes in peak and q-value of peaks is in parentheses). Only peak regions with a q-value <= 0.20 are reported. all_cancers(1;1.57e-46)|all_epithelial(1;5.62e-37)|Lung NSC(1;9.29e-25)
60 Tumorscape_Deletion_Peaks Overlapping significant GISTIC deletion focal peaks from Tumorscape. (Number of genes in peak and q-value of peaks is in parentheses). Only peak regions with a q-value <= 0.20 are reported. Lung NSC(174;0.0841)|all_lung(145;0.106)|all_neural(114;0.107)
61 TCGAscape_Amplification_Peaks Overlapping significant GISTIC amplification focal peaks from TCGAscape (Number of genes in peak and q-value of peaks is in parentheses). Only peak regions with a q-value <= 0.20 are reported. GBM - Glioblastoma multiforme(1;0)|all cancers(1;2.19e-314)|LUSC - Lung squamous cell carcinoma(13;0.000168)
62 TCGAscape_Deletion_Peaks Overlapping significant GISTIC deletion focal peaks from TCGAscape. (Number of genes in peak and q-value of peaks is in parentheses). Only peak regions with a q-value <= 0.20 are reported. all cancers(201;9.73e-05)|GBM - Glioblastoma multiforme(135;0.0845)
63 DrugBank Listing of compounds from DrugBank known to interact with genes (DrugBank compound ID in parentheses). Sunitinib(DB01268)
64 PPH2_Class Polyphen-2 probabilistic binary classifier outcome ('deleterious' or 'neutral'). deleterious
65 PPH2_Prob Polyphen-2 classifier probability of the variation being damaging. 0.926
66 PPH2_FDR Polyphen-2 classifier model False Discovery Rate at the above probability. 0.171
67 PPH2_MSA_dScore Polyphen-2 difference of multiple sequence alignment PSIC scores for two amino acid residue variants (Score1-Score2). 1.875
68 PPH2_MSA_Score1 Polyphen-2 multiple sequence alignment PSIC score for wild type amino acid residue (aa1). 1.523
69 PPH2_MSA_Score2 Polyphen-2 multiple sequence alignment PSIC score for mutant amino acid residue (aa2). -0.352
70 PPH2_MSA_Nobs Polyphen-2 number of residues observed at the substitution position in multiple sequence alignment (without gaps). 39
71 CCLE_ONCOMAP_overlapping_mutations Protein change of overlapping mutations in CCLE Oncomap dataset. Cell line name and lineage are provided in parentheses. R130G(OV56_OVARY)|R130G(KMBC2_URINARY_TRACT)
72 CCLE_ONCOMAP_total_mutations_in_gene Total number of mutations in CCLE Oncomap data for this gene. 31
73 CGC_Mutation_Type Type of mutations reported for this gene in Cancer Gene Census. See abbreviations here. D, Mis, N, F, S
74 CGC_Translocation_Partner Known translocation partner gene as reported in Cancer Gene Census ALK
75 CGC_Tumor_Types_Somatic Tumor types with somatic alterations in this gene as reported in Cancer Gene Census. See abbreviations here. MDS, CML
76 CGC_Tumor_Types_Germline Tumor types with germline alterations in this gene as reported in Cancer Gene Census. See abbreviations here. T-PLL
77 CGC_Other_Diseases Other diseases/syndromes with alterations in this gene as reported in Cancer Gene Census. type=REGULATORY REGION|TFbs=CTCF|Dataset=CTCF ChIP-chip sites (Ren lab)
78 DNARepairGenes_Role Known DNA repair roles for this gene as reported in Wood et al. NER|Involved_in_tolerance_or_repair_of_DNA_crosslinks
79 FamilialCancerDatabase_Syndromes Familial cancer syndromes with alteration in this gene as reported in the Familial Cancer Database. Wiskott-Aldrich_syndrome
80 MUTSIG_Published_Results Published MutSig analyses with gene in signifcant results. Gene rank and q-value are provided in parentheses. TCGA GBM(2;<1E-8)|TSP Lung(26;0.18)


A REST-like interface is available for obtaining detailed annotataions in JSON format for genes, transcripts, and mutations.


Example API Queries

specific gene annotations
http://www.broadinstitute.org/oncotator/gene/EGFR/

gene annotations across a given genomic range (hg19 coordinates)

Provide "chr", "start", and "end" parameters delimited by an underscore character ("_").

http://www.broadinstitute.org/oncotator/genes/chr4_50164411_60164411/

specific transcript annotations
http://www.broadinstitute.org/oncotator/transcript/uc003hal.2/

transcript annotations across a given genomic range (hg19 coordinates)

Provide "chr", "start", and "end" parameters delimited by an underscore character ("_").

http://www.broadinstitute.org/oncotator/transcripts/chr4_50164411_60164411/

specific mutation annotations

Provide "chr", "start", "end", "reference_allele", and "observed_allele" parameters delimited by an underscore character ("_").

http://www.broadinstitute.org/oncotator/mutation/7_55259515_55259515_T_G/