GenomicAnnotator
From GSA
Warning: the material on this page is considered out of date by the GSA team.
Contents |
This Walker is Deprecated
The GenomicAnnotator was replaced by Adding Genomic Annotations Using SnpEff and VariantAnnotator in GATK release 1.2.
Background
GenomicAnnotator is a GATK tool for annotating variant calls with data from arbitrary tabular data sources.
NOTE: The VariantAnnotator is another general-purpose annotation tool within the GATK. Its design allows users to write plugins that compute specific annotations. GenomicAnnotator's relationship to this is that it's a VariantAnnotator plugin which, instead of computing annotations on the fly like most other plugins, generates them by retrieving data from user-provided tabular files.
Introduction
The GenomicAnnotator serves as the core of a 3-part annotation process.
The first part is pre-processing. Here, arbitrary data sources are converted into tab-delimited text files (called "tabular files" from here on). This is done by converter applications, with each converter being responsible for its own data source or file format. Some converters may need to do little more than normalize the data's coordinate system and sort it into reference order (as is the case for the dbSNP converter). More complicated converters may have to substantially transform the dataset in order to produce a table indexed by genomic coordinates (This is true for the RefSeq converter). These converters are meant to be run infrequently - ideally only when a new version of a dataset becomes available. In the mean time, the converted files are stored in a commonly-accessible directory to be shared by the user community. One advantage of this design is that converted files can be created for different versions of a dataset, and users can then explicitly choose which version they want to annotate with.
In the second part, the GenomicAnnotator is used to annotate variants by finding matching records in one or more tabular files. The matching is done as follows: For each variant to be annotated, GenomicAnnotator finds all records in the tabular files that overlap the variant's position in the genome. In the simplest case, these records are directly used as annotations in the form: fileName1.columnName1=fieldValue1, fileName1.columnName2=fieldValue2, fileName2.columnNameX=fieldValueY ... etc. If a tabular file contains columns with the special names "haplotypeReference", "haplotypeAlternate", (and optionally "haplotypeStrand"), then GenomicAnnotator will only pull annotations from those records whose reference and alternate alleles match those of the variant. Another special case is when more than one record in a tabular file overlaps the variant. Handling of this case is discussed in the docs for the -m command-line arg below. One key aspect of the GenomicAnnotator's design is that it does not know biology. Beyond simple comparisons of reference and alternate alleles, its does all operations without knowing anything about the data its operating on. It's role is to perform database-like operations similar to JOINs. This design makes it so new annotations can be added without modifying the GenomicAnnotator. Instead, the new annotations are implemented either by creating a new tabular file with the annotations pre-computed, or by writing a post-processing step that dynamically computes them based on annotations coming from the GenomicAnnotator. An concrete example of the output produced for a given set of inputs can be found here: detailed example.
The final post-processing part serves as the main point of extension for adding complex new annotations. Here, analysts will be able to write post-processing applications that take annotations produced by GenomicAnnotator and dynamically compute new ones (or modify existing ones). One example would be to annotate each SNP with likely_benign, known_deleterious, or likely_deleterious using GenomicAnnotator-produced annotations such as synonymous / nonsynonymous, cross-species conservation scores, and dbSNP fields.
Although the post-processing applications can be written using any toolkits or programming languages, GATK is recommended because it provides built in support for parsing and traversing VCF files and for parallelizing the applications.
Limitations:
- GenomicAnnotator doesn't currently phasing information.
- GenomicAnnotator has very limited support for insertions and deletions (basically, just whether an indel is frameshift, inframe, or non-coding).
- Input VCF files are only allowed to have one alternate allele per record. Although the VCF format allows multiple alternate alleles per record, this creates a problem for the annotator as to how to store annotations for each allele inside the VCF INFO field. It is assumed that most input VCF files will have one allele per record, or be easily convertable into this form by splitting each multiple-allele record into several single-allele records. If this limitation becomes burdensome, a scheme could be devised to annotate multiple alleles, or perform the multi-allele to single-allele conversion on the fly (where possible).
Example
This should be run at the top level of your GATK checkout:
java -Xmx1g -jar $GATK/dist/GenomeAnalysisTK.jar \
-T GenomicAnnotator \
-l info \
-R /broad/1KG/reference/human_b36_both.fasta \
-B:variant,vcf /humgen/gsa-hpprojects/GATK/data/Annotations/examples/CEU_hapmap_nogt_23_subset.vcf \
-B:refseq,AnnotatorInputTable /humgen/gsa-hpprojects/GATK/data/Annotations/refseq/refGene-big-table-b36.txt \
-m \
-o output.vcf \
-BTI variant \
-s dbsnp.name,dbsnp.refUCSC,dbsnp.strand,dbsnp.observed,dbsnp.avHet
To see a more detailed example including inputs and outputs, see: detailed example.
Command Line Args
- -B:<bindingName>,<bindingType> </path/to/file>: GenomicAnnotator expects one -B input to have bindingName = 'variant'. This file has to contain the variant calls to be annotated. All other -B args are used to specify one or more data tables to be used as sources of annotations - their bindingType must be ' AnnotatorInputTable' (this determines the file format). The bindingNames for these can be arbitrary.
- -J <bindingName>,</path/to/file>,<bindingName.columnName=bindingName2.columnName2>: Specifies a table that should be LEFT-JOINED with the data from the -B args or other -J args. The bindingName is an arbitrary label, but must be unique across all binding names. bindingName.columnName=bindingName2.columnName2 specifies what columns to join on. The column from the -B arg doesn't have to contain unique keys (see docs for the -m arg). For an example, see this page.
- --out, -o </path/to/output_file>: Specifies the output path of the annotated .vcf file.
- --rodToIntervalTrackName, -BTI variant: Turns on an optimization that significantly reduces execution time.
- --oneToMany, -m: If more than one record from the same file matches a particular locus (for example, multiple dbSNP records with the same position), create multiple entries in the ouptut VCF file - one for each match. If a particular tabular file has J matches, and another tabular file has K matches for a given locus, then J*K output VCF records will be generated - one for each pair of K, J. If this flag is not provided, the multiple records are still generated, but they are stored in the INFO field of a single output VCF record, with their annotation keys differentiated by appending '_i' with i varying from 1 to K*J.
- --select, -s <columnNames>: Optionally limits what columns from what files are used for annotations. For example, -B:mydbsnp,AnnotatorInputTable /path/to/mydbsnp.txt -B:mytable,AnnotatorInputTable /path/mytable.txt -s mydbsnp.avHet,mydbsnp.name,mytable.column5 will cause annotations to only be generated from the 3 columns given to -s. That is: avHet, name, column5.
Data
Ready-for-use tabular data files are accessible
internally on a shared drive:
/humgen/gsa-hpprojects/GATK/data/Annotations
externally via FTP:
ftp://gatk-ftp:PH5UH7Pa@ftp.broadinstitute.org
Here are the details on the tables currently stored there, including how they were generated:
- RefSeq
- ./examples/*.vcf - These are examples of input .vcf files that can be annotated with the GenomicAnnotator
NOTE: Currently, two versions of each data file are generated (for example: snp130-hg18-only-the-SNPs.txt and snp130-b36-only-the-SNPs.txt). This is done to deal with the difference in chromosome naming/ordering conventions between the NCBI and UCSC references. The "hg18" versions of the data files will have chromosome names like 'chr1', 'chr2', 'chrM', etc. The "b36" versions will have chromosome names like '1', '2', 'MT', etc. The two versions of each data file should be identical except for the chromosome names and ordering.
Data Formats
There are 2 types of data files that can be passed to the GenomicAnnotator to serve as a source of annotations:
1: Files indexed by genomic position or interval:
These are specified on the command-line with -B and bindingType 'AnnotatorInputTable'.
They must have the following format:
- Tab-delimited fields.
- Comment lines start with "#".
- The first 3 columns describe the position. Its format is "chromosome start stop". The coordinates must be position-based rather than offset-based, (aka. 1-based inclusive).
- Records must be sorted in reference order by their start coordinate. The chromosomal order depends on what reference sequence you will be using. If its a UCSC reference sequence (eg. hg18), then chromosome names must start with 'chr' and be in the order: chrM,chr1-22,chrX,chrY, chrM_random,chr1-22_random,chrX_random,chrY_random. If you will be using an NCBI reference sequence (eg. b36), then chromosomes must not start with 'chr', and must be in the order: 1-22,X,Y,MT.
- The first non-comment, non-empty line is treated as the header. The header must have tab-separated column names for all the columns. The column names are used as the keys for annotations. The annotations will look like: bindingName.columnName=value. (The bindingName is specified as part of the -B command-line args).
- There are two special column names: haplotypeReference, haplotypeAlternate. These columns are optional. If a file does have one or more of them, then GenomicAnnotator will compare their values to each variant's reference and alternate alleles. Only records that match a given variant's alleles will be used to annotate that variant.
- The haplotypeReference column can have values A,C,G,T and is matched against the variant's reference allele using a simple base-to-base comparison. If a particular record has the value * (star) for haplotypeReference, then it will match any variant reference allele. The haplotypeReference is on the "+" strand.
- The same is true for haplotypeAlternate, except that, instead of specifying one allele, this column can optionally specify multiple comma-separated alleles (eg. "A,G,T"). If more than one allele is specified in a given record, this record will be used for annotations if any one of its alternate alleles matches the variant's alternate allele. For convenience, other characters beside "," can be used to separate alleles - including "\", ":", "/", "|". The haplotypeAlternate is on the "+" strand.
- A record can have empty values (eg. value is "") for missing data, etc. In that case, that column in that record won't be used for annotation even if other columns in the record are used.
2: Files indexed by an arbitrary value:
These are specified on the command-line with -J.
They must have the following format:
- Tab-delimited fields.
- Comment lines start with "#".
- The first non-comment, non-empty line is treated as the header. The header must have tab-separated column names for all the columns. The column names are used as the keys for annotations. The annotations will look like: bindingName.columnName=value. (The bindingName is specified as part of the -B and -J command-line args).
- A record can have empty values (eg. value is "") for missing data, etc. In that case, that column in that record won't be used for annotation even if other columns in the record are used.
- The values in the column used as the join key do NOT have to be unique.
Preprocessing
Please note that we are not supporting generation of tables into the format used by the Genomic Annotator and instead expect users to download our pre-generated large tables. However, we do recognize that some power users will want to generate their own tables and can gladly point you in the right direction below - but please be aware that we don't support any of the code/scripts mentioned. We're more than happy to address bugs (please report them on our support forum), but won't provide any other guidance on how to use the tools other than what follows below.
One is required to perform an initial phase of pre-processing in order to convert the raw data from an arbitrary source to a tabular format such that:
- the fields are all tab-delimited
- either the 1st column specifies the position in chr:start-stop format
- or the 1st column is the chromosome, the 2nd is the start position, and the 3rd column is the stop position
- with other data following in subsequent columns
The second phase of pre-processing then converts the tabular data from the first phase into the data format required by the Genomic Annotator. The tool that performs the conversion is called TranscriptToGenomicInfo. The following is an example command-line:
java -Xmx2g -jar $GATK/dist/GenomeAnalysisTK.jar \
-T TranscriptToGenomicInfo \
-R reference.fasta \
-B:transcripts,AnnotatorInputTable converted_table.txt \
-o output_table.txt \
-n name,proteinID
It's important to note that the output of TranscriptToGenomicInfo is NOT guaranteed to be sorted; however, the Genomic Annotator requires sorted files. You can see $GATK/perl/createTranscriptToGenomicInfoTables.pl for one easy way to sort these files.
Post-processing
The annotator is designed to perform routine database operations, and does not perform any biologically-aware computations (such as computing whether a SNP is deleterious). The biological knowledge must either be encoded in the data tables, or implemented in post-processing steps. Each post-processing step is an application that take the VCF file produced by the annotator, adds or modifies annotations in some way, and then output these in a new VCF file. Post processing steps can be implemented using any programming language or toolkit. Several possibilities are:
- Java/GATK - Implement a GATK walker. The GATK takes care of VCF parsing, and makes it extremely easy to parallelize the application.
- Python
What has been written so far:
- 6/15/2010 - Python - Steve Hershman has written post-processing steps including a "uniquify" operation which selects the most biologically-interesting set of annotations for those variants that have multiple annotation sets associated with them due to the -m command-line option.
Support Forum
Questions regarding this tool should be posted to this forum: http://getsatisfaction.com/gsa
