-A <some annotation> in a command line invoking one of the annotation-capable tools (HaplotypeCaller, MuTect2, UnifiedGenotyper and VariantAnnotator), but that annotation did not show up in your output VCF.
Keep in mind that all annotations that are necessary to run our Best Practices are annotated by default, so you should generally not need to request annotations unless you're doing something a bit special.
There can be several reasons why this happens, depending on the tool, the annotation, and you data. These are the four we see most often; if you encounter another that is not listed here, let us know in the comments.
For example, you're running MuTect2 but requested an annotation that is specific to HaplotypeCaller. There should be an error message to that effect in the output log. It's not possible to override this; but if you believe the annotation should be available to the tool, let us know in the forum and we'll consider putting in a feature request.
For example, you're running HaplotypeCaller and you want InbreedingCoefficient, but you didn't specify a pedigree file. There should be an error message to that effect in the output log. The solution is simply to provide the missing input file. Another example: you're running VariantAnnotator and you want to annotate Coverage, but you didn't specify a BAM file. The tool needs to see the read data in order to calculate the annotation, so again, you simply need to provide the BAM file.
For example, you're looking at RankSumTest annotations, which require heterozygous sites in order to perform the necessary calculations, but you're running on haploid data so you don't have any het sites. There is no workaround; the annotation is not applicable to your data. Another example: you requested InbreedingCoefficient, but your population includes fewer than 10 founder samples, which are required for the annotation calculation. There is no workaround; the annotation is not applicable to your data.
For example, you requested Coverage from HaplotypeCaller, which already annotates this by default. There is currently a bug that causes some default annotations to be dropped from the list if specified on the command line. This will be addressed in an upcoming version. For now the workaround is to check what annotations are applied by default and NOT request them with
-ERC GVCFmode, please see this companion document.
VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and expansion has been taken over by the Global Alliance for Genomics and Health Data Working group file format team. The full format spec can be found in the Samtools/Hts-specs repository along with other useful specs like SAM/BAM. We highly encourage you to take a look at those documents, as they contain a lot of useful information that we don't go over in this document.
VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.
That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from high-throughput sequencing data, such as the HaplotypeCaller, is especially complex. This document describes the key features and annotations that you need to know about in order to understand VCF files output by the GATK tools.
Note that VCF files are plain text files, so you can open them for viewing or editing in any text editor, with the following caveats:
Some VCF files are very large, so your personal computer may struggle to load the whole file into memory. In such cases, you may need to use a different approach, such as using UNIX tools to access the part of the dataset that is relevant to you, or subsetting the data using tools like GATK's SelectVariants.
NEVER EDIT A VCF IN A WORD PROCESSOR SUCH AS MICROSOFT WORD BECAUSE IT WILL SCREW UP THE FORMAT! You have been warned :)
A valid VCF file is composed of two main parts: the header, and the variant call records.
The header contains information about the dataset and relevant reference sources (e.g. the organism, genome build version etc.), as well as definitions of all the annotations used to qualify and quantify the properties of the variant calls contained in the VCF file. The header of VCFs generated by GATK tools also include the command line that was used to generate them. Some other programs also record the command line in the VCF header, but not all do so as it is not required by the VCF specification. For more information about the header, see the next section.
The actual data lines will look something like this:
[HEADER LINES] #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255 1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0 1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:26:103,0,26 1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:61:61,0,255
After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that all the lines shown in the example above describe SNPs (also called SNVs), but other variation could be described, such as indels or CNVs. See the VCF specification for details on how the various types of variations are represented. Depending on how the callset was generated, there may only be records for sites where a variant was identified, or there may also be "invariant" records, ie records for sites where no variation was identified.
You will sometimes come across VCFs that have only 8 columns, and contain no FORMAT or sample-specific information. These are called "sites-only" VCFs, and represent variation that has been observed in a population. Generally, information about the population of origin should be included in the header.
The following is a valid VCF header produced by HaplotypeCaller on an example data set (derived from our favorite test sample, NA12878). You can download similar test data from our resource bundle and try looking at it yourself!
##fileformat=VCFv4.1 ##FILTER=<ID=LowQual,Description="Low quality"> ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> ##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.4-3-gd1ac142,Date="Mon May 18 17:36:4 . . . ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> ##contig=<ID=chr1,length=249250621,assembly=b37> ##reference=file:human_genome_b37.fasta
We're not showing all the lines here, but that's still a lot... so let's break it down into digestible bits. Note that the header lines are always listed in alphabetical order.
The first line:
tells you the version of the VCF specification to which the file conforms. This may seem uninteresting but it can have some important consequences for how to handle and interpret the file contents. As genomics is a fast moving field, the file formats are evolving fairly rapidly, so some of the encoding conventions change. If you run into unexpected issues while trying to parse a VCF file, be sure to check the version and the spec for any relevant format changes.
The FILTER lines tell you what filters have been applied to the data. In our test file, one filter has been applied:
Records that fail any of the filters listed here will contain the ID of the filter (here,
LowQual) in its
FILTER field (see how records are structured further below).
These lines define the annotations contained in the
INFO columns of the VCF file, which we explain further below. If you ever need to know what an annotation stands for, you can always check the VCF header for a brief explanation.
The GATKCommandLine lines contain all the parameters that went used by the tool that generated the file. Here,
GATKCommandLine.HaplotypeCaller refers to a command line invoking HaplotypeCaller. These parameters include all the arguments that the tool accepts, not just the ones specified explicitly by the user in the command line.
These contain the contig names, lengths, and which reference assembly was used with the input bam file. This can come in handy when someone gives you a callset but doesn't tell you which reference it was derived from -- remember that for most organisms, there are multiple reference assemblies, and you should always make sure to use the appropriate one!
[todo: FAQ on genome builds]
For each site record, the information is structured into columns (also called fields) as follows:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 [other samples...]
The first 8 columns of the VCF records (up to and including
INFO) represent the properties observed at the level of the variant (or invariant) site. Keep in mind that when multiple samples are represented in a VCF file, some of the site-level annotations represent a summary or average of the values obtained for that site from the different samples.
Sample-specific information such as genotype and individual sample-level annotation values are contained in the
FORMAT column (9th column) and in the sample-name columns (10th and beyond). In the example above, there is one sample called NA12878; if there were additional samples there would be additional columns to the right. Most programs order the sample columns alphabetically by sample name, but this is not always the case, so be aware that you can't depend on ordering rules for parsing VCF output!
These first 7 fields are required by the VCF format and must be present, although they can be empty (in practice, there has to be a dot, ie
. to serve as a placeholder).
CHROM and POS : The contig and genomic coordinates on which the variant occurs. Note that for deletions the position given is actually the base preceding the event.
ID: An optional identifier for the variant. Based on the contig and position of the call and whether a record exists at this site in a reference database such as dbSNP.
REF and ALT: The reference allele and alternative allele(s) observed in a sample, set of samples, or a population in general (depending how the VCF was generated). Note that REF and ALT are always given on the forward strand. For insertions, the ALT allele includes the inserted sequence as well as the base preceding the insertion so you know where the insertion is compared to the reference sequence. For deletions, the ALT allele is the base before the deletion.
QUAL: The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the FAQ article for a detailed explanation). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic. Not to be confused with the sample-level annotation GQ; see this FAQ article for an explanation of the differences in what they mean and how they should be used.
PASSif the variant passed all filters. If the FILTER value is
., then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.
This next field does not have to be present in the VCF.
=, and pairs are separated by colons, ie
;as in this example:
MQ=99.00;MQ0=0;QD=17.94. They typically summarize context information from the samples, but can also include information from other sources (e.g. population frequencies from a database resource). Some are annotated by default by the GATK tools that produce the callset, and some can be added on request. They are always defined in the VCF header, so that's an easy way to check what an annotation means if you don't recognize it. You can also find additional information on how they are calculated and how they should be interpreted in the "Annotations" section of the Tool Documentation.
At this point you've met all the fields up to INFO in this lineup:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 [other samples...]
All the rest is going to be sample-level information. Sample-level annotations are tag-value pairs, like the INFO annotations, but the formatting is a bit different. The short names of the sample-level annotations are recorded in the
FORMAT field. The annotation values are then recorded in corresponding order in each sample column (where the sample names are the
SM tags identified in the read group data). Typically, you will at minimum have information about the genotype and confidence in the genotype for the sample at each site. See the next section on genotypes for more details.
The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.
Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:
1 873762 . T G [CLIPPED] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255 1 877664 rs3828047 A G [CLIPPED] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0 1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:26:103,0,26
Looking at that last column, here is what the tags mean:
GT : The genotype of this sample at this site. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:
AD and DP : Allele depth and depth of coverage. These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site. AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another. DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP. See the Tool Documentation for more details on AD (DepthPerAlleleBySample) and DP (Coverage) for more details.
PL : Normalized Phred-scaled likelihoods of the possible genotypes. For the typical case of a monomorphic site (where there is only one ALT allele) in a diploid organism, the PL field will contain three numbers, corresponding to the three possible genotypes (0/0, 0/1, and 1/1). The PL values are normalized so that the PL of the most likely genotype (assigned in the GT field) is 0 in the Phred scale (meaning its P = 1.0 in regular scale). The other values are scaled relative to this most likely genotype. Keep in mind, if you're not familiar with the statistical lingo, that when we say PL is the "likelihood of the genotype", we mean it is "the probability that the genotype is not correct". That's why the smaller the value, the better it is. [todo: PL details doc]
With that out of the way, let's interpret the genotype information for NA12878 at 1:899282.
1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:26:103,0,26
At this site, the called genotype is
GT = 0/1, which corresponds to the alleles C/T. The confidence indicated by
GQ = 26 isn't very good, largely because there were only a total of 4 reads at this site (
DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by
AD=1,3). The lack of certainty is evident in the PL field, where
PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele, but the next PL is
PL(1/1) = 26 (which corresponds to 10^(-2.6), or 0.0025). So although we're pretty sure there's a variant at this site, there's a chance that the genotype assignment is incorrect, and that the subject may in fact not be het (heterozygous) but be may instead be hom-var (homozygous with the variant allele). But either way, it's clear that the subject is definitely not hom-ref (homozygous with the reference allele) since
PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number.
No, really, don't write your own parser if you can avoid it. This is not a comment on how smart or how competent we think you are -- it's a comment on how annoyingly obtuse and convoluted the VCF format is.
Seriously. The VCF format lends itself really poorly to parsing methods like regular expressions, and we hear sob stories all the time from perfectly competent people whose home-brewed parser broke because it couldn't handle a more esoteric feature of the format. We know we broke a bunch of people's scripts when we introduced a new representation for spanning deletions in multisample callsets. OK, we ended up replacing it with a better representation a month later that was a lot less disruptive and more in line with the spirit of the specification -- but the point is, that first version was technically legal by the 4.2 spec, and that sort of thing can happen at any time. So yes, the VCF is a difficult format to work with, and one way to deal with that safely is to not home-brew parsers.
(Why are we sticking with it anyway? Because, as Winston Churchill famously put it, VCF is the worst variant call representation, except for all the others.)
I try to filter both variant non-variant sites together. I see the only reasonable way to do it is to filter by the per-sample DP. However, I noticed that substantial fraction of called sites (~10%) have missing value in DP field ( . instead of 0 or other value). Although many of sites with missing DP values indeed have low coverage with bad mapping and called ./., some sites have in my view good coverage. When I check the bam file (-bamout result) in IGV, I see many reads mapped to those sites with good quality (MQ=60). The genotype is usually called correctly, but for some unclear for me reason GATK doesn't report DP values. The AD values are usually 0,0 in such sites. Interestingly, the nearby sites have DP values reported. When I check gVCF files, these sites usually have DP=0. Assuming coverage of 0 I could discard these sites, but I see in a bam file that it is not 0. So, I do not want to throw away 10% of my data. I notice a tendency that such site is usually homozygot for ALT allele. I provide an example of such a site below. See the 1st sample in the position 13388742. (1/1:0,0:.:3:1|1:13388738_T_C:45,3,0) I generated the data using HaplotypeCaller in GVCF mode and then GenotypeGVCFs. Could you please tell me the reason why this is so?
scaffold_1 13388742 . A G 8529.41 . AC=35;AF=0.833;AN=42;BaseQRankSum=-1.730e-01;ClippingRankSum=-6.940e-01;DP=247;ExcessHet=0.4083;FS=9.162;InbreedingCoeff=0.1415;MLEAC=37;MLEAF=0.881;MQ=38.83;MQRankSum=2.51;QD=32.26;ReadPosRankSum=1.56;SOR=0.043 GT:AD:DP:GQ:PGT:PID:PL 1/1:0,0:.:3:1|1:13388738_T_C:45,3,0 ./.:0,0:0 ./.:2,0:2 ./.:4,0:4 ./.:0,0:0 ./.:3,0:3 0/0:20,0:20:0:.:.:0,0,577 0/1:0,1:1:11:0|1:13388692_G_T:81,0,11 1/1:0,10:10:33:1|1:13388724_G_A:495,33,0 ./.:0,0:0 1/1:1,19:20:27:1|1:13388742_A_G:979,27,0 1/1:0,20:20:35:1|1:13388724_G_A:944,35,0 1/1:0,1:1:6:.:.:90,6,0 1/1:0,7:7:24:1|1:13388724_G_A:360,24,0 0/1:0,2:2:30:0|1:13388692_G_T:165,0,30 1/1:0,5:5:15:1|1:13388742_A_G:225,15,0 1/1:0,20:20:60:1|1:13388724_G_A:900,60,0 1/1:0,8:8:30:1|1:13388724_G_A:450,30,0 1/1:0,15:15:48:.:.:720,48,0 0/1:3,28:31:42:0|1:13388742_A_G:1167,0,42 ./.:3,0:3 1/1:0,1:1:3:1|1:13388687_C_T:45,3,0 1/1:0,2:2:12:1|1:13388692_G_T:180,12,0 ./.:4,0:4 ./.:6,0:6 1/1:0,6:6:18:1|1:13388724_G_A:270,18,0 ./.:4,0:4 1/1:0,14:14:45:1|1:13388724_G_A:675,45,0 1/1:0,3:3:9:1|1:13388692_G_T:135,9,0 0/0:21,0:21:0:.:.:0,0,533 1/1:0,14:14:45:1|1:13388724_G_A:655,45,0
scaffold_1 13388740 . C <NON_REF> . . END=13388741 GT:DP:GQ:MIN_DP:PL 0/0:4:12:4:0,12,162 scaffold_1 13388742 . A G,<NON_REF> 31.82 . DP=0;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;RAW_MQ=0.00 GT:GQ:PGT:PID:PL:SB 1/1:3:0|1:13388738_T_C:45,3,0,45,3,45:0,0,0,0 scaffold_1 13388743 . A <NON_REF> . . END=13388743 GT:DP:GQ:MIN_DP:PL 0/0:5:15:5:0,15,214
Screenshot of a BAM:
The formula and the description given for the OND annotation seem to be contradictory (see: https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_AlleleBalance.php). The formula implies that a true diploid variant would have a non-allele value of zero and therefore have an OND=1. However, the description "reads that support something other than the genotyped alleles (called "non-alleles") will be counted in the OND tag, which represents the overall fraction of data that diverges from the diploid hypothesis." suggests that a higher fraction is more divergent from diploid. Can you please clarify (e.g., confirm the true formula should be 1-alleles/(alleles+non-alleles) and that an ideal diploid variant would have an OND of zero)?
Additionally, we have noticed a lot of missing OND values (not multiallelic or indels). Can you explain when/why these may be missing?
Thanks so much!
Hi GATK !
I am using DiagnoseTargets on exome data with the 1000 Genomes human v37 ref genome and Illumina exome interval .bed file. It seems that DiagnoseTargets (unlike DepthOfCoverage) doesn't accept the -genelist argument.
1) What would be a possible alternative to add the gene name on the output interval stat .vcf file (and optionally the --missing intervals)? Is VariantAnnotator (with the --comp argument) would work?
2) What annotation file should I use?
sortByRef.plscript (mentioned here) not being available anymore, did it only (i) discard records with non-Chr1-22/X/Y/M, and (ii) sort by Chr?
Does VariantAnnotator automatically adjust from zero-based half-open intervals (UCSC standard) to one-based closed intervals or should I modify the file during the previous steps as well?
I'm using VariantAnnotator to add annotations to variants from a bunch of sources. One issue that I have is that for some variants, there are multiple annotations in a supplied resource. In the docs, I read
"Note that if there are multiple records in the resource file that overlap the given position, one is chosen randomly."
Can this behaviour be altered? I need to output all annotations for a record, either on a single line, or on multiple.
In the case i'm working on, one line has the annotation "CLNSIG=5" (i.e. a known pathogenic variant) and the other (likely older record) is "CLNSIG=1" i.e. a variant of unknown significance. I need to output both so I can filter downstream (using SelectVariants) to select those where "CLNSIG=5".
The documentation on the HaplotypeScore annotation reads:
HaplotypeCaller does not output this annotation because it already evaluates haplotype segregation internally. This annotation is only informative (and available) for variants called by Unified Genotyper.
The annotation used to be part of the best practices:
I will include it in the VQSR model for UG calls from low coverage data. Is this an unwise decision? I guess this is for myself to evaluate. I thought I would ask, in case I have missed something obvious.
I have been trying to find documentation for understanding the annotations and numbers in the VCF file. Something like below, can someone guide me how to understand/interpret the numbers? How is the quality of the variant calling for this particular indel?
AC=2; AF=0.100; AN=20; BaseQRankSum=6.161; ClippingRankSum=-2.034; DP=313; FS=5.381; InbreedingCoeff=-0.1180; MLEAC=2; MLEAF=0.100; MQ=58.49; MQ0=0; MQRankSum=-0.456; QD=1.46; ReadPosRankSum=-4.442; VQSLOD=0.348; topculprit=ReadPosRankSum
Could you tell me how to encourage GATK to annotate my genotype columns (i.e. add annotations to the FORMAT and PANC_R columns in the following file):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PANC_R chrX 259221 . GA G 136.74 . AC=2;AF=1.00;AN=2;DP=15;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=8.82;MQ0=1;QD=3.04 GT:AD:GQ:PL 1/1:0,2:6:164,6,0
The file was generated with HaplotypeCaller. I used a command line similar to this one to no effect:
java -jar $GATKROOT/GenomeAnalysisTK.jarT VariantAnnotator -R hg19_random.fa -I chr7_recalibrated.bam -V chr7.vcf --dbsnpdbSNP135_chr.vcf -A Coverage -A QualByDepth -A FisherStrand -A MappingQualityRankSumTest -A ReadPosRankSumTest -o chr7_annotated-again.vcf
Does anyone have any suggestions? Thanks in advance!
I had a few questions about the haplotype score.
In the technical documentation it states that "Higher scores are indicative of regions with bad alignments, often leading to artifactual SNP and indel calls. Note that the Haplotype Score is only calculated for sites with read coverage."
How is the haplotype group for each variant site determined? e.g. Does it take the closest two variants to the query site and then treat the query variant + closest two variants as the haplotype group?
Also, in the case of multiallelic SNPs (>2 SNPs), haplotype score is inappropriate since it only looks at whether a site can be explained by the segregation of two and only two haplotypes, correct? So multiallelic snps will be assigned poor haplotype scores OR will these sites not be annotated at all? If we have a case where there is a truly biallelic SNP and a couple of samples have some reads that are erroneously calling a third allele, this variant site will be assigned a poor haplotype score overall, correct?
Just in the process of updating our pipeline from v2.3-4-gb8f1308 Lite to v2.4-7-g5e89f01 Academic and have run into a small issue. The command line:
-T UnifiedGenotyper -glm SNP -R /lustre/scratch111/resources/ref/Homo_sapiens/1000Genomes_hs37d5/hs37d5.fa -I /lustre/scratch111/projects/helic/vcf-newpipe/lists/chr1-pooled.list --alleles /lustre/scratch111/projects/helic/vcf-newpipe/pooled/1/1:1-1000000.snps.vcf.gz -L 1:1-1000000 -U LENIENT_VCF_PROCESSING -baq CALCULATE_AS_NECESSARY -gt_mode GENOTYPE_GIVEN_ALLELES -out_mode EMIT_ALL_SITES --standard_min_confidence_threshold_for_calling 4.0 --standard_min_confidence_threshold_for_emitting 4.0 -l INFO -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A InbreedingCoeff -A DepthOfCoverage -o /lustre/scratch111/projects/helic/vcf-newpipe/pooled/1/1:1-1000000.asnps.vcf.gz.part.vcf.gz
This worked in 2.3.4. But now gives:
Invalid command line: Argument annotation has a bad value: Class DepthOfCoverage is not found; please check that you have specified the class name correctly
I've looked at the release notes but it's not giving me a clue as to what has changed. Has the DepthOfCoverage annotation now been dropped? I've checked and I can reproduce this on the latest nightly (nightly-2013-03-11-g184e5ac)
Hi the GATK team;
I use the UnifiedGenotyper the following way:
java -jar GenomeAnalysisTK-2.1-13-g1706365/GenomeAnalysisTK.jar \ -R /human_g1k_v37.fasta \ -T UnifiedGenotyper \ -glm BOTH \ -S SILENT \ -L ../align/capture.bed \ -I myl.bam \ --dbsnp broadinstitute.org/bundle/1.5/b37/dbsnp_135.b37.vcf.gz \ -o output.vcf
When I look at the generated VCF , the variation 18:55997929 (CTTCT/C) is said to be rs4149608
18 55997929 rs4149608 CTTCT C (...)
but in the dbsnp_135.b37.vcf.gz, you can see that the right rs## should be rs144384654
$ gunzip -c broadinstitute.org/bundle/1.5/b37/dbsnp_135.b37.vcf.gz |grep -E -w '(rs4149608|rs144384654)' 18 55997929 rs4149608 CT C,CTTCT (...) 18 55997929 rs144384654 CTTCT C (...)
does UnifiedGenotyper uses the first rs## it finds at a given position ? Or should I use another method/tool to get the 'right' rs## ?
Please look at lines 1 and 2 taken from a vcf file, which have same Chromosome and Position and one of the Alt allele is same in both lines, different allele count and have different rsID.
1 1229111 rs70949568 A ACGCCCCTGCCCTGGAGGCCCCGCCCCTGCCCTGGAGGCCC,C 2629.32 TruthSensitivityTranche99.50to99.90;TruthSensitivityTranche99.30to99.50 AC=80,31;AF=0.1273;AN=284;BaseQRankSum=1.124;DB;DP=426;Dels=0.00;FS=4.620;HRun=1;HaplotypeScore=0.2101;InbreedingCoeff=-0.0029;MQ0=0;MQ=58.46;MQRankSum=1.211;QD=5.26;ReadPosRankSum=-5.748;SB=-36.94;SF=0f,1f;SNPEFF_EFFECT=DOWNSTREAM;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=ACAP3;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENST00000379037;VQSLOD=-2.3894;culprit=MQ GT:DP:GQ:AD:PL
1 1229111 . A C 89.94 TruthSensitivityTranche99.00to99.30 AC=7;AF=0.0614;AN=114;BaseQRankSum=0.801;DP=175;Dels=0.00;FS=1.668;HRun=1;HaplotypeScore=0.2276;InbreedingCoeff=-0.0538;MQ0=0;MQ=57.90;MQRankSum=0.501;QD=4.28;ReadPosRankSum=-4.531;SB=-15.19;SF=0f;SNPEFF_EFFECT=DOWNSTREAM;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=ACAP3;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENST00000379037;VQSLOD=-1.4433;culprit=MQ GT:DP:GQ:AD:PL
Could you tell me when we can use new version of SnpEff with GATK?
I have some bugs :
caused by exception org.broadinstitute.sting.gatk.walkers.annotator.interfaces.ExperimentalAnnotation.
I don't know if I forget some other options linked these annotations. These options are important for me. So I deleted them but if somebody want to use them ...
I'm curious about the experience of the community at large with VQSR, and specifically with which sets of annotations people have found to work well. The GATK team's recommendations are valuable, but my impression is that they have fairly homogenous data types - I'd like to know if anyone has found it useful to deviate from their recommendations.
For instance, I no longer include InbreedingCoefficient with my exome runs. This was spurred by a case where previously validated variants were getting discarded by VQSR. It turned out that these particular variants were homozygous alternate in the diseased samples and homozygous reference in the controls, yielding an InbreedingCoefficient very close to 1. We decided that the all-homozygous case was far more likely to be genuinely interesting than a sequencing/variant calling artifact, so we removed the annotation from VQSR. In order to catch the all-heterozygous case (which is more likely to be an error), we add a VariantFiltration pass for 'InbreedingCoefficient < -0.8' following ApplyRecalibration.
In my case, I think InbreedingCoefficient isn't as useful because my UG/VQSR cohorts tend to be smaller and less diverse than what the GATK team typically runs (and to be honest, I'm still not sure we're doing the best thing). Has anyone else found it useful to modify these annotations? It would be helpful if we could build a more complete picture of these metrics in a diverse set of experiments.
Broad recommends using snpEff to add annotations to VCF files created by GATK. This gives annotations about the effect of a given variant: is it in a coding region? Does it cause a frameshift? What transcripts are impacted? etc. However, snpEff does not provide other annotations you might want, such as 1000 genomes minor allele frequency, SIFT scores, phyloP conservation scores, and so on. I've previously used annovar to get those sorts of things, and that worked well enough, though I did not find it to be especially user-friendly.
So my question is, what other ways have users found of getting this sort of annotation information? I'm interested specifically in human exomes, but I am sure other users reading this Ask the Community post will be interested in answers for other organisms as well. I'm looking for recommendations on what's quick, simple, easy to use, and has been used successfully with VCFs produced by GATK. I'm open to answers in the form of other software tools or sources of raw data that I can easily manipulate on my own.
Thanks in advance.
I am doing human exome sequencing with hg19 as a reference, and I want UnifiedGenotyper to give me whatever annotations are available and I will worry later about which ones are useful and which are not.
I am confused about the behavior of the --annotation option in UnifiedGenotyper. The default value is listed as , implying that unless we explicitly list what annotations we want, we get no annotations at all? Is that correct? Then in order to get a list of available annotations, we are directed to the VariantAnnotator --list option but it appears that it is not possible to just run:
java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T VariantAnnotator \ --list
In order to get a list of annotations. Instead, one not only needs to include a --variants flag, but the vcf file you point to actually has to be well-formatted, etc., otherwise you get errors like this
##### ERROR MESSAGE: Argument with name '--variant' (-V) is missing.
##### ERROR MESSAGE: Invalid command line: No tribble type was provided on the command line and the type of the file could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types:
So, that having failed, is anyone able to just provide me with a list of possible arguments to the UnifiedGenotyper --annotation option?