Tagged with #vcf 11 documentation articles | 0 announcements | 87 forum discussions

This document describes "regular" (variants-only) VCF files. For information on the gVCF format produced by HaplotypeCaller in -ERC GVCF mode, please see this companion document.

1. What is VCF?

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. See this page for detailed specifications.

VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from NGS data, such as the UnifiedGenotyper and the HaplotypeCaller, is especially complex. This document describes some specific features and annotations used in the VCF files output by the GATK tools.

2. Basic structure of a VCF file

The following text is a valid VCF file describing the first few SNPs found by the UG in a deep whole genome data set from our favorite test sample, NA12878:

##fileformat=VCFv4.0
##FILTER=<ID=LowQual,Description="QUAL < 50.0">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=3,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="log10-scaled probability of variant being true under the trained gaussian mixture model">
##UnifiedGenotyperV2="analysis_type=UnifiedGenotyperV2 input_file=[TEXT CLIPPED FOR CLARITY]"
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
chr1    873762  .       T   G   5231.78 PASS    AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 GT:AD:DP:GQ:PL   0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   3931.66 PASS    AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185 GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   71.77   PASS    AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148 GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26
chr1    974165  rs9442391   T   C   29.84   LowQual AC=1;AF=0.50;AN=2;DB;DP=18;Dels=0.00;HRun=1;HaplotypeScore=0.16;MQ=95.26;MQ0=0;QD=1.66;SB=-0.98 GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255


It seems a bit complex, but the structure of the file is actually quite simple:

[HEADER LINES]
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878
chr1    873762  .       T   G   5231.78 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   3931.66 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   71.77   PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26
chr1    974165  rs9442391   T   C   29.84   LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255


After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that here everything is a SNP, but some could be indels or CNVs.

3. How variation is represented

The first 6 columns of the VCF, which represent the observed variation, are easy to understand because they have a single, well-defined meaning.

• CHROM and POS : The CHROM and POS gives the contig on which the variant occurs. For indels this is actually the base preceding the event, due to how indels are represented in a VCF.

• ID: The dbSNP rs identifier of the SNP, based on the contig and position of the call and whether a record exists at this site in dbSNP.

• REF and ALT: The reference base and alternative base that vary in the samples, or in the population in general. Note that REF and ALT are always given on the forward strand. For indels the REF and ALT bases always include at least one base each (the base before the event).

• QUAL: The Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance. These values can grow very large when a large amount of NGS data is used for variant calling.

• FILTER: In a perfect world, the QUAL field would be based on a complete model for all error modes present in the data used to call. Unfortunately, we are still far from this ideal, and we have to use orthogonal approaches to determine which called sites, independent of QUAL, are machine errors and which are real SNPs. Whatever approach is used to filter the SNPs, the VCFs produced by the GATK carry both the PASSing filter records (the ones that are good have PASS in their FILTER field) as well as those that fail (the filter field is anything but PASS or a dot). If the FILTER field is a ".", then no filtering has been applied to the records, meaning that all of the records will be used for analysis but without explicitly saying that any PASS. You should avoid such a situation by always filtering raw variant calls before analysis.

In the excerpt shown above, here is how we interpret the line corresponding to each variant:

• chr1:873762 is a novel T/G polymorphism, found with very high confidence (QUAL = 5231.78)
• chr1:877664 is a known A/G SNP (named rs3828047), found with very high confidence (QUAL = 3931.66)
• chr1:899282 is a known C/T SNP (named rs28548431), but has a relative low confidence (QUAL = 71.77)
• chr1:974165 is a known T/C SNP but we have so little evidence for this variant in our data that although we write out a record for it (for book keeping, really) our statistical evidence is so low that we filter the record out as a bad site, as indicated by the "LowQual" annotation.

4. How genotypes are represented

The genotype fields of the VCF look more complicated but they're actually not that hard to interpret once you understand that they're just sets of tags and values. Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

chr1    873762  .       T   G   [CLIPPED] GT:AD:DP:GQ:PL    0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   [CLIPPED] GT:AD:DP:GQ:PL    1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:25.92:103,0,26


Looking at that last column, here is what the tags mean:

• GT : The genotype of this sample. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:

• 0/0 - the sample is homozygous reference
• 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
• 1/1 - the sample is homozygous alternate In the three examples above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively.
• GQ: The Genotype Quality, or Phred-scaled confidence that the true genotype is the one provided in GT. In the diploid case, if GT is 0/1, then GQ is really L(0/1) / (L(0/0) + L(0/1) + L(1/1)), where L is the likelihood that the sample is 0/0, 0/1/, or 1/1 under the model built for the NGS dataset. The GQ is simply the second most likely PL - the most likely PL. Because the most likely PL is always 0, GQ = second highest PL - 0. If the second most likely PL is greater than 99, we still assign a GQ of 99, so the highest value of GQ is 99.

• AD and DP: These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site. See the Technical Documentation for details on AD (DepthPerAlleleBySample) and DP (Coverage).

• PL: This field provides the likelihoods of the given genotypes (here, 0/0, 0/1, and 1/1). These are normalized, Phred-scaled likelihoods for each of the 0/0, 0/1, and 1/1, without priors. To be concrete, for the heterozygous case, this is L(data given that the true genotype is 0/1). The most likely genotype (given in the GT field) is scaled so that it's P = 1.0 (0 when Phred-scaled), and the other likelihoods reflect their Phred-scaled likelihoods relative to this most likely genotype.

With that out of the way, let's interpret the genotypes for NA12878 at chr1:899282.

chr1    899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:25.92:103,0,26


At this site, the called genotype is GT = 0/1, which is C/T. The confidence indicated by GQ = 25.92 isn't so good, largely because there were only a total of 4 reads at this site (DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0). There's a chance that the subject is "hom-var" (=homozygous with the variant allele) since PL(1/1) = 26, which corresponds to 10^(-2.6), or 0.0025, but either way, it's clear that the subject is definitely not "hom-ref" (=homozygous with the reference allele) since PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number.

5. Understanding annotations

Finally, variants in a VCF can be annotated with a variety of additional tags, either by the built-in tools or with others that you add yourself. The way they're formatted is similar to what we saw in the Genotype fields, except instead of being in two separate fields (tags and values, respectively) the annotation tags and values are grouped together, so tag-value pairs are written one after another.

chr1    873762  [CLIPPED] AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473
chr1    877664  [CLIPPED] AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185
chr1    899282  [CLIPPED] AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148


Here are some commonly used built-in annotations and what they mean:

Annotation tag in VCF Meaning
AC,AF,AN See the Technical Documentation for Chromosome Counts.
DB If present, then the variant is in dbSNP.
DP See the Technical Documentation for Coverage.
DS Were any of the samples downsampled because of too much coverage?
Dels See the Technical Documentation for SpanningDeletions.
MQ and MQ0 See the Technical Documentation for RMS Mapping Quality and Mapping Quality Zero.
BaseQualityRankSumTest See the Technical Documentation for Base Quality Rank Sum Test.
MappingQualityRankSumTest See the Technical Documentation for Mapping Quality Rank Sum Test.
HRun See the Technical Documentation for Homopolymer Run.
HaplotypeScore See the Technical Documentation for Haplotype Score.
QD See the Technical Documentation for Qual By Depth.
VQSLOD Only present when using Variant quality score recalibration. Log odds ratio of being a true variant versus being false under the trained gaussian mixture model.
FS See the Technical Documentation for Fisher Strand
SB How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls).

liftOverVCF.pl

Introduction

This script converts a VCF file from one reference build to another. It runs 3 modules within our toolkit that are necessary for lifting over a VCF.
1. LiftoverVariants walker
2. sortByRef.pl to sort the lifted-over file
3. Filter out records whose ref field no longer matches the new reference

Obtaining the Script

The liftOverVCF.pl script is available in our public source repository under the 'perl' directory. Instructions for pulling down our source are available here.

Example

./liftOverVCF.pl -vcf calls.b36.vcf \
-out calls.hg19.vcf \
-gatk /humgen/gsa-scr1/ebanks/Sting_dev
-newRef /seq/references/Homo_sapiens_assembly19/v0/Homo_sapiens_assembly19
-oldRef /humgen/1kg/reference/human_b36_both


Usage

Running the script with no arguments will show the usage:

Usage: liftOverVCF.pl
-vcf        <input vcf>
-gatk       <path to gatk trunk>
-chain      <chain file>
-newRef     <path to new reference prefix; we will need newRef.dict, .fasta, and .fasta.fai>
-oldRef     <path to old reference prefix; we will need oldRef.fasta>
-out        <output vcf>
-tmp            <temp file location; defaults to /tmp>

• The 'tmp' argument is optional. It specifies the location to write the temporary file from step 1 of the process.

Chain files

Chain files from b36/hg18 to hg19 are located here within the Broad:

   /humgen/gsa-hpprojects/GATK/data/Liftover_Chain_Files/


External users can get them off our ftp site:

   location: ftp.broadinstitute.org
path:     Liftover_Chain_Files


Introduction

SelectVariants is a GATK tool used to subset a VCF file by many arbitrary criteria listed in the command line options below. The output VCF wiil have the AN (number of alleles), AC (allele count), AF (allele frequency), and DP (depth of coverage) annotations updated as necessary to accurately reflect the file's new contents.

Select Variants operates on VCF files (ROD Tracks) provided in the command line using the GATK's built in --variant option. You can provide multiple tracks for Select Variants but at least one must be named 'variant' and this will be the file all your analysis will be based of. Other tracks can be named as you please. Options requiring a reference to a ROD track name will use the track name provided in the -B option to refer to the correct VCF file (e.g. --discordance / --concordance ). All other analysis will be done in the 'variant' track.

Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). SelectVariants can be used for this purpose. Given a single VCF file, one or more samples can be extracted from the file (based on a complete sample name or a pattern match). Variants can be further selected by specifying criteria for inclusion, i.e. "DP > 1000" (depth of coverage greater than 1000x), "AF < 0.25" (sites with allele frequency less than 0.25). These JEXL expressions are documented here in the FAQ article on JEXL expressions; it is particularly important to note the section on working with complex expressions.

Command-line arguments

For a complete, detailed argument reference, refer to the GATK document page here.

How do the AC, AF, AN, and DP fields change?

Let's say you have a file with three samples. The numbers before the ":" will be the genotype (0/0 is hom-ref, 0/1 is het, and 1/1 is hom-var), and the number after will be the depth of coverage.

BOB        MARY        LINDA
1/0:20     0/0:30      1/1:50


In this case, the INFO field will say AN=6, AC=3, AF=0.5, and DP=100 (in practice, I think these numbers won't necessarily add up perfectly because of some read filters we apply when calling, but it's approximately right).

Now imagine I only want a file with the samples "BOB" and "MARY". The new file would look like:

BOB        MARY
1/0:20     0/0:30


The INFO field will now have to change to reflect the state of the new data. It will be AN=4, AC=1, AF=0.25, DP=50.

Let's pretend that MARY's genotype wasn't 0/0, but was instead "./." (no genotype could be ascertained). This would look like

BOB        MARY
1/0:20     ./.:.


with AN=2, AC=1, AF=0.5, and DP=20.

Subsetting by sample and ALT alleles

SelectVariants now keeps (r5832) the alt allele, even if a record is AC=0 after subsetting the site down to selected samples. For example, when selecting down to just sample NA12878 from the OMNI VCF in 1000G (1525 samples), the resulting VCF will look like:

1       82154   rs4477212       A       G       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       T       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942


Although NA12878 is 0/0 at the first sites, ALT allele is preserved in the VCF record. This is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. This is related to the tricky issue of isPolymorphic() vs. isVariant().

• isVariant => is there an ALT allele?

• isPolymorphic => is some sample non-ref in the samples?

In part this is complicated as the semantics of sites-only VCFs, where ALT = . is used to mean not-polymorphic. Unfortunately, I just don't think there's a consistent convention right now, but it might be worth at some point to adopt a single approach to handling this.

For clarity, in previous versions of SelectVariants, the first two monomorphic sites lose the ALT allele, because NA12878 is hom-ref at this site, resulting in VCF that looks like:

1       82154   rs4477212       A       .       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       .       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942


If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting.

Known issues

Some VCFs may have repeated header entries with the same key name, for instance:

##fileformat=VCFv3.3
##FILTER=ABFilter,&quot;AB &gt; 0.75&quot;
##FILTER=HRunFilter,&quot;HRun &gt; 3.0&quot;
##FILTER=QDFilter,&quot;QD &lt; 5.0&quot;
##UG_bam_file_used=file1.bam
##UG_bam_file_used=file2.bam
##UG_bam_file_used=file3.bam
##UG_bam_file_used=file4.bam
##UG_bam_file_used=file5.bam
##source=UnifiedGenotyper
##source=VariantFiltration
##source=AnnotateVCFwithMAF
...


Here, the "UG_bam_file_used" and "source" header lines appear multiple times. When SelectVariants is run on such a file, the program will emit warnings that these repeated header lines are being discarded, resulting in only the first instance of such a line being written to the resulting VCF. This behavior is not ideal, but expected under the current architecture.

For information on how to construct regular expressions for use with this tool, see the "Summary of regular-expression constructs" section here.

This tool combines VCF records from different sources. Any (unique) name can be used to bind your rod data and any number of sources can be input. This tool currently supports two different combination types for each of variants (the first 8 fields of the VCF) and genotypes (the rest)

For a complete, detailed argument reference, refer to the GATK document page here.

2. Logic for merging records across VCFs

CombineVariants will include a record at every site in all of your input VCF files, and annotate which input ROD bindings the record is present, pass, or filtered in in the set attribute in the INFO field (see below). In effect, CombineVariants always produces a union of the input VCFs. However, any part of the Venn of the N merged VCFs can be exacted using JEXL expressions on the set attribute using SelectVariants. If you want to extract just the records in common between two VCFs, you would first CombineVariants the two files into a single VCF, and then run SelectVariants to extract the common records with -select 'set == "Intersection"', as worked out in the detailed example below.

3. Handling PASS/FAIL records at the same site in multiple input files

The -filteredRecordsMergeType argument determines how CombineVariants handles sites where a record is present in multiple VCFs, but it is filtered in some and unfiltered in others, as described in the Tech Doc page for the tool.

4. Understanding the set attribute

The set INFO field indicates which call set the variant was found in. It can take on a variety of values indicating the exact nature of the overlap between the call sets. Note that the values are generalized for multi-way combinations, but here we describe only the values for 2 call sets being combined.

• set=Intersection : occurred in both call sets, not filtered out

• set=NAME : occurred in the call set NAME only

• set=NAME1-filteredInNAME : occurred in both call sets, but was not filtered in NAME1 but was filtered in NAME2

• set=filteredInAll : occurred in both call sets, but was filtered out of both

For three or more call sets combinations, you can see records like NAME1-NAME2 indicating a variant occurred in both NAME1 and NAME2 but not all sets.

5. Changing the set key

You can use -setKey foo to change the set=XXX tag to foo=XXX in your output. Additionally, -setKey null stops the set tag=value pair from being emitted at all.

6. Minimal VCF output

Add the -minimalVCF argument to CombineVariants if you want to eliminate unnecessary information from the INFO field and genotypes. The only fields emitted will be GT:GQ for genotypes and the keySet for INFO

An even more extreme output format is -sites_only, a general engine capability, where the genotypes for all samples are completely stripped away from the output format. Enabling this option results in a significant performance speedup as well.

7. Combining Variant Calls with a minimum set of input sites

Add the -minN (or --minimumN) command, followed by an integer if you want to only output records present in at least N input files. Useful, for example in combining several data sets where we only want to keep sites present in for example at least 2 of them (in which case -minN 2 should be added to the command line).

8. Example: intersecting two VCFs

In the following example, we use CombineVariants and SelectVariants to obtain only the sites in common between the OMNI 2.5M and HapMap3 sites in the GSA bundle.

java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R bundle/b37/human_g1k_v37.fasta -L 1:1-1,000,000 -V:omni bundle/b37/1000G_omni2.5.b37.sites.vcf -V:hm3 bundle/b37/hapmap_3.3.b37.sites.vcf -o union.vcf
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T SelectVariants -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -L 1:1-1,000,000 -V:variant union.vcf -select 'set == "Intersection";' -o intersect.vcf


This results in two vcf files, which look like:

==> union.vcf <==
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       990882  SNP1-980745     C       T       .       PASS    CR=99.79873;GentrainScore=0.7403;HW=0.005225421;set=omni
1       990984  SNP1-980847     G       A       .       PASS    CR=99.76005;GentrainScore=0.8406;HW=0.26163524;set=omni
1       992265  SNP1-982128     C       T       .       PASS    CR=100.0;GentrainScore=0.7412;HW=0.0025895447;set=omni
1       992819  SNP1-982682     G       A       .       id50    CR=99.72961;GentrainScore=0.8505;HW=4.811053E-17;set=FilteredInAll
1       993987  SNP1-983850     T       C       .       PASS    CR=99.85935;GentrainScore=0.8336;HW=9.959717E-28;set=omni
1       994391  rs2488991       G       T       .       PASS    AC=1936;AF=0.69341;AN=2792;CR=99.89378;GentrainScore=0.7330;HW=1.1741E-41;set=filterInomni-hm3
1       996184  SNP1-986047     G       A       .       PASS    CR=99.932205;GentrainScore=0.8216;HW=3.8830226E-6;set=omni
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
1       999649  SNP1-989512     G       A       .       PASS    CR=99.93262;GentrainScore=0.7965;HW=4.9767335E-4;set=omni

==> intersect.vcf <==
1       950243  SNP1-940106     A       C       .       PASS    AC=826;AF=0.29993;AN=2754;CR=97.341675;GentrainScore=0.7311;HW=0.15148845;set=Intersection
1       957640  rs6657048       C       T       .       PASS    AC=127;AF=0.04552;AN=2790;CR=99.86667;GentrainScore=0.6806;HW=2.286109E-4;set=Intersection
1       959842  rs2710888       C       T       .       PASS    AC=654;AF=0.23559;AN=2776;CR=99.849;GentrainScore=0.8072;HW=0.17526293;set=Intersection
1       977780  rs2710875       C       T       .       PASS    AC=1989;AF=0.71341;AN=2788;CR=99.89077;GentrainScore=0.7875;HW=2.9912625E-32;set=Intersection
1       985900  SNP1-975763     C       T       .       PASS    AC=182;AF=0.06528;AN=2788;CR=99.79926;GentrainScore=0.8374;HW=0.017794203;set=Intersection
1       987200  SNP1-977063     C       T       .       PASS    AC=1956;AF=0.70007;AN=2794;CR=99.45917;GentrainScore=0.7914;HW=1.413E-42;set=Intersection
1       987670  SNP1-977533     T       G       .       PASS    AC=2485;AF=0.89196;AN=2786;CR=99.51427;GentrainScore=0.7005;HW=0.24214932;set=Intersection
1       990417  rs2465136       T       C       .       PASS    AC=1113;AF=0.40007;AN=2782;CR=99.7599;GentrainScore=0.8750;HW=8.595538E-5;set=Intersection
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection


VariantFiltration

For a complete, detailed argument reference, refer to the GATK document page here.

The documentation for Using JEXL expressions within the GATK contains very important information about limitations of the filtering that can be done; in particular please note the section on working with complex expressions.

Filtering Individual Genotypes

One can now filter individual samples/genotypes in a VCF based on information from the FORMAT field: Variant Filtration will add the sample-level FT tag to the FORMAT field of filtered samples (this does not affect the record's FILTER tag). This is still a work in progress and isn't quite as flexible and powerful yet as we'd like it to be. For now, one can filter based on most fields as normal (e.g. GQ < 5.0), but the GT (genotype) field is an exception. We have put in convenience methods so that one can now filter out hets (isHet == 1), refs (isHomRef == 1), or homs (isHomVar == 1).

IMPORTANT NOTE: This document is out of date and will be replaced soon. In the meantime, you can find accurate information on how to run SnpEff in a compatible way with GATK in the SnpEff documentation, and instructions on what steps are necessary in the presentation on Functional Annotation linked in the comments below.

Our testing has shown that not all combinations of snpEff/database versions produce high-quality results. Be sure to read this document completely to familiarize yourself with our recommended best practices BEFORE running snpEff.

Introduction

Until recently we were using an in-house annotation tool for genomic annotation, but the burden of keeping the database current and our lack of ability to annotate indels has led us to employ the use of a third-party tool instead. After reviewing many external tools (including annoVar, VAT, and Oncotator), we decided that SnpEff best meets our needs as it accepts VCF files as input, can annotate a full exome callset (including indels) in seconds, and provides continually-updated transcript databases. We have implemented support in the GATK for parsing the output from the SnpEff tool and annotating VCFs with the information provided in it.

SnpEff Setup and Usage

Download the SnpEff core program. If you want to be able to run VariantAnnotator on the SnpEff output, you'll need to download a version of SnpEff that VariantAnnotator supports from this page (currently supported versions are listed below). If you just want the most recent version of SnpEff and don't plan to run VariantAnnotator on its output, you can get it from here.

After unzipping the core program, open the file snpEff.config in a text editor, and change the "database_repository" line to the following:

database_repository = http://sourceforge.net/projects/snpeff/files/databases/


java -jar snpEff.jar download GRCh37.64


You can find a list of available databases here. The human genome databases have GRCh or hg in their names. You can also download the databases directly from the SnpEff website, if you prefer.

The download command by default puts the databases into a subdirectory called data within the directory containing the SnpEff jar file. If you want the databases in a different directory, you'll need to edit the data_dir entry in the file snpEff.config to point to the correct directory.

Run SnpEff on the file containing your variants, and redirect its output to a file. SnpEff supports many input file formats including VCF 4.1, BED, and SAM pileup. Full details and command-line options can be found on the SnpEff home page.

Supported SnpEff Versions

If you want to take advantage of SnpEff integration in the GATK, you'll need to run SnpEff version **2.0.5*. Note: newer versions are currently unsupported by the GATK, as we haven't yet had the reources to test it.

Current Recommended Best Practices When Running SnpEff

These best practices are based on our analysis of various snpEff/database versions as described in detail in the Analysis of SnpEff Annotations Across Versions section below.

• We recommend using only the GRCh37.64 database with SnpEff 2.0.5. The more recent GRCh37.65 database produces many false-positive Missense annotations due to a regression in the ENSEMBL Release 65 GTF file used to build the database. This regression has been acknowledged by ENSEMBL and is supposedly fixed as of 1-30-2012; however as we have not yet tested the fixed version of the database we continue to recommend using only GRCh37.64 for now.

• We recommend always running with -onlyCoding true with human databases (eg., the GRCh37.* databases). Setting -onlyCoding false causes snpEff to report all transcripts as if they were coding (even if they're not), which can lead to nonsensical results. The -onlyCoding false option should only be used with databases that lack protein coding information.

• Do not trust annotations from versions of snpEff prior to 2.0.4. Older versions of snpEff (such as 2.0.2) produced many incorrect annotations due to the presence of a certain number of nonsensical transcripts in the underlying ENSEMBL databases. Newer versions of snpEff filter out such transcripts.

Analyses of SnpEff Annotations Across Versions

See our analysis of the SNP annotations produced by snpEff across various snpEff/database versions here.

• Both snpEff 2.0.2 + GRCh37.63 and snpEff 2.0.5 + GRCh37.65 produce an abnormally high Missense:Silent ratio, with elevated levels of Missense mutations across the entire spectrum of allele counts. They also have a relatively low (~70%) level of concordance with the 1000G Gencode annotations when it comes to Silent mutations. This suggests that these combinations of snpEff/database versions incorrectly annotate many Silent mutations as Missense.

• snpEff 2.0.4 RC3 + GRCh37.64 and snpEff 2.0.5 + GRCh37.64 produce a Missense:Silent ratio in line with expectations, and have a very high (~97%-99%) level of concordance with the 1000G Gencode annotations across all categories.

See our comparison of SNP annotations produced using the GRCh37.64 and GRCh37.65 databases with snpEff 2.0.5 here

• The GRCh37.64 database gives good results on the condition that you run snpEff with the -onlyCoding true option. The -onlyCoding false option causes snpEff to mark all transcripts as coding, and so produces many false-positive Missense annotations.

• The GRCh37.65 database gives results that are as poor as those you get with the -onlyCoding false option on the GRCh37.64 database. This is due to a regression in the ENSEMBL release 65 GTF file used to build snpEff's GRCh37.65 database. The regression has been acknowledged by ENSEMBL and is due to be fixed shortly.

See our analysis of the INDEL annotations produced by snpEff across snpEff/database versions here

• snpEff's indel annotations are highly concordant with those of a high-quality set of genomic annotations from the 1000 Genomes project. This is true across all snpEff/database versions tested.

Example SnpEff Usage with a VCF Input File

Below is an example of how to run SnpEff version 2.0.5 with a VCF input file and have it write its output in VCF format as well. Notice that you need to explicitly specify the database you want to use (in this case, GRCh37.64). This database must be present in a directory of the same name within the data_dir as defined in snpEff.config.

java -Xmx4G -jar snpEff.jar eff -v -onlyCoding true -i vcf -o vcf GRCh37.64 1000G.exomes.vcf > snpEff_output.vcf


In this mode, SnpEff aggregates all effects associated with each variant record together into a single INFO field annotation with the key EFF. The general format is:

EFF=Effect1(Information about Effect1),Effect2(Information about Effect2),etc.


And here is the precise layout with all the subfields:

EFF=Effect1(Effect_Impact|Effect_Functional_Class|Codon_Change|Amino_Acid_Change|Gene_Name|Gene_BioType|Coding|Transcript_ID|Exon_ID),Effect2(etc...


It's also possible to get SnpEff to output in a (non-VCF) text format with one Effect per line. See the SnpEff home page for full details.

Once you have a SnpEff output VCF file, you can use the VariantAnnotator walker to add SnpEff annotations based on that output to the input file you ran SnpEff on.

There are two different options for doing this:

Option 1: Annotate with only the highest-impact effect for each variant

NOTE: This option works only with supported SnpEff versions as explained above. VariantAnnotator run as described below will refuse to parse SnpEff output files produced by other versions of the tool, or which lack a SnpEff version number in their header.

The default behavior when you run VariantAnnotator on a SnpEff output file is to parse the complete set of effects resulting from the current variant, select the most biologically-significant effect, and add annotations for just that effect to the INFO field of the VCF record for the current variant. This is the mode we plan to use in our Production Data-Processing Pipeline.

When selecting the most biologically-significant effect associated with the current variant, VariantAnnotator does the following:

• Prioritizes the effects according to the categories (in order of decreasing precedence) "High-Impact", "Moderate-Impact", "Low-Impact", and "Modifier", and always selects one of the effects from the highest-priority category. For example, if there are three moderate-impact effects and two high-impact effects resulting from the current variant, the annotator will choose one of the high-impact effects and add annotations based on it. See below for a full list of the effects arranged by category.

• Within each category, ties are broken using the functional class of each effect (in order of precedence: NONSENSE, MISSENSE, SILENT, or NONE). For example, if there is both a NON_SYNONYMOUS_CODING (MODERATE-impact, MISSENSE) and a CODON_CHANGE (MODERATE-impact, NONE) effect associated with the current variant, the annotator will select the NON_SYNONYMOUS_CODING effect. This is to allow for more accurate counts of the total number of sites with NONSENSE/MISSENSE/SILENT mutations. See below for a description of the functional classes SnpEff associates with the various effects.

• Effects that are within a non-coding region are always considered lower-impact than effects that are within a coding region.

Example Usage:

java -jar dist/GenomeAnalysisTK.jar \
-T VariantAnnotator \
-R /humgen/1kg/reference/human_g1k_v37.fasta \
-A SnpEff \
--variant 1000G.exomes.vcf \        (file to annotate)
--snpEffFile snpEff_output.vcf \    (SnpEff VCF output file generated by running SnpEff on the file to annotate)
-L 1000G.exomes.vcf \
-o out.vcf


VariantAnnotator adds some or all of the following INFO field annotations to each variant record:

• SNPEFF_EFFECT - The highest-impact effect resulting from the current variant (or one of the highest-impact effects, if there is a tie)
• SNPEFF_IMPACT - Impact of the highest-impact effect resulting from the current variant (HIGH, MODERATE, LOW, or MODIFIER)
• SNPEFF_FUNCTIONAL_CLASS - Functional class of the highest-impact effect resulting from the current variant (NONE, SILENT, MISSENSE, or NONSENSE)
• SNPEFF_CODON_CHANGE - Old/New codon for the highest-impact effect resulting from the current variant
• SNPEFF_AMINO_ACID_CHANGE - Old/New amino acid for the highest-impact effect resulting from the current variant
• SNPEFF_GENE_NAME - Gene name for the highest-impact effect resulting from the current variant
• SNPEFF_GENE_BIOTYPE - Gene biotype for the highest-impact effect resulting from the current variant
• SNPEFF_TRANSCRIPT_ID - Transcript ID for the highest-impact effect resulting from the current variant
• SNPEFF_EXON_ID - Exon ID for the highest-impact effect resulting from the current variant

Example VCF records annotated using SnpEff and VariantAnnotator:

1   874779  .   C   T   279.94  . AC=1;AF=0.0032;AN=310;BaseQRankSum=-1.800;DP=3371;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=1.4493;InbreedingCoeff=-0.0045;
SNPEFF_EFFECT=SYNONYMOUS_CODING;SNPEFF_EXON_ID=exon_1_874655_874840;SNPEFF_FUNCTIONAL_CLASS=SILENT;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=SAMD11;
SNPEFF_IMPACT=LOW;SNPEFF_TRANSCRIPT_ID=ENST00000342066

1   874816  .   C   CT  2527.52 .   AC=15;AF=0.0484;AN=310;BaseQRankSum=-11.876;DP=4718;FS=48.575;HRun=1;HaplotypeScore=91.9147;InbreedingCoeff=-0.0520;
SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=SAMD11;SNPEFF_IMPACT=HIGH;SNPEFF_TRANSCRIPT_ID=ENST00000342066


Option 2: Annotate with all effects for each variant

VariantAnnotator also has the ability to take the EFF field from the SnpEff VCF output file containing all the effects aggregated together and copy it verbatim into the VCF to annotate.

Here's an example of how to do this:

java -jar dist/GenomeAnalysisTK.jar \
-T VariantAnnotator \
-R /humgen/1kg/reference/human_g1k_v37.fasta \
-E resource.EFF \
--variant 1000G.exomes.vcf \      (file to annotate)
--resource snpEff_output.vcf \    (SnpEff VCF output file generated by running SnpEff on the file to annotate)
-L 1000G.exomes.vcf \
-o out.vcf


Of course, in this case you can also use the VCF output by SnpEff directly, but if you are using VariantAnnotator for other purposes anyway the above might be useful.

List of Genomic Effects

Below are the possible genomic effects recognized by SnpEff, grouped by biological impact. Full descriptions of each effect are available on this page.

High-Impact Effects

• SPLICE_SITE_ACCEPTOR
• SPLICE_SITE_DONOR
• START_LOST
• EXON_DELETED
• FRAME_SHIFT
• STOP_GAINED
• STOP_LOST

Moderate-Impact Effects

• NON_SYNONYMOUS_CODING
• CODON_CHANGE (note: this effect is used by SnpEff only for MNPs, not SNPs)
• CODON_INSERTION
• CODON_CHANGE_PLUS_CODON_INSERTION
• CODON_DELETION
• CODON_CHANGE_PLUS_CODON_DELETION
• UTR_5_DELETED
• UTR_3_DELETED

Low-Impact Effects

• SYNONYMOUS_START
• NON_SYNONYMOUS_START
• START_GAINED
• SYNONYMOUS_CODING
• SYNONYMOUS_STOP
• NON_SYNONYMOUS_STOP

Modifiers

• NONE
• CHROMOSOME
• CUSTOM
• CDS
• GENE
• TRANSCRIPT
• EXON
• INTRON_CONSERVED
• UTR_5_PRIME
• UTR_3_PRIME
• DOWNSTREAM
• INTRAGENIC
• INTERGENIC
• INTERGENIC_CONSERVED
• UPSTREAM
• REGULATION
• INTRON

Functional Classes

SnpEff assigns a functional class to certain effects, in addition to an impact:

• NONSENSE: assigned to point mutations that result in the creation of a new stop codon
• MISSENSE: assigned to point mutations that result in an amino acid change, but not a new stop codon
• SILENT: assigned to point mutations that result in a codon change, but not an amino acid change or new stop codon
• NONE: assigned to all effects that don't fall into any of the above categories (including all events larger than a point mutation)

The GATK prioritizes effects with functional classes over effects of equal impact that lack a functional class when selecting the most significant effect in VariantAnnotator. This is to enable accurate counts of NONSENSE/MISSENSE/SILENT sites.

2 SNPs with significant strand bias

Several SNPs with excessive coverage

For a complete, detailed argument reference, refer to the GATK document page here.

Introduction

In addition to true variation, variant callers emit a number of false-positives. Some of these false-positives can be detected and rejected by various statistical tests. VariantAnnotator provides a way of annotating variant calls as preparation for executing these tests.

Description of the haplotype score annotation

Examples of Available Annotations

The list below is not comprehensive. Please use the --list argument to get a list of all possible annotations available. Also, see the FAQ article on understanding the Unified Genotyper's VCF files for a description of some of the more standard annotations.

Note that technically the VariantAnnotator does not require reads (from a BAM file) to run; if no reads are provided, only those Annotations which don't use reads (e.g. Chromosome Counts) will be added. But most Annotations do require reads. When running the tool we recommend that you add the -L argument with the variant rod to your command line for efficiency and speed.

For a complete, detailed argument reference, refer to the technical documentation page.

Modules

You can find detailed information about the various modules here.

Stratification modules

• AlleleFrequency
• AlleleCount
• CompRod
• Contig
• CpG
• Degeneracy
• EvalRod
• Filter
• FunctionalClass
• JexlExpression
• Novelty
• Sample

Evaluation modules

• CompOverlap
• CountVariants

Note that the GenotypeConcordance module has been rewritten as a separate walker tool (see its Technical Documentation page).

A useful analysis using VariantEval

We in GSA often find ourselves performing an analysis of 2 different call sets. For SNPs, we often show the overlap of the sets (their "venn") and the relative dbSNP rates and/or transition-transversion ratios. The picture provided is an example of such a slide and is easy to create using VariantEval. Assuming you have 2 filtered VCF callsets named 'foo.vcf' and 'bar.vcf', there are 2 quick steps.

Combine the VCFs

java -jar GenomeAnalysisTK.jar \
-R ref.fasta \
-T CombineVariants \
-V:FOO foo.vcf \
-V:BAR bar.vcf \
-priority FOO,BAR \
-o merged.vcf


Run VariantEval

java -jar GenomeAnalysisTK.jar \
-T VariantEval \
-R ref.fasta \
-D dbsnp.vcf \
-select 'set=="Intersection"' -selectName Intersection \
-select 'set=="FOO"' -selectName FOO \
-select 'set=="FOO-filterInBAR"' -selectName InFOO-FilteredInBAR \
-select 'set=="BAR"' -selectName BAR \
-select 'set=="filterInFOO-BAR"' -selectName InBAR-FilteredInFOO \
-select 'set=="FilteredInAll"' -selectName FilteredInAll \
-o merged.eval.gatkreport \
-eval merged.vcf \
-l INFO


Checking the possible values of 'set'

It is wise to check the actual values for the set names present in your file before writing complex VariantEval commands. An easy way to do this is to extract the value of the set fields and then reduce that to the unique entries, like so:

java -jar GenomeAnalysisTK.jar -T VariantsToTable -R ref.fasta -V merged.vcf -F set -o fields.txt
grep -v 'set' fields.txt | sort | uniq -c


This will provide you with a list of all of the possible values for 'set' in your VCF so that you can be sure to supply the correct select statements to VariantEval.

The VariantEval output is formatted as a GATKReport.

Understanding Genotype Concordance values from Variant Eval

The VariantEval genotype concordance module emits information the relationship between the eval calls and genotypes and the comp calls and genotypes. The following three slides provide some insight into three key metrics to assess call sensitivity and concordance between genotypes.

##:GATKReport.v0.1 GenotypeConcordance.sampleSummaryStats&#160;: the concordance statistics summary for each sample
GenotypeConcordance.sampleSummaryStats  CompRod   CpG      EvalRod  JexlExpression  Novelty  percent_comp_ref_called_var  percent_comp_het_called_het  percent_comp_het_called_var  percent_comp_hom_called_hom  percent_comp_hom_called_var  percent_non-reference_sensitivity  percent_overall_genotype_concordance  percent_non-reference_discrepancy_rate
GenotypeConcordance.sampleSummaryStats  compOMNI  all      eval     none            all      0.78                         97.65                        98.39                        99.13                        99.44                        98.80                              99.09                                 3.60


The key outputs:

• percent_overall_genotype_concordance
• percent_non_ref_sensitivity_rate
• percent_non_ref_discrepancy_rate

All defined below.

Introduction

Three-stage procedure:

• Create a master set of sites from your N batch VCFs that you want to genotype in all samples. At this stage you need to determine how you want to resolve disagreements among the VCFs. This is your master sites VCF.

• Take the master sites VCF and genotype each sample BAM file at these sites

• (Optionally) Merge the single sample VCFs into a master VCF file

Creating the master set of sites: SNPs and Indels

The first step of batch merging is to create a master set of sites that you want to genotype in all samples. To make this problem concrete, suppose I have two VCF files:

Batch 1:

##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12891
20      9999996     .       A       ATC     .       PASS    .       GT:GQ   0/1:30
20      10000000        .       T       G       .       PASS    .       GT:GQ   0/1:30
20      10000117        .       C       T       .       FAIL    .       GT:GQ   0/1:30
20      10000211        .       C       T       .       PASS    .       GT:GQ   0/1:30
20      10001436        .       A       AGG     .       PASS    .       GT:GQ   1/1:30


Batch 2:

##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12878
20      9999996     .       A       ATC     .       PASS    .       GT:GQ   0/1:30
20      10000117        .       C       T       .       FAIL    .       GT:GQ   0/1:30
20      10000211        .       C       T       .       FAIL    .       GT:GQ   0/1:30
20      10000598        .       T       A       .       PASS    .       GT:GQ   1/1:30
20      10001436        .       A       AGGCT   .       PASS    .       GT:GQ   1/1:30


In order to merge these batches, I need to make a variety of bookkeeping and filtering decisions, as outlined in the merged VCF below:

Master VCF:

20      9999996     .       A       ATC     .       PASS    .       GT:GQ   0/1:30  [pass in both]
20      10000000        .       T       G       .       PASS    .       GT:GQ   0/1:30  [only in batch 1]
20      10000117        .       C       T       .       FAIL    .       GT:GQ   0/1:30  [fail in both]
20      10000211        .       C       T       .       FAIL    .       GT:GQ   0/1:30  [pass in 1, fail in 2, choice in unclear]
20      10000598        .       T       A       .       PASS    .       GT:GQ   1/1:30  [only in batch 2]
20      10001436        .       A       AGGCT   .       PASS    .       GT:GQ   1/1:30  [A/AGG in batch 1, A/AGGCT in batch 2, including this site may be problematic]


These issues fall into the following categories:

• For sites present in all VCFs (20:9999996 above), the alleles agree, and each site PASS is pass, this site can obviously be considered "PASS" in the master VCF
• Some sites may be PASS in one batch, but absent in others (20:10000000 and 20:10000598), which occurs when the site is polymorphic in one batch but all samples are reference or no-called in the other batch
• Similarly, sites that are fail in all batches in which they occur can be safely filtered out, or included as failing filters in the master VCF (20:10000117)

There are two difficult situations that must be addressed by the needs of the project merging batches:

• Some sites may be PASS in some batches but FAIL in others. This might indicate that either:
• The site is actually truly polymorphic, but due to limited coverage, poor sequencing, or other issues it is flag as unreliable in some batches. In these cases, it makes sense to include the site
• The site is actually a common machine artifact, but just happened to escape standard filtering in a few batches. In these cases, you would obviously like to filter out the site
• Even more complicated, it is possible that in the PASS batches you have found a reliable allele (C/T, for example) while in others there's no alt allele but actually a low-frequency error, which is flagged as failing. Ideally, here you could filter out the failing allele from the FAIL batches, and keep the pass ones
• Some sites may have multiple segregating alleles in each batch. Such sites are often errors, but in some cases may be actual multi-allelic sites, in particular for indels.

Unfortunately, we cannot determine which is actually the correct choice, especially given the goals of the project. We leave it up the project bioinformatician to handle these cases when creating the master VCF. We are hopeful that at some point in the future we'll have a consensus approach to handle such merging, but until then this will be a manual process.

The GATK tool CombineVariants can be used to merge multiple VCF files, and parameter choices will allow you to handle some of the above issues. With tools like SelectVariants one can slice-and-dice the merged VCFs to handle these complexities as appropriate for your project's needs. For example, the above master merge can be produced with the following CombineVariants:

java -jar dist/GenomeAnalysisTK.jar \
-T CombineVariants \
-R human_g1k_v37.fasta \
-V:one,VCF combine.1.vcf -V:two,VCF combine.2.vcf \
--sites_only \
-minimalVCF \
-o master.vcf


producing the following VCF:

##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
20      9999996     .       A       ACT         .       PASS    set=Intersection
20      10000000        .       T       G           .   PASS    set=one
20      10000117        .       C       T           .       FAIL    set=FilteredInAll
20      10000211        .       C       T           .       PASS    set=filterIntwo-one
20      10000598        .       T       A           .       PASS    set=two
20      10001436        .       A       AGG,AGGCT       .       PASS    set=Intersection


Genotyping your samples at these sites

Having created the master set of sites to genotype, along with their alleles, as in the previous section, you now use the UnifiedGenotyper to genotype each sample independently at the master set of sites. This GENOTYPE_GIVEN_ALLELES mode of the UnifiedGenotyper will jump into the sample BAM file, and calculate the genotype and genotype likelihoods of the sample at the site for each of the genotypes available for the REF and ALT alleles. For example, for site 10000211, the UnifiedGenotyper would evaluate the likelihoods of the CC, CT, and TT genotypes for the sample at this site, choose the most likely configuration, and generate a VCF record containing the genotype call and the likelihoods for the three genotype configurations.

As a concrete example command line, you can genotype the master.vcf file using in the bundle sample NA12878 with the following command:

java -Xmx2g -jar dist/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
-R bundle/b37/human_g1k_v37.fasta \
-I bundle/b37/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam \
-alleles master.vcf \
-L master.vcf \
-gt_mode GENOTYPE_GIVEN_ALLELES \
-out_mode EMIT_ALL_SITES \
-stand_call_conf 0.0 \
-glm BOTH \
-G none \


The -L master.vcf argument tells the UG to only genotype the sites in the master file. If you don't specify this, the UG will genotype the master sites in GGA mode, but it will also genotype all other sites in the genome in regular mode.

The last item,-G  prevents the UG from computing annotations you don't need. This command produces something like the following output:

##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12878
20      9999996     .       A       ACT         4576.19 .       .   GT:DP:GQ:PL     1/1:76:99:4576,229,0
20      10000000        .       T       G           0       .       .       GT:DP:GQ:PL     0/0:79:99:0,238,3093
20      10000211        .       C       T       857.79  .       .   GT:AD:DP:GQ:PL  0/1:28,27:55:99:888,0,870
20      10000598        .       T       A           1800.57 .       .   GT:AD:DP:GQ:PL  1/1:0,48:48:99:1834,144,0
20      10001436        .       A       AGG,AGGCT       1921.12 .       .   GT:DP:GQ:PL     0/2:49:84.06:1960,2065,0,2695,222,84


Several things should be noted here:

• The genotype likelihoods calculation evolves, especially for indels, the exact results of this command will change.
• The command will emit sites that are hom-ref in the sample at the site, but the -stand_call_conf 0.0 argument should be provided so that they aren't tagged as "LowQual" by the UnifiedGenotyper.
• The filtered site 10000117 in the master.vcf is not genotyped by the UG, as it doesn't pass filters and so is considered bad by the GATK UG. If you want to determine the genotypes for all sites, independent on filtering, you must unfilter all of your records in master.vcf, and if desired, restore the filter string for these records later.

This genotyping command can be performed independently per sample, and so can be parallelized easily on a farm with one job per sample, as in the following:

foreach sample in samples:
run UnifiedGenotyper command above with -I $sample.bam -o$sample.vcf
end


(Optional) Merging the sample VCFs together

You can use a similar command for CombineVariants above to merge back together all of your single sample genotyping runs. Suppose all of my UnifiedGenotyper jobs have completed, and I have VCF files named sample1.vcf, sample2.vcf, to sampleN.vcf. The single command:

java -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R human_g1k_v37.fasta -V:sample1 sample1.vcf -V:sample2 sample2.vcf [repeat until] -V:sampleN sampleN.vcf -o combined.vcf


General notes

• Because the GATK uses dynamic downsampling of reads, it is possible for truly marginal calls to change likelihoods from discovery (processing the BAM incrementally) vs. genotyping (jumping into the BAM). Consequently, do not be surprised to see minor differences in the genotypes for samples from discovery and genotyping.
• More advanced users may want to consider group several samples together for genotyping. For example, 100 samples could be genotyped in 10 groups of 10 samples, resulting in only 10 VCF files. Merging the 10 VCF files may be faster (or just easier to manage) than 1000 individual VCFs.
• Sometimes, using this method, a monomorphic site within a batch will be identified as polymorphic in one or more samples within that same batch. This is because the UnifiedGenotyper applies a frequency prior to determine whether a site is likely to be monomorphic. If the site is monomorphic, it is either not output, or if EMIT_ALL_SITES is thrown, reference genotypes are output. If the site is determined to be polymorphic, genotypes are assigned greedily (as of GATK-v1.4). Calling single-sample reduces the effect of the prior, so sites which were considered monomorphic within a batch could be considered polymorphic within a sub-batch.

Note: As of version 4, BEAGLE reads and outputs VCF files directly, and can handle multiallelic sites. We have not yet evaluated what this means for the GATK-BEAGLE interface; it is possible that some of the information provided below is no longer applicable as a result.

Introduction

BEAGLE is a state of the art software package for analysis of large-scale genetic data sets with hundreds of thousands of markers genotyped on thousands of samples. BEAGLE can

• phase genotype data (i.e. infer haplotypes) for unrelated individuals, parent-offspring pairs, and parent-offspring trios.
• infer sporadic missing genotype data.
• impute ungenotyped markers that have been genotyped in a reference panel.
• perform single marker and haplotypic association analysis.
• detect genetic regions that are homozygous-by-descent in an individual or identical-by-descent in pairs of individuals.

The GATK provides an experimental interface to BEAGLE. Currently, the only use cases supported by this interface are a) inferring missing genotype data from call sets (e.g. for lack of coverage in low-pass data), b) Genotype inference for unrelated individuals.

Workflow

The basic workflow for this interface is as follows:

After variants are called and possibly filtered, the GATK walker ProduceBeagleInput will take the resulting VCF as input, and will produce a likelihood file in BEAGLE format.

• User needs to run BEAGLE with this likelihood file specified as input.
• After Beagle runs, user must unzip resulting output files (.gprobs, .phased) containing posterior genotype probabilities and phased haplotypes.
• User can then run GATK walker BeagleOutputToVCF to produce a new VCF with updated data. The new VCF will contain updated genotypes as well as updated annotations.

Producing Beagle input likelihoods file

Before running BEAGLE, we need to first take an input VCF file with genotype likelihoods and produce the BEAGLE likelihoods file using walker ProduceBealgeInput, as described in detail in its documentation page.

For each variant in inputvcf.vcf, ProduceBeagleInput will extract the genotype likelihoods, convert from log to linear space, and produce a BEAGLE input file in Genotype likelihoods file format (see BEAGLE documentation for more details). Essentially, this file is a text file in tabular format, a snippet of which is pasted below:

marker    alleleA alleleB NA07056 NA07056 NA07056 NA11892 NA11892 NA11892
20:60251    T        C     10.00   1.26    0.00     9.77   2.45    0.00
20:60321    G        T     10.00   5.01    0.01    10.00   0.31    0.00
20:60467    G        C      9.55   2.40    0.00     9.55   1.20    0.00


Note that BEAGLE only supports biallelic sites. Markers can have an arbitrary label, but they need to be in chromosomal order. Sites that are not genotyped in the input VCF (i.e. which are annotated with a "./." string and have no Genotype Likelihood annotation) are assigned a likelihood value of (0.33, 0.33, 0.33).

IMPORTANT: Due to BEAGLE memory restrictions, it's strongly recommended that BEAGLE be run on a separate chromosome-by-chromosome basis. In the current use case, BEAGLE uses RAM in a manner approximately proportional to the number of input markers. After BEAGLE is run and an output VCF is produced as described below, CombineVariants can be used to combine resulting VCF's, using the "-variantMergeOptions UNION" argument.

Running Beagle

We currently only support a subset of BEAGLE functionality - only unphased, unrelated input likelihood data is supported. To run imputation analysis, run for example

java -Xmx4000m -jar path_to_beagle/beagle.jar like=path_to_beagle_output/beagle_output out=myrun


Extra BEAGLE arguments can be added as required.

Empirically, Beagle can run up to about ~800,000 markers with 4 GB of RAM. Larger chromosomes require additional memory.

Processing BEAGLE output files

BEAGLE will produce several output files. The following shell commands unzip the output files in preparation for their being processed, and put them all in the same place:

# unzip gzip'd files, force overwrite if existing
gunzip -f path_to_beagle_output/myrun.beagle_output.gprobs.gz
gunzip -f path_to_beagle_output/myrun.beagle_output.phased.gz
#rename also Beagle likelihood file to mantain consistency
mv path_to_beagle_output/beagle_output path_to_beagle_output/myrun.beagle_output.like


Creating a new VCF from BEAGLE data with BeagleOutputToVCF

Once BEAGLE files are produced, we can update our original VCF with BEAGLE's data. This is done using the BeagleOutputToVCF tool.

The walker looks for the files specified with the -B(type,BEAGLE,file) triplets as above for the output posterior genotype probabilities, the output r^2 values and the output phased genotypes. The order in which these are given in the command line is arbitrary, but all three must be present for correct operation.

The output VCF has the new genotypes that Beagle produced, and several annotations are also updated. By default, the walker will update the per-genotype annotations GQ (Genotype Quality), the genotypes themselves, as well as the per-site annotations AF (Allele Frequency), AC (Allele Count) and AN (Allele Number).

The resulting VCF can now be used for further downstream analysis.

Merging VCFs broken up by chromosome into a single genome-wide file

Assuming you have broken up your calls into Beagle by chromosome (as recommended above), you can use the CombineVariants tool to merge the resulting VCFs into a single callset.

java -jar /path/to/dist/GenomeAnalysisTK.jar \
-T CombineVariants \
-R reffile.fasta \
--out genome_wide_output.vcf \
-V:input1 beagle_output_chr1.vcf \
-V:input2 beagle_output_chr2.vcf \
.
.
.
-V:inputX beagle_output_chrX.vcf \
-type UNION -priority input1,input2,...,inputX


This document describes what Variant Quality Score Recalibration (VQSR) is designed to do, and outlines how it works under the hood. For command-line examples and recommendations on what specific resource datasets and arguments to use for VQSR, please see this FAQ article.

As a complement to this document, we encourage you to watch the workshop videos available on our Events webpage.

Slides that explain the VQSR methodology in more detail as well as the individual component variant annotations can be found here in the GSA Public Drop Box.

Detailed information about command line options for VariantRecalibrator can be found here.

Detailed information about command line options for ApplyRecalibration can be found here.

Introduction

The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. This enables you to generate highly accurate call sets by filtering based on this single estimate for the accuracy of each call.

The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input (typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array, for humans). This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.

The variant recalibrator contrastively evaluates variants in a two step process, each performed by a distinct tool:

• VariantRecalibrator
Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants. This step produces a recalibration file.

• ApplyRecalibration
Apply the model parameters to each variant in input VCF files producing a recalibrated VCF file in which each variant is annotated with its VQSLOD value. In addition, this step will filter the calls based on this new lod score by adding lines to the FILTER column for variants that don't meet the specified lod threshold.

Please see the VQSR tutorial for step-by-step instructions on running these tools.

How VariantRecalibrator works in a nutshell

The tool takes the overlap of the training/truth resource sets and of your callset. It models the distribution of these variants relative to the annotations you specified, and attempts to group them into clusters. Then it uses the clustering to assign VQSLOD scores to all variants. Variants that are closer to the heart of a cluster will get a higher score than variants that are outliers.

How ApplyRecalibration works in a nutshell

During the first part of the recalibration process, variants in your callset were given a score called VQSLOD. At the same time, variants in your training sets were also ranked by VQSLOD. When you specify a tranche sensitivity threshold with ApplyRecalibration, expressed as a percentage (e.g. 99.9%), what happens is that the program looks at what is the VQSLOD value above which 99.9% of the variants in the training callset are included. It then takes that value of VQSLOD and uses it as a threshold to filter your variants. Variants that are above the threshold pass the filter, so the FILTER field will contain PASS. Variants that are below the threshold will be filtered out; they will be written to the output file, but in the FILTER field they will have the name of the tranche they belonged to. So VQSRTrancheSNP99.90to100.00 means that the variant was in the range of VQSLODs corresponding to the remaining 0.1% of the training set, which are basically considered false positives.

Interpretation of the Gaussian mixture model plots

The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run (in the above command line the report will appear as path/to/output.plots.R.pdf). For every pair-wise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.

The figure shows one page of an example Gaussian mixture model report that is automatically generated by the VQSR from the example HiSeq call set. This page shows the 2D projection of mapping quality rank sum test versus Haplotype score by marginalizing over the other annotation dimensions in the model.

In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.

The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).

An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown (haplotype score and mapping quality bias). As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense as not only do higher values of HaplotypeScore indicate a lower chance of the data being explained by only two haplotypes but also higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out!

Tranches and the tranche plot

The recalibrated variant quality score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The main purpose of the tranches is to establish thresholds within your data that correspond to certain levels of sensitivity relative to the truth sets. The idea is that with well calibrated variant quality scores, you can generate call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way you can choose to use some of the filtered records or only use the PASSing records.

The first tranche (from the bottom, with lowest values) is exceedingly specific but less sensitive, and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. Downstream applications can select in a principled way more specific or more sensitive call sets or incorporate directly the recalibrated quality scores to avoid entirely the need to analyze only a fixed subset of calls but rather weight individual variant calls by their probability of being real. An example tranche plot, automatically generated by the VariantRecalibrator walker, is shown below.

This is an example of a tranches plot generated for a HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion ratio and the overall truth sensitivity.

Note that the tranches plot is not applicable for indels.

Ti/Tv-free recalibration

We use a Ti/Tv-free approach to variant quality score recalibration. This approach requires an additional truth data set, and cuts the VQSLOD at given sensitivities to the truth set. It has several advantages over the Ti/Tv-targeted approach:

• The truth sensitivity (TS) approach gives you back the novel Ti/Tv as a QC metric
• The truth sensitivity (TS) approach is conceptual cleaner than deciding on a novel Ti/Tv target for your dataset
• The TS approach is easier to explain and defend, as saying "I took called variants until I found 99% of my known variable sites" is easier than "I took variants until I dropped my novel Ti/Tv ratio to 2.07"

We have used hapmap 3.3 sites as the truth set (genotypes_r27_nr.b37_fwd.vcf), but other sets of high-quality (~99% truly variable in the population) sets of sites should work just as well. In our experience, with HapMap, 99% is a good threshold, as the remaining 1% of sites often exhibit unusual features like being close to indels or are actually MNPs, and so receive a low VQSLOD score.
Note that the expected Ti/Tv is still an available argument but it is only used for display purposes.

Finally, a couple of Frequently Asked Questions

- Can I use the variant quality score recalibrator with my small sequencing experiment?

This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties.

One piece of advice is to turn down the number of Gaussians used during training. This can be accomplished by adding --maxGaussians 4 to your command line.

maxGaussians is the maximum number of different "clusters" (=Gaussians) of variants the program is "allowed" to try to identify. Lowering this number forces the program to group variants into a smaller number of clusters, which means there will be more variants in each cluster -- hopefully enough to satisfy the statistical requirements. Of course, this decreases the level of discrimination that you can achieve between variant profiles/error modes. It's all about trade-offs; and unfortunately if you don't have a lot of variants you can't afford to be very demanding in terms of resolution.

- Why don't all the plots get generated for me?

The most common problem related to this is not having Rscript accessible in your environment path. Rscript is the command line version of R that gets installed right alongside. We also make use of the ggplot2 library so please be sure to install that package as well.

No posts found with the requested search criteria.

Hi,

I used CalculateGenotypePosteriors with the supporting file called ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf, obtained from 1000 Genomes. It contains both indels and SNPs, and I was able to use the file for the first step of the Genotype Refinement Workflow. My question is: is it an issue that I didn't remove the indels from the supporting file? I would presume not since first of all, both the indels and the SNPs from Phase 3 of 1000G should be high confidence, and second, my recalibrated vcf file includes both indels and snps, so it should be in my interest to have as much information as possible, actually, so I should consider indels as well.

Just want to check whether my reasoning is correct.

Thanks, Alva

Hi, I'm running on the Unified Genotyper (Version=3.3-0-g37228af) on a pooled sample. The ploidy is set to 32. I'm trying to get allele frequency information. I'm trying to filter sites based on depth of coverage but the DP flag is not consistent within a loci

The DP at the start says there is a depth of 37, but at the end it says its 13. Judging from neighbouring sites, the actual depth seems to be 37, but I'm confused why it seems to only be using 13 of them. At other sites the scores match.

Dear all,

I need to generate vcf file with GATK and I need to have TI (transcript information) in INFO field and VF and GQX in FORMAT field.

Could you help me please with arguments.

My agruments are to call variants:

ERROR ------------------------------------------------------------------------------------------

Current error: INFO 13:49:09,583 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:21 INFO 13:49:09,583 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 13:49:09,583 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 13:49:09,588 HelpFormatter - Program Args: --log_to_file /home/mano/media/NAS1/PFlab/Mano/AllBAM/GATK/QualityScore/H3K4me1_run22_F3_5.sorted.marked.gatk.realigned.fixed.log --performanceLog /home/mano/media/NAS1/PFlab/Mano/AllBAM/GATK/QualityScore/H3K4me1_run22_F3_5.sorted.marked.gatk.realigned.fixed.perflog --keep_program_records -T BaseRecalibrator -R /home/mano/media/NAS1/PFlab/Mano/Genome/reference.fa -I /home/mano/media/NAS1/PFlab/Mano/AllBAM/GATK/H3K4me1_run22_F3_5.sorted.marked.intervals.realigned.fixed.bam -o /home/mano/media/NAS1/PFlab/Mano/AllBAM/GATK/QualityScore/H3K4me1_run22_F3_5.sorted.marked.intervals.realigned.after_recal_data.table --knownSites /home/mano/media/NAS1/PFlab/Mano/Fwithoutchrsorted.vcf --knownSites /home/mano/media/NAS1/PFlab/Mano/SNPs_only_withoutCHR_sort.vcf --deletions_default_quality 45 --insertions_default_quality 45 --low_quality_tail 2 --solid_nocall_strategy LEAVE_READ_UNRECALIBRATED --solid_recal_mode SET_Q_ZERO --lowMemoryMode --no_standard_covs --bqsrBAQGapOpenPenalty 30.0 INFO 13:49:09,591 HelpFormatter - Executing as mano@balrog04 on Linux 3.2.0-26-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13. INFO 13:49:09,592 HelpFormatter - Date/Time: 2014/06/02 13:49:09 INFO 13:49:09,592 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:49:09,592 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:49:10,461 GenomeAnalysisEngine - Strictness is SILENT INFO 13:49:10,538 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 13:49:10,547 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:49:10,574 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 INFO 13:50:17,644 RMDTrackBuilder - Writing Tribble index to disk for file /home/mano/media/NAS1/PFlab/Mano/SNPs_only_withoutCHR_sort.vcf.idx INFO 13:50:29,885 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------

Please note that .idx file is formed but only in Kilobytes

Hi GATK team,

I ran HaplotypeCaller on a bam file that I pre-pprocessed according to your Best-Practices. When looking in a genome browser at this bam, I encountered a position that seems to have a variant, but the variant doesn't appear in the vcf file.

This is the line from the vcf file (the variant I'm talking about is at position 75680995): 3 75680992 . C . . END=75680995 GT:DP:GQ:MIN_DP:PL 0/0:155:0:147:0,0,0

Attached is the picture from the genome browser - At the same position (75680995) there is a variant with the allele 'A', while the reference is 'C'. The coverage at this position is 195, and 171 of those reads are for the 'A' allele.

Why is this happen?

Thank you, Maya

Hi,

I'm doing a variant analysis of genomic DNA from 2 related samples. I followed the up-to-date Best practices using HaplotypeCaller in GVCF mode for both samples followed by GenotypeGVCF to compute a common vcf of variant loci. I'm looking at variants that would be sample2-specific (present in sample2 but not in sample1)

Here is a line of this file:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 chrIII 91124 . A AATAAGAGGAATTAGGCT 1132.42 . AC=2;AF=0.500;AN=4;DP=47;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=58.85;MQ0=0;QD=7.99 GT:AD:DP:GQ:PL 1/1:0,25:25:55:1167,55,0 0/0:.:22:33:0,33,495

In the Genotype Field, sample2.AD is a . (dot) meaning that no reads passed the Quality filters. However, sample2.DP=22 meaning that 22 reads covered this position. This line suggest that this variation is specific to sample1 (genotype HomVar 1/1) and is not present in sample2 (HomRef 0/0). But given the biological relationship between sample1 and 2 (the way they were generated), I doubt that this variation is true: it is very likely to be present in sample2 as well. It's a false

I have 416 loci like this. For the vast majority of them, sample1 and 2 likely share the same variation. But since it is not impossible that a very few of them are really sample1=HomVar and sample2=HomRef, could you suggest me a way to detect those guys? What about comparing sample1.PL(1/1) and sample2.PL(0/0) ? For example could you suggest a rule of thumb to determine their ratio ?

Hi!

I try run GATK with SelectVariants.

But the error message was generated.

ERROR MESSAGE: The provided VCF file is malformed at approximately line number 37: unparsable vcf record with allele X

My vcf file have X.

Do I need to change format?

I made this with Bowtie2 -> Samtools -> Bcftools

Here is my 50 lines of vcf file.

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT pe_repeat_sort.bam

NODE_2_length_24922_cov_5660.89_ID_3 1 . T X . DP=5656;I16=5430,120,0,0,186033,6369057,0,0,230246,9637234,0,0,117493,2780997,0,0;QS=1.000000,0.000000,0.000000,0.000000 PL 0,255,255 NODE_2_length_24922_cov_5660.89_ID_3 2 . G X . DP=5959;I16=5703,150,0,0,195867,6688005,0,0,243427,10211483,0,0,119037,2816345,0,0;QS=1.000000,0.000000,0.000000,0.000000 PL 0,255,255 NODE_2_length_24922_cov_5660.89_ID_3 3 . T X . DP=6026;I16=5772,149,0,0,198149,6767651,0,0,246413,10342791,0,0,120850,2853280,0,0;QS=1.000000,0.000000,0.000000,0.000000 PL 0,255,255 NODE_2_length_24922_cov_5660.89_ID_3 4 . G T,X . DP=6418;I16=6110,193,2,0,210264,7157322,53,1517,263032,11066766,65,2257,122440,2884596,50,1250;QS=0.999747,0.000253,0.000000,0.000000;VDB=7.840000e-02;RPB=2.061348e+00 PL 0,255,255,255,255,255 NODE_2_length_24922_cov_5660.89_ID_3 5 . T X . DP=6471;I16=6156,194,0,0,211750,7207204,0,0,265091,11157299,0,0,124426,2918388,0,0;QS=1.000000,0.000000,0.000000,0.000000 PL 0,255,255 NODE_2_length_24922_cov_5660.89_ID_3 6 . T X . DP=6590;I16=6231,229,0,0,214810,7294948,0,0,269745,11355423,0,0,126359,2952585,0,0;QS=1.000000,0.000000,0.000000,0.000000 PL 0,255,255 NODE_2_length_24922_cov_5660.89_ID_3 7 . A T,X . DP=6799;I16=6344,308,3,0,221029,7499167,78,2178,277797,11694735,132,5808,128237,2985471,17,97;QS=0.999645,0.000355,0.000000,0.000000;VDB=2.063840e-02;RPB=-1.796363e+00 PL 0,255,255,255,255,255 NODE_2_length_24922_cov_5660.89_ID_3 8 . A X . DP=6940;I16=6453,374,0,0,229907,7902269,0,0,285040,11996704,0,0,131453,3049447,0,0;QS=1.000000,0.000000,0.000000,0.000000 PL 0,255,255 NODE_2_length_24922_cov_5660.89_ID_3 9 . G T,A,X . DP=7067;I16=6506,397,8,0,232521,8000427,275,9515,288326,12138944,347,15061,132809,3068393,146,3280;QS=0.998813,0.000894,0.000294,0.000000;VDB=4.127108e-03;RPB=-7.938902e-01 PL 0,255,255,255,255,255,255,255,255,255 NODE_2_length_24922_cov_5660.89_ID_3 10 . C X . DP=7122;I16=6564,398,0,0,235379,8123921,0,0,290896,12250986,0,0,135314,3114576,0,0;QS=1.000000,0.000000,0.000000,0.000000 PL 0,255,255 NODE_2_length_24922_cov_5660.89_ID_3 11 . T X . DP=7156;I16=6590,398,0,0,237130,8213824,0,0,292077,12304019,0,0,137422,3153354,0,0;QS=1.000000,0.000000,0.000000,0.000000 PL 0,255,255 NODE_2_length_24922_cov_5660.89_ID_3 12 . G T,X . DP=7392;I16=6719,483,13,0,244199,8451047,409,13477,301084,12684774,529,21969,139336,3191190,254,5850;QS=0.998319,0.001681,0.000000,0.000000;VDB=2.994939e-02;RPB=1.872494e-02 PL 0,255,255,255,255,255 NODE_2_length_24922_cov_5660.89_ID_3 13 . A G,T,X . DP=7491;I16=6768,524,2,1,246463,8505451,102,3486,304841,12842749,116,4528,141728,3238198,53,1145;QS=0.999584,0.000265,0.000151,0.000000;VDB=5.361253e-02;RPB=2.817384e-01 PL 0,255,255,255,255,255,255,255,255,255 NODE_2_length_24922_cov_5660.89_ID_3 14 . T C,X . DP=7508;I16=6779,530,2,0,248025,8592981,47,1325,305546,12872218,84,3528,144261,3290147,50,1250;QS=0.999810,0.000190,0.000000,0.000000;VDB=6.720000e-02;RPB=1.155135e+00 PL 0,255,255,255,255,255

Thank you.

Won

Hello

I am following the gatk best practices guide, and using the 2 stages BaseRecalibrator process. The first recalibration is going well, the second one, produces an error like described below, I find it strange since it is not mentioned in the documentation (neither the tutorials that BaseRecalibrator require a vcf input

Here is my command

java -Xmx2g -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -R GRCh37-lite.fa \ -I SA495-Tumor.sorted.realigned.bam \ -BQSR SA495-Tumor.sorted.realigned.grp \ -o SA495-Tumor.sorted.post_recal.grp2

here is the error message

INFO 14:03:38,425 GATKRunReport - Uploaded run statistics report to AWS S3 ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8): ERROR ERROR This means that one or more arguments or inputs in your command are incorrect. ERROR The error message below tells you what is the problem. ERROR ERROR If the problem is an invalid argument, please check the online documentation guide ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ERROR ERROR Visit our website and forum for extensive documentation and answers to ERROR commonly asked questions http://www.broadinstitute.org/gatk ERROR ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ERROR ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation. Any idea ?

Thanks

I've noticed a small bug with GATK tools and VCF files. CombineVariants and GenotypeGVCF can generate files where some samples have fewer fields than the format column.

For instance, this is part of a line from the VQSR-ed output VCF of GenotypeGVCF: 1 15820 rs200482301 G T 5909.59 VQSRTrancheSNP99.90to100.00 AC=21;AF=0.154;AN=136..... GT:AD:DP:GQ:PL 0/0:.:40:66:0,66,990 0/0:.:41:69:0,69,1035 1/1:0,20,1:21:78:985,78,0 0/0:.:35:60:0,60,900 ./.:.:1 0/0:.:7:21:0,21,233 ..............

The second to last sample entry is ./.:.:1 (3 fields), while the format entry is GT:AD:DP:GQ:PL (5 fields). I think that GT=./., AD=., and DP=1, so the data is not getting messed up. This might even be within the rules of VCF, but one of the software that I use will not parse VCF files when 1 < # sample fields < # format fields. If sample entries were extended by ":." for every empty FORMAT field (unless only . or ./. was present in sample column), it would make the file parsable.

It's not too hard of a manual fix, but it might be nice to add the functionality into the toolkit. I've seen it happen with CombineVariants as well, when the input VCF files have different numbers of FORMAT fields.

Is there a Walker to select a certain number of random variants from a VCF file? I have checked only the SelectVariants, but the only option similar is --select_random_fraction, that select a fraction of mutations instead of exact number.

Hi,

I'm trying to use the RealignerTargetCreator as a test with 1 know file; 1000G_phase1.indels.b37.vcf. At first the contigs didn't match with my .BAM file (chr1/chr2 vs 1/2), so I adjusted that. Now when running it for the first time; java - jar [path to genomeAnalysisTK.jar] -T RealignerTargetCreator -R [.fasta] -I [.bam] -o [.intervals] -known [path to 1000g_phase1_adjusted.indels.b37.vcf] it gives the following error: "I/O error loading or writing tribble index file for [path to 1000g]". When running it the second time, I get the following error; "Problem detecting index type", because the .idx file is not correctly created.

What am I doing wrong?

Hi, I start working with IGV, but I have some doubts in how to identify a good SPN in this program. First I download the new Soybean Genome on Phytozome (Gmax_275_v2.0.fa and Gmax_275_Wm82.a2.v1.gene.gff3 files), and then I upload my files (sample.vcf, sample.bam and sample.bam.bai) into the program. I indexed which files that program needed, so that's OK! But my doubt is which parameters should I consider for a good SNP? For example, what I need to see on Alleles, Genotypes and Variant Attributes? See the example below.

Chr: Chr06 Position: 35170948 ID: . Reference: C* Alternate: T Qual: 160 Type: SNP Is Filtered Out: No

Alleles: No Call: 0 Allele Num: 2 Allele Count: 4 Allele Frequency: 1

Minor Allele Fraction: 1

Genotypes: Non Variant: 0 - No Call: 0 - Hom Ref: 0 Variant: 1 - Het: 0 - Hom Var: 1

Variant Attributes AF1: 1 RPB: 5.557190e-01 VDB: 1.587578e-01 Depth: 18 FQ: -54 DP4: [1, 1, 6, 8] AC1: 2 Mapping Quality: 25 PV4: [1, 0.22, 1, 0.24]

Hi I've just been having a go with the new Haplotype Caller method and I've getting some odd or malformed lines in the VCF file for example:

For example the format line has declared we should have 4 fields for each Sample record but instead we have samples with 2 records.

Two examples are shown here:

See Block 1
GT:DP:GQ:PL 1/1:.:3:32,3,0 ./.:0


Or where the format block declares 5 fields and we get 3 instead:

GT:AD:DP:GQ:PL  0/0:1,1,0,0,0,0,0,0,0:2:3:0,3,24,3,24,24        ./.:.:0 0/0:1,0,0,0,0,0,0,0,0:1:1:0,1,40,3,41,43        ./.:.:3


Full blocks, Block 1

chr1    3489714 .       G       A       9667.23 .       AC=154;AF=1.00;AN=154;DP=0;FS=0.000;InbreedingCoeff=-0.0391;MLEAC=154;MLEAF=1.00;MQ=0.00;MQ0=0;EFF=INTRAGENIC(MODIFIER|||||TIAM1||CODING|||1),INTRON(MODIFIER||||649|TIAM1|protein_coding|CODING|ENSBTAT00000064124|1|1);CSQ=A|ENSBTAG00000017839|ENSBTAT00000064124|Transcript|intron_variant||||||||1/5||1|TIAM1|HGNC|      GT:DP:GQ:PL     1/1:.:3:32,3,0  1/1:.:18:207,18,0     1/1:.:6:78,6,0  1/1:.:12:140,12,0       1/1:.:9:101,9,0 ./.:0   1/1:.:9:97,9,0  1/1:.:9:96,9,0  1/1:.:21:244,21,0       1/1:.:12:138,12,0       1/1:.:12:124,12,0       1/1:.:9:105,9,0 1/1:.:15:164,15,0     1/1:.:15:153,15,0       1/1:.:27:265,27,0       1/1:.:12:125,12,0       1/1:.:18:214,18,0       ./.:0   1/1:.:9:108,9,0 1/1:.:15:169,15,0       ./.:0   1/1:.:6:76,6,0  1/1:.:6:66,6,0  1/1:.:12:140,12,0     1/1:.:3:28,3,0  1/1:.:3:10,3,0  1/1:.:12:128,12,0       ./.:0   1/1:.:18:181,18,0       1/1:.:9:98,9,0  1/1:.:15:161,15,0       1/1:.:15:185,15,0       1/1:.:12:133,12,0       1/1:.:15:175,15,0     1/1:.:18:178,18,0       1/1:.:12:133,12,0       1/1:.:9:105,9,0 1/1:.:12:141,12,0       1/1:.:15:166,15,0       1/1:.:9:108,9,0 1/1:.:15:160,15,0       1/1:.:27:267,27,0       1/1:.:21:218,21,0    1/1:.:9:107,9,0  1/1:.:3:28,3,0  1/1:.:9:80,9,0  1/1:.:6:46,6,0  ./.:0   1/1:.:6:61,6,0  1/1:.:21:241,21,0       1/1:.:15:161,15,0       1/1:.:6:82,6,0  1/1:.:12:143,12,0       1/1:.:9:109,9,0 1/1:.:21:249,21,0     1/1:.:6:40,6,0  1/1:.:9:94,9,0  1/1:.:15:185,15,0       1/1:.:12:129,12,0       1/1:.:12:132,12,0       ./.:0   1/1:.:21:207,21,0       1/1:.:12:136,12,0       1/1:.:12:109,12,0       1/1:.:18:192,18,0     ./.:0   1/1:.:9:68,9,0  1/1:.:12:138,12,0       1/1:.:6:73,6,0  1/1:.:9:105,9,0 1/1:.:9:98,9,0  1/1:.:6:65,6,0  ./.:0   1/1:.:6:65,6,0  ./.:0   1/1:.:6:58,6,0  1/1:.:12:131,12,0       ./.:0   ./.:01/1:.:3:38,3,0   1/1:.:3:37,3,0  1/1:.:21:227,21,0       1/1:.:12:131,12,0       1/1:.:6:66,6,0  1/1:.:9:100,9,0 1/1:.:21:209,21,0       1/1:.:6:63,6,0  1/1:.:6:69,6,0


Block 2

chr1    55248   .       ACCC    A,CCCC  179.69  .       AC=13,6;AF=0.100,0.046;AN=130;BaseQRankSum=0.736;ClippingRankSum=0.736;DP=347;FS=0.000;InbreedingCoeff=0.2231;MLEAC=10,4;MLEAF=0.077,0.031;MQ=53.55;MQ0=0;MQRankSum=0.736;QD=4.99;ReadPosRankSum=0.736;EFF=INTERGENIC(MODIFIER||||||||||1),INTERGENIC(MODIFIER||||||||||2);CSQ=-||||intergenic_variant|||||||||||||     GT:AD:DP:GQ:PL  0/0:1,1,0,0,0,0,0,0,0:2:3:0,3,24,3,24,24        ./.:.:0 0/0:1,0,0,0,0,0,0,0,0:1:1:0,1,40,3,41,43        0/0:1,0,0,0,0,0,0,0,0:1:2:0,2,44,3,45,46        0/0:.:5:0:0,0,103,0,103,103     0/0:3,0,1,0,0,0,0,0,0:4:9:0,9,81,9,81,81        0/0:0,0,1,0,0,0,0,0,0:1:1:0,1,2,1,2,2   ./.:.:1 0/0:.:3:0:0,0,41,0,41,41        0/0:1,0,0,0,0,0,0,0,0:1:9:0,9,73,9,73,73        0/1:1,0,0,1,0,0,0,0,0:2:22:28,0,73,28,22,46     0/0:1,0,0,0,0,0,0,0,0:1:4:0,4,25,4,25,25        0/0:2,0,0,0,0,0,0,0,0:2:21:0,21,273,21,273,273  0/1:5,0,0,1,0,0,0,0,0:6:11:11,0,235,26,158,175  ./.:0,0,0,0,1,0,0,0,0:1 0/0:.:4:0:0,0,46,0,46,46        ./.:.:3 ./.:.:2 0/1:1,0,1,1,0,0,0,0,0:3:21:28,0,44,31,21,52     0/0:.:4:0:0,0,77,0,77,77        0/0:0,0,1,0,0,0,0,0,0:1:1:0,1,2,1,2,2   0/0:.:2:6:0,6,51,6,51,51        0/0:.:6:2:0,2,151,2,151,151     0/0:2,0,1,0,0,0,0,0,0:3:7:0,7,74,7,74,74        ./.:.:0 0/0:2,0,0,0,0,0,0,0,0:2:5:0,7,59,5,37,35        1/2:0,0,0,1,0,0,0,0,0:1:1:27,1,40,26,0,25       0/0:2,0,0,0,0,0,0,0,0:2:9:0,9,83,9,83,83        ./.:.:0 2/2:0,0,0,0,0,3,0,0,0:3:9:59,59,59,9,9,0        ./.:.:16        0/0:2,0,0,0,0,0,0,0,0:2:7:0,7,59,7,59,59        0/0:2,0,0,0,0,0,0,0,0:2:9:0,9,83,9,83,83        0/0:2,0,0,0,0,0,0,0,0:2:11:0,11,68,11,68,68     0/2:6,0,0,0,0,2,0,0,0:8:18:18,39,384,0,346,340  ./.:.:1 0/1:3,0,0,1,0,0,0,0,0:4:25:25,0,96,34,100,134   0/0:2,0,0,0,0,0,0,0,0:2:12:0,12,105,12,105,105  0/1:1,0,0,1,0,0,0,0,0:2:17:17,0,94,21,26,44     0/0:.:2:6:0,6,64,6,64,64        0/0:1,0,0,0,0,0,0,0,0:1:2:0,2,24,3,25,27        0/0:.:2:6:0,6,48,6,48,48        0/0:.:2:6:0,6,63,6,63,63        0/0:0,0,1,0,0,0,0,0,0:1:0:0,0,1,0,1,1   ./.:.:0 0/0:1,0,0,0,0,0,0,0,0:1:1:0,1,6,3,8,9   0/0:1,0,0,0,0,0,0,0,0:1:15:0,15,124,15,124,124  ./.:.:0 0/0:1,0,0,0,0,0,0,0,0:1:4:0,4,19,4,19,19        0/1:2,0,0,1,0,0,0,0,0:3:28:28,0,66,34,70,104    0/0:.:4:0:0,0,18,0,18,18        0/2:2,0,0,0,0,4,0,0,0:6:57:68,74,143,0,69,57    ./.:.:0 0/0:4,0,0,0,0,0,0,0,1:5:15:0,15,109,15,109,109  0/0:.:10:0:0,0,182,0,182,182    0/0:.:1:3:0,3,31,3,31,31        0/0:0,0,1,0,0,0,0,0,0:1:1:0,1,2,1,2,2   0/0:2,0,0,0,0,0,0,0,0:2:5:0,5,76,6,78,79        0/0:1,1,0,0,0,0,0,0,0:2:4:0,4,28,4,28,28        ./.:.:1 1/1:0,0,0,2,0,0,0,0,0:2:6:67,6,0,67,6,67        ./.:.:1 0/0:.:2:0:0,0,14,0,14,14        ./.:.:0 ./.:.:1 0/0:.:5:1:0,1,120,1,120,120     0/0:.:4:0:0,0,8,0,8,8   0/0:.:6:0:0,0,94,0,94,94        0/1:1,0,0,2,0,0,0,0,0:3:1:27,0,1,29,6,35        0/0:.:2:0:0,0,3,0,3,3   ./.:.:6 ./.:.:0 ./.:.:0 0/2:3,0,0,0,0,2,0,0,0:5:27:46,27,89,0,62,81     ./.:.:0 0/0:.:1:0:0,0,6,0,6,6   0/0:.:9:0:0,0,86,0,86,86        ./.:.:1 ./.:.:0 1/1:.:.:0:1,1,0,1,0,0   0/0:0,0,0,0,0,0,0,0,0:0:3:0,3,26,3,26,26        0/1:4,0,0,2,0,0,0,0,0:6:49:49,0,93,61,100,160   0/0:.:3:0:0,0,14,0,14,14        0/0:.:8:0:0,0,60,0,60,60        0/0:2,1,0,0,0,0,0,0,0:3:6:0,6,37,6,37,37        ./.:.:2 0/0:.:9:0:0,0,81,0,81,81        0/0:0,0,0,0,0,0,0,0,0:0:1:0,1,2,1,2,2


Any idea what the issue is?

Hi,

Thanks to previous replies can run Queue and the relevant walker on a distributed computing server. The question was if I define my scala script to require an argument for the output file, using the -o parameter like so:

            // Required arguments.  All initialized to empty values.
....

@Input(doc="Output file.", shortName="o")
var outputFile: File = _


How do I direct the output to pipe the result to a specified directory? Currently I have the code: genotyper.out = swapExt(qscript.bamFile, "bam", outputFile, "unfiltered.vcf")

Currently when I include the string -o /path/to/my/output/files/MyResearch.vcf

The script creates a series of folders within the directory where I execute Queue from. In this case my results were sent to: /Queue-2-8-1-g932cd3a/MyResearch./path/to/my/output/files/MyResearch.unfiltered.vcf

when all I wanted was the output to appear in the path: /path/to/my/output/files/MyResearch.unfiltered.vcf

As always any help is much appreciated.

RodWalkers are really fast and great. Is there any way to connect a BAM file to it and get the pileup in the given positions? If not, what is the appropriate way to get pileups for a small set of selection positions?

Dear GATK team,

Could you please help me how to explain genotyping in my vcf file. I have Illumina data and vcf caller was GATK. My variant frequency (Alt variant freq) is 99.7%. DP = 4622 (AD = 16, 4606) - so I would expect that this sample is alternate homozygous. But when I check PL value, which is - PL = 1655,0,323 - after calculating my likelihood -

REF= G ALT= A

P(D|GG) = 10 ^ -165.5 = small
P(D|AG) = 10 ^ 0  = 1
P(D|AA) = 10 ^ 32.3 = small


we can see it is heterozygous. Can anybody help me how to interpret my result? How it is possible that likelihoods show me heterozygous and coverage and VF show me homozygous?

Here is part of my vcf file:

chr13 32899193 . G A 1625.01 PASS AC=1;AF=0.5;AN=2;DP=4622;QD=0.35;TI=NM_000059;GI=BRCA2;FC=Silent GT:AD:DP:GQ:PL:VF:GQX 0/1:16,4606:5000:99:1655,0,323:0.997:99

Thank you for any explanation.

Paul.

Dear all I am writing to you to understand why FastaAlternateReferenceMaker is missing a deletion in the final consensus. I am using gatk v. 2.8-1 and this commandline:

output from gatk was:

INFO 08:57:46,123 HelpFormatter - -------------------------------------------------------------------------------- INFO 08:57:46,126 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.8-1-g932cd3a, Compiled 2013/12/06 16:47:15 INFO 08:57:46,126 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 08:57:46,126 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 08:57:46,130 HelpFormatter - Program Args: -R ref.fa -T FastaAlternateReferenceMaker -o reads_vs_ref_gatk_consensus.fa --variant reads_vs_ref_var.vcf INFO 08:57:46,130 HelpFormatter - Date/Time: 2014/02/20 08:57:46
INFO 08:57:46,131 HelpFormatter - --------------------------------------------------------------------------------
INFO 08:57:46,131 HelpFormatter - --------------------------------------------------------------------------------
INFO 08:57:46,137 ArgumentTypeDescriptor - Dynamically determined type of reads_vs_ref_var.vcf to be VCF
INFO 08:57:46,856 GenomeAnalysisEngine - Strictness is SILENT
INFO 08:57:46,978 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 08:57:47,010 RMDTrackBuilder - Creating Tribble index in memory for file reads_vs_ref_var.vcf
INFO 08:57:47,079 RMDTrackBuilder - Writing Tribble index to disk for file /home/aprea/test/consensus/reads_vs_ref_var.vcf.idx
INFO 08:57:47,223 GenomeAnalysisEngine - Preparing for traversal
INFO 08:57:47,224 GenomeAnalysisEngine - Done preparing for traversal
INFO 08:57:47,225 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 08:57:47,226 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
INFO 08:57:47,546 ProgressMeter - done 1.05e+04 0.0 s 30.0 s 99.0% 0.0 s 0.0 s
INFO 08:57:47,547 ProgressMeter - Total runtime 0.32 secs, 0.01 min, 0.00 hours
INFO 08:57:51,452 GATKRunReport - Uploaded run statistics report to AWS S3

There are 3 variants in the vcf file: the first is a SNP and is reported, the second is an insertion and is reported but the third, a deletion, is missing. Could you please tell me where's the problem? I attached both reference and vcf file.

Many thanks.

I believe that I may have found an issue with the CombineVariants tool of GATK that manifests itself when there is a repeated ID in a given VCF. For us, the reason to have repeated IDs in a VCF file is to detect inconsistencies in our sample by calling variants on 2 different DNA samples and then checking the concordance. Our current process is:

1) Generate a VCF containing unique IDs (using GATK CallVariants)
2) Replace the VCF header with potentially non-unique IDs (using tabix -r)
3) Merge a single VCF to uniqify the IDs (using GATK CombineVariants)

It seems that the genotypes in the merged VCF are off by one column. I've attached 3 files that demonstrate the issue: "combined" which is the result of step 1, "combined.renamed", which is the output of step 2, and "combined.renamed.merged", which is the output of step 3.

The relevant lines are as follows:
combined:

HG00421@123910725 HG00422 HG00422@123910706 HG00423@123910701 NA12801 NA12802
0/0:300           0/0:127 0/0:292           0/0:290           0/0:127 0/0:127
0/0:299           0/0:127 0/0:299           0/0:293           0/0:127 0/0:127


combined.renamed:

HG00421 HG00422 HG00422 HG00423 NA12801 NA12802
0/0:300 0/0:127 0/0:292 0/0:290 0/0:127 0/0:127
0/0:299 0/0:127 0/0:299 0/0:293 0/0:127 0/0:127


combined.renamed.merged:

HG00421 HG00422 HG00423 NA12801 NA12802
0/0:300 0/0:127 0/0:292 0/0:290 0/0:127
0/0:299 0/0:127 0/0:299 0/0:293 0/0:127


Using the depth argument here, we can see that in the merged dataset, NA12801 has depths 290,293 whereas in the original and renamed datasets the depths were 127,127. The 290,293 depths correspond to HG00423, which is the column before.

I have confirmed this behavior in both GATK 2.7-4 and 2.8-1. If there's any more information that you need, please let me know, and I would be happy to provide it. Also, if you might know where this issue arises, I would be happy to try to provide a patch.

Thanks,

John Wallace

Is there a way to include only variant sites and no-calls in your final vcf. I know during SNP calls you can only emit variants, or only confident sites or all. However is there a way to reduce your vcf in the end to only variant sites (vsqr passed) and places where no calls could be made. So the end vcfs have only variant sites and missing data - and everything that is not listed in the vcf file is reference. I need such a file for merging with other vcf files - so that every position that is not in the vcfs while merging can be called ref.

So far i have called snps with emit-all and done vsqr - I now want to reduce vcfs in size by excluding NO_VARINATION sites (but want to keep information on "missing" sites)

I using the UnifiedGenotyper to call SNPs in a pooled sample of 30 diploid individuals (i.e., I am setting the ploidy to 60). Does this mean that if the coverage is < 60 at a given variant site, the vcf file will read "./." for all alleles at that site? In other words, does it require the coverage to be >= the ploidy or it won't produce a called variant at that site? I'm just trying to make sure I am interpreting the vcf file correctly, in that: if there is a called genotype at a given variant site that I can interpret that as the estimated allele frequency in that pool.

Best, Jon

Hi,

I have annotated my vcf file of 20 samples from Unified genotyper using the following steps.

Unified genotyper->Variantrecalibration->Applyrecalibration->VariantAnnotator

My question is how should I proceed if I have to select rare variants (MAF<1%) for the candidate genes that I have,for each of these 20 samples?

Hi, I'm trying to understand how do you usually calculate the GQ of a SNP. I understood the model to calculate the Likelihood of all the genotypes (AA,AC, AG,AT,CC,CG,CT,GG,GT,TT). Once all the likelihood have been calculated, correct me if I'm wrong, you normalize the likelihood of the best genotype to 1 and all the other likelihoods according to that scale. So the PL field in the VCF should be the Phred-scaled values of the normalized values. But is not clear to me how do you finally calculate the GQ value. What values do you use to calculate that quality (normalized or phred scaled)? And what's the right formula? I've tried to debug the code but it ends to be really tricky. I really hope that you would help me. I would be really thankfull for that.

I just wanted to select variants from a VCF with 42 samples. After 3 hours I got the following Error. How to fix this? please advise. Thanks I had the same problem when I used "VQSR". How can I fix this problem?

INFO 20:28:17,247 HelpFormatter - -------------------------------------------------------------------------------- INFO 20:28:17,250 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51 INFO 20:28:17,250 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 20:28:17,251 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 20:28:17,255 HelpFormatter - Program Args: -T SelectVariants -rf BadCigar -R /groups/body/JDM_RNA_Seq-2012/GATK/bundle-2.3/ucsc.hg19/ucsc.hg19.fasta -V /hms/scratch1/mahyar/Danny/data/Overal-RGSM-42prebamfiles-allsites.vcf -L chr1 -L chr2 -L chr3 -selectType SNP -o /hms/scratch1/mahyar/Danny/data/Filter/extract_SNP_only3chr.vcf INFO 20:28:17,256 HelpFormatter - Date/Time: 2014/01/20 20:28:17 INFO 20:28:17,256 HelpFormatter - -------------------------------------------------------------------------------- INFO 20:28:17,256 HelpFormatter - -------------------------------------------------------------------------------- INFO 20:28:17,305 ArgumentTypeDescriptor - Dynamically determined type of /hms/scratch1/mahyar/Danny/data/Overal-RGSM-42prebamfiles-allsites.vcf to be VCF INFO 20:28:18,053 GenomeAnalysisEngine - Strictness is SILENT INFO 20:28:18,167 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 20:28:18,188 RMDTrackBuilder - Creating Tribble index in memory for file /hms/scratch1/mahyar/Danny/data/Overal-RGSM-42prebamfiles-allsites.vcf INFO 23:15:08,278 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR stack trace

java.lang.NegativeArraySizeException at org.broad.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:97) at org.broad.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:116) at org.broad.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:84) at org.broad.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:73) at net.sf.samtools.util.AbstractIterator.next(AbstractIterator.java:57) at org.broad.tribble.readers.AsciiLineReaderIterator.next(AsciiLineReaderIterator.java:46) at org.broad.tribble.readers.AsciiLineReaderIterator.next(AsciiLineReaderIterator.java:24) at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:73) at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:35) at org.broad.tribble.AbstractFeatureCodec.decodeLoc(AbstractFeatureCodec.java:40) at org.broad.tribble.index.IndexFactory$FeatureIterator.readNextFeature(IndexFactory.java:428) at org.broad.tribble.index.IndexFactory$FeatureIterator.next(IndexFactory.java:390) at org.broad.tribble.index.IndexFactory.createIndex(IndexFactory.java:288) at org.broad.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:278) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createIndexInMemory(RMDTrackBuilder.java:388) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:274) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:211) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:140) at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208) at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:964) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:758) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:284) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

ERROR ------------------------------------------------------------------------------------------

How I can remove SNPs with Low quality or out of Hardy-Weinberg in my VCF file. In column 7 of my VCF file I have only "LowQual" or ".". There is no "PASS" word in this column. Is there something wrong in my VCF or any problem with UnifiedGenotyper procedure? thanks

Hi,

I must be doing something silly because all of my VCF files are empty (header, no results) after calling with Unified Genotyper. I'm using GATK version 2.7-2-g6bda569 with Java 1.7.00_40. The FASTQs were aligned with Novoalign.

java -Xmx4g -Djava.io.tmpdir=tmp -jar GenomeAnalysisTK.jar -R hg19_random.fa -D dbsnp_137.hg19.vcf -T UnifiedGenotyper -I data/gatk.realigned.recal.chr1-1-249250621.bam -o data/gatk.snps.raw.chr1-1-249250621.vcf -metrics data/gatk.snps.raw.chr1-1-249250621.vcf.metrics -L chr1:1-249250621 --computeSLOD -stand_call_conf 30 -stand_emit_conf 1 -nt 8 -K GATK_public.key -et NO_ET

Here is a snippet from my BAM file:

H801:144:C39TBACXX:7:2113:7016:95747    73      chr1    671881  20      101M    =       671881  0       GCGGAGGCTGCCGTGACGTAGGGTATGGGCCTAAATAGGCCATTGTGAGTCATGAGCTTGGTCTGTAGAGGCTGACTGGAGAAAGTTCTCGGCCTGGAGAG YYYZZZZZZZZZZZZYZ]ZZ]]]YYYZ]]]ZZ]]]Z]Z]Z]]]]]Z]]]YZZZ]]]]ZYZZZZZZZZZYYYZZZZYYZZYZYZZZYZYZZZZZZZZZWXWW   BD:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]      MD:Z:101        PG:Z:novoalign  RG:Z:SWID:testing:0     BI:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]      NM:i:0  UQ:i:0  AS:i:0
H801:144:C39TBACXX:7:2302:11261:7162    153     chr1    971492  70      101M    =       971492  0       GCTGTCTGTGAGGGCTGTGCTGAGGCCTTCCTGACCAGCACATGGGGTGGGAAGGACGACCTGGGGAATCCTGAAGTGATCTGAAGACAGAGCCCTGGGCT   WYWXYZZZZZYZZXZYZZZYZZZZZZYYZZZZZZZZZZZZZZY]]]]ZZYYYZZZYYZZZZ]]]ZZZYZXZ]ZZZZZZYZ]]]]ZZYZZZZYZZZZZZYYY   BD:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]      MD:Z:101        PG:Z:novoalign  RG:Z:SWID:testing:0     BI:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]      NM:i:0  UQ:i:0  AS:i:0
H801:144:C39TBACXX:7:2302:13699:15898   153     chr1    971492  70      101M    =       971492  0       GCTGTCTGTGAGGGCTGTGCTGAGGCCTTCCTGACCAGCACATGGGGTGGGAAGGACGACCTGGGGAATCCTGAAGTGATCTGAAGACAGAGCCCTGGGCT   XYXZZYZZZZYZZZYYXYZZYZZZZZYYZZZZYYZZYYZZZYY]]]]]ZZZ]]]ZZZYZ]]Z]]Z]ZZZ]]]]]]ZZYZZ]]ZZYYYZYZXZZZZZZZYYX   BD:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]      MD:Z:101        PG:Z:novoalign  RG:Z:SWID:testing:0     BI:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]      NM:i:0  UQ:i:0  AS:i:0
H801:144:C39TBACXX:7:2206:12702:96307   97      chr1    987905  70      101M    =       988021  217     GTGTGCATATGGGTCCATGTATGTGTGTGTATATGAGGGAGACACGCAGGTGTGTGTCTGAGTGTGTGCGCACATGGGTCCATGTATGTGTGTGTATAGGT   YXYZZZZZZZZZZZZ]]]]ZZZ]ZZZYYZZZZ]]]]]]]Z]ZZ]Z]]]]ZZZZZZZYZZ]ZZXYYYWYYZWXWXYWWWYZZYZYZZZYZZZZWXWWYZZYW   BD:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]      MD:Z:101        PG:Z:novoalign  RG:Z:SWID:testing:0     BI:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]      NM:i:0  MQ:i:70 UQ:i:0  AS:i:0
H801:144:C39TBACXX:7:2206:12702:96307   145     chr1    988021  70      101M    =       987905  -217    GTGTGTGTCCGTGTGTGTGCATGGGTCCATGTGTGTATAGTGTGTACACATGGGTCCATGTATGTGTGTGTATATGAGGGAGACACGCAGGTGTGTGTCCG   WWWWWZXZXXWWWXZYWWWYWWWWWWWXWXZYXYWWYYWX]]]ZYXZZYZZ]]]]]]]Z]Z]]]]]]]]]]]]]Z]]]]]]]]]Z]]]ZZZZZZZZZZYYY   BD:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]


And the output on the command line when running

INFO  12:44:55,654 HelpFormatter - --------------------------------------------------------------------------------
INFO  12:44:55,656 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-2-g6bda569, Compiled 2013/08/28 16:30:29
INFO  12:44:55,656 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  12:44:55,660 HelpFormatter - Program Args: -R hg19_random.fa -D dbsnp_137.hg19.vcf -T UnifiedGenotyper -I data/gatk.realigned.recal.chr1-1-249250621.bam -o data/gatk.snps.raw.chr1-1-249250621.vcf -metrics data/gatk.snps.raw.chr1-1-249250621.vcf.metrics -et NO_ET -L chr1:1-249250621 --computeSLOD -stand_call_conf 30 -stand_emit_conf 1 -nt 8 -K GATK_public.key
INFO  12:44:55,660 HelpFormatter - Date/Time: 2014/01/16 12:44:55
INFO  12:44:55,661 HelpFormatter - --------------------------------------------------------------------------------
INFO  12:44:55,661 HelpFormatter - --------------------------------------------------------------------------------
INFO  12:44:55,755 ArgumentTypeDescriptor - Dynamically determined type of dbsnp_137.hg19.vcf to be VCF
INFO  12:44:56,282 GenomeAnalysisEngine - Strictness is SILENT
INFO  12:44:56,370 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
INFO  12:44:56,377 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:44:56,433 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.05
INFO  12:44:56,490 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf
INFO  12:44:56,736 IntervalUtils - Processing 249250621 bp from intervals
INFO  12:44:56,749 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 1 CPU thread(s) for each of 8 data thread(s), of 24 processors available on this machine
INFO  12:44:56,799 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO  12:44:56,992 GenomeAnalysisEngine - Done preparing for traversal
INFO  12:44:56,992 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  12:44:56,992 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
INFO  12:44:57,085 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:44:57,090 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
INFO  12:44:57,095 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:44:57,101 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO  12:44:57,105 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:44:57,108 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf INFO 12:44:57,112 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO  12:44:57,114 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:44:57,124 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO  12:44:57,129 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:44:57,140 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO  12:44:57,168 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:44:57,177 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO  12:44:57,186 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:44:57,191 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
INFO  12:44:57,285 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf
INFO  12:44:57,482 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf
INFO  12:44:57,655 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf
INFO  12:44:57,827 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf
INFO  12:44:57,996 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf
INFO  12:44:58,301 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf
INFO  12:45:26,995 ProgressMeter -   chr1:93237277        9.18e+07   30.0 s        0.0 s     37.4%        80.0 s    50.0 s
INFO  12:45:56,999 ProgressMeter -  chr1:200739385        1.99e+08   60.0 s        0.0 s     80.5%        74.0 s    14.0 s
INFO  12:46:12,672 ProgressMeter -            done        2.49e+08   75.0 s        0.0 s    100.0%        75.0 s     0.0 s
INFO  12:46:12,673 ProgressMeter - Total runtime 75.68 secs, 1.26 min, 0.02 hours
INFO  12:46:12,673 MicroScheduler - 5225 reads were filtered out during the traversal out of approximately 500463 total reads (1.04%)
INFO  12:46:12,673 MicroScheduler -   -> 5225 reads (1.04% of total) failing BadMateFilter
INFO  12:46:12,673 MicroScheduler -   -> 0 reads (0.00% of total) failing DuplicateReadFilter
INFO  12:46:12,673 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO  12:46:12,674 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO  12:46:12,674 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO  12:46:12,674 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
INFO  12:46:12,674 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter


Can anyone tell me what might be wrong? It's reliably failing for every chromosome on reasonably-sized data (this BAM is 20M with ~500k reads).

One of my samples has this entry "1/1:0,1:1:3:36,3,0" in the information field, and from my understanding, the genotype is homo variant, because it has 1/1. However, I do not understand why. Since it has 0 REF reads and has only 1 ALT read, how does GATK tell this variant is a homo variant??

Hey there,

I was trying to build an analysis pipeline for paired reads with BWA, Duplicate Removal Local Realignment and Base Quality Score Recalibration to finally use GATK's UnifiedGenotyper for SNP and Indel calling. However, for both SNPs and Indels, I receive no called variants no matter how low my used thresholds are. Quality values of the reads look ok, leaving out dbSNP does not change results. I have used the same reference throughout the whole pipeline. I use GATK 2.7, nevertheless a switch to GATK 1.6 did not change anything.

This is my shell command for SNP calling on chromosome X (GATK delivers no results for all chromosomes): java -Xmx4g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R Homo_sapiens_assembly19.fasta -stand_call_conf 30.0 -stand_emit_conf 30.0 -glm SNP -mbq 17 -I test.bam -L X -o test.snps.vcf -D dbsnp_135.hg19.excluding_sites_after_129.vcf

Entries in my BAM file look like this: SRR389458.1885965 113 X 10092397 37 76M = 10092397 1 CCTGTTTCCCCTGGGGCTGGGCTNGANACTGGGCCCAACCNGTGGCTCCCACCTGCACACACAGGGCTGGAGGGAC 98998999989:99:9:999888#88#79999:;:89998#99:;:88:989:;:91889888:;:9;:::::999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:23G2G13C35 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:3 SM:i:37 XM:i:3 XO:i:0 MQ:i:37 XT:A:U SRR389458.1885965 177 X 10092397 37 76M = 10092397 -1 CCTGTTTCCCCTGGGGCTGGGCTNGANACTGGGCCCAACCNGTGGCTCCCACCTGCACACACAGGGCTGGAGGGAC 98998999989:99:9:999888#88#79999:;:89998#99:;:88:989:;:91889888:;:9;:::::999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:23G2G13C35 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:3 SM:i:37 XM:i:3 XO:i:0 MQ:i:37 XT:A:U SRR389458.1888837 113 X 14748343 37 76M = 14748343 1 TCGTGAAAGTCGTTTTAATTTTAGCGGTTATGGGATGGGTCACTGCCTCCAAGTGAAAGATGGAACAGCTGTCAAG 889999:9988;98:9::9;9986::::99:8:::::999988989:8;;9::989:999:9;9:;:99:98:999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:76 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 MQ:i:37 XT:A:U

And this is the output of the UnifiedGenotyper: INFO 17:57:00,575 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:57:00,578 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51 INFO 17:57:00,578 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:57:00,578 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:57:00,582 HelpFormatter - Program Args: -T UnifiedGenotyper -R /hana/exchange/reference_genomes/hg19/Homo_sapiens_assembly19.fasta -stand_call_conf 30.0 -stand_emit_conf 30.0 -glm SNP -mbq 17 -I test.bam -L X -o testX.snps.vcf -D dbsnp_135.hg19.excluding_sites_after_129.vcf INFO 17:57:00,583 HelpFormatter - Date/Time: 2013/12/17 17:57:00 INFO 17:57:00,583 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:57:00,583 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:57:00,943 ArgumentTypeDescriptor - Dynamically determined type of /hana/exchange/reference_genomes/hg19/dbsnp_135.hg19.excluding_sites_after_129.vcf to be VCF INFO 17:57:01,579 GenomeAnalysisEngine - Strictness is SILENT INFO 17:57:02,228 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 17:57:02,237 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 17:57:02,364 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.13 INFO 17:57:02,594 RMDTrackBuilder - Loading Tribble index from disk for file /hana/exchange/reference_genomes/hg19/dbsnp_135.hg19.excluding_sites_after_129.vcf INFO 17:57:02,867 IntervalUtils - Processing 155270560 bp from intervals INFO 17:57:02,943 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 17:57:03,166 GenomeAnalysisEngine - Done preparing for traversal INFO 17:57:03,167 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:57:03,167 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 17:57:33,171 ProgressMeter - X:11779845 1.01e+07 30.0 s 2.0 s 7.6% 6.6 m 6.1 m INFO 17:58:03,173 ProgressMeter - X:24739805 1.89e+07 60.0 s 3.0 s 15.9% 6.3 m 5.3 m INFO 17:58:33,175 ProgressMeter - X:37330641 3.25e+07 90.0 s 2.0 s 24.0% 6.2 m 4.7 m INFO 17:59:03,177 ProgressMeter - X:49404077 4.94e+07 120.0 s 2.0 s 31.8% 6.3 m 4.3 m INFO 17:59:33,178 ProgressMeter - X:64377965 5.36e+07 2.5 m 2.0 s 41.5% 6.0 m 3.5 m INFO 18:00:03,180 ProgressMeter - X:75606869 7.54e+07 3.0 m 2.0 s 48.7% 6.2 m 3.2 m INFO 18:00:33,189 ProgressMeter - X:88250233 7.74e+07 3.5 m 2.0 s 56.8% 6.2 m 2.7 m INFO 18:01:03,190 ProgressMeter - X:100393213 9.94e+07 4.0 m 2.0 s 64.7% 6.2 m 2.2 m INFO 18:01:33,192 ProgressMeter - X:110535705 1.09e+08 4.5 m 2.0 s 71.2% 6.3 m 109.0 s INFO 18:02:03,193 ProgressMeter - X:121257489 1.20e+08 5.0 m 2.0 s 78.1% 6.4 m 84.0 s INFO 18:02:33,195 ProgressMeter - X:132533757 1.32e+08 5.5 m 2.0 s 85.4% 6.4 m 56.0 s INFO 18:03:03,197 ProgressMeter - X:144498909 1.41e+08 6.0 m 2.0 s 93.1% 6.4 m 26.0 s INFO 18:03:30,079 ProgressMeter - done 1.55e+08 6.4 m 2.0 s 100.0% 6.4 m 0.0 s INFO 18:03:30,079 ProgressMeter - Total runtime 386.91 secs, 6.45 min, 0.11 hours INFO 18:03:30,080 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 150 total reads (0.00%) INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing BadMateFilter INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter INFO 18:03:32,167 GATKRunReport - Uploaded run statistics report to AWS S3

Do I miss anything here?

Best,

Cindy

How can an individual in a vcf file have one called and one missing allele? I generated a dataset in GATK and did not encounter this in my GATK generated vcfs. However in comparative data vcfs I saw this. But I dont understand how this call is generated in a vcf file. I thought it was when the position had very low coverage, maybe one read only and then one allele can be called only. But some of the missing/allele calls had high read depth. Some more info: It is a single individual vcf and a low percentage of sites (around 0.1%) had one called allele and one missing allele, for instance [0/.] or [1/.] . Could somebody explain what is the reason that such calls exist and is it advisable to filter these out of the data?

I downloaded my intervals files from UCSC human exome captured file (hg19_exome_sorted.bed), but I keep getting lots of "./." in the final VCF files. I asked before in this forum, and I remember Geraldine said this may be because of sub-optimal interval files. This made me wonder what you guys use for human exome variant calling?

Hi,

I was wondering if you could use the toolkit to generate a separate VCF file containing only SNPs that are found at a predetermined chromosome and base pair position. I have a plink file which I want to convert back to VCF format and it seems unbelievably hard to do so I thought this may be a good way to get around that problem?

I am aware that vcftools offers this function with the "--positions " option, however for some reason I am getting far more variants than I listed and there is nothing wrong that is obvious with my listed positions/vcf file.

My exome-seq reads quality is pretty good (120M reads) and the alignment using bwa is more than 95%. I use GATK to do the variant calling, however, the vcf output from VQSR contains many './.' fields, which I think means not detected?

I am wondering what's the possible cause for ./. ? my conditions are too stringent during gatk run?

Hi Team, I have a VCF which I'd like to filter by variant frequency. The problem is, my frequencies are percentages rather than decimals. Is there a workaround in JEXL which allows it to parse the '%' operator as a percentage (or ignore it entirely) rather than considering the field a string upon seeing the modulo operator? The VCF also has two columns in the format column (a normal and a tumor). Is it possible to drill down into these using just the genotypeFilterExpression/genotypeFilterName flags or must do something else?

Thanks, Eric T Dawson

is there any way to combine a snp vcf and indel vcf (generated with the UnifiedGenotyper) later? in the way that there is only one row per locus?

regardless how I combine (I tried mainly CombineVariants), if there is something different called in the two vcf files in one locus, there are two rows in the combined one; I would like this called/written as alternatives for one locus

So I have used the latest GATK best practices pipeline for variant detection on non-human organisms, but now I am trying to do it for human data. I downloaded the Broad bundle and I was able to run all of the steps up to and including ApplyRecalibration. However, now I am not exactly sure what to do. The VCF file that is generated contains these FILTER values:

.

PASS

VQSRTrancheSNP99.90to100.00

I am not sure what these mean. Does the "VQSRTrancheSNP99.90to100.00" filter mean that that SNP falls below the specified truth sensitivity level? Does "PASS" mean that it is above that level? Or is it vice versa? And what does "." mean? Which ones should I keep as "good" SNPs?

I'm also having some difficulty fully understanding how the VQSLOD is used.... and what does the "culprit" mean when the filter is "PASS"?

A final question.... I've been using this command to actually create a file with only SNPs that PASSed the filter:

java -Xmx2g -jar /share/apps/GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -T SelectVariants -R ~/broad_bundle/ucsc.hg19.fasta --variant Pt1.40300.output.recal_and_filtered.snps.chr1.vcf -o Pt1.40300.output.recal_and_filtered.passed.snps.chr1.vcf -select 'vc.isNotFiltered()'

Is this the correct way to get PASSed SNPs? Is there a better way? Any help you can give me would be highly appreciated. Thanks!

• Nikhil Joshi

Hi,

I have a vcf files of interest, and would like to append QUAL scores ONLY from CORRESPONDING genotypes in OTHER vcf files, all into a single vcf output, so that the corresponding QUAL scores will be displayed as separate QUAL columns per sample one next to the other.

Is there a way to accomplish such a task via GATK?

Thanks!

Sagi

Hello

I am working with canine genomes and donwloaded the reference file (ftp://ftp.ensembl.org/pub/release-73/fasta/canis_familiaris/dna/Canis_familiaris.CanFam3.1.73.dna.toplevel.fa.gz) and variation VCF file (ftp://ftp.ensembl.org/pub/release-73/variation/vcf/canis_familiaris/Canis_familiaris.vcf.gz) from Ensembl. I was able to index the reference files and perform the alignment and Mark duplicate steps of the pipeline for the canine genomes.

I extracted SNPs that were marked either deleteions or duplications from the variation VCF file using grep -E "^#|deletion|insertion" Canis_familiaris.vcf > Canis_familiaris.indels.vcf to use in the Indel Realigner steps. The header of the file is as follows: ##fileformat=VCFv4.1 ##fileDate=20130826 ##source=ensembl;version=73;url=http://e73.ensembl.org/canis_familiaris ##reference=ftp://ftp.ensembl.org/pub/release-73/fasta/canis_familiaris/dna/ ##INFO=<ID=TSA,Number=0,Type=String,Description="Type of sequence alteration. Child of term sequence_alteration as defined by the sequence ontology project."> ##INFO=<ID=E_MO,Number=0,Type=Flag,Description="Multiple_observations.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status"> ##INFO=<ID=E_ESP,Number=0,Type=Flag,Description="Exome_Sequencing_Project.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status"> ##INFO=<ID=E_1000G,Number=0,Type=Flag,Description="1000Genomes.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status"> ##INFO=<ID=E_HM,Number=0,Type=Flag,Description="HapMap.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status"> ##INFO=<ID=E_Freq,Number=0,Type=Flag,Description="Frequency.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status"> ##INFO=<ID=E_C,Number=0,Type=Flag,Description="Cited.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status"> ##INFO=<ID=dbSNP_131,Number=0,Type=Flag,Description="Variants (including SNPs and indels) imported from dbSNP [remapped from build CanFam2.0]"> 1 8252 rs8962840 C CT . . dbSNP_131;TSA=insertion 1 9702 rs8457244 CA C . . dbSNP_131;TSA=deletion 1 18289 rs8444620 GT G . . dbSNP_131;TSA=deletion 1 36381 rs8471229 GT G . . dbSNP_131;TSA=deletion 1 52452 rs8469855 AT A . . dbSNP_131;TSA=deletion 1 55767 rs8285977 G GA . . dbSNP_131;TSA=insertion 1 60065 rs8723538 TC T . . dbSNP_131;TSA=deletion 1 62067 rs8395051 TA T . . dbSNP_131;TSA=deletion

However, when I run the filrst step of Indel Realignment -T RealignerTargetCreator -nt 16 -R canFam3.1_bwa075/Canis_familiaris.CanFam3.1.73.dna.toplevel.fa -I AF23.rg.prmdup.bam -l INFO -S SILENT -o AF23.indel.intervals --known canFam3.1_bwa075/Canis_familiaris.indels.vcf

I get the error message: ##### ERROR A USER ERROR has occurred (version 2.7-2-g6bda569): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file

I guess it is because the VCF file is missing the standard header (#CHROM POS ID ...). I was wondering if there was any way I could add the header line to the VCF file to be able to use in GATK? Would it also be possible to use the original VCF with all variations in the Indel Realignment steps or would you recommend I subset it for insertions/deletions as I did in this case?

Thank you very much for your help and suggestions!!!

Yours sincerely Shanker Swaminathan

Hey guys,

im struggeling with some statistics given by the vcf file: the Ranksumtests. I started googleing arround, but that turned out to be not helpfult for understanding it (in may case). I really have no idea how to interprete the vcf-statistic-values comming from ranksumtest. I have no clue whether a negative, positive or value near zero is good/bad. Therefore im asking for some help here. Maybe someone knows a good tutorial-page or can give me a hint to better understand the values of MQRankSum, ReadPosRankSum and BaseQRankSum. I have the same problem with the FisherStrand statistics. Many, many thanks in advance.

Hi. I have run UnifiedGenotyper and the output VCF file has :

When the genotype is called you get: GT:AD:DP:GQ:PL 1/1:0,1:1:3:37,3,0 (i.e. five colon-separated fields as expected)

When the genotype is not called you get: GT:AD:DP:GQ:PL ./.:.:1 (i.e. only three colon-separated fields)

Version: (version 2.7-1-g42d771f)

Is this a bug or is this expected?

I am trying to liftover from NIST b37 to hg19. I have all the files I need and I can kick off the liftover just fine, but I keep running into problems because the NIST vcf has tags in the variant line INFO field that are not in the header. ##### ERROR MESSAGE: Key PLHSWG found in VariantContext field INFO at chr1:52238 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default

I identified about 90 tags that are not properly documented in the header. Is there a way to ignore all of these INFO header lapses?

Hi,

I known that this question should not post to the GATK forum because the ERROR told me that "Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself." However, I really cant find which step I made this error although I have read many documentations of GATK forum about this error. Could you please give me some suggestions? Much thanks!!

In my VCF file, I find that not all the SNP terms have the same set of annotation, and some annotations cant be found in some SNP terms, like this:

chr1    5036777 rs898335        T       G       36.74   .       AC=2;AF=1.00;AN=2;DB;DP=2;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLE AF=1.00;MQ=37.00;MQ0=0;QD=18.37  GT:AD:DP:GQ:PL  1/1:0,2:2:6:64,6,0
chr1    9507566 rs12742542      T       C       33.74   .       AC=2;AF=1.00;AN=2;DB;DP=2;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=37.00;MQ0=0;QD=16.87   GT:AD:DP:GQ:PL  1/1:0,2:2:6:61,6,0
chr1    9507621 rs12755964      G       A       37.74   .       AC=2;AF=1.00;AN=2;DB;DP=2;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=37.00;MQ0=0;QD=18.87   GT:AD:DP:GQ:PL  1/1:0,2:2:6:65,6,0
chr1    22376947        rs2473327       A       G       40.74   .       AC=2;AF=1.00;AN=2;DB;DP=2;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=37.00;MQ0=0;QD=20.37   GT:AD:DP:GQ:PL  1/1:0,2:2:6:68,6,0
chr1    38061706        rs10908362      G       C       32.74   .       AC=2;AF=1.00;AN=2;DB;DP=2;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=37.00;MQ0=0;QD=16.37   GT:AD:DP:GQ:PL  1/1:0,2:2:6:60,6,0
chr1    78317717        rs10782656      G       C       36.74   .       AC=2;AF=1.00;AN=2;DB;DP=2;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=37.00;MQ0=0;QD=18.37   GT:AD:DP:GQ:PL  1/1:0,2:2:6:64,6,0
chr1    111457142       rs1282019       A       G       35.74   .       AC=2;AF=1.00;AN=2;DB;DP=2;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=30.81;MQ0=0;QD=17.87   GT:AD:DP:GQ:PL  1/1:0,2:2:6:63,6,0
chr1    121484153       rs9701684       C       G       32.74   .       AC=2;AF=1.00;AN=2;DB;DP=5;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=19.48;MQ0=3;QD=6.55    GT:AD:DP:GQ:PL  1/1:2,3:5:6:60,6,0
chr1    121484423       rs7368003       T       C       83.03   .       AC=2;AF=1.00;AN=2;DB;DP=5;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=28.24;MQ0=1;QD=16.61   GT:AD:DP:GQ:PL  1/1:0,5:5:12:111,12,0
chr1    121484503       rs4898086       T       A       311.77  .       AC=2;AF=1.00;AN=2;DB;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=33.63;MQ0=1;QD=25.98  GT:AD:DP:GQ:PL  1/1:0,12:12:33:340,33,0
chr1    121484591       rs4898109       T       A       463.77  .       AC=2;AF=1.00;AN=2;DB;DP=20;Dels=0.00;FS=0.000;HaplotypeScore=0.9947;MLEAC=2;MLEAF=1.00;MQ=30.44;MQ0=1;QD=23.19  GT:AD:DP:GQ:PL  1/1:0,20:20:54:492,54,0


e.g. MQRankSum can be found only in the last 4 terms.

and my Command Line:

java -Xmx15g -jar /ifs1/ST_POP/USER/lantianming/HUM/bin/GenomeAnalysisTK-2.7-2-g6bda569/GenomeAnalysisTK.jar
-glm BOTH
-l INFO
-R /nas/RD_09C/resequencing/soft/pipeline/GATK/bundle/2.5/hg19/ucsc.hg19.fasta
-T UnifiedGenotyper
-I /ifs1/ST_POP/USER/lantianming/HUM/align/bwa/recal_03/test.realn_8.recal.bam
-D /nas/RD_09C/resequencing/soft/pipeline/GATK/bundle/2.5/hg19/dbsnp_137.hg19.vcf
-o /ifs1/ST_POP/USER/lantianming/HUM/align/bwa/callsnp/realn_8.recal.bam.vcf
-metrics /ifs1/ST_POP/USER/lantianming/HUM/align/bwa/callsnp/snpcall.metrics


What can I do to solve this problem?

Thanks a lot !

I don't know if this question has been asked, if so sorry.

When using UnifiedGenotyper, I was wondering if it was possible (via hidden command option, etc) to use a VCF file as a prior. Currently I have just been adding additional bam files, but it would be nice (and quicker) if I could use a Indexed file.

Shawn.

Hello dear GATK Team,

it seams that the ignoreFilter Argument in VariantRecalibrator does not work. I want to include variants with LowQual filter in the calculation, but cant find the right way to do it. I tried all of these:

-ignoreFilter LowQual
-ignoreFilter [LowQual]
-ignoreFilter "LowQual"
-ignoreFilter "Low Quality"
-ignoreFilter Low Quality
-ignoreFilter [Low Quality]


and also with --ignore_filter instead of -ignoreFilter.

#

Found 2 solutions:

1) Remove the LowQual filter with VariantFiltration and "--invalidatePreviousFilters"

2) "-ignoreFilter LowQual" has to be applied to ApplyRecalibration also...

Strelka produces vcf files that GATK has issues with. The files pass vcftools validation, which according to the docs is the official validation, they do not pass ValidateVariants. VariantEval can't read them either. I'm unsure where the bug lives.

vcf file looks like this:

##fileformat=VCFv4.1
##fileDate=20130801
##source=strelka
##source_version=2.0.8
##startTime=Thu Aug  1 15:23:54 2013
##reference=file:///xchip/cga_home/louisb/reference/human_g1k_v37_decoy.fasta
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>
##contig=<ID=GL000207.1,length=4262>
##contig=<ID=GL000226.1,length=15008>
##contig=<ID=GL000229.1,length=19913>
##contig=<ID=GL000231.1,length=27386>
##contig=<ID=GL000210.1,length=27682>
##contig=<ID=GL000239.1,length=33824>
##contig=<ID=GL000235.1,length=34474>
##contig=<ID=GL000201.1,length=36148>
##contig=<ID=GL000247.1,length=36422>
##contig=<ID=GL000245.1,length=36651>
##contig=<ID=GL000197.1,length=37175>
##contig=<ID=GL000203.1,length=37498>
##contig=<ID=GL000246.1,length=38154>
##contig=<ID=GL000249.1,length=38502>
##contig=<ID=GL000196.1,length=38914>
##contig=<ID=GL000248.1,length=39786>
##contig=<ID=GL000244.1,length=39929>
##contig=<ID=GL000238.1,length=39939>
##contig=<ID=GL000202.1,length=40103>
##contig=<ID=GL000234.1,length=40531>
##contig=<ID=GL000232.1,length=40652>
##contig=<ID=GL000206.1,length=41001>
##contig=<ID=GL000240.1,length=41933>
##contig=<ID=GL000236.1,length=41934>
##contig=<ID=GL000241.1,length=42152>
##contig=<ID=GL000243.1,length=43341>
##contig=<ID=GL000242.1,length=43523>
##contig=<ID=GL000230.1,length=43691>
##contig=<ID=GL000237.1,length=45867>
##contig=<ID=GL000233.1,length=45941>
##contig=<ID=GL000204.1,length=81310>
##contig=<ID=GL000198.1,length=90085>
##contig=<ID=GL000208.1,length=92689>
##contig=<ID=GL000191.1,length=106433>
##contig=<ID=GL000227.1,length=128374>
##contig=<ID=GL000228.1,length=129120>
##contig=<ID=GL000214.1,length=137718>
##contig=<ID=GL000221.1,length=155397>
##contig=<ID=GL000209.1,length=159169>
##contig=<ID=GL000218.1,length=161147>
##contig=<ID=GL000220.1,length=161802>
##contig=<ID=GL000213.1,length=164239>
##contig=<ID=GL000211.1,length=166566>
##contig=<ID=GL000199.1,length=169874>
##contig=<ID=GL000217.1,length=172149>
##contig=<ID=GL000216.1,length=172294>
##contig=<ID=GL000215.1,length=172545>
##contig=<ID=GL000205.1,length=174588>
##contig=<ID=GL000219.1,length=179198>
##contig=<ID=GL000224.1,length=179693>
##contig=<ID=GL000223.1,length=180455>
##contig=<ID=GL000195.1,length=182896>
##contig=<ID=GL000212.1,length=186858>
##contig=<ID=GL000222.1,length=186861>
##contig=<ID=GL000200.1,length=187035>
##contig=<ID=GL000193.1,length=189789>
##contig=<ID=GL000194.1,length=191469>
##contig=<ID=GL000225.1,length=211173>
##contig=<ID=GL000192.1,length=547496>
##contig=<ID=NC_007605,length=171823>
##contig=<ID=hs37d5,length=35477943>
##content=strelka somatic indel calls
##germlineIndelTheta=0.0001
##priorSomaticIndelRate=1e-06
##INFO=<ID=QSI,Number=1,Type=Integer,Description="Quality score for any somatic variant, ie. for the ALT haplotype to be present at a significantly different frequency in the tumor and normal">
##INFO=<ID=TQSI,Number=1,Type=Integer,Description="Data tier used to compute QSI">
##INFO=<ID=NT,Number=1,Type=String,Description="Genotype of the normal in all data tiers, as used to classify somatic variants. One of {ref,het,hom,conflict}.">
##INFO=<ID=QSI_NT,Number=1,Type=Integer,Description="Quality score reflecting the joint probability of a somatic variant and NT">
##INFO=<ID=TQSI_NT,Number=1,Type=Integer,Description="Data tier used to compute QSI_NT">
##INFO=<ID=SGT,Number=1,Type=String,Description="Most likely somatic genotype excluding normal noise states">
##INFO=<ID=RU,Number=1,Type=String,Description="Smallest repeating sequence unit in inserted or deleted sequence">
##INFO=<ID=RC,Number=1,Type=Integer,Description="Number of times RU repeats in the reference allele">
##INFO=<ID=IC,Number=1,Type=Integer,Description="Number of times RU repeats in the indel allele">
##INFO=<ID=IHP,Number=1,Type=Integer,Description="Largest reference interupted homopolymer length intersecting with the indel">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Somatic mutation">
##INFO=<ID=OVERLAP,Number=0,Type=Flag,Description="Somatic indel possibly overlaps a second indel.">
##FORMAT=<ID=TAR,Number=2,Type=Integer,Description="Reads strongly supporting alternate allele for tiers 1,2">
##FORMAT=<ID=TIR,Number=2,Type=Integer,Description="Reads strongly supporting indel allele for tiers 1,2">
##FORMAT=<ID=TOR,Number=2,Type=Integer,Description="Other reads (weak support or insufficient indel breakpoint overlap) for tiers 1,2">
##FORMAT=<ID=DP50,Number=1,Type=Float,Description="Average tier1 read depth within 50 bases">
##FORMAT=<ID=FDP50,Number=1,Type=Float,Description="Average tier1 number of basecalls filtered from original read depth within 50 bases">
##FORMAT=<ID=SUBDP50,Number=1,Type=Float,Description="Average number of reads below tier1 mapping quality threshold aligned across sites within 50 bases">
##FILTER=<ID=Repeat,Description="Sequence repeat of more than 8x in the reference sequence">
##FILTER=<ID=iHpol,Description="Indel overlaps an interupted homopolymer longer than 14x in the reference sequence">
##FILTER=<ID=BCNoise,Description="Average fraction of filtered basecalls within 50 bases of the indel exceeds 0.3">
##FILTER=<ID=QSI_ref,Description="Normal sample is not homozygous ref or sindel Q-score < 30, ie calls with NT!=ref or QSI_NT < 30">
##cmdline=/xchip/cga_home/louisb/Strelka/strelka_workflow_1.0.7/libexec/consolidateResults.pl --config=/xchip/cga/benchmark/testing/full-run/somatic-benchmark/spiked/Strelka_NDEFGHI_T12345678_0.8/config/run.config.ini
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NORMAL  TUMOR
1   797126  .   GTAAT   G   .   PASS    IC=1;IHP=2;NT=ref;QSI=56;QSI_NT=56;RC=2;RU=TAAT;SGT=ref-        >het;SOMATIC;TQSI=1;TQSI_NT=1   DP:DP2:TAR:TIR:TOR:DP50:FDP50:SUBDP50   47:47:48,49:0,0:3,3:48.72:0.00:0.00 62:62:36,39:17,19:9,9:42.49:0.21:0.00


The output I get from ValidateVariants is java -jar ~/Workspace/gatk-protected/dist/GenomeAnalysisTK.jar -T ValidateVariants --variant strelka1.vcf -R ~/cga_home/reference/human_g1k_v37_decoy.fasta INFO 17:19:45,289 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:19:45,291 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-1-g42d771f, Compiled 2013/08/22 11:08:15 INFO 17:19:45,291 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:19:45,291 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:19:45,295 HelpFormatter - Program Args: -T ValidateVariants --variant strelka1.vcf -R /Users/louisb/cga_home/reference/human_g1k_v37_decoy.fasta INFO 17:19:45,295 HelpFormatter - Date/Time: 2013/08/28 17:19:45 INFO 17:19:45,295 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:19:45,295 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:19:45,300 ArgumentTypeDescriptor - Dynamically determined type of strelka1.vcf to be VCF INFO 17:19:45,412 GenomeAnalysisEngine - Strictness is SILENT INFO 17:19:45,513 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 17:19:45,533 RMDTrackBuilder - Loading Tribble index from disk for file strelka1.vcf INFO 17:19:45,615 GenomeAnalysisEngine - Preparing for traversal INFO 17:19:45,627 GenomeAnalysisEngine - Done preparing for traversal INFO 17:19:45,627 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:19:45,627 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 17:19:46,216 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 2.7-1-g42d771f): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: File /Users/louisb/Workspace/strelkaVcfDebug/strelka1.vcf fails strict validation: one or more of the ALT allele(s) for the record at position 1:797126 are not observed at all in the sample genotypes ##### ERROR ------------------------------------------------------------------------------------------

output from VariantEval is:

java -jar ~/Workspace/gatk-protected/dist/GenomeAnalysisTK.jar -T VariantEval --eval strelka1.vcf -R ~/cga_home/reference/human_g1k_v37_decoy.fasta
INFO  17:15:44,333 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:15:44,335 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-1-g42d771f, Compiled 2013/08/22 11:08:15
INFO  17:15:44,335 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  17:15:44,339 HelpFormatter - Program Args: -T VariantEval --eval strelka1.vcf -R /Users/louisb/cga_home/reference/human_g1k_v37_decoy.fasta
INFO  17:15:44,339 HelpFormatter - Date/Time: 2013/08/28 17:15:44
INFO  17:15:44,339 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:15:44,339 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:15:44,349 ArgumentTypeDescriptor - Dynamically determined type of strelka1.vcf to be VCF
INFO  17:15:44,476 GenomeAnalysisEngine - Strictness is SILENT
INFO  17:15:44,603 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  17:15:44,623 RMDTrackBuilder - Loading Tribble index from disk for file strelka1.vcf
INFO  17:15:44,710 GenomeAnalysisEngine - Preparing for traversal
INFO  17:15:44,722 GenomeAnalysisEngine - Done preparing for traversal
INFO  17:15:44,722 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  17:15:44,723 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
INFO  17:15:44,831 VariantEval - Creating 3 combinatorial stratification states
INFO  17:15:45,382 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: BUG: Unexpected genotype type: [NORMAL NA DP 47 {DP2=47, DP50=48.72, FDP50=0.00, SUBDP50=0.00, TAR=48,49, TIR=0,0, TOR=3,3}]
at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.7-1-g42d771f):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR
##### ERROR MESSAGE: BUG: Unexpected genotype type: [NORMAL NA DP 47 {DP2=47, DP50=48.72, FDP50=0.00, SUBDP50=0.00, TAR=48,49, TIR=0,0, TOR=3,3}]
##### ERROR ------------------------------------------------------------------------------------------


As I said in my last post about splitting my 11 samples from the recalibrated VCF file. I now have a different question which is how to set up a criteria to select variants from this 11-sample-combined VCF. My criteria would be DP >= 20 and # of ALT reads >= 10. I know the AD is the sum of both REF and ALT reads, but I was wondering if there's any way to select by the # of ALT and DP >=20?

Should I use the "-T SelectVariants" or "-T VariantFiltration"? I am using GATK 2.5 on a remote Mac OS X server by the way.

I finally got the filtered VCF file from PWA + PiCard + GATK pipeline, and have 11 exome-seq data files which were processed as a list of input to GATK. In the process of getting VCF, I did not see an option of separating the 11 samples. Now, I've got two VCF files (one for SNPs and the other for indels) that each has 11 samples. My question is how to proceed from here?

Should I separate the 11 files before annotation? or annotation first then split them 11 samples to individual files? Big question here is how to split the samples from vcf files? thanks

Hi,

Could you tell me how to encourage GATK to annotate my genotype columns (i.e. add annotations to the FORMAT and PANC_R columns in the following file):

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PANC_R
chrX 259221 . GA G 136.74 . AC=2;AF=1.00;AN=2;DP=15;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=8.82;MQ0=1;QD=3.04 GT:AD:GQ:PL 1/1:0,2:6:164,6,0


The file was generated with HaplotypeCaller. I used a command line similar to this one to no effect:

java -jar \$GATKROOT/GenomeAnalysisTK.jarT VariantAnnotator -R hg19_random.fa -I chr7_recalibrated.bam -V chr7.vcf --dbsnpdbSNP135_chr.vcf -A Coverage -A QualByDepth -A FisherStrand -A MappingQualityRankSumTest -A ReadPosRankSumTest -o chr7_annotated-again.vcf


Does anyone have any suggestions? Thanks in advance!

I've notices on some occasions that the .vcf.ind file that is created alongside the vcf is older than the vcf itself (not by much, a second or so). I've seen this happening in small (highly scattered) jobs where there are many (~13K) samples. Perhaps the last VCF line is so long that it takes longer to flush the vcf buffer than it takes to write the index...I don't know. At any rate, the result is a stale VCF index which slows subsequent operations down (as the index needs to be rebuild), and since further operations may be performed by a less-poweful user, the GATK may be forced to use in-memory index.

an example (for broadies) can be seen here: /seq/dax/t2d_genes/v3/scatter/temp_0001_of_2000/t2d_genes.unfiltered.vcf* (not for long!)

Perhaps a small delay or test can be introduced to verify that the vcf-file is really closed before closing the index file.

Thanks.

Hi team,

This is two separate questions:

1. Starting with a vcf file, plotting the depth (DP) distribution gives a nice, slightly asymmetrical bell-shaped curve. Given that SNPs with very high and very low coverages should be excluded, how does one decide what is very high and low. e.g. 5% either side ?

2. I'm only interested in chromosomes 2L, 2R, 3L, 3R and X of my Drosophila sequences. Filtering for these is easy with a Perl script but I'm trying to do this earlier on in the GATK processes. I've tried ...-L 2L -L 2R -L 3L ...etc, -L 2L 2R 3L ....etc and, -L 2L, 2R, 3R...etc but the result is either input error message or chromosome 2L only.

Many thanks and apologies if I've missed anything in the instructions.

Cheers,

Blue

I have twice run UnifiedGenotyper and the resultant .vcf file contains only part of chromosome 20. I do not see what I am doing wrong. Neither do the other two people in the lab who have extensive experience with GATKcat

Hi all,

I have been looking for a documentation for the INFO column in the VCF of the Mills indels in the GATK resource bundle (Mills_and_1000G_gold_standard.indels.b37.sites.vcf.gz), but to no avail. I did a unique list of all the possible entries for the INFO, but I have no idea what they mean. Does anybody know what they mean? Thank you!!

set=Intersect1000GAll

set=Intersect1000GAll-MillsDoubleCenter

set=Intersect1000GAll-MillsDoubleCenter-MillsTracesUnknown

set=Intersect1000GAll-MillsTracesUnknown

set=Intersect1000GMinusBI

set=Intersect1000GMinusDI

set=Intersect1000GMinusDI-MillsDoubleCenter

set=Intersect1000GMinusOX

set=Intersect1000GMinusOX-MillsCenterUnknown-MillsTracesUnknown

set=Intersect1000GMinusOX-MillsDoubleCenter

set=Intersect1000GMinusOX-MillsDoubleCenter-MillsTracesUnknown

set=Intersect1000GMinusOX-MillsTracesUnknown

set=Intersect1000GMinusSI

set=Intersect1000GMinusSI-MillsDoubleCenter

set=Intersect1000GMinusSI-MillsDoubleCenter-MillsTracesUnknown

set=Intersect1000GMinusSI-MillsTracesUnknown

set=MillsAlleleMatch1000G

set=MillsAlleleMatch1000G-MillsDoubleCenter

set=MillsAlleleMatch1000G-MillsDoubleCenter-MillsTracesUnknown

set=MillsAlleleMatch1000G-MillsTracesUnknown

set=MillsCenterUnknown

set=MillsCenterUnknown-Intersect1000GAll-MillsTracesUnknown

set=MillsCenterUnknown-Intersect1000GMinusBI-MillsTracesUnknown

set=MillsCenterUnknown-Intersect1000GMinusDI-MillsTracesUnknown

set=MillsCenterUnknown-Intersect1000GMinusSI

set=MillsCenterUnknown-Intersect1000GMinusSI-MillsTracesUnknown

set=MillsCenterUnknown-MillsAlleleMatch1000G

set=MillsCenterUnknown-MillsAlleleMatch1000G-MillsTracesUnknown

set=MillsCenterUnknown-MillsTraces3Plus

set=MillsCenterUnknown-MillsTraces3Plus-Intersect1000GMinusSI

set=MillsCenterUnknown-MillsTraces3Plus-MillsAlleleMatch1000G

set=MillsCenterUnknown-MillsTracesUnknown

set=MillsDoubleCenter

set=MillsDoubleCenter-MillsTracesUnknown

set=MillsTraces3Plus

set=MillsTraces3Plus-Intersect1000GMinusSI

set=MillsTraces3Plus-MillsAlleleMatch1000G

set=MillsTracesUnknown

Hello all,

We've just started using GATK in order to perform variant calling in a non-model teleost fish. The fish genome is highly repetitive (>65%), and also suffers from the whole genome duplication event common in teleosts (e.g. zebrafish). Additionally, the fish strain we use is highly inbred, which should result in a highly homogenous genome. We have generated a genome assembly and a de novo repeat library based on NGS data (manuscript submitted) before mapping the reads from four individuals (male and female) to the genome via bowtie2. Variants were called using UnifiedGenotyper.

We generally get a very good list of variants, but it seems that we're getting a number of false positives and negatives when calling variants. Some of these appear to be due to paralogues, but some seem to be errors in the actual genotype call. For example:

scaffold00001 1199020 . T G 44.35 . AC=1;AF=0.167;AN=6;BaseQRankSum=-7.420;DP=110;Dels=0.00;FS=152.859;HaplotypeScore=3.6965;MLEAC=1;MLEAF=0.167;MQ=42.00;MQ0=0;MQRankSum=-1.972;QD=1.53;ReadPosRankSum=-2.777;SB=-4.096e+00 GT:AD:DP:GQ:PL 0/1:20,9:29:79:79,0,588 0/0:16,7:23:12:0,12,447 0/0:39,18:57:65:0,65,1426 ./.

In this case, individual 3 has a homozygous reference genotype, despite having a 31% minor allele frequency. Individual 1 also has a 31% minor allele frequency, but is called heterozygous.Some of the bases used to call the G allele are of low quality (when looking more closely using IGV), but I would still expect the genotype to be heterozygous.

A reverse example:

scaffold00458 298207 . A G 64.81 . AC=2;AF=0.333;AN=6;BaseQRankSum=3.027;DP=64;Dels=0.00;FS=5.080;HaplotypeScore=0.0000;MLEAC=2;MLEAF=0.333;MQ=16.26;MQ0=0;MQRankSum=3.177;QD=1.16;ReadPosRankSum=-3.252;SB=0.439 GT:AD:DP:GQ:PL 0/0:8,0:8:21:0,21,207 0/1:20,1:21:13:13,0,152 0/1:31,4:35:90:90,0,102 ./.

Here, individual 2 is called heterozygous, but there is only a single read which supports the minor allele. Additionally, when looking at IGV, you can see that the read in question has a number of mismatches, suggesting it originates from another area of the genome.

I've also uploaded screenshots of IGV if that I hope will help clarify the problems we're having. We have used default parameters of GATK in almost all cases, and we did not used VQSR, as we did not have a list of high confidence SNPs at the time.

we are running tests trying to get UG to produce 1 vcf per sample when inputting bams from multiple subjects. our situation is complicated slightly by the fact that each sample has 3 bams. when we input all 6 bams into UG, hoping to output 2 vcfs (1 per sample) we instead get a single vcf. we found some relevant advice in this post: http://gatkforums.broadinstitute.org/discussion/2262/why-unifiedgenotyper-treat-multiple-bam-input-as-one-sample but still haven't solved the issue.

details include: 1) we are inputting 6 bams for our test, 3 per sample for 2 samples. 2) bams were generated using Bioscope from targeted capture reads sequenced on a Solid 4. 3) as recommended in the post above we checked out the @RG statements in the bam headers using Samtools -- lines for the 6 bams are as follows:

sample 1:

@RG ID:20130610202026358 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-10T16:20:26-0400 SM:S1

@RG ID:20130611214013844 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-11T17:40:13-0400 SM:S1

@RG ID:20130613002511879 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:147 DT:2013-06-12T20:25:11-0400 SM:S1

sample 2:

@RG ID:20130611021848236 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-10T22:18:48-0400 SM:S1

@RG ID:20130612014345277 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-11T21:43:45-0400 SM:S1

@RG ID:20130613085411753 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:150 DT:2013-06-13T04:54:11-0400 SM:S1

Based on the former post, I would have expected each of these bams to generate a separate vcf as it appears the ids are all different (which would not have been desirable either, as we are hoping to generate 2 vcfs in this test). Thus, it is not clear if/how we should use Picard tool AddOrReplaceReadGroups to modify the @RG headers?

Does that make sense? Any advice?

I am just starting with GATK but even though I have looked and looked I can not find a simple walkthrough of having many VCFs and running a range of contrasts based on sample data. I guess this must be one of the most common used tools. I have 200 VCFs that I want to contrast in many different ways depending on sample phenotype. Is there anyone that can direct me to a good guide on how to do that? Thanks

Hello, Here is part of a vcf generated by GATK Unified Genotyper :

The first variant is filtred by the "LowQD filter" ( QD = 1.22 < 1.5) whereas it seems to be good. Indeed : QD=QUAL/DP = 32729.73/6710 = 4.87.

When I calculate other QD for other variant in the same file, the result is good.

So, why GATK indicates a wrong QD value ? (I did the Unified Genotyper twice, and the results are the same).

Hi,

I have a vcf containing multiple samples. I would like to put the bam files also as input for the Variant Annotator but how does the variant annotator know which bam is for wich column in the vcf? Does the order of the args of the bam files need to correspond to the order of the samples columns in the vcf?

Thanks,

Robin

Hi,

I'm having problems understanding a GATK output VCF. I have read the VCF standard, but I'm obviously missing something.

I /think/ I understand how SNPs and short indels are represented, but clearly I do not. Below is an excerpt that illustrates sites which I do not understand. I suspect it may be something to do with GATK quality filters that I'm not understanding, or something about using EMIT_ALL_SITES...

The excerpt below was generated using

GATK -l INFO -I my.bam -R my.fa -T UnifiedGenotyper -S LENIENT -nt 8 --heterozygosity 0.1 -o test.vcf --genotype_likelihoods_model BOTH --min_base_quality_score 10 --output_mode EMIT_ALL_SITES -ploidy 2

Thanks!

Darren

    CH1 225 .   T   G   12.71   LowQual AC=1;AF=0.500;AN=2;BaseQRankSum=1.978;DP=59;Dels=0.03;FS=0.000;HaplotypeScore=10.2840;MLEAC=1;MLEAF=0.500;MQ=70.25;MQ0=8;MQRankSum=-5.349;QD=0.22;ReadPosRankSum=-3.188 GT:AD:DP:GQ:PL  0/1:41,16:55:20:20,0,1435
CH1 226 .   T   .   121.53  .   AN=2;DP=59;MQ=70.25;MQ0=8   GT:DP   0/0:43
CH1 227 .   A   .   121.53  .   AN=2;DP=59;MQ=70.25;MQ0=8   GT:DP   0/0:43
CH1 228 .   T   .   121.53  .   AN=2;DP=59;MQ=70.25;MQ0=8   GT:DP   0/0:43
CH1 229 .   A   .   115.53  .   AN=2;DP=57;MQ=69.66;MQ0=8   GT:DP   0/0:38
CH1 230 .   C   .   115.53  .   AN=2;DP=57;MQ=69.66;MQ0=8   GT:DP   0/0:38
CH1 231 .   T   .   115.53  .   AN=2;DP=57;MQ=69.66;MQ0=8   GT:DP   0/0:36
CH1 232 .   G   .   115.53  .   AN=2;DP=57;MQ=69.66;MQ0=8   GT:DP   0/0:36
CH1 233 .   C   .   115.53  .   AN=2;DP=57;MQ=69.66;MQ0=8   GT:DP   0/0:37
CH1 234 .   A   .   139.53  .   AN=2;DP=70;MQ=59.20;MQ0=14  GT:DP   0/0:63
CH1 235 .   A   .   175.53  .   AN=2;DP=84;MQ=51.67;MQ0=15  GT:DP   0/0:79
CH1 236 .   A   .   175.53  .   AN=2;DP=84;MQ=51.67;MQ0=15  GT:DP   0/0:79
CH1 237 .   T   .   175.53  .   AN=2;DP=85;MQ=51.37;MQ0=16  GT:DP   0/0:80
CH1 238 .   A   .   175.53  .   AN=2;DP=102;MQ=46.90;MQ0=28 GT:DP   0/0:97
CH1 239 .   A   .   175.53  .   AN=2;DP=102;MQ=46.90;MQ0=28 GT:DP   0/0:101
CH1 240 .   T   .   175.53  .   AN=2;DP=102;MQ=46.90;MQ0=28 GT:DP   0/0:98
CH1 241 .   A   .   169.53  .   AN=2;DP=108;MQ=44.14;MQ0=29 GT:DP   0/0:107
*    CH1    242 .   T   .   169.53  .   AN=2;DP=109;MQ=43.94;MQ0=29 GT:DP   0/0:103
*    CH1    242 .   T   .   118.27  .   AN=2;DP=109;MQ=43.94;MQ0=29 GT:AD:DP    0/0:27:55
*    CH1    243 .   C   .   172.53  .   AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP   0/0:108
*    CH1    243 .   CTTTT   .   118.27  .   AN=2;DP=110;MQ=43.76;MQ0=29 GT:AD:DP    0/0:27:56
CH1 244 .   T   .   91.53   .   AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP   0/0:61
CH1 245 .   T   .   91.53   .   AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP   0/0:53
CH1 246 .   T   .   73.53   .   AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP   0/0:41
CH1 247 .   T   .   91.53   .   AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP   0/0:46
CH1 248 .   A   .   172.53  .   AN=2;DP=116;MQ=42.61;MQ0=31 GT:DP   0/0:100
CH1 249 .   A   .   172.53  .   AN=2;DP=116;MQ=42.61;MQ0=31 GT:DP   0/0:100
CH1 250 .   T   .   172.53  .   AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP   0/0:101
CH1 251 .   T   .   169.53  .   AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP   0/0:96
CH1 251 .   T   .   118.27  .   AN=2;DP=117;MQ=42.43;MQ0=32 GT:AD:DP    0/0:27:56
CH1 252 .   C   .   172.53  .   AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP   0/0:113
CH1 253 .   C   .   172.53  .   AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP   0/0:110
CH1 254 .   T   .   172.53  .   AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP   0/0:111
CH1 255 .   T   .   172.53  .   AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP   0/0:111
CH1 256 .   T   .   172.53  .   AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP   0/0:111


Line 1 is a SNP
Lines 14 and 15 are an indel that I do understand
Lines 19 and 20 I do /not/ understand
Lines 21 and 22 I do /not/ understand

Hi all,

I'm currently trying to extract de novo mutations from my multi-sample vcf files (trios). I've already read the VCF file specification documentation but wanted to check if I got this right. So I would call a de novo mutation candidate in the following cases:

1.Child has the genotype 0|1 , 1|0 or 1|1 and both parents have 0|0

2.Child has the genotype 1|0 or 1|1 and the mother 0|0

3.Child: 0|1 or 1|1 and the father 0|0

Is this correct ? And are there any other cases which indicate a de novo mutation which I missed so far ?

Thanks !

Hi,

quality score (phred score) is defined as below. (i.e. 1% error rate is equal to phred score of 20 (-10xlog 0.01))

QUAL phred-scaled quality score for the assertion made in ALT. i.e. -10log_10 prob(call in ALT is wrong). If ALT is ”.” (no variant) then this is -10log_10 p(variant), and if ALT is not ”.” this is -10log_10p(no variant). High QUAL scores indicate high confidence calls. Although traditionally people use integer phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired. If unknown, the missing value should be specified. (Numeric)

Using GATK to generate vcf files and looking through the quality column of those files, I found out that the max quality score is 441,453 which is extremely huge number.

I wonder if the quality score of GATK tool follows the phred score system; if not, how do you calculate the quality score and what do the numbers of quality score represent?

Look forward to hearing back from you soon and thank you very much.

Before there is webpage for how to convert plink ped format to vcf format. But it seems that this link disappeared.

Thank you very much in advance.

(There was another question about a similar symptom, but the answer does not appear to apply to what I'm seeing.)

I get an empty VCF file that just contains the header lines. The input VCF file is 1.8Gb, and as far as I can tell the content is OK - it has MAPQ scores, the flags seem reasonable, etc. I've attached a copy of the console output, and the beginning of the input file in SAM format. Let me know if you have any suggestions. Thanks,

Ravi

This is not a bug per se in that it does not cause incorrect output, but I think it would be accurately described as an "unintended consequence" of very poorly compressed VCF output files.

GATK allows for output VCF files to be written using Picard's BlockCompressedOutputStream when the the output file is specified with the extension .vcf.gz, which I consider to be very good behavior. However, I noticed after doing some minor external manipulation that the files produced this way are "suboptimally" compressed. By suboptimal, I mean that sometimes the files are even larger than the uncompressed VCF files.

Since the problem occurs in GATK-Lite, I was able to look through the source code to see what is going on. From what I can tell, the issue is that VCFWriter calls mWriter.flush() at the end of VCFWriter.add() for each variant. Per the documentation for BlockCompressedOutputStream.flush():

WARNING: flush() affects the output format, because it causes the current contents of uncompressedBuffer to be compressed and written, even if it isn't full.

As a result, instead of the default of blocks of about 64k, the bgzf-formatted .vcf.gz files produced by GATK have blocks for each line. That reduces the amount repetition for gzip to take advantage of. Not being sure what issues led to requiring a call to flush after every variant, I'm not sure how to best address this, but it may be necessary to wrap BlockCompressedOutputStream when used by VCFWriter to catch this flush in order to get effective compression.

Of course, it is possible to simply write the file and then compress it in a separate step, but this leads to disk IO that should be preventable.

I have used the UnifiedGenotyper to call variants on a set of ~2400 genes (TruSeq Illumina data) from 28 different samples mapped against a preliminary draft genome. I do not have a defined set of SNPs or INDELs to use in recalibration via VQSR.

While the raw VCF has plenty of QUAL scores that are very high, not a single call has a PASS associated with it in the Filter field- all are "." If I use SelectVaraints to filter the VCF based on high QUAL or DP values, or combination, the Filter field remains "." for the returned variants.

Am I doing something wrong, or is the raw file telling me that none of the variant calls are meaningful, in spite of their high QUAL values?

Is there a "best practices" way to go about filtering such a dataset when VQSR can't be employed? If so, I haven't found it.

Dear team, I am new to GATK and I am having a hard time simply trying to merge vcf files. I have tried to solve the problem by referring to the guide and to previous posts, but nothing woked. Actually I found several discussions about the very same error message I receive, but it seems that no clear answere was provided. Here is this message:

ERROR ------------------------------------------------------------------------------------------

I have tried three different MS Dos commands to do the task (see belbow), but the message didn't change:

1. java -jar GenomeAnalysisTK.jar -T CombineVariants -R E:\RessourcesGATK\ucsc.hg19.fasta -V:sample1 E:\TestGATK\sample1.vcf -V:sample2 E:\TestGATK\sample2.vcf -o combined.vcf

2. java -jar GenomeAnalysisTK.jar -R E:\RessourcesGATK\ucsc.hg19.fasta -T CombineVariants  --variant E:\TestGATK\sample1.vcf  --variant E:\TestGATK\sample2.vcf  -o output.vcf  -genotypeMergeOptions UNIQUIFY

3.java -jar GenomeAnalysisTK.jar -R E:\RessourcesGATK\ucsc.hg19.fasta  -T CombineVariants  --variant E:\TestGATK\sample1.vcf  --variant E:\TestGATK\sample2.vcf  -o output.vcf  -genotypeMergeOptions PRIORITIZE  -priority foo,bar
`

I have also tried to use the reference human_g1k_v37.fasta as a resource, but it was the same. I have suppressed the # before CHROM in the header line, tested vcf generated by Samtools or by GATK, but it did not change anything. Is this a problem of environment? I haven't read anything mentioning that GATK could not work with MS Dos.

Thank you very much for your help. S.

Hello, I'm maybe missing something obvious but it seems a GATK vcf file does not tell a given variant is SNP, Insertion or deletion. Did I missed some command when I called variations? I can easily classify variations by eye or a script from a given vcf entry but the entry does not explicitly say variation type.

Here are deletions:

Here are insertions:

Ld04 23671 . G GAAAT 6952 . AC=2;AF=1.00;AN=2;DP=100;FS=0.000;HaplotypeScore=24.8695;MLEAC=2;MLEAF=1.00;MQ=59.54;MQ0=0;QD=69.52;SB=-3.355e+03 GT:AD:DP:GQ:PL 1/1:65,34:100:99:6952,301,0

Here are SNPs

Ld04 19890 . T C 3276.01 . AC=2;AF=1.00;AN=2;DP=85;Dels=0.00;FS=0.000;HaplotypeScore=0.7887;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=38.54;SB=-1.614e+03 GT:AD:DP:GQ:PL 1/1:0,85:85:99:3276,253,0