Understanding the Unified Genotyper's VCF files
Contents |
Introduction
The GATK's primary (and only well-supported) file format for representing SNP, indel, and structural variation calls is the Variant Call Format (VCF). VCF is a text format that can be a bit verbose but is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.
Unfortunately, understanding the detailed information provided by the GATK, in particular by tools that infer variation from NGS data, such as the Unified genotyper, can be quite challenging. This document should provide users of the GATK with an fairly detailed understanding of the organization and meaning of VCF files generated by the GATK and the Unified genotyper in particular.
An example VCF file
The following text is a valid VCF file describing the first few SNPs found by the UG in a deep whole genome data set from NA12878:
##fileformat=VCFv4.0 ##FILTER=<ID=LowQual,Description="QUAL < 50.0"> ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=PL,Number=3,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic"> ##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?"> ##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions"> ##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"> ##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes"> ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality"> ##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads"> ##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth"> ##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias"> ##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="log10-scaled probability of variant being true under the trained gaussian mixture model"> ##UnifiedGenotyperV2="analysis_type=UnifiedGenotyperV2 input_file=[TEXT CLIPPED FOR CLARITY]" #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 chr1 873762 . T G 5231.78 PASS AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255 chr1 877664 rs3828047 A G 3931.66 PASS AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185 GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0 chr1 899282 rs28548431 C T 71.77 PASS AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148 GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26 chr1 974165 rs9442391 T C 29.84 LowQual AC=1;AF=0.50;AN=2;DB;DP=18;Dels=0.00;HRun=1;HaplotypeScore=0.16;MQ=95.26;MQ0=0;QD=1.66;SB=-0.98 GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255
It seems a bit complex, but the structure of the file is actually quite simple:
[HEADER LINES] #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 chr1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255 chr1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0 chr1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26 chr1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255
How variation is represented in a VCF
Each line represents one variant (here everything is a SNP, but some could be indels or CNVs) as well as the genotype of our sample, NA12878, at that variant. I've chosen these four variants because they each represent an important aspect in interpreting a VCF file:
- chr1:873762 is a novel T/G polymorphism, found with very high confidence (QUAL = 5231.78).
- chr1:877664 is a known A/G SNP (rs3828047), found with very high confidence (QUAL = 3931.66)
- chr1:899282 is a known C/T SNP (rs28548431), but has a relative low confidence (QUAL = 71.77)
- chr1:974165 is a known T/C SNP but we have so little evidence for this variant in our data that although we write out a record for it (book keeping, really) our statistical evidence is so low that we filter the record out as a bad site "LowQual".
Basically, the first 6 columns of the VCF are very easy to understand, because they have a single, well-understand meaning.
| Field in VCF | Meaning |
|---|---|
| CHROM and POS | The CHROM and POS gives the contig on which the variant occurs. For indels this is actually the base preceding the event, due to the representation of indels in a VCF. |
| ID | The dbSNP rs identifier of the SNP, based on the contig : position of the call and whether a record exists at this site in dbSNP. |
| REF and ALT | The reference base and alternative base that vary in the samples, or in the population in general. Note that REF and ALT are always given on the forward strand. For indels the REF and ALT bases always include at least one base each (the base before the event). For a truly detailed discussion of how to interpret VCF files, see [1]. |
| QUAL | The Phred scaled probability of Probability that REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance. The GATK values can grow very large when lots of NGS data is used to call. |
| FILTER | In a perfect world, the QUAL field would be based on a complete model for all error modes present in the data used to call. Unfortunately, we are still far from this ideal, and we have to use orthogonal approaches to determine which called sites, independent of QUAL, are machine errors and which are real SNPs. Whatever approach is used to filter the SNPs, the VCFs produced by the GATK carry both the PASSing filter records (the ones that are good have PASS in their FILTER field) as well as those that fail (the filter field is anything but PASS or a dot). If the FILTER field is a ".", then no filtering has been applied to the records, meaning that all of the records should be used for analysis but without explicitly saying that any PASS (it's a somewhat meaningless, philosophical difference). |
How genotypes are represented in a VCF
The genotype fields of the VCF are far easier to understand than the variation fields, largely because they are more concrete and we use fewer tags. Let's take a look at the three records, simplified down to just show the key genotype fields:
chr1 873762 . T G [CLIPPED] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255 chr1 877664 rs3828047 A G [CLIPPED] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0 chr1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26
| Field | Meaning |
|---|---|
| GT | The genotype of this sample. For a diploid, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (the by far more common case), GT will be either:
In the three examples above, NA12878 is T/G, G/G, and C/T. |
| GQ | The Genotype Quality, as a Phred-scaled confidence at the true genotype is the one provided in GT. In diploid case, if GT is 0/1, then GQ is really L(0/1) / (L(0/0) + L(0/1) + L(1/1)), where L is the likelihood of the NGS sequencing data under the model of that the sample is 0/0, 0/1/, or 1/1. |
| AD and DP | See the online documentation for AD and DP. |
| PL | We provide the AD and DP fields since this is usually what downstream users want. However, the truly sophisticated users will want to directly use the likelihoods of the three genotypes 0/0, 0/1, and 1/1 provide in the PL field. These are normalized, Phred-scaled likelihoods for each of the 0/0, 0/1, and 1/1, without priors. To be concrete, for the het case, this is L(data given that the true genotype is 0/1). The most likely genotype (the one in GT) is scaled so that it's P = 1.0 (0 when Phred-scaled), and the other likelihoods reflect their Phred-scaled likelihoods relative to this most likely genotype. Currently only provided when the site is biallelic. |
With that said, let's interpret the genotypes for NA12878 at
chr1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26
At this site, the called genotype is GT = 0/1, which is C/T. The confidence isn't so good (GQ=25.92), largely because there were only a total of 4 reads at this site (DP=4), 1 of which was ref and 3 of which were alt (AD=1,3). The lack of certainty is evident in the PL field, where 0/1 = 0 (the normalized value), whereas there's a serious chance that she's hom-var 1/1 = 26 = 10^(-2.6) = 0.25%. Either way, though, it's clear that she's definitely not homozygous reference here 0/0 = 103 = 10^(-10.3) which is a very small number.
Understanding annotations
Example VCF record annotations
chr1 873762 [CLIPPED] AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 chr1 877664 [CLIPPED] AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185 chr1 899282 [CLIPPED] AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148
| Field | Meaning |
|---|---|
| AC,AF,AN | Online documentation |
| DB | If present, then the variant is in dbSNP. |
| DP | Online documentation |
| DS | Were any of the samples downsampled because of too much coverage? |
| Dels | Online documentation |
| MQ and MQ0 | See the online documentation for RMS Mapping Quality and Mapping Quality Zero. |
| BaseQualityRankSumTest and MappingQualityRankSumTest and ReadPosRankSumTest | See the online documentation for the Base Quality, Mapping Quality, and Read Position tests. |
| HRun | Online documentation |
| HaplotypeScore | Online documentation |
| QD | Online documentation |
| VQSLOD | Only present when using Variant quality score recalibration. Log odds ratio of being a true variant versus being false under the trained gaussian mixture model. |
| FS | Online documentation |
| SB | How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls). The exact calculation used is described in the figure below. |