Understanding the Unified Genotyper's VCF files

From GSA
Jump to: navigation, search

Contents

Introduction

The GATK's primary (and only well-supported) file format for representing SNP, indel, and structural variation calls is the Variant Call Format (VCF). VCF is a text format that can be a bit verbose but is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

Unfortunately, understanding the detailed information provided by the GATK, in particular by tools that infer variation from NGS data, such as the Unified genotyper, can be quite challenging. This document should provide users of the GATK with an fairly detailed understanding of the organization and meaning of VCF files generated by the GATK and the Unified genotyper in particular.


An example VCF file

The following text is a valid VCF file describing the first few SNPs found by the UG in a deep whole genome data set from NA12878:

##fileformat=VCFv4.0
##FILTER=<ID=LowQual,Description="QUAL < 50.0">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=3,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="log10-scaled probability of variant being true under the trained gaussian mixture model">
##UnifiedGenotyperV2="analysis_type=UnifiedGenotyperV2 input_file=[TEXT CLIPPED FOR CLARITY]"
#CHROM	POS	ID		REF	ALT	QUAL	FILTER	INFO	FORMAT	NA12878
chr1	873762	.		T	G	5231.78	PASS	AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 GT:AD:DP:GQ:PL	0/1:173,141:282:99:255,0,255
chr1	877664	rs3828047	A	G	3931.66	PASS	AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185	GT:AD:DP:GQ:PL	1/1:0,105:94:99:255,255,0
chr1	899282	rs28548431	C	T	71.77	PASS	AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148 GT:AD:DP:GQ:PL	0/1:1,3:4:25.92:103,0,26
chr1	974165	rs9442391	T	C	29.84	LowQual	AC=1;AF=0.50;AN=2;DB;DP=18;Dels=0.00;HRun=1;HaplotypeScore=0.16;MQ=95.26;MQ0=0;QD=1.66;SB=-0.98 GT:AD:DP:GQ:PL	0/1:14,4:14:60.91:61,0,255

It seems a bit complex, but the structure of the file is actually quite simple:

[HEADER LINES]
#CHROM	POS	ID		REF	ALT	QUAL	FILTER	INFO	      FORMAT	      NA12878
chr1	873762	.		T	G	5231.78	PASS	[ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
chr1	877664	rs3828047	A	G	3931.66	PASS	[ANNOTATIONS] GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
chr1	899282	rs28548431	C	T	71.77	PASS	[ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26
chr1	974165	rs9442391	T	C	29.84	LowQual	[ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255


How variation is represented in a VCF

Each line represents one variant (here everything is a SNP, but some could be indels or CNVs) as well as the genotype of our sample, NA12878, at that variant. I've chosen these four variants because they each represent an important aspect in interpreting a VCF file:

  • chr1:873762 is a novel T/G polymorphism, found with very high confidence (QUAL = 5231.78).
  • chr1:877664 is a known A/G SNP (rs3828047), found with very high confidence (QUAL = 3931.66)
  • chr1:899282 is a known C/T SNP (rs28548431), but has a relative low confidence (QUAL = 71.77)
  • chr1:974165 is a known T/C SNP but we have so little evidence for this variant in our data that although we write out a record for it (book keeping, really) our statistical evidence is so low that we filter the record out as a bad site "LowQual".

Basically, the first 6 columns of the VCF are very easy to understand, because they have a single, well-understand meaning.

The six basic VCF fields
Field in VCF Meaning
CHROM and POS The CHROM and POS gives the contig on which the variant occurs. For indels this is actually the base preceding the event, due to the representation of indels in a VCF.
ID The dbSNP rs identifier of the SNP, based on the contig : position of the call and whether a record exists at this site in dbSNP.
REF and ALT The reference base and alternative base that vary in the samples, or in the population in general. Note that REF and ALT are always given on the forward strand. For indels the REF and ALT bases always include at least one base each (the base before the event). For a truly detailed discussion of how to interpret VCF files, see [1].
QUAL The Phred scaled probability of Probability that REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance. The GATK values can grow very large when lots of NGS data is used to call.
FILTER In a perfect world, the QUAL field would be based on a complete model for all error modes present in the data used to call. Unfortunately, we are still far from this ideal, and we have to use orthogonal approaches to determine which called sites, independent of QUAL, are machine errors and which are real SNPs. Whatever approach is used to filter the SNPs, the VCFs produced by the GATK carry both the PASSing filter records (the ones that are good have PASS in their FILTER field) as well as those that fail (the filter field is anything but PASS or a dot). If the FILTER field is a ".", then no filtering has been applied to the records, meaning that all of the records should be used for analysis but without explicitly saying that any PASS (it's a somewhat meaningless, philosophical difference).


How genotypes are represented in a VCF

The genotype fields of the VCF are far easier to understand than the variation fields, largely because they are more concrete and we use fewer tags. Let's take a look at the three records, simplified down to just show the key genotype fields:

chr1	873762	.		T	G	[CLIPPED] GT:AD:DP:GQ:PL	0/1:173,141:282:99:255,0,255
chr1	877664	rs3828047	A	G	[CLIPPED] GT:AD:DP:GQ:PL	1/1:0,105:94:99:255,255,0
chr1	899282	rs28548431	C	T	[CLIPPED] GT:AD:DP:GQ:PL	0/1:1,3:4:25.92:103,0,26


The standard GATK VCF genotype fields
Field Meaning
GT The genotype of this sample. For a diploid, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (the by far more common case), GT will be either:
  • 0/0 - the sample is homozygous reference
  • 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
  • 1/1 - the sample is homozygous alternate

In the three examples above, NA12878 is T/G, G/G, and C/T.

GQ The Genotype Quality, as a Phred-scaled confidence at the true genotype is the one provided in GT. In diploid case, if GT is 0/1, then GQ is really L(0/1) / (L(0/0) + L(0/1) + L(1/1)), where L is the likelihood of the NGS sequencing data under the model of that the sample is 0/0, 0/1/, or 1/1.
AD and DP See the online documentation for AD and DP.
PL We provide the AD and DP fields since this is usually what downstream users want. However, the truly sophisticated users will want to directly use the likelihoods of the three genotypes 0/0, 0/1, and 1/1 provide in the PL field. These are normalized, Phred-scaled likelihoods for each of the 0/0, 0/1, and 1/1, without priors. To be concrete, for the het case, this is L(data given that the true genotype is 0/1). The most likely genotype (the one in GT) is scaled so that it's P = 1.0 (0 when Phred-scaled), and the other likelihoods reflect their Phred-scaled likelihoods relative to this most likely genotype. Currently only provided when the site is biallelic.

With that said, let's interpret the genotypes for NA12878 at

chr1	899282	rs28548431	C	T	[CLIPPED] GT:AD:DP:GQ:PL	0/1:1,3:4:25.92:103,0,26

At this site, the called genotype is GT = 0/1, which is C/T. The confidence isn't so good (GQ=25.92), largely because there were only a total of 4 reads at this site (DP=4), 1 of which was ref and 3 of which were alt (AD=1,3). The lack of certainty is evident in the PL field, where 0/1 = 0 (the normalized value), whereas there's a serious chance that she's hom-var 1/1 = 26 = 10^(-2.6) = 0.25%. Either way, though, it's clear that she's definitely not homozygous reference here 0/0 = 103 = 10^(-10.3) which is a very small number.

Understanding annotations

Example VCF record annotations

chr1	873762	[CLIPPED] AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473
chr1	877664	[CLIPPED] AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185
chr1	899282	[CLIPPED] AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148
The standard GATK VCF INFO field annotations
Field Meaning
AC,AF,AN Online documentation
DB If present, then the variant is in dbSNP.
DP Online documentation
DS Were any of the samples downsampled because of too much coverage?
Dels Online documentation
MQ and MQ0 See the online documentation for RMS Mapping Quality and Mapping Quality Zero.
BaseQualityRankSumTest and MappingQualityRankSumTest and ReadPosRankSumTest See the online documentation for the Base Quality, Mapping Quality, and Read Position tests.
HRun Online documentation
HaplotypeScore Online documentation
QD Online documentation
VQSLOD Only present when using Variant quality score recalibration. Log odds ratio of being a true variant versus being false under the trained gaussian mixture model.
FS Online documentation
SB How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls). The exact calculation used is described in the figure below.


Strand Bias calculation

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox