GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.
This document explains what that extra information is and how you can use it to empower your variants analyses.
What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with
--output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller 3.x contain additional information that is formatted in a very specific way. Read on to find out more.
The key difference between a regular VCF and a gVCF is that the gVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a gVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.
Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the
BP_RESOLUTION gVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.
As you can see in the figure above, there are two options you can use with
BP_RESOLUTION, you get a gVCF with an individual record at every site: either a variant record, or a non-variant record. With
GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the
##GVCFBlock line of the gVCF header. The purpose of the blocks (also called banding) is to keep file size down, and there is no downside for the downstream analysis, so we do recommend using the
This is a banded gVCF produced by HaplotypeCaller with the
As you can see in the first line, the basic file format is a valid version 4.1 VCF. See also the
##GVCFBlock lines (after the
##FORMAT lines) which indicate the GQ ranges used for banding, and the definition of the
MIN_DP annotation in the
##fileformat=VCFv4.1 ##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location"> ##FILTER=<ID=LowQual,Description="Low quality"> ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> ##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias."> ##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive) ##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive) ##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive) ##GVCFBlock=minGQ=60(inclusive),maxGQ=2147483647(exclusive) ##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities"> ##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered"> ##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?"> ##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval"> ##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes"> ##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation"> ##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed"> ##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed"> ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality"> ##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads"> ##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities"> ##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias"> ##contig=<ID=20,length=63025520,assembly=b37> ##reference=file:///humgen/1kg/reference/human_g1k_v37.fasta
The first thing you'll notice, hopefully, is the
<NON_REF> symbolic allele listed in every record's
ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.
The second thing to look for is the
END tag in the
INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10000000 and ends at 20:10000116.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 20 10000000 . T <NON_REF> . . END=10000116 GT:DP:GQ:MIN_DP:PL 0/0:44:99:38:0,89,1385 20 10000117 . C T,<NON_REF> 612.77 . BaseQRankSum=0.000;ClippingRankSum=-0.411;DP=38;MLEAC=1,0;MLEAF=0.500,0.00;MQ=221.39;MQ0=0;MQRankSum=-2.172;ReadPosRankSum=-0.235 GT:AD:DP:GQ:PL:SB 0/1:17,21,0:38:99:641,0,456,691,519,1210:6,11,11,10 20 10000118 . T <NON_REF> . . END=10000210 GT:DP:GQ:MIN_DP:PL 0/0:42:99:38:0,80,1314 20 10000211 . C T,<NON_REF> 638.77 . BaseQRankSum=0.894;ClippingRankSum=-1.927;DP=42;MLEAC=1,0;MLEAF=0.500,0.00;MQ=221.89;MQ0=0;MQRankSum=-1.750;ReadPosRankSum=1.549 GT:AD:DP:GQ:PL:SB 0/1:20,22,0:42:99:667,0,566,728,632,1360:9,11,12,10 20 10000212 . A <NON_REF> . . END=10000438 GT:DP:GQ:MIN_DP:PL 0/0:52:99:42:0,99,1403 20 10000439 . T G,<NON_REF> 1737.77 . DP=57;MLEAC=2,0;MLEAF=1.00,0.00;MQ=221.41;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,56,0:56:99:1771,168,0,1771,168,1771:0,0,0,0 20 10000440 . T <NON_REF> . . END=10000597 GT:DP:GQ:MIN_DP:PL 0/0:56:99:49:0,120,1800 20 10000598 . T A,<NON_REF> 1754.77 . DP=54;MLEAC=2,0;MLEAF=1.00,0.00;MQ=185.55;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,53,0:53:99:1788,158,0,1788,158,1788:0,0,0,0 20 10000599 . T <NON_REF> . . END=10000693 GT:DP:GQ:MIN_DP:PL 0/0:51:99:47:0,120,1800 20 10000694 . G A,<NON_REF> 961.77 . BaseQRankSum=0.736;ClippingRankSum=-0.009;DP=54;MLEAC=1,0;MLEAF=0.500,0.00;MQ=106.92;MQ0=0;MQRankSum=0.482;ReadPosRankSum=1.537 GT:AD:DP:GQ:PL:SB 0/1:21,32,0:53:99:990,0,579,1053,675,1728:9,12,10,22 20 10000695 . G <NON_REF> . . END=10000757 GT:DP:GQ:MIN_DP:PL 0/0:48:99:45:0,120,1800 20 10000758 . T A,<NON_REF> 1663.77 . DP=51;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.32;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,50,0:50:99:1697,149,0,1697,149,1697:0,0,0,0 20 10000759 . A <NON_REF> . . END=10001018 GT:DP:GQ:MIN_DP:PL 0/0:40:99:28:0,65,1080 20 10001019 . T G,<NON_REF> 93.77 . BaseQRankSum=0.058;ClippingRankSum=-0.347;DP=26;MLEAC=1,0;MLEAF=0.500,0.00;MQ=29.65;MQ0=0;MQRankSum=-0.925;ReadPosRankSum=0.000 GT:AD:DP:GQ:PL:SB 0/1:19,7,0:26:99:122,0,494,179,515,694:12,7,4,3 20 10001020 . C <NON_REF> . . END=10001020 GT:DP:GQ:MIN_DP:PL 0/0:26:72:26:0,72,1080 20 10001021 . T <NON_REF> . . END=10001021 GT:DP:GQ:MIN_DP:PL 0/0:25:37:25:0,37,909 20 10001022 . C <NON_REF> . . END=10001297 GT:DP:GQ:MIN_DP:PL 0/0:30:87:25:0,72,831 20 10001298 . T A,<NON_REF> 1404.77 . DP=41;MLEAC=2,0;MLEAF=1.00,0.00;MQ=171.56;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,41,0:41:99:1438,123,0,1438,123,1438:0,0,0,0 20 10001299 . C <NON_REF> . . END=10001386 GT:DP:GQ:MIN_DP:PL 0/0:43:99:39:0,95,1226 20 10001387 . C <NON_REF> . . END=10001418 GT:DP:GQ:MIN_DP:PL 0/0:41:42:39:0,21,315 20 10001419 . T <NON_REF> . . END=10001425 GT:DP:GQ:MIN_DP:PL 0/0:45:12:42:0,9,135 20 10001426 . A <NON_REF> . . END=10001427 GT:DP:GQ:MIN_DP:PL 0/0:49:0:48:0,0,1282 20 10001428 . T <NON_REF> . . END=10001428 GT:DP:GQ:MIN_DP:PL 0/0:49:21:49:0,21,315 20 10001429 . G <NON_REF> . . END=10001429 GT:DP:GQ:MIN_DP:PL 0/0:47:18:47:0,18,270 20 10001430 . G <NON_REF> . . END=10001431 GT:DP:GQ:MIN_DP:PL 0/0:45:0:44:0,0,1121 20 10001432 . A <NON_REF> . . END=10001432 GT:DP:GQ:MIN_DP:PL 0/0:43:18:43:0,18,270 20 10001433 . T <NON_REF> . . END=10001433 GT:DP:GQ:MIN_DP:PL 0/0:44:0:44:0,0,1201 20 10001434 . G <NON_REF> . . END=10001434 GT:DP:GQ:MIN_DP:PL 0/0:44:18:44:0,18,270 20 10001435 . A <NON_REF> . . END=10001435 GT:DP:GQ:MIN_DP:PL 0/0:44:0:44:0,0,1130 20 10001436 . A AAGGCT,<NON_REF> 1845.73 . DP=43;MLEAC=2,0;MLEAF=1.00,0.00;MQ=220.07;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,42,0:42:99:1886,125,0,1888,126,1890:0,0,0,0 20 10001437 . A <NON_REF> . . END=10001437 GT:DP:GQ:MIN_DP:PL 0/0:44:0:44:0,0,0
Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).
Better late than never, here is the now-traditional "Highlights" document for GATK version 3.0, which was released two weeks ago. It will be a very short one since we've already gone over the new features in detail in separate articles --but it's worth having a recap of everything in one place. So here goes.
We are delighted to present our new Best Practices workflow for variant calling in which multisample calling is replaced by a winning combination of single-sample calling in gVCF mode and joint genotyping analysis. This allows us to both bypass performance issues and solve the so-called "N+1 problem" in one fell swoop. For full details of why and how this works, please see this document. In the near future, we will update our Best Practices page to make it clear that the new workflow is now the recommended way to go for calling variants on cohorts of samples. We've already received some pretty glowing feedback from early adopters, so be sure to try it out for yourself!
All the cool kids were doing it, so we had to join the party. It took a few months of experimentation, a couple of new tools and some tweaks to the HaplotypeCaller, but you can now call variants on RNAseq with GATK! This document details our Best Practices recommendations for doing so, along with a non-trivial number of caveats that you should keep in mind as you go.
Nice try, but no. This tool is obsolete now that we have the gVCF/reference model pipeline (see above). Note that this means that GATK 3.0 will not support BAM files that were processed using ReduceReads!
We've switched the build system from Ant to Maven, which should make it much easier to use GATK as a library against which you can develop your own tools. And on a related note, we're also making significant changes to the internal structure of the GATK codebase. Hopefully this will not have too much impact on external projects, but there will be a doc very shortly describing how the new build system works and how the codebase is structured.
For reasons that will be made clear in the near future, we decided to hold the previously announced hardware optimizations until version 3.1, which will be released very soon. Stay tuned!