As of GATK 2.7, the HaplotypeCaller is now capable of emitting a per-bp or summarized confidence estimate for a site being strictly homozygous-reference.
The mode works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. For each position in the genome we have either a non-reference call (via the standard calling mechanism) or we can estimate the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:
Based on this, we emit the genotype likelihoods (
PL) and compute the
GQ (from the
PLs) for the least confidence of these two models.
We use a symbolic allele,
<NON_REF>, to indicate that the site is homozygous reference, and because we have an ALT allele we can provide allele-specific
PL field values.
This mode can be enabled with the command line argument
--emitRefConfidence (-ERC), which has 3 options:
NONE: don't emit anything (default)
BP_RESOLUTION: emit detailed information for each BP
GVCF: emit a block summarized version of the
This mode emits a very detailed model for the reference confidence at each reference base within the requested calling intervals. The output looks like:
20 10000435 . T <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:54,0:53:54:99:0,160,2385 20 10000436 . G <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:53,0:53:53:99:0,159,2011 20 10000437 . C <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:53,0:53:53:99:0,159,1971 20 10000438 . C <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:53,0:53:53:99:0,159,1816 20 10000439 . T G 1553.77 . AC=2;AF=1.00;AN=2;DP=56;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=221.40;MQ0=0;QD=27.75 GT:AD:DP:GQ:PL 1/1:0,55:55:99:1582,164,0 20 10000440 . T <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:55,0:55:55:99:0,166,2063 20 10000441 . T <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:55,0:55:55:99:0,165,2000 20 10000442 . T <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:56,0:56:56:99:0,168,2095 20 10000443 . A <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:56,0:56:56:99:0,169,2089 20 10000444 . A <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:56,0:56:56:99:0,168,2093 20 10000445 . C <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:56,0:56:56:99:0,169,2137
This shows each reference base in a 11 bp interval. Only one site (10000439) contains a SNP call. The other 10 bases are reference. For each reference base not part of a variant call we emit a VCF record with have the special symbolic allele
<NON_REF> indicating the call is between the reference base and any possible non-reference allele that might be segregating at this site. The genotype fields include the standard GATK
PL fields, all calculated as ref vs. any non-reference base. There's a new special
CD field as well, that gives the number of indel informative reads at this location, given the maximum size of the indel to consider.
Note that there's no site-level
QUAL field value. We discussed this internally and since the
QUAL is the probability that the site is polymorphic, all of the
QUAL field values should be 0 here, so we decided to drop it.
GVCF mode summarizes the output of
BP_RESOLUTION by grouping sequential reference calls with similar
GQ values into blocks, and emitting summarized blocks instead of the values at each base pair. The blocking is discrete, deterministic, and can be controlled by the command line parameter
GVCFGQBands. The algorithm creates bands of
GQ values and merges sequential blocks with
GQ values within the same band into a single synthetic block.
For example, if the
GVCFBQBands argument is
-GQB 10 -GQB 20 then we'll create blocks in the GVCF where sequential
<NON_REF> calls have
GQ values either < 10, between 10 and 19, or are 20 or greater.
FORMAT fields have a few new annotations. First,
BLOCK_SIZE indicate the extent of the reference block and its size (redundant yes but easy to read). For each
GVCF sample field there's now only
GQ from the standard annotations.
GT is always 0/0 while
GQ are representative values (currently median) from the bp resolution
GQ values within the block. The
MIN_GQ are the minimum values seen within the block for their respective fields.
The output for the above example in
GVCF mode looks like:
20 10000435 . T <NON_REF> . . BLOCK_SIZE=4;END=10000438 GT:DP:GQ:MIN_DP:MIN_GQ 0/0:53:99:53:159 20 10000439 . T G 1553.77 . AC=2;AF=1.00;AN=2;DP=56;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=221.40;MQ0=0;QD=27.75 GT:AD:DP:GQ:PL 1/1:0,55:55:99:1582,164,0 20 10000440 . T <NON_REF> . . BLOCK_SIZE=6;END=10000445 GT:DP:GQ:MIN_DP:MIN_GQ 0/0:56:99:55:165
Here's a GVCF output for 100 bp around the same variant:
20 10000400 . T <NON_REF> . . BLOCK_SIZE=39;END=10000438 GT:DP:GQ:MIN_DP:MIN_GQ 0/0:57:99:53:128 20 10000439 . T G 1553.77 . AC=2;AF=1.00;AN=2;DP=56;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=221.40;MQ0=0;QD=27.75 GT:AD:DP:GQ:PL 1/1:0,55:55:99:1582,164,0 20 10000440 . T <NON_REF> . . BLOCK_SIZE=61;END=10000500 GT:DP:GQ:MIN_DP:MIN_GQ 0/0:54:99:49:136
GVCFmode is likely to work with downstream GATK tools right now, as the special
<NON_REF>allele may be treated as a polymorphic site. If you want to use the BP resolution mode you will likely need to split the VCF into the ref sites and polymorphic sites, perform standard filtering etc to the polymorphic VCF, and then merge the ref and non-ref VCFs back together
We are interested in community feedback on the output format and additional information we should provide to make the reference model as useful as possible. Please post suggestions as comments here.
The Unified Genotyper calls SNPs relative to the specified, publicly-available reference assembly.
How can I call SNPs (in many samples, with UG) relative to an in-house individual, which I have sequenced at high-coverage?
My current solution is to perform a de novo assembly on the in-house reference individual using e.g. Velvet, and then simply use the fasta as the reference for UG.
Can the publicly-available reference assembly still be useful here for speeding up the mapping and filling-in missing parts ?
My organism is Drosophila melanogaster.