Tagged with #haplotypecaller
6 documentation articles | 3 announcements | 89 forum discussions

Created 2016-04-01 19:25:14 | Updated | Tags: haplotypecaller rnaseq joint-discovery gvcf joint-calling

Comments (0)

We have not yet validated the joint genotyping methods (HaplotypeCaller in -ERC GVCF mode per-sample then GenotypeGVCFs per-cohort) on RNAseq data. Our standard recommendation is to process RNAseq samples individually as laid out in the RNAseq-specific documentation.

However, we know that a lot of people have been trying out the joint genotyping workflow on RNAseq data, and there do not seem to be any major technical problems. You are welcome to try it on your own data, with the caveat that we cannot guarantee correctness of results, and may not be able to help you if something goes wrong. Please be sure to examine your results carefully and critically.

If you do pursue this, you will need to pre-process your samples according to our RNA-specific documentation, then switch to the GVCF workflow at the HaplotypeCaller stage. For filtering, it will be up to you to determine whether the hard filtering or VQSR filtering method produce best results. We have not tested any of this so we cannot provide a recommendation. Be prepared to do a lot of analysis to validate the quality of your results.

Good luck!

Created 2014-05-08 22:09:16 | Updated 2015-05-16 04:53:11 | Tags: haplotypecaller haplotypescore assembly haplotype

Comments (9)

This document details the procedure used by HaplotypeCaller to re-assemble read data and determine candidate haplotypes as a prelude to variant calling. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.

Note that we are still working on producing figures to complement the text. We will update this document as soon as the figures are ready. Note also that this is a provisional document and some final corrections may be made for accuracy and/or completeness. Feedback is most welcome!


The previous step produced a list of ActiveRegions that showed some evidence of possible variation (see step 1 documentation). Now, we need to process each Active Region in order to generate a list of possible haplotypes based on the sequence data we have for that region.

To do so, the program first builds an assembly graph for each active region (determined in the previous step) using the reference sequence as a template. Then, it takes each read in turn and attempts to match it to a segment of the graph. Whenever portions of a read do not match the local graph, the program adds new nodes to the graph to account for the mismatches. After this process has been repeated with many reads, it typically yields a complex graph with many possible paths. However, because the program keeps track of how many reads support each path segment, we can select only the most likely (well-supported) paths. These likely paths are then used to build the haplotype sequences which will be used to call variants and assign per-sample genotypes in the next steps.

1. Reference graph assembly

First, we construct the reference assembly graph, which starts out as a simple directed DeBruijn graph. This involves decomposing the reference sequence into a succession of kmers (pronounced "kay-mers"), which are small sequence subunits that are k bases long. Each kmer sequence overlaps the previous kmer by k-1 bases. The resulting graph can be represented as a series of nodes and connecting edges indicating the sequential relationship between the adjacent bases. At this point, all the connecting edges have a weight of 0.

In addition to the graph, we also build a hash table of unique kmers, which we use to keep track of the position of nodes in the graph. At the beginning, the hash table only contains unique kmers found in the reference sequence, but we will add to it in the next step.

A note about kmer size: by default, the program will attempt to build two separate graphs, using kmers of 10 and 25 bases in size, respectively, but other kmer sizes can be specified from the command line with the -kmerSize argument. The final set of haplotypes will be selected from the union of the graphs obtained using each k.

2. Threading reads through the graph

This is where our simple reference graph turns into a read-threading graph, so-called because we're going to take each read in turn and try to match it to a path in the graph.

We start with the first read and compare its first kmer to the hash table to find if it has a match. If there is a match, we look up its position in the reference graph and record that position. If there is no match, we consider that it is a new unique kmer, so we add that unique kmer to the hash table and add a new node to the graph. In both cases, we then move on and repeat the process with the next kmer in the read until we reach the end of the read.

When two consecutive kmers in a read belong to two nodes that were already connected by an edge in the graph, we increase the weight of that edge by 1. If the two nodes were not connected yet, we add a new edge to the graph with a starting weight of 1. As we repeat the process on each read in turn, edge weights will accumulate along the paths that are best supported by the read data, which will help us select the most likely paths later on.

Note on graph complexity, cycles and non-unique kmers

For this process to work properly, we need the graph to be sufficiently complex (where the number of non-unique k-mers is less that 4-fold the number of unique kmers found in the data) and without cycles. In certain genomic regions where there are a lot of repeated sequences, these conditions may not be met, because repeats cause cycles and diminish the number of available unique kmers. If none of the kmer sizes provided results in a viable graph (complex enough and without cycles) the program will automatically try the operation again with larger kmer sizes. Specifically, we take the largest k provided by the user (or by the default settings) and increase it by 10 bases. If no viable graph can be obtained after iterating over increased kmer sizes 6 times, we give up and skip the active region entirely.

3. Graph refinement

Once all the reads have been threaded through the graph, we need to clean it up a little. The main cleaning-up operation is called pruning (like the gardening technique). The goal of the pruning operation is to remove noise due to errors. The basic idea is that sections of the graph that are supported by very few reads are most probably the result of stochastic errors, so we are going to remove any sections that are supported by fewer than a certain threshold number of reads. By default the threshold value is 2, but this can be controlled from the command line using the -minPruning argument. In practice, this means that linear chains in the graph (linear sequence of vertices and edges without any branching) where all edges have fewer than 2 supporting reads will be removed. Increasing the threshold value will lead to faster processing and higher specificity, but will decrease sensitivity. Decreasing this value will do the opposite, decreasing specificity but increasing sensitivity.

At this stage, the program also performs graph refinement operations, such as recovering dangling heads and tails from the splice junctions to compensate for issues that are related to limitations in graph assembly.

Note that if you are calling multiple samples together, the program also looks at how many of the samples support each segment, and only prunes segments for which fewer than a certain number of samples have the minimum required number of supporting reads. By default this sample number is 1, so as long as one sample in the cohort passes the pruning threshold, the segment will NOT be pruned. This is designed to avoid losing singletons (variants that are unique to a single sample in a cohort). This parameter can also be controlled from the command line using the -minPruningSamples argument, but keep in mind that increasing the default value may lead to decreased sensitivity.

4. Select best haplotypes

Now that the graph is all cleaned up, the program builds haplotype sequences by traversing all possible paths in the graph and calculates a likelihood score for each one. This score is calculated as the product of transition probabilities of the path edges, where the transition probability of an edge is computed as the number of reads supporting that edge divided by the sum of the support of all edges that share that same source vertex.

In order to limit the amount of computation needed for the next step, we limit the number of haplotypes that will be considered for each value of k (remember that the program builds graphs for multiple kmer sizes). This is easy to do since we conveniently have scores for each haplotype; all we need to do is select the N haplotypes with the best scores. By default that number is very generously set to 128 (so the program would proceed to the next step with up to 128 haplotypes per value of k) but this can be adjusted from the command line using the -maxNumHaplotypesInPopulation argument. You would mainly want to decrease this number in order to improve speed; increasing that number would rarely be reasonable, if ever.

5. Identify potential variation sites

Once we have a list of plausible haplotypes, we perform a Smith-Waterman alignment (SWA) of each haplotype to the original reference sequence across the active region in order to reconstruct a CIGAR string for the haplotype. Note that indels will be left-aligned; that is, their start position will be set as the leftmost position possible.

This finally yields the potential variation sites that will be put through the variant modeling step next, bringing us back to the "classic" variant calling methods (as used by GATK's UnifiedGenotyper and Samtools' mpileup). Note that this list of candidate sites is essentially a super-set of what will eventually be the final set of called variants. Every site that will be called variant is in the super-set, but not every site that is in the super-set will be called variant.

Created 2014-04-10 21:57:10 | Updated 2015-02-20 18:10:09 | Tags: haplotypecaller reference-model gvcf

Comments (12)

This document describes the reference confidence model applied by HaplotypeCaller to generate genomic VCFs (gVCFS), invoked by -ERC GVCF or -ERC BP_RESOLUTION (see the FAQ on gVCFs for format details).

Please note that this document may be expanded with more detailed information in the near future.

How it works

The mode works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. For each position in the genome we have either an ALT call (via the standard calling mechanism) or we can estimate the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:

  • Estimate the confidence that no SNP exists at the site by contrasting all reads with the ref base vs all reads with any non-reference base.
  • Estimate the confidence that no indel of size < X (determined by command line parameter) could exist at this site by calculating the number of reads that provide evidence against such an indel, and from this value estimate the chance that we would not have seen the allele confidently.

Based on this, we emit the genotype likelihoods (PL) and compute the GQ (from the PLs) for the least confidence of these two models.

We use a symbolic allele pair, <NON_REF>, to indicate that the site is not homozygous reference, and because we have an ALT allele we can provide allele-specific AD and PL field values.

For details of the gVCF format, please see the document that explains what is a gVCF.

Created 2014-04-03 20:20:08 | Updated 2014-10-22 19:22:34 | Tags: haplotypecaller genotypegvcfs combinegvcfs gvcf joint-analysis

Comments (56)


GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variants analyses.

Important caveat

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller 3.x contain additional information that is formatted in a very specific way. Read on to find out more.

General comparison of VCF vs. gVCF

The key difference between a regular VCF and a gVCF is that the gVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a gVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION gVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

The two types of gVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a gVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the gVCF header. The purpose of the blocks (also called banding) is to keep file size down, and there is no downside for the downstream analysis, so we do recommend using the -GVCF option.

Example gVCF file

This is a banded gVCF produced by HaplotypeCaller with the -GVCF option.


As you can see in the first line, the basic file format is a valid version 4.1 VCF:

##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">

Toward the middle you see the ##GVCFBlock lines (after the ##FORMAT lines) (repeated here for clarity):


which indicate the GQ ranges used for banding (corresponding to the boundaries [5, 20, 60]).

You can also see the definition of the MIN_DP annotation in the ##FORMAT lines.


The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10000000 and ends at 20:10000116.

20  10000000    .   T   <NON_REF>   .   .   END=10000116    GT:DP:GQ:MIN_DP:PL  0/0:44:99:38:0,89,1385
20  10000117    .   C   T,<NON_REF> 612.77  .   BaseQRankSum=0.000;ClippingRankSum=-0.411;DP=38;MLEAC=1,0;MLEAF=0.500,0.00;MQ=221.39;MQ0=0;MQRankSum=-2.172;ReadPosRankSum=-0.235   GT:AD:DP:GQ:PL:SB   0/1:17,21,0:38:99:641,0,456,691,519,1210:6,11,11,10
20  10000118    .   T   <NON_REF>   .   .   END=10000210    GT:DP:GQ:MIN_DP:PL  0/0:42:99:38:0,80,1314
20  10000211    .   C   T,<NON_REF> 638.77  .   BaseQRankSum=0.894;ClippingRankSum=-1.927;DP=42;MLEAC=1,0;MLEAF=0.500,0.00;MQ=221.89;MQ0=0;MQRankSum=-1.750;ReadPosRankSum=1.549    GT:AD:DP:GQ:PL:SB   0/1:20,22,0:42:99:667,0,566,728,632,1360:9,11,12,10
20  10000212    .   A   <NON_REF>   .   .   END=10000438    GT:DP:GQ:MIN_DP:PL  0/0:52:99:42:0,99,1403
20  10000439    .   T   G,<NON_REF> 1737.77 .   DP=57;MLEAC=2,0;MLEAF=1.00,0.00;MQ=221.41;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,56,0:56:99:1771,168,0,1771,168,1771:0,0,0,0
20  10000440    .   T   <NON_REF>   .   .   END=10000597    GT:DP:GQ:MIN_DP:PL  0/0:56:99:49:0,120,1800
20  10000598    .   T   A,<NON_REF> 1754.77 .   DP=54;MLEAC=2,0;MLEAF=1.00,0.00;MQ=185.55;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,53,0:53:99:1788,158,0,1788,158,1788:0,0,0,0
20  10000599    .   T   <NON_REF>   .   .   END=10000693    GT:DP:GQ:MIN_DP:PL  0/0:51:99:47:0,120,1800
20  10000694    .   G   A,<NON_REF> 961.77  .   BaseQRankSum=0.736;ClippingRankSum=-0.009;DP=54;MLEAC=1,0;MLEAF=0.500,0.00;MQ=106.92;MQ0=0;MQRankSum=0.482;ReadPosRankSum=1.537 GT:AD:DP:GQ:PL:SB   0/1:21,32,0:53:99:990,0,579,1053,675,1728:9,12,10,22
20  10000695    .   G   <NON_REF>   .   .   END=10000757    GT:DP:GQ:MIN_DP:PL  0/0:48:99:45:0,120,1800
20  10000758    .   T   A,<NON_REF> 1663.77 .   DP=51;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.32;MQ0=0  GT:AD:DP:GQ:PL:SB   1/1:0,50,0:50:99:1697,149,0,1697,149,1697:0,0,0,0
20  10000759    .   A   <NON_REF>   .   .   END=10001018    GT:DP:GQ:MIN_DP:PL  0/0:40:99:28:0,65,1080
20  10001019    .   T   G,<NON_REF> 93.77   .   BaseQRankSum=0.058;ClippingRankSum=-0.347;DP=26;MLEAC=1,0;MLEAF=0.500,0.00;MQ=29.65;MQ0=0;MQRankSum=-0.925;ReadPosRankSum=0.000 GT:AD:DP:GQ:PL:SB   0/1:19,7,0:26:99:122,0,494,179,515,694:12,7,4,3
20  10001020    .   C   <NON_REF>   .   .   END=10001020    GT:DP:GQ:MIN_DP:PL  0/0:26:72:26:0,72,1080
20  10001021    .   T   <NON_REF>   .   .   END=10001021    GT:DP:GQ:MIN_DP:PL  0/0:25:37:25:0,37,909
20  10001022    .   C   <NON_REF>   .   .   END=10001297    GT:DP:GQ:MIN_DP:PL  0/0:30:87:25:0,72,831
20  10001298    .   T   A,<NON_REF> 1404.77 .   DP=41;MLEAC=2,0;MLEAF=1.00,0.00;MQ=171.56;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,41,0:41:99:1438,123,0,1438,123,1438:0,0,0,0
20  10001299    .   C   <NON_REF>   .   .   END=10001386    GT:DP:GQ:MIN_DP:PL  0/0:43:99:39:0,95,1226
20  10001387    .   C   <NON_REF>   .   .   END=10001418    GT:DP:GQ:MIN_DP:PL  0/0:41:42:39:0,21,315
20  10001419    .   T   <NON_REF>   .   .   END=10001425    GT:DP:GQ:MIN_DP:PL  0/0:45:12:42:0,9,135
20  10001426    .   A   <NON_REF>   .   .   END=10001427    GT:DP:GQ:MIN_DP:PL  0/0:49:0:48:0,0,1282
20  10001428    .   T   <NON_REF>   .   .   END=10001428    GT:DP:GQ:MIN_DP:PL  0/0:49:21:49:0,21,315
20  10001429    .   G   <NON_REF>   .   .   END=10001429    GT:DP:GQ:MIN_DP:PL  0/0:47:18:47:0,18,270
20  10001430    .   G   <NON_REF>   .   .   END=10001431    GT:DP:GQ:MIN_DP:PL  0/0:45:0:44:0,0,1121
20  10001432    .   A   <NON_REF>   .   .   END=10001432    GT:DP:GQ:MIN_DP:PL  0/0:43:18:43:0,18,270
20  10001433    .   T   <NON_REF>   .   .   END=10001433    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1201
20  10001434    .   G   <NON_REF>   .   .   END=10001434    GT:DP:GQ:MIN_DP:PL  0/0:44:18:44:0,18,270
20  10001435    .   A   <NON_REF>   .   .   END=10001435    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1130
20  10001436    .   A   AAGGCT,<NON_REF>    1845.73 .   DP=43;MLEAC=2,0;MLEAF=1.00,0.00;MQ=220.07;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,42,0:42:99:1886,125,0,1888,126,1890:0,0,0,0
20  10001437    .   A   <NON_REF>   .   .   END=10001437    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,0

Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

Created 2013-08-23 21:34:04 | Updated 2014-10-24 16:21:11 | Tags: unifiedgenotyper haplotypecaller ploidy

Comments (7)

Use HaplotypeCaller!

The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper. Its ability to call SNPs is equivalent to that of the UnifiedGenotyper, its ability to call indels is far superior, and it is now capable of calling non-diploid samples. It also comprises several unique functionalities such as the reference confidence model (which enables efficient and incremental variant discovery on ridiculously large cohorts) and special settings for RNAseq data.

As of GATK version 3.3, we recommend using HaplotypeCaller in all cases, with no exceptions.

Caveats for older versions

If you are limited to older versions for project continuity, you may opt to use UnifiedGenotyper in the following cases:

  • If you are working with non-diploid organisms (UG can handle different levels of ploidy while older versions of HC cannot)
  • If you are working with pooled samples (also due to the HC’s limitation regarding ploidy)
  • If you want to analyze more than 100 samples at a time (for performance reasons) (versions 2.x)

Created 2013-06-17 21:31:21 | Updated 2016-02-11 10:57:49 | Tags: haplotypecaller variant-discovery

Comments (21)


Call variants on a single genome with the HaplotypeCaller, producing a raw (unfiltered) VCF.


This is meant only for single-sample analysis. To analyze multiple samples, see the Best Practices documentation on joint analysis.


  • TBD


  1. Determine the basic parameters of the analysis
  2. Call variants in your sequence data

1. Determine the basic parameters of the analysis

If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.

  • Genotyping mode (--genotyping_mode)

This specifies how we want the program to determine the alternate alleles to use for genotyping. In the default DISCOVERY mode, the program will choose the most likely alleles out of those it sees in the data. In GENOTYPE_GIVEN_ALLELES mode, the program will only use the alleles passed in from a VCF file (using the -alleles argument). This is useful if you just want to determine if a sample has a specific genotype of interest and you are not interested in other alleles.

  • Emission confidence threshold (-stand_emit_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.

  • Calling confidence threshold (-stand_call_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.

The terms "called" and "filtered" are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.

2. Call variants in your sequence data


Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T HaplotypeCaller \ 
    -R reference.fa \ 
    -I preprocessed_reads.bam \  
    -L 20 \ 
    --genotyping_mode DISCOVERY \ 
    -stand_emit_conf 10 \ 
    -stand_call_conf 30 \ 
    -o raw_variants.vcf 

Note that -L specifies that we only want to run the command on a subset of the data (here, chromosome 20). This is useful for testing as well as other purposes, as documented here. For example, when running on exome data, we use -L to specify a file containing the list of exome targets corresponding to the capture kit that was used to generate the exome libraries.

Expected Result

This creates a VCF file called raw_variants.vcf, containing all the sites that the HaplotypeCaller evaluated to be potentially variant. Note that this file contains both SNPs and Indels.

Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.

Created 2015-11-25 07:37:00 | Updated 2015-11-25 14:21:18 | Tags: haplotypecaller release mutect version-highlights topstory mutect2

Comments (21)

The last GATK 3.x release of the year 2015 has arrived!

The major feature in GATK 3.5 is the eagerly awaited MuTect2 (beta version), which brings somatic SNP and Indel calling to GATK. This is just the beginning of GATK’s scope expansion into the somatic variant domain, so expect some exciting news about copy number variation in the next few weeks! Meanwhile, more on MuTect2 awesomeness below.

In addition, we’ve got all sorts of variant context annotation-related treats for you in the 3.5 goodie bag -- both new annotations and new capabilities for existing annotations, listed below.

In the variant manipulation space, we enhanced or fixed functionality in several tools including LeftAlignAndTrimVariants, FastaAlternateReferenceMaker and VariantEval modules. And in the variant calling/genotyping space, we’ve made some performance improvements across the board to HaplotypeCaller and GenotypeGVCFs (mostly by cutting out crud and making the code more efficient) including a few improvements specifically for haploids. Read the detailed release notes for more on these changes. Note that GenotypeGVCFs will now emit no-calls at sites where RGQ=0 in acknowledgment of the fact that those sites are essentially uncallable.

We’ve got good news for you if you’re the type who worries about disk space (whether by temperament or by necessity): we finally have CRAM support -- and some recommendations for keeping the output of BQSR down to reasonable file sizes, detailed below.

Finally, be sure to check out the detailed release notes for the usual variety show of minor features (including a new Queue job runner that enables local parallelism), bug fixes and deprecation notices (a few tools have been removed from the codebase, in the spirit of slimming down ahead of the holiday season).

Introducing MuTect2 (beta): calling somatic SNPs and Indels natively in GATK

MuTect2 is the next-generation somatic SNP and indel caller that combines the DREAM challenge-winning somatic genotyping engine of the original MuTect with the assembly-based machinery of HaplotypeCaller.

The original MuTect (Cibulskis et al., 2013) was built on top of the GATK engine by the Cancer Genome Analysis group at the Broad Institute, and was distributed as a separate package. By all accounts it did a great job calling somatic SNPs, and was part of the winning entries for multiple DREAM challenges (including some submitted by groups outside the Broad). However it was not able to call indels; and the less said about the indel caller that accompanied it (first named SomaticIndelDetector then Indelocator) the better.

This new incarnation of MuTect leverages much of the HaplotypeCaller’s internal machinery (including the all-important graph assembly bit) to call both SNPs and indels together. Yet it retains key parts of the original MuTect’s internal genotyping engine that allow it to model somatic variation appropriately. This is a major differentiation point compared to HaplotypeCaller, which has expectations about ploidy and allele frequencies that make it unsuitable for calling somatic variants.

As a convenience add-on to MuTect2, we also integrated the cross-sample contamination estimation tool ContEst into GATK 3.5. Note that while the previous public version of this tool relied on genotyping chip data for its operation, this version of the tool has been upgraded to enable on-the-fly genotyping for the case where genotyping data is not available. Documentation of this feature will be provided in the near future. Both MuTect2 and ContEst are now featured in the Tool Documentation section of the Guide. Stay tuned for pipeline-level documentation on performing somatic variant discovery, to be added to the Best Practices docs in the near future.

Please note that this release of MuTect2 is a beta version intended for research purposes only and should not be applied in production/clinical work. MuTect2 has not yet undergone the same degree of scrutiny and validation as the original MuTect since it is so new. Early validation results suggest that MuTect2 has a tendency to generate more false positives as compared to the original MuTect; for example, it seems to overcall somatic mutations at low allele frequencies, so for now we recommend applying post-processing filters, e.g. by hard-filtering calls with low minor allele frequencies. Rest assured that data is being generated and the tools are being improved as we speak. We’re also looking forward to feedback from you, the user community, to help us make it better faster.

Finally, note also that MuTect2 is distributed under the same restricted license as the original MuTect; for-profit users are required to seek a license to use it (please email softwarelicensing@broadinstitute.org). To be clear, while MuTect2 is released as part of GATK, the commercial licensing has not been consolidated under a single license. Therefore, current holders of a GATK license will still need to contact our licensing office if they wish to use MuTect2.

Annotate this: new and improved variant context annotations

Whew that was a long wall of text on MuTect2, wasn’t it. Let’s talk about something else now. Annotations! Not functional annotations, mind you -- we’re not talking about e.g. predicting synonymous vs. non-synonymous mutations here. I mean variant context annotations, i.e. all those statistics calculated during the variant calling process which we mostly use to estimate how confident we are that the variants are real vs. artifacts (for filtering and related purposes).

So we have two new annotations, BaseCountsBySample (what it says on the can) and ExcessHet (for excess heterozygosity, i.e. the number of heterozygote calls made in excess of the Hardy-Weinberg expectations), as well as a set of new annotations that are allele-specific versions of existing annotations (with AS_ prefix standing for Allele-Specific) which you can browse here. Right now we’re simply experimenting with these allele-specific annotations to determine what would be the best way to make use of them to improve variant filtering. In the meantime, feel free to play around with them (via e.g. VariantsToTable) and let us know if you come up with any interesting observations. Crowdsourcing is all the rage, let’s see if it gets us anywhere on this one!

We also made some improvements to the StrandAlleleCountsBySample annotation, to how VQSR handles MQ, and to how VariantAnnotator makes use of external resources -- and we fixed that annoying bug where default annotations were getting dropped. All of which you can read about in the detailed release notes.

These Three Awesome File Hacks Will Restore Your Faith In Humanity’s Ability To Free Up Some Disk Space

CRAM support! Long-awaited by many, lovingly implemented by Vadim Zalunin at EBI and colleagues at the Sanger Institute. We haven’t done extensive testing, and there are a few tickets for improvements that are planned at the htsjdk level -- but it works well enough that we’re comfortable releasing it under a beta designation. Meaning have fun with it, but do your own thorough testing before putting it into production or throwing out your old BAMs!

Static binning of base quality scores. In a nutshell, binning (or quantizing) the base qualities in a BAM file means that instead of recording all possible quality values separately, we group them into bins represented by a single value (by default, 10, 20, 30 or 40). By doing this we end up having to record fewer separate numbers, which through the magic of BAM compression yields substantially smaller files. The idea is that we don’t actually need to be able to differentiate between quality scores at a very high resolution -- if the binning scheme is set up appropriately, it doesn’t make any difference to the variant discovery process downstream. This is not a new concept, but now the GATK engine has an argument to enable binning quality scores during the base recalibration (BQSR) process using a static binning scheme that we have determined produces optimal results in our hands. The level of compression is of course adjustable if you’d like to set your own tradeoff between compression and base quality resolution. We have validated that this type of binning (with our chosen default parameters) does not have any noticeable adverse effect on germline variant discovery. However we are still looking into some possible effects on somatic variant discovery, so we can’t yet recommend binning for that application.

Disable indel quality scores. The Base Recalibration process produces indel quality scores in addition to the regular base qualities. They are stored in the BI and BD tags of the read records, taking up a substantial amount of space in the resulting BAM files. There has been a lot of discussion about whether these indel quals are worth the file size inflation. Well, we’ve done a lot of testing and we’ve now decided that no, for most use cases the indel quals don’t make enough of a difference to justify the extra file size. The one exception to this is when processing PacBio data, it seems that indel quals may help model the indel-related errors of that technology. But for the rest, we’re now comfortable recommending the use of the --disable_indel_quals argument when writing out the recalibrated BAM file with PrintReads.

Created 2015-11-25 07:10:45 | Updated 2016-02-17 06:37:17 | Tags: Promote haplotypecaller release-notes mutect gatk3 mutect2

Comments (7)

GATK 3.5 was released on November 25, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights.

New tools

  • MuTect2: somatic SNP and indel caller based on HaplotypeCaller and the original MuTect.
  • ContEst: estimation of cross-sample contamination (primarily for use in somatic variant discovery).
  • GatherBqsrReports: utility to gather recalibration tables from scatter-parallelized BaseRecalibrator runs.

Variant Context Annotations

  • Added allele-specific version of existing annotations: AS_BaseQualityRankSumTest, AS_FisherStrand, AS_MappingQualityRankSumTest, AS_RMSMappingQuality, AS_RankSumTest, AS_ReadPosRankSumTest, AS_StrandOddsRatio, AS_QualByDepth and AS_InbreedingCoeff.

  • Added BaseCountsBySample annotation. Intended to provide insight into the pileup of bases used by HaplotypeCaller in the calling process, which may differ from the pileup observed in the original bam file because of the local realignment and additional filtering performed internally by HaplotypeCaller. Can only be requested from HaplotypeCaller, not VariantAnnotator.

  • Added ExcessHet annotation. Estimates excess heterozygosity in a population of samples. Related to but distinct from InbreedingCoeff, which estimates evidence for inbreeding in a population. ExcessHet scales more reliably to large cohort sizes.

  • Added FractionInformativeReads annotation. Reports the number of reads that were considered informative by HaplotypeCaller (over all samples).

  • Enforced calculating GenotypeAnnotations before InfoFieldAnnotations. This ensures that the AD value is available to use in the QD calculation.

  • Reorganized standard annotation groups processing to ensure that all default annotations always get annotated regardless of what is specified on the command line. This fixes a bug where default annotations were getting dropped when the command line included annotation requests.

  • Made GenotypeGVCFs subset StrandAlleleCounts intelligently, i.e. subset the SAC values to the called alleles. Previously, when the StrandAlleleCountsBySample (SAC) annotation was present in GVCFs, GenotypeGVCFs carried it over to the final VCF essentially unchanged. This was problematic because SAC includes the counts for all alleles originally present (including NON-REF) even when some are not called in the final VCF. When the full list of original alleles is no longer available, parsing SAC could become difficult if not impossible.

  • Added new MQ jittering functionality to improve how VQSR handles MQ. Note that HaplotypeCaller now calculates a new annotation called RAW_MQ per-sample, which is then integrated per-cohort by GenotypeGVCFs to produce the MQ annotation.

  • VariantAnnotator can now annotate FILTER field from an external resource. Usage: --resource:foo resource.vcf --expression foo.FILTER

  • VariantAnnotator can now check allele concordance when annotating with an external resource. Usage: --resourceAlleleConcordance

  • Bug fix: The annotation framework was improved to allow for the collection of sufficient statistics during GVCF creation which are then used to compute the final annotation during the genotyping. This avoids the use of median as the representative annotation from the collection of values (one from each sample). TL;DR annotations will be more accurate when using the GVCF workflow for joint discovery.

Variant manipulation tools

  • Allowed overriding hard-coded cutoff for allele length in ValidateVariants and in LeftAlignAndTrimVariants. Usage: --reference_window_stop N where N is the desired cutoff.

  • Also in LeftAlignAndTrimVariants, trimming multiallelic alleles is now the default behavior.

  • Fixed ability to mask out snps with --snpmask in FastaAlternateReferenceMaker.

  • Also in FastaAlternateReferenceMaker, fixed merging of contiguous intervals properly, and made the tool produce more informative contig names.

  • Fixed a bug in CombineVariants that occurred when one record has a spanning deletion and needs a padded reference allele.

  • Added a new VariantEval evaluation module, MetricsCollection, that summarizes metrics from several EV modules.

  • Enabled family-level stratification in MendelianViolationEvaluator of VariantEval (if a ped file is provided), making it possible to count Mendelian violations for each family in a callset with multiple families.

  • Added the ability to SelectVariants to enforce 4.2 version output of the VCF spec when processing older files. Use case: the 4.2 spec specifies that GQ must be an integer; by default we don’t enforce it (so if reading an older file that used decimals, we don’t change it) but the new argument --forceValidOutput converts the values on request. Not made default because of some performance slowdown -- so writing VCFs is now fast by default, compliant by choice.

  • Improved VCF sequence dictionary validation. Note that as a side effect of the additional checks, some users have experienced an error that starts with "ERROR MESSAGE: Lexicographically sorted human genome sequence detected in variant." that is due to unintentional activation of a check that is not necessary. This will be fixed in the next release; in the meantime -U ALLOW_SEQ_DICT_INCOMPATIBILITY can be used (with caution) to override the check.

GVCF tools

  • Various improvements to the tools’ performance, especially HaplotypeCaller, by making the code more efficient and cutting out crud.

  • GenotypeGVCFs now emits a no-call (./.) when the evidence is too ambiguous to make a call at all (e.g. all the PLs are zero). Previously this would have led to a hom-ref call with RGQ=0.

  • Fixed a bug in GenotypeGVCFs that sometimes generated invalid VCFs for haploid callsets. The tool was carrying over the AD from alleles that had been trimmed out, causing field length mismatches.

  • Changed the genotyping implementation for haploid organisms to address performance problems reported when running GenotypeGVCFs on haploid callsets. Note that this change may lead to a slight loss of sensitivity at low-coverage sites -- let us know if you observe anything dramatic.

Genotyping engine tweaks

  • Ensured inputPriors get used if they are specified to the genotyper (previously they were ignored). Also improved docs on --heterozygosity and --indel_ heterozygosity priors.

  • Fixed bug that affected the --ignoreInputSamples behavior of CalculateGenotypePosteriors.

  • Limited emission of the scary warning message about max number of alleles (“this tool is set to genotype at most x alleles but we found more; only x will be used”) to a single occurrence unless DEBUG logging mode is activated. Otherwise it fills up our output logs.

Miscellaneous tool fixes

  • Added option to OverclippedReadFilter to not require soft-clips on both ends. Contributed by Jacob Silterra.

  • Fixed a bug in IndelRealigner where the tool was incorrectly "fixing" mates when supplementary alignments are present. The patch involves ignoring supplementary alignments.

  • Fixed a bug in CatVariants. Previously, VCF files were being sorted solely on the base pair position of the first record, ignoring the chromosome. This can become problematic when merging files from different chromosomes, especially if you have multiple VCFs per chromosome. Contributed by John Wallace.

Engine-level behaviors and capabilities

  • Support for reading and writing CRAM files. Some improvements are still expected in htsjdk. Contributed by Vadim Zalunin at EBI and collaborators at the Sanger Institute.

  • Made interval-list output format dependent on the file extension (for RealignerTargetCreator). If the extension is .interval_list, output will be formatted as a proper Picard interval list (with sequence dictionary). Otherwise it will be a basic GATK interval list as previously.

  • Adding static binning capability for base recalibration (BQSR).


  • Added a new JobRunner called ParallelShell that will run jobs locally on one node concurrently as specified by the DAG, with the option to limit the maximum number of concurrently running jobs using the flag maximumNumberOfJobsToRunConcurrently. Contributed by Johan Dahlberg.

  • Updated extension for Picard CalculateHsMetrics to include PER_TARGET_COVERAGE argument and added extension for Picard CollectWgsMetrics.

Deprecation notice


  • BeagleOutputToVCF, VariantsToBeagleUnphased, ProduceBeagleInput. These are tools for handling Beagle data. The latest versions of Beagle support VCF input and output, so there is no longer any reason for us to provide converters.
  • ReadAdaptorTrimmer and VariantValidationAssessor. These were experimental tools which we think are not useful and not operating on a sufficiently sound basis.
  • BaseCoverageDistribution and CoveredByNSamplesSites. These tools were redundant with DiagnoseTargets and/or DepthOfCoverage.
  • LiftOverVariants, FilterLiftedVariants and liftOverVCF.pl. The Picard liftover tool LiftoverVCF works better and is easier to operate.
  • sortByRef.pl. Use Picard SortVCF instead.
  • ListAnnotations. This was intended as a utility for listing annotations easily from command line, but it has not proved useful.


  • Made various documentation improvements.
  • Updated date and street address in license text.
  • Moved htsjdk & picard to version 1.141

Created 2014-08-27 18:39:39 | Updated 2014-12-16 03:19:38 | Tags: haplotypecaller ploidy haploid polyploid

Comments (27)

Until now, HaplotypeCaller was only capable of calling variants in diploid organisms due to some assumptions made in the underlying algorithms. I'm happy to announce that we now have a generalized version that is capable of handling any ploidy you specify at the command line!

This new feature, which we're calling "omniploidy", is technically still under development, but we think it's mature enough for the more adventurous to try out as a beta test ahead of the next official release. We'd especially love to get some feedback from people who work with non-diploids on a regular basis, so we're hoping that some of you microbiologists and assorted plant scientists will take it out for a spin and let us know how it behaves in your hands.

It's available in the latest nightly builds; just use the -ploidy argument to give it a whirl. If you have any questions or feedback, please post a comment on this article in the forum.

Caveat: the downstream tools involved in the new GVCF-based workflow (GenotypeGVCFs and CombineGVCFs) are not yet capable of handling non-diploid calls correctly -- but we're working on it.


We have added omniploidy support to the GVCF handling tools, with the following limitations:

  • When running, you need to indicate the sample ploidy that was used to generate the GVCFs with -ploidy. As usual 2 is the default ploidy.

  • The system does not support mixed ploidy across samples nor positions. An error message will be thrown if you attempt to genotype GVCFs that have a mixture, or that have some genotype whose ploidy does not match the -ploidy argument.


As of GATK version 3.3-0, the GVCF tools are capable of ad-hoc ploidy detection, and can handle mixed ploidies. See the release highlights for details.

Created 2016-05-26 11:54:51 | Updated | Tags: haplotypecaller erc 0-0

Comments (1)

Hi I'm trying to use Haplotype Caller for some WES and I used the --emitRefConfidence GVCF option in order to see only variants in my VCF output (no 0/0 homo wild type). Unexpectedly I obtained something like this: [...] chrX 155254672 . C . . END=155254791 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 chrX 155254853 . T . . END=155254972 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0

What am I getting wrong?

Created 2016-05-24 21:40:42 | Updated | Tags: unifiedgenotyper haplotypecaller gatk snps

Comments (1)


I am currently working on two different projects and interested in finding common and rare variants.

Project1 - Organism : Influenza virus Number of samples : 18 Sequencing type: Exome sequencing Alignment tool : BWA Analysis ready BAM files :

  • BAM files are generated using BWA,
  • then sorted the bam
  • deduplicated bam file using picard (markduplicates) GATK variant caller : Unifiedgenotyper or HaplotypeCaller (Which one to be used?)

Project2 :
Organism : Human Number of samples : 24 Sequencing type: DNAseq (paired-end) Analysis ready BAM files :

  • BAM files are generated using BWA,
  • then sorted the bam
  • and finally deduplicated and recalibrated_reads.bam GATK variant caller : Unifiedgenotyper or HaplotypeCaller (Which one to be used?)

What is the criteria to select the variant caller?

Created 2016-05-20 15:02:21 | Updated | Tags: haplotypecaller bug

Comments (6)


I am running GATK 3.5.-0 and encountered the following error: ERROR MESSAGE: Graph must have ref source and sink vertices

The error occurs towards the end of the file processing and the resulting gvcf file seems rather complete to me, but I'd like to make sure.

Any help is greatly appreciated, Susanne

INFO 15:31:52,917 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalStateException: Graph must have ref source and sink vertices at org.broadinstitute.gatk.tools.walkers.haplotypecaller.graphs.BaseGraph.removePathsNotConnectedToRef(BaseGraph.java:576) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler.createGraph(ReadThreadingAssembler.java:211) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler.assemble(ReadThreadingAssembler.java:127) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.LocalAssemblyEngine.runLocalAssembly(LocalAssemblyEngine.java:169) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.assembleReads(HaplotypeCaller.java:1029) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:865) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:228) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4):
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR MESSAGE: Graph must have ref source and sink vertices
ERROR ------------------------------------------------------------------------------------------

Created 2016-05-19 08:43:40 | Updated 2016-05-19 08:44:42 | Tags: haplotypecaller vcf genotype reference-confidence-model

Comments (3)

Hi, Just wondering what the possible reasons could be for Haplotype Caller (version: 3.5-0-g36282e4) to declare a reference genotype quality of 0 for positions where the read depth is relatively high, such as the following region:

SC1 1628 . T . . PASS DP=50;GC=33.33 GT:AD:DP:RGQ 0/0:50:50:99 SC1 1629 . T . . PASS DP=50;GC=38.1 GT:AD:DP:RGQ 0/0:50:50:99 SC1 1630 . T . . FAIL_RGQ DP=50;GC=38.1 GT:AD:DP:RGQ 0/0:45:50:5 SC1 1631 . C . . PASS DP=51;GC=33.33 GT:AD:DP:RGQ 0/0:51:51:99 SC1 1632 . A . . PASS DP=51;GC=33.33 GT:AD:DP:RGQ 0/0:51:51:96

In this instance the RGQ of the failed position above (at SC1:1630) was actually 5 (which I set as the threshold for filtering in this example), but I have plenty of instances where the read depth and resultant RGQ are like:

SC1 1630 . T . . FAIL_RGQ DP=50;GC=38.1 GT:AD:DP:RGQ ./.:45:50:5 SC1 1640 . T . . FAIL_RGQ DP=48;GC=38.1 GT:AD:DP:RGQ ./.:34:48:0 SC1 1805 . T . . FAIL_RGQ DP=36;GC=33.33 GT:AD:DP:RGQ ./.:32:36:0 SC1 2046 . A . . FAIL_RGQ DP=37;GC=19.05 GT:AD:DP:RGQ ./.:33:37:2 SC1 2345 . A . . FAIL_RGQ DP=105;GC=23.81 GT:AD:DP:RGQ ./.:90:105:0 SC1 2352 . A . . FAIL_RGQ DP=116;GC=19.05 GT:AD:DP:RGQ ./.:103:116:0 SC1 2356 . C . . FAIL_RGQ DP=112;GC=23.81 GT:AD:DP:RGQ ./.:100:112:0 SC1 2359 . G . . FAIL_RGQ DP=111;GC=28.57 GT:AD:DP:RGQ ./.:99:111:0

It feels like something funny is going on. Should it be possible for RGQ to be so low with such high depth? Also, I thought the AD format tag gave the count of unfiltered reads, whilst the format DP tag gave the filtered read depth (i.e. reads HC finds informative). Therefore shouldn't the AD count always be at least as high as the DP count?

Created 2016-05-17 05:28:21 | Updated | Tags: haplotypecaller rnaseq

Comments (3)

Hi, I am following the BP guidelines for RNA Seq variant calling and am trying to run the Haplotype Caller on the output .bam files from the Split & Trim step. All the MAPQ scores of 255 have been replaced with 60 as suggested. For all samples, not a singe variant was called, yet I can see them clearly when viewing my .bam files in IGV. We are using GATK version 3.5-0-g36282e4.

I am a bit confused as more than 60% of the reads passed the filters and were processed (see data below).

What is going wrong here, and what should be done to obtain the variants that are clearly present?

Here is the command line (as given in the documentation):

java-1.8/jdk1.8.0_92/bin/java -jar -Xms16000m -Xmx16000m -Djava.io.tmpdir=/tmp/2171CB /gatk-1.0/GenomeAnalysisTK.jar -T HaplotypeCaller -R Peaxi162_genome.fa -I 2171CB_dedup_split_realigned.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o 2171CB.vcf

The job output included:

Successfully completed.

Resource usage summary:

CPU time   :   3492.42 sec.
Max Memory :      6118 MB
Max Swap   :     20473 MB

Max Processes  :         3

The job error included: ... Using SSE4.1 accelerated implementation of PairHMM INFO 18:40:12,995 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file INFO 18:40:12,995 VectorLoglessPairHMM - Using vectorized implementation of PairHMM INFO 18:40:12,997 VectorLoglessPairHMM - Time spent in setup for JNI call : 0.0 INFO 18:40:12,997 PairHMM - Total compute time in PairHMM computeLikelihoods() : 0.0 INFO 18:40:12,997 HaplotypeCaller - Ran local assembly on 0 active regions INFO 18:40:13,214 ProgressMeter - done 1.259220201E9 62.6 m 2.0 s 100.0% 62.6 m 0.0 s INFO 18:40:13,214 ProgressMeter - Total runtime 3756.49 secs, 62.61 min, 1.04 hours INFO 18:40:13,215 MicroScheduler - 48797065 reads were filtered out during the traversal out of approximately 127666865 total reads (38.22%) INFO 18:40:13,215 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter INFO 18:40:13,215 MicroScheduler - -> 39887566 reads (31.24% of total) failing DuplicateReadFilter INFO 18:40:13,215 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 18:40:13,216 MicroScheduler - -> 8909499 reads (6.98% of total) failing HCMappingQualityFilter INFO 18:40:13,216 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 18:40:13,216 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 18:40:13,216 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 18:40:13,217 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter INFO 18:40:16,079 GATKRunReport - Uploaded run statistics report to AWS S3

Thanks for your guidance,


Created 2016-05-16 13:15:15 | Updated | Tags: unifiedgenotyper haplotypecaller gwas

Comments (6)

Hi GATK Team,

I'm performing a Genome Wide Association Study and have just used HaplotypeCaller to call SNPs. Having read previous threads discussing the reference bias inherent in HC (at least relative to UnifiedGenotyper), I included the recommended arguments so my command line was as follows:

java -Xmx16000m -XX:+UseSerialGC -jar /GenomeAnalysisTK.jar -T HaplotypeCaller -L /lustre/scratch113/projects/cichlid/Mzebra_UMD1_assembly/New_Intervals_May16/UMD_1_2.intervals -R /lustre/scratch113/projects/cichlid/Mzebra_UMD1_assembly/UMD1_mzebra_nuclear_and_mtDNA.fa --emitRefConfidence GVCF --variant_index_type LINEAR --minPruning 1 --minDanglingBranchLength 1 --variant_index_parameter 128000 -nct 4 -I {} -o GATKgVCFs/{}_UMD_1_2.g.vcf.gz

I understand that the minPruning and minDanglingBranchLength arguments are expected to mitigate the reference bias? I would like to know how it does so - does it just refuse to make calls at those sites rather than guessing or does it take another approach? As I'm sure you appreciate, in a GWAS study it could be very misleading to have certain sites incorrectly recorded as reference bases simply as a result of lack of sequencing depth. Is HC's appropriateness for a GWAS study comparable to that of UG or would UG be recommended for this particular application?

Kind regards,


Created 2016-05-07 09:12:27 | Updated | Tags: haplotypecaller bam

Comments (1)

Hi professor, when I use HC call mutations ,the genotype and mutation as follows: HC_Gene.refGene Chr Start End Ref Alt III17 HAVCR1 chr5 156479568 156479568 C CGTT,* 0/1

but when I chek the bam file of the reads approve the mutation,I see that almost all reads approve the second type,why HC believe it 0/1 , not 2/2? see the attachment?

Created 2016-05-02 16:43:01 | Updated | Tags: haplotypecaller

Comments (1)


I'm rather new to variant calling and had a question regarding this process on NGS data performed on a specific locus. I sequenced the PCR amplicon of a bunch of samples in a pool together. Let's say I have 20 samples which were all diploid. I expect some SNPs in some of the alleles but it's also possible that there is more than 1 SNP in 1 allele. Is it possible to do a variant calling process that will 'group' the two SNPs in 1 allele together? I looked into it a bit and I maybe found HaplotypeCaller, but I'm unsure if this will do the trick.

My apologies if this is a stupid question.

Kind regards

Created 2016-04-27 14:30:11 | Updated | Tags: haplotypecaller igv

Comments (1)

In your documentation for this you say "You can see that the bamout file, on top, contains data only for the ActiveRegion that was within the analysis interval specified by -L. The two blue reads represent the artificial haplotypes constructed by HaplotypeCaller (you may need to adjust your IGV settings to see the same thing on your machine). "

What are the IGV settings to see the artificial haplotypes?

Created 2016-04-24 00:40:28 | Updated | Tags: unifiedgenotyper haplotypecaller

Comments (3)

Hello , I recently saw a webinar by NCBI "Advanced Workshop on SRA and dbGaP Data Analysis" (ftp://ftp.ncbi.nlm.nih.gov/pub/education/public_webinars/2016/03Mar23_Advanced_Workshop/). They mentioned that they were able to run GATK directly on SRA files.

I downloaded GenomeAnalysisTK-3.5 jar file to my computer. I tried both these commands:

java -jar /path/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T HaplotypeCaller -R SRRFileName -I SRRFileName -stand_call_conf 30 -stand_emit_conf 10 -o SRRFileName.vcf

java -jar /path/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T SRRFileName -R SRR1718738 -I SRRFileName -stand_call_conf 30 -stand_emit_conf 10 -o SRRFileName.vcf

For both these commands, I got this error: ERROR MESSAGE: Invalid command line: The GATK reads argument (-I, --input_file) supports only BAM/CRAM files with the .bam/.cram extension and lists of BAM/CRAM files with the .list extension, but the file SRR1718738 has neither extension. Please ensure that your BAM/CRAM file or list of BAM/CRAM files is in the correct format, update the extension, and try again.

I don't see any documentation here about this, so wanted to check with you or anyone else has had any experience with this.

Thanks K

Created 2016-04-22 14:11:11 | Updated 2016-04-22 14:17:51 | Tags: haplotypecaller downsampling memory

Comments (2)

Hi there,

I am working on mitochondrial genomes (250 samples) with a coverage of ~7000x. The HC (v 3.5) is running perfectly for the firsts 248 samples but reaching the last ones, the following Java error appears :

ERROR MESSAGE: An error occurred because you did not provide enough memory to run this program. You can use the -Xmx argument (before the -jar argument) to adjust the maximum heap size provided to Java. Note that this is a JVM argument, not a GATK argument.

All the other samples work perfectly with ~64G of mem and v_mem so i tried 128G but it is still not sufficient for my last samples to be processed. As you probably understand, i can't extend the memory indefinitely and I would want to know if it exists a way to force the down-sampling or at least to greatly ameliorate the memory used by the HaplotypeCaller.

Thanks for your time,


Alex H

Created 2016-04-20 14:28:03 | Updated | Tags: haplotypecaller memory

Comments (4)

Hi, When I run the pipeline according to best practices, HC on a fastq of 60MB (for a targeted panel) takes about 10 minutes, but then for the same pipeline/targeted region, on a fastq of 150MB, HC takes 6 hrs. Any idea what would explain such a stark jump in runtime? Also, is there any way to reduce it? Can I use -Xmx -Xms arguments to increase the speed if memory is not an issue? Any help will be appreciated. Thanks!

Created 2016-04-20 13:52:37 | Updated | Tags: haplotypecaller

Comments (3)

Dears, I found many calling errors like this, all of the phred quality >30, but HaplotypeCaller miss them, report as Homozygote. How to deal with it? Here is the code. Many thanks. java -Xmx200g -jar /home/share/bin/GenomeAnalysisTK-3.5.jar \ -R /home/share/index/Prunus_persica.fa \ -T HaplotypeCaller -nct 8 \ -I $d'.sorted.uniqe.rg.dedup.realn.bam' \ -o $d'.gvcf' \ --genotyping_mode DISCOVERY \ -stand_emit_conf 30 \ -stand_call_conf 30 \ -ERC GVCF \ -variant_index_type LINEAR \ -variant_index_parameter 128000

Created 2016-04-17 06:38:08 | Updated | Tags: haplotypecaller nct pairhmm

Comments (3)

Hi, I'm using multi-threading for HaplotypeCaller by setting the nct option. But actually, I found that the speedup it gains isn't in proportional to the increase of the number of data threads. I tried nct as 8,12,16,24 on my machine, and gained a speedup of 4.1x, 4.2x, 4.2x, 4.2x. Seems that there is an upper bound of performance gains when enabling mult-threading for HaplotypeCaller.

I'm wondering what each data thread stands for in HaplotypeCaller. We need to use PairHMM to calculate the likelihood array in each active region. Are we distributing each read-haplotype pair in the region as one data thread and map it to a CPU thread? Or are we distributing the calculation in each region as one data thread?


Created 2016-04-15 22:01:44 | Updated | Tags: commandlinegatk haplotypecaller genotypegvcfs joint-calling

Comments (5)

I have been following the best practices outline for calling SNPs on our samples, but I'm a little confused as to what to do with the VCF file produced following the joint genotyping/genotypeGVCFs step.

I understand the principle of gVCF calling for the most part, but my confusion is what are we to do with the VCF file once we do the joint genotyping step? We are looking at a F1 mapping population of a non-model organism, so does this VCF file have individual progeny (bam file names) indicated within it? I think not since I can't find any of the sample names while scrolling through it.

Can this VCF file be used to construct a pedigree file to use during genotype refinement? Should it be somehow fed back into Haplotypecaller to inform on likely calls during a second round of variant calling? Do you use it to go back to the individual gVCF files to extract the high confidence variants?

There seems to be a good amount of literature on the Broad websites about what a gVCF file is and how to perform joint genotyping, but not much direction about what to do with the joint genotyped VCF file once it is produced.

Any advice or referral to other walkthroughs/guides would be very appreciated.


[extra project information: My project involves calling SNPs across a mapping population for a non-model organism with the intent of mapping a trait. The goal is to produce robust SNP calls for each individual progeny (of which we have 30 currently, and >60 in the near future) and the two parents. We only have halfway-decent sequencing coverage of ~10-20x for each sample, which is thus why doing gVCF calling and joint genotyping sounds attractive to us. Since we work on a non-model, we also lack previously produced "gold standard" SNP sets or other resources allowing us to refine genotypes.]

Created 2016-04-15 15:32:24 | Updated 2016-04-15 15:39:55 | Tags: haplotypecaller annotation dp missing

Comments (3)

I try to filter both variant non-variant sites together. I see the only reasonable way to do it is to filter by the per-sample DP. However, I noticed that substantial fraction of called sites (~10%) have missing value in DP field ( . instead of 0 or other value). Although many of sites with missing DP values indeed have low coverage with bad mapping and called ./., some sites have in my view good coverage. When I check the bam file (-bamout result) in IGV, I see many reads mapped to those sites with good quality (MQ=60). The genotype is usually called correctly, but for some unclear for me reason GATK doesn't report DP values. The AD values are usually 0,0 in such sites. Interestingly, the nearby sites have DP values reported. When I check gVCF files, these sites usually have DP=0. Assuming coverage of 0 I could discard these sites, but I see in a bam file that it is not 0. So, I do not want to throw away 10% of my data. I notice a tendency that such site is usually homozygot for ALT allele. I provide an example of such a site below. See the 1st sample in the position 13388742. (1/1:0,0:.:3:1|1:13388738_T_C:45,3,0) I generated the data using HaplotypeCaller in GVCF mode and then GenotypeGVCFs. Could you please tell me the reason why this is so?

VCF: scaffold_1 13388742 . A G 8529.41 . AC=35;AF=0.833;AN=42;BaseQRankSum=-1.730e-01;ClippingRankSum=-6.940e-01;DP=247;ExcessHet=0.4083;FS=9.162;InbreedingCoeff=0.1415;MLEAC=37;MLEAF=0.881;MQ=38.83;MQRankSum=2.51;QD=32.26;ReadPosRankSum=1.56;SOR=0.043 GT:AD:DP:GQ:PGT:PID:PL 1/1:0,0:.:3:1|1:13388738_T_C:45,3,0 ./.:0,0:0 ./.:2,0:2 ./.:4,0:4 ./.:0,0:0 ./.:3,0:3 0/0:20,0:20:0:.:.:0,0,577 0/1:0,1:1:11:0|1:13388692_G_T:81,0,11 1/1:0,10:10:33:1|1:13388724_G_A:495,33,0 ./.:0,0:0 1/1:1,19:20:27:1|1:13388742_A_G:979,27,0 1/1:0,20:20:35:1|1:13388724_G_A:944,35,0 1/1:0,1:1:6:.:.:90,6,0 1/1:0,7:7:24:1|1:13388724_G_A:360,24,0 0/1:0,2:2:30:0|1:13388692_G_T:165,0,30 1/1:0,5:5:15:1|1:13388742_A_G:225,15,0 1/1:0,20:20:60:1|1:13388724_G_A:900,60,0 1/1:0,8:8:30:1|1:13388724_G_A:450,30,0 1/1:0,15:15:48:.:.:720,48,0 0/1:3,28:31:42:0|1:13388742_A_G:1167,0,42 ./.:3,0:3 1/1:0,1:1:3:1|1:13388687_C_T:45,3,0 1/1:0,2:2:12:1|1:13388692_G_T:180,12,0 ./.:4,0:4 ./.:6,0:6 1/1:0,6:6:18:1|1:13388724_G_A:270,18,0 ./.:4,0:4 1/1:0,14:14:45:1|1:13388724_G_A:675,45,0 1/1:0,3:3:9:1|1:13388692_G_T:135,9,0 0/0:21,0:21:0:.:.:0,0,533 1/1:0,14:14:45:1|1:13388724_G_A:655,45,0

gVCF: scaffold_1 13388740 . C <NON_REF> . . END=13388741 GT:DP:GQ:MIN_DP:PL 0/0:4:12:4:0,12,162 scaffold_1 13388742 . A G,<NON_REF> 31.82 . DP=0;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;RAW_MQ=0.00 GT:GQ:PGT:PID:PL:SB 1/1:3:0|1:13388738_T_C:45,3,0,45,3,45:0,0,0,0 scaffold_1 13388743 . A <NON_REF> . . END=13388743 GT:DP:GQ:MIN_DP:PL 0/0:5:15:5:0,15,214

Screenshot of a BAM:

Created 2016-04-15 02:50:00 | Updated | Tags: haplotypecaller

Comments (3)

I got this error when running HaplotypeCaller:

ERROR MESSAGE: The requested extended must fully contain the requested span

My command is: java -Xmx4g -jar ~/GATK_3.2/GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -I Sample_55348_recal.bam --dbsnp dbsnp_138.hg19.vcf -L capture_targets_buffered10bases.bed --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o Sample_55348_rawcall_HCgvcf.vcf

What does the error mean?

Created 2016-04-11 09:55:10 | Updated | Tags: haplotypecaller createsequencedictionary

Comments (3)

Hello, I am using HaplotypeCaller (GATK v3.5) with an input BAM file which has a header line like this (just a fake example):

@SQ SN:chr1 LN:100000 SP:Arabis thal AS:2 M5:8668a646eada2f4 UR:file:refgenome_Atha_v2.fa

But the output VCF only has a subset of this information:

##contig=<ID=chr1,length=100000> ##reference=file:///home/me/tmp/refgenome_Atha_v2.fa

Is there a way to obtain something like this instead? (i.e. also indicate species, assembly and MD5 sum)

##contig=<ID=chr1,length=100000,assembly=2,md5=8668a646eada2f4,species="Arabis thal">

The information in the BAM file initially comes from a "dict" file generated by Picard CreateSequenceDictionary. So I tried to feed this "dict" file with the VCF file to Picard UpdateVcfSequenceDictionary, but it didn't give me species nor mD5 sum:


Thank you in advance, Tim

Created 2016-04-07 16:31:11 | Updated | Tags: haplotypecaller dp vcf-format genotypegvcfs

Comments (3)


I've run into the (already reported http://gatkforums.broadinstitute.org/dsde/discussion/5598/missing-depth-dp-after-haplotypecaller ) bug of the missing DP format field in my callings.

I've run the following (relevant) commands:

Haplotype Caller -> Generate GVCF:

    java -Xmx${xmx} ${gct} -Djava.io.tmpdir=${NEWTMPDIR} -jar ${gatkpath}/GenomeAnalysisTK.jar \
       -T HaplotypeCaller \
       -R ${ref} \
       -I ${NEWTMPDIR}/${prefix}.realigned.fixed.recal.bam \
       -L ${reg} \
       -ERC GVCF \
       -nct ${nct} \
       --genotyping_mode DISCOVERY \
       -stand_emit_conf 10 \
       -stand_call_conf 30  \
       -o ${prefix}.raw_variants.annotated.g.vcf \
       -A QualByDepth -A RMSMappingQuality -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A StrandOddsRatio -A Coverage

That generates GVCF files that DO HAVE the DP field for all reference positions, but DO NOT HAVE the DP format field for any called variant (but still keep the DP in the INFO field):

18      11255   .       T       <NON_REF>       .       .       END=11256       GT:DP:GQ:MIN_DP:PL      0/0:18:48:18:0,48,720
18      11257   .       C       G,<NON_REF>     229.77  .       BaseQRankSum=1.999;DP=20;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=-1.377;ReadPosRankSum=0.489      GT:AD:GQ:PL:SB  0/1:10,8,0:99:258,0,308,288
18      11258   .       G       <NON_REF>       .       .       END=11260       GT:DP:GQ:MIN_DP:PL      0/0:17:48:16:0,48,530

Later, I ran Genotype GVCF joining all the samples with the following command:

java -Xmx${xmx} ${gct} -Djava.io.tmpdir=${NEWTMPDIR} -jar ${gatkpath}/GenomeAnalysisTK.jar \
   -T GenotypeGVCFs \
   -R ${ref} \
   -L ${pos} \
   -o ${prefix}.raw_variants.annotated.vcf \
   --variant ${variant} [...]

This generated vcf files where the DP field is present in the format description, it IS present in the Homozygous REF samples, but IS MISSING in any Heterozygous or HomoALT samples.

22  17280388    .   T   C   18459.8 PASS    AC=34;AF=0.340;AN=100;BaseQRankSum=-2.179e+00;DP=1593;FS=2.526;InbreedingCoeff=0.0196;MLEAC=34;MLEAF=0.340;MQ=60.00;MQRankSum=0.196;QD=19.76;ReadPosRankSum=-9.400e-02;SOR=0.523    GT:AD:DP:GQ:PL  0/0:29,0:29:81:0,81,1118    0/1:20,22:.:99:688,0,682    1/1:0,27:.:81:1018,81,0 0/0:22,0:22:60:0,60,869 0/1:20,10:.:99:286,0,664    0/1:11,17:.:99:532,0,330    0/1:14,14:.:99:431,0,458    0/0:28,0:28:81:0,81,1092    0/0:35,0:35:99:0,99,1326    0/1:14,20:.:99:631,0,453    0/1:13,16:.:99:511,0,423    0/1:38,29:.:99:845,0,1231   0/1:20,10:.:99:282,0,671    0/0:22,0:22:63:0,63,837 0/1:8,15:.:99:497,0,248 0/0:32,0:32:90:0,90,1350    0/1:12,12:.:99:378,0,391    0/1:14,26:.:99:865,0,433    0/0:37,0:37:99:0,105,1406   0/0:44,0:44:99:0,120,1800   0/0:24,0:24:72:0,72,877 0/0:30,0:30:84:0,84,1250    0/0:31,0:31:90:0,90,1350    0/1:15,25:.:99:827,0,462    0/0:35,0:35:99:0,99,1445    0/0:29,0:29:72:0,72,1089    1/1:0,32:.:96:1164,96,0 0/0:21,0:21:63:0,63,809 0/1:21,15:.:99:450,0,718    1/1:0,40:.:99:1539,120,0    0/0:20,0:20:60:0,60,765 0/1:11,9:.:99:293,0,381 1/1:0,35:.:99:1306,105,0    0/1:18,14:.:99:428,0,606    0/0:32,0:32:90:0,90,1158    0/1:24,22:.:99:652,0,816    0/0:20,0:20:60:0,60,740 1/1:0,30:.:90:1120,90,0 0/1:15,13:.:99:415,0,501    0/0:31,0:31:90:0,90,1350    0/1:15,18:.:99:570,0,480    0/1:22,13:.:99:384,0,742    0/1:19,11:.:99:318,0,632    0/0:28,0:28:75:0,75,1125    0/0:20,0:20:60:0,60,785 1/1:0,27:.:81:1030,81,0 0/0:30,0:30:90:0,90,1108    0/1:16,16:.:99:479,0,493    0/1:14,22:.:99:745,0,439    0/0:31,0:31:90:0,90,1252
22  17280822    .   G   A   5491.56 PASS    AC=8;AF=0.080;AN=100;BaseQRankSum=1.21;DP=1651;FS=0.000;InbreedingCoeff=-0.0870;MLEAC=8;MLEAF=0.080;MQ=60.00;MQRankSum=0.453;QD=17.89;ReadPosRankSum=-1.380e-01;SOR=0.695   GT:AD:DP:GQ:PL  0/0:27,0:27:72:0,72,1080    0/0:34,0:34:90:0,90,1350    0/1:15,16:.:99:528,0,491    0/0:27,0:27:60:0,60,900 0/1:15,22:.:99:699,0,453    0/0:32,0:32:90:0,90,1350    0/0:37,0:37:99:0,99,1485    0/0:31,0:31:87:0,87,1305    0/0:40,0:40:99:0,108,1620   0/1:20,9:.:99:258,0,652 0/0:26,0:26:72:0,72,954 0/1:16,29:.:99:943,0,476    0/0:27,0:27:69:0,69,1035    0/0:19,0:19:48:0,48,720 0/0:32,0:32:81:0,81,1215    0/0:36,0:36:99:0,99,1435    0/0:34,0:34:99:0,99,1299    0/0:35,0:35:99:0,102,1339   0/0:38,0:38:99:0,102,1520   0/0:36,0:36:99:0,99,1476    0/0:31,0:31:81:0,81,1215    0/0:31,0:31:75:0,75,1125    0/0:35,0:35:99:0,99,1485    0/0:37,0:37:99:0,99,1485    0/0:35,0:35:90:0,90,1350    0/0:20,0:20:28:0,28,708 0/1:16,22:.:99:733,0,474    0/0:32,0:32:90:0,90,1350    0/0:35,0:35:99:0,99,1467    0/1:27,36:.:99:1169,0,831   0/0:28,0:28:75:0,75,1125    0/0:36,0:36:81:0,81,1215    0/0:35,0:35:90:0,90,1350    0/0:28,0:28:72:0,72,1080    0/0:31,0:31:81:0,81,1215    0/0:37,0:37:99:0,99,1485    0/0:31,0:31:84:0,84,1260    0/0:39,0:39:99:0,101,1575   0/0:37,0:37:96:0,96,1440    0/0:34,0:34:99:0,99,1269    0/0:30,0:30:81:0,81,1215    0/0:36,0:36:99:0,99,1485    0/1:17,17:.:99:567,0,530    0/0:26,0:26:72:0,72,1008    0/0:18,0:18:45:0,45,675 0/0:33,0:33:84:0,84,1260    0/0:25,0:25:61:0,61,877 0/1:9,21:.:99:706,0,243 0/0:35,0:35:81:0,81,1215    0/0:35,0:35:99:0,99,1485

I've just discovered this issue, and I need to run an analysis trying on the differential depth of coverage in different regions, and if there is a DP bias between called/not-called samples.

I have thousands of files and I've spent almost 1 year generating all these callings, so redoing the callings is not an option.

What would be the best/fastest strategy to either fix my final vcfs with the DP data present in all intermediate gvcf files (preferably) or, at least, extracting this data for all snps and samples?

Thanks in advance,


PS: Recalling the individual samples from bamfiles is not an option. Fixing the individual gvcfs and redoing the joint GenotypeGVCFs could be.

Created 2016-04-07 01:28:02 | Updated | Tags: haplotypecaller haploid bug runtime-error pooled-sequencing-haplotypecaller

Comments (9)

I was just running haplotype caller on a pool, and it gave an error message which (if I understand the options I gave it correctly) should not occur. My command was:

nice -n 5 java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -o Vars.vcf --forceActive -ploidy 18 -I input.bam --max_alternate_alleles 2 --genotyping_mode DISCOVERY

and the error stack trace I got was:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: the combination of ploidy (18) and number of alleles (17) results in a very large number of genotypes (> 2147483647). You need to limit ploidy or the number of alternative alleles to analyze this locus at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypeLikelihoodCalculator.(GenotypeLikelihoodCalculator.java:214) at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypeLikelihoodCalculators.getInstance(GenotypeLikelihoodCalculators.java:327) at org.broadinstitute.gatk.tools.walkers.genotyper.InfiniteRandomMatingPopulationModel.getLikelihoodsCalculator(InfiniteRandomMatingPopulationModel.java:145) at org.broadinstitute.gatk.tools.walkers.genotyper.InfiniteRandomMatingPopulationModel.singleSampleLikelihoods(InfiniteRandomMatingPopulationModel.java:137) at org.broadinstitute.gatk.tools.walkers.genotyper.InfiniteRandomMatingPopulationModel.calculateLikelihoods(InfiniteRandomMatingPopulationModel.java:115) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.calculateGLsForThisEvent(HaplotypeCallerGenotypingEngine.java:695) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:269) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:924) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:228) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4):
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR MESSAGE: the combination of ploidy (18) and number of alleles (17) results in a very large number of genotypes (> 2147483647). You need to limit ploidy or the number of alternative alleles to analyze this locus
ERROR ------------------------------------------------------------------------------------------

If I understand the options correctly, I told GATK to only use two possible alternates, so I'm not sure why this error message is showing up.

Incidentally, earlier I got an error message saying GATK had run out of memory with a very similar command. If necessary, I can send you my input file and reference genome. (I'm working with pooled chloroplast data for this file).

Created 2016-04-06 21:06:40 | Updated | Tags: unifiedgenotyper haplotypecaller rnaseq splitncigarreads

Comments (1)

Hi, I work on plant species. I am using GATK on variant discovery in RNAseq data. I am not able to decide whether I should use option --filter_reads_with_N_cigar or
-U ALLOW_N_CIGAR_READS. What do you suggest?

I Would like to count the number of reads affected by N's in the CIGAR? Could you please suggest any tool?

Secondly, I am using Haplotype Caller for variant discovery. It is running very slow (on 12 CPU).
Is it okay to use UnifiedGenotyper instead of Haplotype Caller on RNAseq data?


Created 2016-04-06 19:37:38 | Updated | Tags: haplotypecaller indels snps

Comments (4)


I typed the following command for finding snps and indels: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R chr21/chr21.fa -I chr21/alignments/human38chr21.sorted.bam -o chr21/humanoutput.raw.snps.indels.vcf

I even got the vcf file but it contains only the header and does not contain the data lines.What maybe wrong? I hae attached the file containing the stack trace.

Created 2016-04-06 18:29:56 | Updated | Tags: haplotypecaller pairhmm avx

Comments (1)


I want to switch back to Java LOGLESS_CACHING implementation for PairHMM instead of AVX, how can I make this? I think that I may need to change some argument but I don't know where to start.

Thanks, Jay

Created 2016-04-06 09:42:21 | Updated | Tags: haplotypecaller queue

Comments (3)

Hi!!! I'm attempting to use Queue on PBSPro HPC cluster. I have tested the functionality of a custom scala script for Haplotype Caller and it is runnable. However, following the discussion on GATK forum, I should need a job scheduler to dispatch queue output on several nodes..could you give me some advice or examples of the type of scheduler I need in a PBSpro system? I tried to run Queue on a single node and it seems working faster..the question is: when I run Queue on a single node I actually 'multithreading' HC or I'm wrong? Thanks a lot Best Regards Marco

Created 2016-04-05 19:39:33 | Updated | Tags: haplotypecaller createsequencedictionary

Comments (3)

Hi All,

I am running HaplotypeCaller and getting the error:

ERROR MESSAGE: Fasta dict file /net/rcnfs02/srv/export/duraisingh_lab/share_root/data/Plasmodium_knowlesi/jva/PlasmoDB-26_PknowlesiH_Genome_02.dict for reference /net/rcnfs02/srv/export/duraisingh_lab/share_root/data/Plasmodium_knowlesi/jva/PlasmoDB-26_PknowlesiH_Genome_02.fasta does not exist

BUT... the dictionary DOES exist! I made it with CreateSequenceDictionary.jar and it looks OK.

The reference dict and fasta are symbolically linked to the working directory. I did some googling on this but no luck.



Created 2016-03-20 21:47:36 | Updated | Tags: haplotypecaller

Comments (2)

Hello, we have a GATK automatic pipeline all set up. We noticed a run that looked finished but then noticed that some SNP/InDel calls were missing. We were able to see that Haplotype Caller died unexpectedly. It's progress meter only said it had finished a little more than 16%. The pipeline continued to process on this incomplete vcf file.

Am I correct in assuming that the only way we'd ever be able automatically note this error is if we monitor the progress meter of Haplotype caller because the vcf is being written as we go? Is there any way to write the vcf all at once at the end? This doesn't seem like a great idea anyway because of resource issues, but I am just curious. We cannot depend solely on the existence of an output vcf file unless this is the last step in the process. Is there any phrase we can monitor for at the end of the Haplotype caller that we can be sure won't appear elsewhere in the file? Like, "Done", "Finished", etc? Thank you!

Created 2016-03-20 19:55:44 | Updated | Tags: unifiedgenotyper haplotypecaller algorithm

Comments (1)

Under every poster of GATK asking which is better, HC or UG, Geraldine always said HC.

So Is there any documents talking about the detail algorithms HC and UG are using, so that I can get a clear idea why HC is better?


Created 2016-03-19 17:20:42 | Updated | Tags: haplotypecaller dp m gvcf mq genotypegvcf

Comments (27)

Hello! I had a question about the difference between using HaplotypeCaller's --emitRefConfidence GVCF vs BP_RESOLUTION. Maybe the answer is obvious or in the forum somewhere already but I couldn't spot it...

First, some context: I'm working with GATK v. 3.5.0 in a haploid organism. I have 34 samples, from which 5 are very similar to the reference (they are backcrosses) while the rest are strains from a wild population. Originally I used --emitRefConfidence GVCF followed by GenotypeGVCF. While checking the output VCF file, I realized that the five backcrosses had a much lower DP in average than the other samples (but this doesn't make sense due to difference in reads numbers or anything like that, since they were run in the same lane, etc). I assume this happened because there are long tracks without any variant compare to the reference in those samples, and the GVCF blocks end up assigning a lower depth for a great amount of sites in those samples compare to the much more polymorphic ones. In any case, I figured I could just get all sites using BP_RESOLUTION so to obtain the "true" DP values per site. However, when I tried to do that, the resulting VCF file had very low MQ values! Can you explain why this happened?

This is the original file with --emitRefConfidence GVCF:

$ bcftools view -H 34snps.vcf | head -n3 | cut -f1-8
chromosome_1    57  .   A   G   309.4   .   AC=4;AF=0.235;AN=17;DP=582;FS=0;MLEAC=4;MLEAF=0.235;MQ=40;QD=34.24;SOR=2.303
chromosome_1    81  .   G   A   84.49   .   AC=2;AF=0.065;AN=31;DP=603;FS=0;MLEAC=2;MLEAF=0.065;MQ=44.44;QD=30.63;SOR=2.833
chromosome_1    88  .   T   C   190.75  .   AC=1;AF=0.091;AN=11;BaseQRankSum=-0.762;ClippingRankSum=0.762;DP=660;FS=7.782;MLEAC=1;MLEAF=0.091;MQ=29.53;MQRankSum=-1.179;QD=21.19;ReadPosRankSum=-1.666;SOR=1.414

And this is with --emitRefConfidence BP_RESOLUTION:

$ bcftools view -H 34allgenome_snps.vcf | head -n3 | cut -f1-8
chromosome_1    57  .   A   G   307.28  .   AC=4;AF=0.211;AN=19;DP=602;FS=0;MLEAC=4;MLEAF=0.211;MQ=8.23;QD=34.24;SOR=2.204
chromosome_1    81  .   G   A   84.49   .   AC=2;AF=0.065;AN=31;DP=750;FS=0;MLEAC=2;MLEAF=0.065;MQ=5.53;QD=30.63;SOR=2.833
chromosome_1    88  .   T   C   190.75  .   AC=1;AF=0.091;AN=11;BaseQRankSum=-1.179;ClippingRankSum=0.762;DP=796;FS=7.782;MLEAC=1;MLEAF=0.091;MQ=4.8;MQRankSum=-1.179;QD=21.19;ReadPosRankSum=-1.666;SOR=1.414

I find it particularly strange since the mapping quality of the backcrosses should in fact be slightly better in average (around 59 for the original BAM file) than the other more polymorphic samples (around 58)...

Thank you very much!

Created 2016-03-17 14:42:53 | Updated | Tags: unifiedgenotyper haplotypecaller multi-sample gvcf

Comments (6)

Hi, I'm using GATK ver 3.4 for SNP calling and I have some question about it. My data set has 500 samples, and I used genome data as reference for bowtie/GATK

1) I called SNP by sample (gvcf) with haplotype and then combined gvcf, however, the combination takes a long time, the GATK wants to recreate gvcf.idx files (4 of my gatk mission stuck at this step), one gatk combination finished after about 20 days calculation. I also try to use '-nct' to improve this, but it still stuck at preparing idx files.

2) For that finished gatk combination data set, I also used Unifiedgenotype with Gr.sorted.bam as input to call SNPs. The result is output with Gr.sorted.bam has 5 times more SNPs number than gvcf combination, and most missing SNPs could be found in individual gvcf files but missing in final result.

Could you help me with these? Thank you!

Created 2016-03-16 15:36:32 | Updated | Tags: haplotypecaller multi-mapper

Comments (2)

Dear GATK team,

I've been trying to use GATK to call SNPs from RNA-Seq data mapped to a transcriptome assembly. I used Bowtie2 for the read mapping. I apologize if the information is already posted, but it seemed hard to find out about this information, so I hoped to get some advice or pointed to the right place - How does the HaplotypeCaller handle reads mapped to multiple places? I used paired-end reads for read mapping.

Thank you very much for any feedback you might have.



Created 2016-03-16 00:20:27 | Updated | Tags: haplotypecaller snp haploid chloroplast pooled-sequencing-haplotypecaller

Comments (1)

I'm trying to call whole-chloroplast genome haplotypes for a pooled chloroplast-captured DNA sample from a non-model organism (no well-established variants). The reads are Illumina 100 bp PE reads, and have already undergone some clipping (adapter-trimming and quality control) and have been aligned to a reference genome. The pool represents 20 individuals. I want to know if there is a way in GATK to call the frequency of whole-genome haplotypes (or else, is there a way currently in existence elsewhere? ) If necessary, I can generate a panel of known haplotypes.

Currently, I have been using HaplotypeCaller to call SNPs and then filtering those by hand in Excel. I have already tried increasing the maximum active region size to larger than the whole reference genome (~150,000 bp), with a corresponding increase in the max reads per sample value, but this doesn't seem to have come up with whole-genome haplotypes.

Created 2016-03-15 13:20:23 | Updated | Tags: haplotypecaller

Comments (4)

I am using HaplotypeCaller for calling the variants for 120 gene based target sequence. For gene PMS2, there is a variant with coverage 21 (in that position) Allele fraction of the alternate allele is 5 reads ( 24%). The mapping quality of the reads are mostly 0 , I geuss because of the very similar pseudogene PMS2CL (mapping done with hg19 using BWA-MEM). This variant is not getting called by haplotypecaller, but is actually a true variant (found with sanger sequencing). I compared the BAM file with bamout file, both are similar. I also tried mapping the sequence with custom reference sequence (target region based). when I used that bam file for variant calling, It called that specific variant (though it also increased the coverage depth and increased the number of variants many fold, which are false positives).

I wonder what can be the possible explanation to this. what is the cutoff criteria, which haplotypecaller is using in this case? why the variant is not getting called at first place?

Here is the link to screenshot of a PMS2 variant with coverage 21 (also atached as file) https://drive.google.com/file/d/0Bwibh75M75p_bGJrNlpyRTVSNHVZRDMzUFB0UDFOV2gyM2Rj/view?usp=sharing

Created 2016-03-15 09:50:46 | Updated 2016-03-15 09:56:32 | Tags: haplotypecaller genotype likelihoods phred

Comments (5)

Dear all,

I have a question regarding the use of the SHAPEIT genotype calling by consecutive use of GATK, BEAGLE, and SHAPEIT. I have used the GATK haplotype caller, giving rise to an output of .vcf files with a ‘PL’ (normalised phred-scaled likelihood) column, and no ‘GL’ (genotype log10 likelihood) column. BEAGLE can handle PL as input, and the initial genotype calling works fine.

The next step would be to use SHAPEIT for phasing, as described here: [https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#gcall]

There is a C++ script described on this page, called prepareGenFromBeagle4, which will convert the BEAGLE .vcf files to the correct SHAPEIT input. However, this script looks for a ‘GL’ column in the .vcf file, which is unavailable since GATK only outputs ‘PL’. The script therefore crashes and I cannot proceed to the phasing step.

I have considered and tested a simple conversion script from PL to GL, where GL = PL/-10 However, since the PL is normalised (genotype with highest likelihood is set to 0), there is some loss of information here.

For example for a given genotype AA/AB/BB, if the original GLs (these are not in the GATK output but say they were) were -0.1 / -1 / -10 The corresponding PLs should be 1 / 10 / 100 And the normalised PLs (GATK output) would be 0 / 10 / 100 Giving rise to these converted GLs after the simple conversion 0 / -1 / -10

The converted GLs are used to then calculate genotype probabilities required for the SHAPEIT input. The issue that I have, is that all three genotype probabilities in the SHAPEIT input need to add up to 1.00 in total. The GL values are somehow scaled to calculate the genotype probabilities. Therefore, the final genotype probabilities in the SHAPEIT input file would turn out differently if I used the 'original' GL values (which I do not have) in comparison to the converted GL values. I am afraid this will introduce bias into the genotype probabilities used by SHAPEIT.

Apologies for the long post, I would be grateful hearing your thoughts on this issue. Have you used GATK and SHAPEIT consecutively before and run into this problem? Is there a reason to be weary of the potential bias here?

Many thanks in advance, Annique Claringbould

Created 2016-03-11 09:57:18 | Updated | Tags: haplotypecaller gvcf

Comments (1)

Hello, It seems that running HaplotypeCaller with -ERC GVCF and then running GenotypeGVCFs -stand_call_conf 30 -stand_emit_conf 30 gives a different vcf than HaplotypeCaller -stand_call_conf 30 -stand_emit_conf 30 (on a single sample). Is that expected? I tried versions 3.2 and 3.5.

Tag along question: I've tried to be a good citizen and post questions on existing topics in this forum (instead of starting yet a duplicate thread) but these never get answered. Is it always better to post a new question?

Thanks for your help.

Created 2016-03-07 12:18:15 | Updated 2016-03-07 12:20:27 | Tags: haplotypecaller baserecalibration joint-discovery cohort-analysis non-model-organism

Comments (3)


In case we do not have a database of known sites for a non model organism, what is the best strategy to do the base recalibration and variant calling with joint genotyping when we have several samples ?

According to the GATK best practices, when we do not have a data base of known sites, it is recommended to run HaplotypeCaller and hard filter the SNPs then use the resultings SNPs for the base recalibration until convergence. As I have many samples (31), I have to do this on all the samples. However, as, after the base recalibration, I want to do the final snp calling with joint genotyping, I think this approach is going to be really time consuming...

So, I'm wondering whether I can execute an initial snp calling (with HaplotypeCaller) on each sample, then joint genotyping them to obtain a global snp data that I will manually filter. This joint and filtered snp database could then be used for the base recalibration and a second joint genotyping. I think that it will be faster this way. However, I'm not sure whether there are caveats in this approach and I would like to have your advices.

Thank you very much in advance.

Regards, Noppol.

Created 2016-02-29 22:12:25 | Updated | Tags: haplotypecaller mutecht2

Comments (3)

Hi All.

I have been using #Haplotyplecaller for SNP calling for my organism of study. Recently, I came across #MUTECHT2 tool by GATK which states that it works better for calling spmatic mutations. But my doubt is does it work for Non-tumour samples, If YES then would it be beneficial to switch to MUTECHT2 from Haplotypecaller? @Geraldine_VdAuwera @Sheila

Best, Jigar

Created 2016-02-27 23:55:23 | Updated | Tags: haplotypecaller gatkdocs variantfiltration vcf developer gatk variant-calling

Comments (2)

In a discussion about using ERC, you provide some example VCF output lines like:

20  10000442   .   T  <NON_REF>  .   .   .   GT:AD:CD:DP:GQ:PL  0/0:56,0:56:56:99:0,168,2095
20  10000443   .   A  <NON_REF>  .   .   .   GT:AD:CD:DP:GQ:PL  0/0:56,0:56:56:99:0,169,2089
20  10000444   .   A  <NON_REF>  .   .   .   GT:AD:CD:DP:GQ:PL  0/0:56,0:56:56:99:0,168,2093

and say:

  1. "For each reference base not part of a variant call we emit a VCF record with the special symbolic allele <NON_REF> indicating the call is between the reference base and any possible non-reference allele that might be segregating at this site,"
  2. and
  3. "Note that there's no site-level QUAL field value. We discussed this internally and since the QUAL is the probability that the site is polymorphic, all of the QUAL field values should be 0 here, so we decided to drop it."

While the formal Variant Call Format 4.2 Specification says:

  1. "ALT - alternate base(s): Comma separated list of alternate non-reference alleles called on at least one of the samples. Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive) or an angle-bracketed ID String (“<ID>”) or a breakend replacement string as described in the section on breakends,"
  2. and
  3. "QUAL - quality: Phred-scaled quality score for the assertion made in ALT. i.e. −10log10 prob(call in ALT is wrong). If ALT is ‘.’ (no variant) then this is −10log10 prob(variant), and if ALT is not ‘.’ this is −10log10 prob(no variant). If unknown, the missing value should be specified."

The issue is subtle, but introduces problems with downstream processing of HaplotypeCaller generated VCF files containing reference calls. The use of the symbol "<NON_REF>" instead of "." for reference calls is a little confusing, but I also see the logic of that. More seriously: According to the VCFv4.2 specs, QUAL is NOT always a measure of "the probability that the site is polymorphic". Perhaps when a variant is called, but not when a site is called as non-variant. All those QUAL field values should NOT be 0 there. It is about the quality and correctness of the call itself. It is defined as the PHRED score of the probability that the assertion made in ALT is wrong. So if ALT asserts "<NON_REF>" or "." CONFIDENTLY (meaning a low probability that the assertion is wrong), then the QUAL PHRED score should be HIGH, not ZERO. A confident call should have a high QUAL score, whether variant or monomorphic. That is the intention of the specification. If otherwise: filtering out low quality records from a non-compliant VCF output file by removing those with low QUAL scores, for instance, would also filter out all the high quality reference calls.

I have previously been in touch with the writers/keepers of the VCF Specs (now over at samtools) and they confirm this interpretation of QUAL scores as the correct one for VCFv4.2 (and other versions) compliant output. It's unfortunately couched in mathematical double negatives, but look at it carefully and you will see the correctness of this reading. As one who uses tools from many sources, I hope you will address this discrepancy. It complicates interoperability.

-- Fred P.

Created 2016-02-26 19:15:45 | Updated | Tags: haplotypecaller heterozygosity

Comments (3)

Hi, I was wondering what is the min percent of variant reads GATK uses by default to make a heterozygous call and how to change this value if need be. Thanks!

Created 2016-02-25 20:42:25 | Updated | Tags: haplotypecaller developer variant-calling

Comments (3)

Regarding the observation elsewhere: "HaplotypeCaller is slower when restricted to intervals": My test-bed is a 48 core, 100GB RAM CentOS 7 system with GenomeAnalysisTK-3.5 on it. If i submit the command:

java -Xmx16g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -I chr12.bam -R hg19.fasta -o out.vcf -L chr12:40600000-40800000 -nct 36

I can see that the system is indeed using all 36 cores through completion.

But if I choose a small subset of ranges such as

, put them in the file test.list, then the command

java -Xmx16g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -I chr12.bam -R hg19.fasta -o out.vcf -L test.list -nct 36

executes, but quickly drops to one core usage for almost the entire run, taking an unexpectedly long time to complete. This does not seem like the efficiency gain anticipated. I have tested this with other interval file formats, but the problem persists and is easily reproducible. This is indeed unfortunate behavior. Examining just the exons of a region should take less time than examining the whole region, not more.

Is there any hope that this will ever operate as expected, or is the problem too deeply embedded in the threading algorithm? Some explanation would be useful. I am doing everything I can to speed up the GATK best practices methods to use on whole exome datasets, but things like this are frustrating.

-- Fred P.

Created 2016-02-25 16:23:22 | Updated | Tags: haplotypecaller

Comments (1)

Analyzing the same sample with and without queue, I noticed a variant being filtered out in one of the runs with VQSRTrancheSNP99.00to99.90 in the filter column.

In my debugging of the problem, I noticed that the size of the region in HaplotypeCaller can influence both BaseQRankSum and ReadPosRankSum greatly in the g.vcf file.

commands: 1) java -Xmx8g -Djava.io.tmpdir=tmp -jar /com/extra/GATK/3.5/jar-bin/GenomeAnalysisTK.jar -T HaplotypeCaller -I BDD.sorted.markdup.realigned.recal.bam -R ucsc.hg19_chrY_PAR1_PAR2_masked.fasta -L chr5:171333106-177333146 --genotyping_mode DISCOVERY --dbsnp dbsnp_138.hg19.vcf -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -o BDD.sorted.markdup.realigned.recal.HaplotypeCaller_gVCF_chr5.vcf.gz

2) java -Xmx8g -Djava.io.tmpdir=tmp -jar /com/extra/GATK/3.5/jar-bin/GenomeAnalysisTK.jar -T HaplotypeCaller -I BDD.sorted.markdup.realigned.recal.bam -R ucsc.hg19_chrY_PAR1_PAR2_masked.fasta -L chr5:175333106-177333146 --genotyping_mode DISCOVERY --dbsnp dbsnp_138.hg19.vcf -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -o BDD.sorted.markdup.realigned.recal.HaplotypeCaller_gVCF_chr5.vcf.gz

The results for the SNP in question in the g.vcf file: 1) chr5 176333126 rs2292256 C T, 5817.77 . BaseQRankSum=0.389;ClippingRankSum=2.280;DB;DP=314;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-1.360;RAW_MQ=1130400.00;ReadPosRankSum=-1.733 GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:154,160,0:314:99:0|1:176333126_C_T:5846,0,7455,6310,7937,14246:53,101,56,104

2) chr5 176333126 rs2292256 C T, 5817.77 . BaseQRankSum=-0.254;ClippingRankSum=0.132;DB;DP=314;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-1.278;RAW_MQ=1130400.00;ReadPosRankSum=-1.679 GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:154,160,0:314:99:0|1:176333126_C_T:5846,0,7455,6310,7937,14246:53,101,56,104

This is probably the cause of the SNP being filtered in one run (no-queue) and not the other (queue). This leaves me with the question of which is most correct.

But why are these values different?

Created 2016-02-25 15:49:57 | Updated | Tags: haplotypecaller intervals wgs

Comments (2)

I'm currently running my first real use of GATK. I was worried about running HaplotypeCaller on whole geneomes given some of the reports I've seen on these forums about how long it can take to run. In contrast, I was pleasantly surprised with the current GATK it is proceeding well (~7 day estimate on dog wgs). But it seems it could be much faster if I divided it up by chromosome with the -L flag.

I see that the advice is to not use the -L flag for whole genome analysis [1]. But the wording in that link seems open: it is not necessary, but if it would help efficiency it might be worthwhile.

I've found a related question on the forums here [2], but it seems the descrepancy discussed in that thread is suspected to be due to downsampling and not actually the result of a chromosome-by-chromosome use of HaplotypeCaller.

Again, I'm content with a ~7 day run time in order to take proper care of our data. I wouldn't want to sacrifice power or accuracy for a shorter runtime, but if there is really no trade-off, a chromosomal approach would be even better. So I'm curious if there is a downside to partitioning the HaplotypeCaller step by chromosome?

[1] http://gatkforums.broadinstitute.org/gatk/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals [2] http://gatkforums.broadinstitute.org/gatk/discussion/5008/haplotypecaller-on-whole-genome-or-chromosome-by-chromosome-different-results

Created 2016-02-22 01:40:44 | Updated | Tags: haplotypecaller index

Comments (2)

Hi everyone,

I got the following warning when running the HaplotypeCaller.

BAM index file M11.bai is older than BAM M11.bam

Although the warning presents in the log, it also gives me a gvcf. Does this warning effect the accuracy of the result? I hope this warning will be fine for me.

I declare those bam files were produced on another server.

The command java -Xmx50g -Djava.io.tmpdir=/data/tmp -jar /bin/GATK3.5/GenomeAnalysisTK.jar \ -R ref.fa \ -T HaplotypeCaller -nct 8 \ -I ./$d'.bam' \ -o $d'.g.vcf' \ --genotyping_mode DISCOVERY \ -stand_emit_conf 30 \ -stand_call_conf 30 \ -ERC GVCF done

What is more, I have another unsure on the new version highlights.

When I run above command with parameters -variant_index_type LINEAR -variant_index_parameter 128000, Do they will come to the same results?

Many thanks!

Created 2016-02-21 21:29:00 | Updated | Tags: haplotypecaller whole-exome

Comments (2)

Hi, This may not directly relate to how GATK works, but I am still asking to get your expert input (I do use GATK for this pipeline).

I am trying to test my variant calling pipeline that I have prepared for my target region of 5Mb for which I need a test dataset where the fastq files and the VCF files are provided so that I can run my pipeline on those fastqs and then compare my VCF to those VCFs. So, I used the two whole exome datasets from Genome In a Bottle data with the following steps:

  1. In order to obtain the fastqs that originated from my target region, I used the bam files published by GIAB and extracted the alignments falling in my target region using samtools

  2. Converted the extracted bams in those regions to paired-end fastqs

  3. Aligned those fastqs to the entire genome using BWA

  4. Called the variants in my region of interest using GATK (with -L option, restricting my analysis to my target region for RealignerTargetCreator and HaplotypeCaller steps only). I don't do BQSR and VQSR.

  5. Compared my variants to the variants from GIAB in the target region.

But I only get 50% variant calls correctly. I call only 50% of the total variants in that region. What could be the reason for this low agreement rate?

Interestingly, when I call the variants on the whole exome for which the fastqs were originally created, I can obtain 96% of variants in the whole-exome region published by GIAB. Moreover, from those variants, when I extract the variants that fall in my target regions and compare it to the corresponding GIAB variants, I can go upto 97%. In other words, when I use the entire whole-exome data, I can call 97% of the variants in my target region, but when I start with the alignments falling in my target region, convert to fastq and then call variants in the target region, I only get 50%. I use the exact same settings of GATK in both cases (except the -L region which is different).

Can you please help me figure out what could be going wrong?



Created 2016-02-18 16:26:07 | Updated 2016-02-18 16:26:53 | Tags: duplicatereadfilter haplotypecaller

Comments (3)

Hi, I am using GATK HC to identify variants in a target region of about 23kb with very deep sequencing. I get this message during HC that 99.76% of reads failing DuplicateReadFilter. This means that a lot of my reads are being thrown out and hence I am not getting correct variant calls.

First, is GATK HC an appropriate tool to call variants in such a small region with deep sequencing (more than 100X)? Second, how can I rectify this error of 99% reads failing DuplicateReadFilter?

Thanks, Nitin

Created 2016-02-11 22:33:35 | Updated | Tags: haplotypecaller bug gatk heterozygous

Comments (4)

Hello! I'm pretty new to this so pardon me if this is formatted incorrectly or I'm missing something obvious. It looks like I have the same problem as: http://gatkforums.broadinstitute.org/gatk/discussion/2319/haplotypecaller-incorrectly-making-heterozygous-calls-again but their problems seemed to be fixed simply by updating, and I'm using the current version of GATK (3.5.0). Any idea what I can do to fix this?

When processing this, I mostly used the GATK Best Practices. However, I did use joint calling instead of the mixed method, since I figured with only 93 samples, for a handful of genes, it wouldn't benefit too heavily from more efficient computation, and plugging the whole BAM in at once seemed easier.

Here's a shot from IGV of a portion of one of my samples. The circled SNPs are the ones in question. For all 3, all of the reads in the BAM are for the alternate, but the VCF produced by HaplotypeCaller has them as heterozygous (and passing filters). It seems to be pretty consistent in which SNPs it messes up. On the other hand, some samples are fine, and some are not, but I don't see a pattern as to why some work and some don't. I'd be happy to provide any other information that would help resolve this.

Thank you for any help you can provide, Jessica Patnode

Created 2016-02-11 09:04:07 | Updated | Tags: haplotypecaller pooled-calls

Comments (1)

Dear GATK team,

We want to run RNAseq SNP analysis on Arabidopsis. Can we build each pool from 6 different plants ( genetically similar individuals ), only 3 will be taken from leaves and 3 will be taken from whole sprout.

Thanks Ronit

Created 2016-02-09 13:06:54 | Updated | Tags: haplotypecaller best-practices rna-editing mutec2

Comments (2)

Dear GATK staff

I would like to use the GATK tool for the detection of possible RNA editing events. I followed the RNA-seq best practice up to the variant calling step itself. There I hesitate to use the haplotype caller because I would not assume that the editing sites follow any kind of allelic ratio. Therefore I wanted to ask if it might be better to use MuTec2 at this stage? I would call it like ...

java -jar GenomeAnalysisTK.jar -T MuTect2 -R reference.fasta -I:tumor normal1.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 --dbsnp dbSNP.vcf --artifact_detection_mode -o output.normal1.vcf
java -jar GenomeAnalysisTK.jar -T CombineVariants -R reference.fasta -V output.normal1.vcf -V output.normal2.vcf -minN 2 --setKey "null" --filteredAreUncalled --filteredrecordsmergetype KEEP_IF_ANY_UNFILTERED -o MuTect2_PON.vcf

Can you comment if this is a suitable modification of the best practice in the case of RNA editing calls?

bw, Fabian

Created 2016-02-08 15:12:24 | Updated | Tags: haplotypecaller multisample gvcf runtime-error erc

Comments (4)

I am trying to run the HaplotypeCaller on a bam file with multiple samples. It runs successfully without the ERC GVCF option, e.g.

java -jar /home/unix/csmillie/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I test.bam

But when I try running it with the ERC GVCF option, I get an error:

java -jar /home/unix/csmillie/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I test.bam --emitRefConfidence GVCF --sample_name TCGGCTGAGAAC

I am using Java 1.7. I have validated the bam file with Picard. The bam file has the appropriate header, with tab-separated read groups that look like this: @RG ID:3 SM:TCGGCTGAGAAC PL:Illumina

The stack trace is below. If anyone can help I would really appreciate it! I am running this on an interactive node on the Broad cluster, in case it helps with debugging. Thanks!

hw-uger-1001:~/data/csmillie/test $ java -jar /home/unix/csmillie/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I test.bam --emitRefConfidence GVCF --sample_name TCGGCTGAGAAC INFO 09:56:10,853 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:56:10,855 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56 INFO 09:56:10,855 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 09:56:10,855 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 09:56:10,859 HelpFormatter - Program Args: -T HaplotypeCaller -R ref.fasta -I test.bam --emitRefConfidence GVCF --sample_name TCGGCTGAGAAC INFO 09:56:10,877 HelpFormatter - Executing as csmillie@hw-uger-1001.broadinstitute.org on Linux 2.6.32-573.12.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_71-b14. INFO 09:56:10,877 HelpFormatter - Date/Time: 2016/02/08 09:56:10 INFO 09:56:10,878 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:56:10,878 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:56:11,500 GenomeAnalysisEngine - Strictness is SILENT INFO 09:56:12,598 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500 INFO 09:56:12,606 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 09:56:12,760 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.15 INFO 09:56:12,972 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 09:56:13,128 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 09:56:13,732 GenomeAnalysisEngine - Done preparing for traversal INFO 09:56:13,733 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 09:56:13,733 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 09:56:13,734 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime INFO 09:56:13,734 HaplotypeCaller - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output INFO 09:56:13,735 HaplotypeCaller - All sites annotated with PLs forced to true for reference-model confidence output INFO 09:56:14,806 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.NullPointerException at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.isGVCF(HaplotypeCaller.java:1251) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.initializeReferenceConfidenceModel(HaplotypeCaller.java:728) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.initialize(HaplotypeCaller.java:659) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4):
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

Created 2016-02-08 13:59:01 | Updated | Tags: haplotypecaller

Comments (16)

We performed variant calling on a whole-exome sequenced human internal control sample. We have a set of 60 confirmed Sanger sequencing SNVs for this sample, which are all detected when we use HaplotypeCaller 2.7 with the following parameters:

-T HaplotypeCaller -nct 8 -R hg19_chr1-y.fasta -I input.bam --genotyping_mode DISCOVERY --max_alternate_alleles 12 --dbsnp dbsnp_137.hg19.vcf --disable_auto_index_creation_and_locking_when_reading_rods -et NO_ET -stand_call_conf 30.0 -stand_emit_conf 15.0 -o output.vcf -dcov 900

When we start from the same .bam file but using HaplotypeCaller 3.3 with the same parameters:

-T HaplotypeCaller -nct 8 -R hg19_chr1-y.fasta -I input.bam --genotyping_mode DISCOVERY --max_alternate_alleles 12 --dbsnp dbsnp_137.hg19.vcf --disable_auto_index_creation_and_locking_when_reading_rods -et NO_ET -stand_call_conf 30.0 -stand_emit_conf 15.0 -o output.vcf -dcov 900

one SNV is not detected (chr19:49671151). I've attached a printscreen of IGV (first track: original .bam file, second track --bamout for hc2.7, third track is --bamout hc3.3). All quality scores look normal. I can provide you with the .bam files if needed.

Could you recommend some things that I can check to find out why this confirmed variant is not detected with hc3.3?

Created 2016-02-05 15:25:48 | Updated 2016-02-05 15:28:04 | Tags: haplotypecaller gvcf no-calls

Comments (2)


I have generated a gVCF for an exome (with non-variant block records) from a BAM file belonging to the 1000Genomes data. I am using GATK tools version 3.5-0-g36282e4 and I have run the HaplotypeCaller as follows:

time java -jar $gatk_dir/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R $reference \ -I $bamfile \ -ploidy 2 \ -stand_call_conf 20 \ -stand_emit_conf 10 \ -ERC GVCF \ -o output.g.vcf.gz

Within the purpose of the analysis I am performing, from this gVCF I need to be able to know whether the positions are no-called, homozygous reference, variant sites or if the positions were not targeted in the exome sequencing.

However, with the gVCF file I obtained I am not able to do it because there are only variant site records or non-variant block records where the GT tag is always "0/0".

So I have few questions regarding the non-variant block records:

  1. Why the output file does not contain any no-call ("./.") record?

  2. Shouldn't regions where there are no reads have the tag GT equal to "./." instead of "0/0"?

  3. How can regions without reads (not targeted) be distinguished from regions with reads that were not called?

  4. When looking at the bam file with IGV, non-variant blocks displayed in gVCF contain regions with reads. What is the explanation for such behaviour?

Thank you for your attention,


Created 2016-02-05 09:25:43 | Updated | Tags: haplotypecaller best-practices merge rna-seq

Comments (11)

Hello, and thanks for making all the GATK tools! I have recently started to try my hand at variant calling of my RNA-seq data, following the GATK Best Practices more or less verbatim, only excluding indel alignment (because I am only interested in SNPs at this point) and the BQSR (partly because I have very high quality data, but mostly because I couldn't get it to work in the workflow).

I have three replicates for each of my samples, and my question is where, if at all, I should merge the data from them. I am not sure if I can (or even should!) merge the FASTQ files before the alignment step, or merge the aligned BAM files, or something else. I read that for aligners such as BWA the options are (more or less) equivalent, but seeing as the RNA-seq Best Practice workflow using STAR... What would be the "correct" way to do it, if at all? How would merging (at some level) affect the speed of the workflow, and can I optimise that somehow?

If it's a bad idea to do merging, how would I determine the "true" variant from my three resulting VCF-files at the end, for cases where they differ?

Created 2016-02-03 20:16:19 | Updated | Tags: haplotypecaller ploidy

Comments (1)

I have NGS data from an organism that is sometimes found to be aneuploid (but usually diploid.) I am trying to determine the ploidy of the individual I have sequenced from my population of interest. I have ~100X coverage, but the best reference is so distantly-related that the normal SNP coverage plots that are used to estimate ploidy are too messy to interpret. I was hoping that I could use HaplotypeCaller here to help determine the number of haplotypes per chromosome (where 2 haplotypes/alleles per position indicates diploidy). I can't figure out how to interpret the resulting VCF file well enough to address this question.

If HaplotypeCaller will not be sufficient, are there any other frequently used options to determine ploidy (or number of alleles/haplotypes per position)?

Thanks for any assistance.


Created 2016-02-03 15:40:20 | Updated | Tags: commandlinegatk haplotypecaller pcrmodel

Comments (6)

Dear GATK team,

I'm a bit confused by --pcr_indel_model argument in HaplotypeCaller. As a can see from the docs, this argument is not required, but in its description I still read the following: "VERY IMPORTANT: when using PCR-free sequencing data we definitely recommend setting this argument to NONE". Does some PCR-bias-oriented filtration is performed by default (so, for PCR-free datasets I should set it to NONE), or actually I don't need to set this argument to NONE (even processing PCR-free datasets) if I simply don't use it?

Best regards, Svyatoslav

Created 2016-01-28 15:08:47 | Updated | Tags: indelrealigner variantrecalibrator vqsr haplotypecaller best-practices

Comments (3)

The release notes for 3.5 state "Added new MQ jittering functionality to improve how VQSR handles MQ". My understanding is that in order to use this, we will need to use the --MQCapForLogitJitterTransform argument in VariantRecalibrator. I have a few questions on this: 1) Is the above correct, i.e. the new MQ jittering functionality is only used if --MQCapForLogitJitterTransform is set to something other than the default value of zero? 2) Is the use of MQCapForLogitJitterTransform recommended? 3) If we do use MQCapForLogitJitterTransform, the tool documentation states "We recommend to either use --read-filter ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller or use a MQCap=max+10 to take that into account". Is one of these to be preferred over the other? Given that it seems that sites that have been realigned can have values up to 70, but sites that have not can have values no higher than 60, it seems to me that the ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller option might be preferred, but I would like to check before running.

Created 2016-01-28 13:26:14 | Updated | Tags: haplotypecaller optimization faster

Comments (1)

Hello everyone,

I am using GATK in a clinical context for NGS diagnosis. The issue is that the HaplotypeCaller take some time, too much time actually (2h per patient). I tried this things :

  • reduce the bam file size by keeping only the genomic regions of my diagnosis genes but it looks like it still run all the hg19 genome.
  • ask "only variants" with the output_mode option but the output file is exactly the same than the default one.
  • use several CPU thread, but 1 CPU = 147 min, 2 CPU = 89 min, 3 CPU = 80 min. And I don't have this much CPU available so it is not interesting above 2 CPU , and still not fast enough.

I can't use the data thread option right now, would it allow me to gain more time than the CPU option ? There is the interval option but I don't think it would allow me to gain enough time since I have gene of interest on almost all chromosomes.

I would appreciate to have your guidance regarding this problem. How would you do to make this HaplotypeCaller step faster ?

Many thanks in advance.


Created 2016-01-27 18:04:37 | Updated 2016-01-27 18:05:57 | Tags: haplotypecaller

Comments (3)

I was comparing the concordance of genotype calls as a function of Genotype Quality (GQ) for the GATK 3.5 Haplotyper. (using genotype given allele and emiting all sites, even those with score of 0).

My "truth" are genotypes from Illumina OMNI 2.5M arrays .. and I am comparing the genotype calls from Illumina exome arrays.

I notice that the Genotype Quality (GQ) scores of the HaplotypeCaller are focussed at intervals of "3". e.g. many many more scores at (orders of magnitude) GQ=0,3,6,9,12,15, ... than at GQ=1,2, 4,5, 7,8, 10,11 , 13,14

This unequal distribution in itself is surprising, but not an issue.

I am noticing that the rate of concordance is NOT monotonically proportional to GQ. Those genotypes with non-multiple of 3 scores have much worst concordance.

If I limit to multiple of 3, the GQ are monotonic. Concordance(GQ=12)>Concordance(GQ=9)>Concordance(GQ=6)

but systematically, the non-multiple of 3 have worst concordance by orders of magnitude. e.g. Concordance(10 or 11) <Concordance(3)

Why is that? So .. trying to look at the source code, for PairHMM (where the likelyhoods are computed), I am identifying this value "TRISTATE" correction (value 3.0) that is differentially applied to reads with "N". .. but I don't see 'N" in alignments for those non-multiple of 3 alignment.

I looked at the "raw" bam file (not the output of haplotyper) for number of examples for scores from 5 to 22 .. coverage 2..14

None has pathologic features.

  • no N in alignment
  • only example one overlapped soft-clipped bases
  • usually only 1 read supporting the alternate allele (or one read supporting the reference)
  • usually all the reads have 100M cigars.
  • Variant is not next to the end of a soft-clipped reads.

Is that a bug in the HMMPair scoring?

p.s. The BAM has 32-offset qualities.

Created 2016-01-19 16:27:08 | Updated | Tags: gccontent haplotypecaller homopolymerrun tandemrepeatannotator hrun

Comments (9)

I am using Genotype Given Allele with Haplotype Caller I am trying to explicitely request all annotations that the documentation says are compatible with the Haplotype caller (and that make sense for a single sample .. e.g. no hardy weinberg ..)

the following annotations all have "NA" GCContent(GC) HomopolymerRun(Hrun) TandemRepeatAnnotator (STR RU RPA) .. but are valid requests because I get no errors from GATK.

This is the command I ran (all on one line)

java -Xmx40g -jar /data5/bsi/bictools/alignment/gatk/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller --input_file /data2/external_data/Weinshilboum_Richard_weinsh/s115343.beauty/Paired_analysis/secondary/Paired_10192014/IGV_BAM/pair_EX167687/s_EX167687_DNA_Blood.igv-sorted.bam --alleles:vcf /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/OMNI.vcf --phone_home NO_ET --gatk_key /projects/bsi/bictools/apps/alignment/GenomeAnalysisTK/3.1-1/Hossain.Asif_mayo.edu.key --reference_sequence /data2/bsi/reference/sequence/human/ncbi/hg19/allchr.fa --minReadsPerAlignmentStart 1 --disableOptimizations --dontTrimActiveRegions --forceActive --out /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/EX167687.0.0375.chr22.vcf --logging_level INFO -L chr22 --downsample_to_fraction 0.0375 --downsampling_type BY_SAMPLE --genotyping_mode GENOTYPE_GIVEN_ALLELES --standard_min_confidence_threshold_for_calling 20.0 --standard_min_confidence_threshold_for_emitting 0.0 --annotateNDA --annotation BaseQualityRankSumTest --annotation ClippingRankSumTest --annotation Coverage --annotation FisherStrand --annotation GCContent --annotation HomopolymerRun --annotation LikelihoodRankSumTest --annotation MappingQualityRankSumTest --annotation NBaseCount --annotation QualByDepth --annotation RMSMappingQuality --annotation ReadPosRankSumTest --annotation StrandOddsRatio --annotation TandemRepeatAnnotator --annotation DepthPerAlleleBySample --annotation DepthPerSampleHC --annotation StrandAlleleCountsBySample --annotation StrandBiasBySample --excludeAnnotation HaplotypeScore --excludeAnnotation InbreedingCoeff

Log file is below( Notice "weird" WARNings about) "StrandBiasBySample annotation exists in input VCF header".. which make no sense because the header is empty other than the barebone fields.

This is the barebone VCF head /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/OMNI.vcf



chr1 723918 rs144434834 G A 30 PASS . chr1 729632 rs116720794 C T 30 PASS . chr1 752566 rs3094315 G A 30 PASS . chr1 752721 rs3131972 A G 30 PASS . chr1 754063 rs12184312 G T 30 PASS . chr1 757691 rs74045212 T C 30 PASS . chr1 759036 rs114525117 G A 30 PASS . chr1 761764 rs144708130 G A 30 PASS .

This is the output

INFO 10:03:06,080 HelpFormatter - --------------------------------------------------------------------------------- INFO 10:03:06,082 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12 INFO 10:03:06,083 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 10:03:06,083 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 10:03:06,086 HelpFormatter - Program Args: -T HaplotypeCaller --input_file /data2/external_data/Weinshilboum_Richard_weinsh/s115343.beauty/Paired_analysis/secondary/Paired_10192014/IGV_BAM/pair_EX167687/s_EX167687_DNA_Blood.igv-sorted.bam --alleles:vcf /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/OMNI.vcf --phone_home NO_ET --gatk_key /projects/bsi/bictools/apps/alignment/GenomeAnalysisTK/3.1-1/Hossain.Asif_mayo.edu.key --reference_sequence /data2/bsi/reference/sequence/human/ncbi/hg19/allchr.fa --minReadsPerAlignmentStart 1 --disableOptimizations --dontTrimActiveRegions --forceActive --out /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/EX167687.0.0375.chr22.vcf --logging_level INFO -L chr22 --downsample_to_fraction 0.0375 --downsampling_type BY_SAMPLE --genotyping_mode GENOTYPE_GIVEN_ALLELES --standard_min_confidence_threshold_for_calling 20.0 --standard_min_confidence_threshold_for_emitting 0.0 --annotateNDA --annotation BaseQualityRankSumTest --annotation ClippingRankSumTest --annotation Coverage --annotation FisherStrand --annotation GCContent --annotation HomopolymerRun --annotation LikelihoodRankSumTest --annotation MappingQualityRankSumTest --annotation NBaseCount --annotation QualByDepth --annotation RMSMappingQuality --annotation ReadPosRankSumTest --annotation StrandOddsRatio --annotation TandemRepeatAnnotator --annotation DepthPerAlleleBySample --annotation DepthPerSampleHC --annotation StrandAlleleCountsBySample --annotation StrandBiasBySample --excludeAnnotation HaplotypeScore --excludeAnnotation InbreedingCoeff INFO 10:03:06,093 HelpFormatter - Executing as m037385@franklin04-213 on Linux 2.6.32-573.8.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26. INFO 10:03:06,094 HelpFormatter - Date/Time: 2016/01/19 10:03:06 INFO 10:03:06,094 HelpFormatter - --------------------------------------------------------------------------------- INFO 10:03:06,094 HelpFormatter - --------------------------------------------------------------------------------- INFO 10:03:06,545 GenomeAnalysisEngine - Strictness is SILENT INFO 10:03:06,657 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Fraction: 0.04 INFO 10:03:06,666 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 10:03:07,012 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.35 INFO 10:03:07,031 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 10:03:07,170 IntervalUtils - Processing 51304566 bp from intervals INFO 10:03:07,256 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 10:03:07,595 GenomeAnalysisEngine - Done preparing for traversal INFO 10:03:07,595 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 10:03:07,595 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 10:03:07,596 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime INFO 10:03:07,596 HaplotypeCaller - Disabling physical phasing, which is supported only for reference-model confidence output WARN 10:03:07,709 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail. WARN 10:03:07,709 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail. INFO 10:03:07,719 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units INFO 10:03:37,599 ProgressMeter - chr22:5344011 0.0 30.0 s 49.6 w 10.4% 4.8 m 4.3 m INFO 10:04:07,600 ProgressMeter - chr22:11875880 0.0 60.0 s 99.2 w 23.1% 4.3 m 3.3 m Using AVX accelerated implementation of PairHMM INFO 10:04:29,924 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file INFO 10:04:29,925 VectorLoglessPairHMM - Using vectorized implementation of PairHMM WARN 10:04:29,938 AnnotationUtils - Annotation will not be calculated, genotype is not called WARN 10:04:29,938 AnnotationUtils - Annotation will not be calculated, genotype is not called WARN 10:04:29,939 AnnotationUtils - Annotation will not be calculated, genotype is not called INFO 10:04:37,601 ProgressMeter - chr22:17412465 0.0 90.0 s 148.8 w 33.9% 4.4 m 2.9 m INFO 10:05:07,602 ProgressMeter - chr22:18643131 0.0 120.0 s 198.4 w 36.3% 5.5 m 3.5 m INFO 10:05:37,603 ProgressMeter - chr22:20133744 0.0 2.5 m 248.0 w 39.2% 6.4 m 3.9 m INFO 10:06:07,604 ProgressMeter - chr22:22062452 0.0 3.0 m 297.6 w 43.0% 7.0 m 4.0 m INFO 10:06:37,605 ProgressMeter - chr22:23818297 0.0 3.5 m 347.2 w 46.4% 7.5 m 4.0 m INFO 10:07:07,606 ProgressMeter - chr22:25491290 0.0 4.0 m 396.8 w 49.7% 8.1 m 4.1 m INFO 10:07:37,607 ProgressMeter - chr22:27044271 0.0 4.5 m 446.4 w 52.7% 8.5 m 4.0 m INFO 10:08:07,608 ProgressMeter - chr22:28494980 0.0 5.0 m 496.1 w 55.5% 9.0 m 4.0 m INFO 10:08:47,609 ProgressMeter - chr22:30866786 0.0 5.7 m 562.2 w 60.2% 9.4 m 3.8 m INFO 10:09:27,610 ProgressMeter - chr22:32908950 0.0 6.3 m 628.3 w 64.1% 9.9 m 3.5 m INFO 10:09:57,610 ProgressMeter - chr22:34451306 0.0 6.8 m 677.9 w 67.2% 10.2 m 3.3 m INFO 10:10:27,611 ProgressMeter - chr22:36013343 0.0 7.3 m 727.5 w 70.2% 10.4 m 3.1 m INFO 10:10:57,613 ProgressMeter - chr22:37387478 0.0 7.8 m 777.1 w 72.9% 10.7 m 2.9 m INFO 10:11:27,614 ProgressMeter - chr22:38534891 0.0 8.3 m 826.8 w 75.1% 11.1 m 2.8 m INFO 10:11:57,615 ProgressMeter - chr22:39910054 0.0 8.8 m 876.4 w 77.8% 11.4 m 2.5 m INFO 10:12:27,616 ProgressMeter - chr22:41738463 0.0 9.3 m 926.0 w 81.4% 11.5 m 2.1 m INFO 10:12:57,617 ProgressMeter - chr22:43113306 0.0 9.8 m 975.6 w 84.0% 11.7 m 112.0 s INFO 10:13:27,618 ProgressMeter - chr22:44456937 0.0 10.3 m 1025.2 w 86.7% 11.9 m 95.0 s INFO 10:13:57,619 ProgressMeter - chr22:45448656 0.0 10.8 m 1074.8 w 88.6% 12.2 m 83.0 s INFO 10:14:27,620 ProgressMeter - chr22:46689073 0.0 11.3 m 1124.4 w 91.0% 12.5 m 67.0 s INFO 10:14:57,621 ProgressMeter - chr22:48062438 0.0 11.8 m 1174.0 w 93.7% 12.6 m 47.0 s INFO 10:15:27,622 ProgressMeter - chr22:49363910 0.0 12.3 m 1223.6 w 96.2% 12.8 m 29.0 s INFO 10:15:57,623 ProgressMeter - chr22:50688233 0.0 12.8 m 1273.2 w 98.8% 13.0 m 9.0 s INFO 10:16:12,379 VectorLoglessPairHMM - Time spent in setup for JNI call : 0.061128124000000006 INFO 10:16:12,379 PairHMM - Total compute time in PairHMM computeLikelihoods() : 22.846350295 INFO 10:16:12,380 HaplotypeCaller - Ran local assembly on 25679 active regions INFO 10:16:12,434 ProgressMeter - done 5.1304566E7 13.1 m 15.0 s 100.0% 13.1 m 0.0 s INFO 10:16:12,435 ProgressMeter - Total runtime 784.84 secs, 13.08 min, 0.22 hours INFO 10:16:12,435 MicroScheduler - 727347 reads were filtered out during the traversal out of approximately 4410423 total reads (16.49%) INFO 10:16:12,435 MicroScheduler - -> 2 reads (0.00% of total) failing BadCigarFilter INFO 10:16:12,436 MicroScheduler - -> 669763 reads (15.19% of total) failing DuplicateReadFilter INFO 10:16:12,436 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 10:16:12,436 MicroScheduler - -> 57582 reads (1.31% of total) failing HCMappingQualityFilter INFO 10:16:12,437 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 10:16:12,437 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 10:16:12,437 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 10:16:12,438 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

Created 2016-01-19 08:46:46 | Updated | Tags: haplotypecaller

Comments (3)

Hello, I am using GATK pipeline for variant calling in Targeted Exome samples. While trying to view the variants in IGV at different levels of variant calling steps (aligning .bam files generated at BaseRecalibration and HaplotypeCaller steps against reference.fasta), i find that the total number of reads aligning at a particular variant location varies. ie. after BaseRecalibration and before HaplotypeCalling total read count at a variant location (as per IGV- BR.bam alignment) is 274. It shows A count is 1 and C count is 273. But after HaplotypeCaller step, The total read count is 460 (as per IGV-bamout alignment). It also shows that A count is 195 and C count is 265. At the same time the variant record from the corresponding g.vcf file shows something like this:

17 48253171 . C A,<NON_REF> 1802.77 . BaseQRankSum=4.848;ClippingRankSum=-1.291;DP=388;MLEAC=1,0;MLEAF=0.500,0.00;MQ=36.72;MQ0=0;MQRankSum=-14.293;ReadPosRankSum=-14.235 GT:AD:DP:GQ:PL:SB 0/1:217,104,0:321:99:1831,0,6990,2478,7307,9785:217,0,104,0

I understand that HC reassembles the active regions. But then also I see a drastic different in the read/Allele counts. My questions are:

  1. Why the total read count(even Alternate Allele count) is increased suddenly in HC step compared to the previous step(s)?
  2. Why the Allele counts and Read depth in the g.vcf file and that in the corresponding bamout files (IGV view) are different?
  3. In the above situation, which allele count should be considered bamout(ie IGV) or g.vcf counts?

Thanking you in advance, Aswathy

Created 2016-01-14 19:02:12 | Updated | Tags: commandlinegatk haplotypecaller multi-sample gatk error

Comments (8)


I would like to force call a list of variants across my cohort using HaplotypeCaller to get more accurate QC metrics for each variant. I am using the following command:

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -et NO_ET -K my.key -I my.cohort.list --alleles my.vcf -L my.vcf -out_mode EMIT_ALL_SITES -gt_mode GENOTYPE_GIVEN_ALLELES -stand_call_conf 30.0 -stand_emit_conf 0.0 -dt NONE -o final_my.vcf

Here is a link to the input VCF file: VCF File

Unfortunately, I keep running into the following error (I've tried GATK ver3.3 and ver3.5):

INFO 18:49:21,288 ProgressMeter - chr1:11177077 21138.0 49.5 m 39.0 h 69.4% 71.3 m 21.8 m ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.IndexOutOfBoundsException: Index: 3, Size: 3 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at htsjdk.variant.variantcontext.VariantContext.getAlternateAllele(VariantContext.java:845) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:248) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:1059) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:221) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3.0-mssm-0-gaa95802): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Index: 3, Size: 3 ##### ERROR ------------------------------------------------------------------------------------------

Would appreciate your help in solving this issue.

Created 2016-01-13 13:11:53 | Updated 2016-01-13 13:22:28 | Tags: haplotypecaller

Comments (3)

Hey there,

I am using the haplotypecaller on our cluster and the stdout is written to a logfile. Unfortunately HC is writing the whole gvcf to the standard out although I specified an output file. This means really huge log files that I can not handle. Is there a way to prevent this behaviour and only get errors/ warnings from the haplotypecaller?

All the best


Created 2016-01-13 12:45:53 | Updated | Tags: haplotypecaller pooled-calls memory

Comments (1)


I am using haplotypecaller version 3.4-46 with pooled data (ploidy 44) and I am running in to the error 'There was a failure because you did not provide enough memory to run this program'. I have a very high coverage (~10000x at each position). Searching in the forums, I have found suggestions of down sampling, (which haplotypecaller does itself in the version I am using) and changing the minPruning parameter (which I am afraid may further compromise the sensitivity of the variant calling given the samples are pooled). Previously I used unifiedgenotyper on the same data and I had no problems. Is there something else I could try to get rid of the memory issues or should I stick with unifiedgenotyper in my case? Thank you for your support.

Created 2016-01-11 20:09:44 | Updated | Tags: haplotypecaller igv bamout

Comments (4)

Hi Everyone,

I am using GATK version 3.3 and I noticed that the numbers in the DP and AD fields actually match the original bam file before realignment and does not match the bamout file produced by haplotype caller. For example, at chromosome 1 position 6253228 the DP is 27 and the AD is 27,0. The number of reads (counted by IGV) in the original bam file at that position is also 27. However, in the "bamout" bam file the number of reads is 56. Below is the line from the VCF file.

chr1 6253228 . C . . . GT:AD:DP:GQ:PL 0/0:27,0:27:75:0,75,1125

Is the DP and AD fields supposed to match the original bamfile or the bamout bam file numbers? When I produce the "bamout" bamfile I use parameters -forceActive and -disableOptimizations.

Thank you.

Best, Sam

Created 2016-01-11 15:50:06 | Updated | Tags: vqsr haplotypecaller bug error

Comments (3)

We are having problems running GATK VQSR on recent VCF files generated using GATK HaplotypeCaller in the N+1 mode.

When running VariantRecalibrator we get the following error (with both 3.4 and 3.5), the error occurs.

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.5-0-g36282e4): 
##### ERROR MESSAGE: The allele with index 5 is not defined in the REF/ALT columns in the record
##### ERROR ------------------------------------------------------------------------------------------


$JAVA -jar $GATK -T VariantRecalibrator -R $REF -input GATK-HC-3.4-DAMONA_10_Jan_2016.vcf.gz \
        -resource:ill770k,known=false,training=true,truth=true,prior=15.0 /home/aeonsim/refs/LIC-seqdHDs.AC1plus.DP10x.vcf.gz \
        -resource:VQSRMend,known=false,training=true,truth=false,prior=10.0  /home/projects/bos_taurus/damona/vcfs/recombination/GATK-HC-95.0tr-GQ40-10Dec-2015run.mend.Con.vcf.gz \
        -an DP -an QD -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -mode SNP \
        -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 97.5 -tranche 95.0 -tranche 92.5 -tranche 90.0 \
        -recalFile GATK-HC-DAMONA-Jan-2016.recal -tranchesFile GATK-HC-DAMONA-Jan-2016.tranches \
        -rscriptFile GATK-HC-DAMONA-Jan-2016.R

The VCF files were created using GenotypeGVCFs (v3.4-46-gbc02625) on GVCF files created with GATK HaplotypeCaller (N+1 mode, same version) which had been combined using GATK CombineGVCFs to merge ~750 whole bovine genome GVCFs to of 20-90 individuals. All stages occurred to finish successfully with out error until the VQSR stage. All processing of the GATK files was done using GATK tools (v3.4-46) and running the ValidateVariants tool on the files only gives warnings about “Reference allele is too long (116)” nothing about alleles not being defined in the Ref/Alt column.

java -jar GenomeAnalysisTK.jar -T ValidateVariants -R bosTau6.fasta  -V GATK.chr29.vcf.gz --dbsnp BosTau6_
dbSNP140_NCBI.vcf.gz -L chr29
ValidateVariants - Reference allele is too long (116) at position chr29:2513143; skipping that record. Set --referenceWindowStop >= 116
Successfully validated the input file.  Checked 531640 records with no failures.

Older versions of GATK worked successfully on part of this dataset last year, but now we've completed the dataset and rerun everything with the latest version of GATK (at the time 3.4-46) using only the GATK tools, GATK VQSR is unable to process the files that were created by it's own HaplotypeCaller (and from the same version), while validation tools supplied with GATK claim there is no problem and programs like bcftools are happily processing the files.

It appears to me that the problem may be around the introduction of the * in the Alt column for VCF4.2 (?) and a failure of VQSR to fully support this?

Created 2016-01-08 18:46:14 | Updated 2016-01-08 19:08:57 | Tags: haplotypecaller missing-genotype

Comments (27)

Hello GATK Team,

I have 21 bam files that I ran through HaplotypeCaller in GVCF mode followed by GenotypeGVCF, using Version=3.4-0-g7e26428. I found a few entries that I am having a difficult time understanding.

Here is one of the entries in question (quick note: each sample is from a pool of 3-5 animals, explaining the high ploidy):

chr1   88988835    rs396411987 C   G   226099.87   .   AC=61;AF=0.455;AN=134;BaseQRankSum=-1.176e+00;ClippingRankSum=0.054;DB;DP=29307;FS=0.000;MLEAC=61;MLEAF=0.455;MQ=59.98;MQRankSum=0.603;QD=11.32;ReadPosRankSum=1.30;SOR=0.696   GT:AD:DP:GQ:PL  0/0/0/0/0/1/1/1/1/1:396,449,0,0:845:28:9249,2087,965,413,120,0,28,219,649,1588,7879 0/1/1/1/1/1/1/1/1/1:275,2150,0,0:2425:99:52058,17674,11483,7903,5424,3571,2143,1051,287,0,3240  0/0/0/1/1/1/1/1:626,1134,0,0:1760:99:24237,5532,2600,1118,316,0,197,1292,11237  0/0/1/1/1/1/1/1:225,824,0,0:1049:99:18893,5121,2835,1577,772,257,0,116,3439 ./././././././././.:0,0 ./././././././././.:0,0 ./././././././.:0,0 0/0/0/0/0/0/0/1:824,105,0,0:929:99:1309,0,239,705,1369,2290,3644,6013,20091 0/0/0/0/0/1/1/1:786,499,0,0:1285:99:9169,1197,249,0,139,633,1609,3600,16598 0/0/1/1/1/1:557,1132,0,2:1691:99:24696,4545,1719,433,0,563,9784 0/0/0/0/0/0/0/1:922,140,0,0:1062:99:1825,0,202,685,1401,2410,3908,6544,22289    0/0/0/0/1/1/1/1/1/1:650,828,0,0:1478:26:16909,4072,1966,903,311,26,0,254,906,2400,12402 ./././././././.:0,0 0/0/0/0/0/0/0/0/1/1:846,179,0,0:1025:95:2546,95,0,178,520,1015,1689,2617,3986,6389,20368    0/0/0/0/1/1/1/1:875,986,0,0:1861:99:20237,3745,1410,380,0,136,885,2819,17626    0/0/0/0/0/0/0/0:570,0:570:23:0,23,50,82,120,170,241,361,1800    0/0/0/0/0/1/1/1:798,604,0,0:1402:20:11395,1675,423,0,20,429,1344,3302,16391 0/0/0/0/1/1/1/1:644,730,0,0:1374:96:14884,2778,1049,285,0,96,643,2062,12779 0/0/0/0/0/1/1/1:613,417,0,0:1030:74:7703,1065,243,0,74,433,1174,2710,12915  0/0/0/1/1/1/1/1:239,513,0,0:752:13:11159,2667,1309,604,198,0,13,378,4191    ./././././././.:0,0

So, for the samples with genotypes present, AD is obviously quite high (in the 1000s, generally). It surprised me, then, that some samples don't have information (represented by ./././././././.). I went back to the original sample gvcf files and pulled the entries from a subset that were represented by ./././././././.:

group14.g.vcf:chr1 88988835    rs396411987 C   G,A,   16897.68    .   BaseQRankSum=-0.709;ClippingRankSum=0.843;DB;DP=1729;MLEAC=5,0,0;MLEAF=0.500,0.00,0.00;MQ=59.96;MQRankSum=0.621;ReadPosRankSum=1.987    GT:AD:DP:GQ:SB  0/0/0/0/0/1/1/1/1/1:876,844,2,0:1722:99:443,433,424,422
group15.g.vcf:chr1 88988835    rs396411987 C   G,T,   690.35  .   BaseQRankSum=-1.311;ClippingRankSum=-0.019;DB;DP=1168;MLEAC=1,0,0;MLEAF=0.125,0.00,0.00;MQ=59.98;MQRankSum=0.603;ReadPosRankSum=3.624   GT:AD:DP:GQ:SB  0/0/0/0/0/0/0/1:1082,78,4,0:1164:99:542,540,44,38

A consistent difference between samples that are included and those that are not is the presence of multiple alternative alleles in the excluded samples (G,A and G,T in the above example). Is this the source of my troubles? Is there a way of forcing GATK to include those samples in the VCF, especially given that the support for the third allele seems pretty weak (2 reads out of ~1722 in sample 14 above)? It seems to me that these samples should reasonably be included in the output of GenotypeGVCF.

Apologies if the answer is somewhere in the forums/guide - I did check, but didn't find anything.

Thanks, Russ

Created 2016-01-08 12:37:01 | Updated 2016-01-08 12:37:38 | Tags: haplotypecaller contigs

Comments (1)

Hey there,

I cannot figure out why I did not get the HaplotypeCaller to work properly. Maybe someone can please help me. I wanted to use only Chromsome20 for variant calling. I used a BAM and BAM.BAI file from 1000 genomes (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/alignment/). For the reference genome I used only Chromsome20 from this site (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/). I created a dict and index file with picard tools.

If I run the HaplotypeCaller like this: java -jar GenomeAnalysisTK.jar -R chr20.fa -T HaplotypeCaller -I HG00096.chrom20.ILLUMINA.bwa.GBR.low_coverage.20120522.bam I get the following error: ##### ERROR MESSAGE: Badly formed genome loc: Contig 1 given as location, but this contig isn't present in the Fasta sequence dictionary

If I run the same with the -L argument (-L 20) I get: #### ERROR MESSAGE: Input files reads and reference have incompatible contigs: The following contigs included in the intervals to process have different indices in the sequence dictionaries for the reads vs. the reference: [20]. As a result, the GATK engine will not correctly process reads from these contigs. You should either fix the sequence dictionaries for your reads so that these contigs have the same indices as in the sequence dictionary for your reference, or exclude these contigs from your intervals. This error can be disabled via -U ALLOW_SEQ_DICT_INCOMPATIBILITY, however this is not recommended as the GATK engine will not behave correctly.. ##### ERROR reads contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1, NC_007605, hs37d5] ##### ERROR reference contigs = [20]

Additionally, I adapted the fasta, fasta index and dictionary file from calling the chromsome "chr20" to "20" because previously it said: ##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: No overlapping contigs found.

I really don't know what I am doing wrong! Any help is appreciated! Regards, Kristina

Created 2016-01-07 21:50:30 | Updated | Tags: haplotypecaller

Comments (8)

Hi Everyone,

I am using GATK version 3.3 and ran haplotype caller with -mmq 30 and -mbq 20. The output of one of the positions in the VCF is below.

chr13 58264978 . C . . . GT:AD:DP:GQ:PL 0/0:22,1:23:54:0,54,655

I used IGV to validate this position. IGV reports, like GATK does that there are 23 total reads covering that position. Although, there are also 4 reads that have a phred score of less than 20 at the position above. The DP statistic seems to be recording the number of unfiltered reads, but not the filtered ones. Do you know why this is?

Thank you.

Best, Sam

Created 2016-01-06 13:01:18 | Updated | Tags: combinevariants haplotypecaller best-practices dbsnp gatk combinegvcfs gvcf

Comments (6)

Hi guys, I have recently jointly called 27 full genome data using GenotypeGVCFs approach. While i was trying to extract some chromosomes from the final file, i got this error The provided VCF file is malformed at approximately line number 16076: Unparsable vcf record with allele *.

I look into the file and I found some of the multi-allellic sites having * as seen in the attached picture.

I feel the problem could be that the program realised that more than one allele is present at that position but could not ascertain which allele. I may be wrong but please what do you think I can do to solve this problem. LAWAL

Created 2015-12-30 23:50:57 | Updated 2015-12-30 23:51:22 | Tags: haplotypecaller snp multi-sample

Comments (2)

I am trying to use GATK HP ( v3.5 ) to call SNPs in amplicon seq data of a small genome of around 600bp. As shown in the attachment, the variants are not called between location 50 and 60, despite high coverage across many samples. ( There are total 96 samples ).


The base qualities are above 30 and mapping quality is also 60 ( bwa mem ). I also did not remove duplicates ( not marked as well ) as its amplicon seq data.

The command I used was ( multisample SNP calling ) :

java -Xmx50g -jar GenomeAnalysisTK.jar -nct 2 -R -T HaplotypeCaller -I merged_samples.bam -o gatk_out_raw_snps_indels.vcf --min_base_quality_score 30

Its the same case even with base quality 20.

Thanks in advance.

Created 2015-12-29 20:00:32 | Updated | Tags: haplotypecaller rmsmappingquality annotations mq gatk3-5

Comments (2)

I upgraded to GATK 3.5 to use MuTect2 and it works great! However, I'm now using the updated .jar in my germline variant calling pipeline and there's some issues. I only noticed it when I went to run VQSR, and the INDELs went through but not the SNPs. I got an error in the SNP log about the MQ annotation. It seems to be completely absent in the haplotype caller output (the .g.vcf's). I tried forcing it with -A RMSMappingQuality but it did not affect the output. However it works fine if I go back to 3.4

Command Line for 3.4: java -Xmx4g -jar /REDACTED/tools/gatk_3.4.0/GenomeAnalysisTK.jar -T HaplotypeCaller -R /REDACTED/resources/b37/gatkBundle/human_g1k_v37_decoy.fasta -I /REDACTED/melanoma/bam/recal/PFM-32.ra.rc.bam -L /REDACTED/resources/b37/targetRegions/nexteraRapidCapture_expandedExome/1.bed -ERC GVCF -o /REDACTED/melanoma/variantCall/gatk3.4/gvcf/PFM-32/PFM-32_chr1.g.vcf -A RMSMappingQuality

Command Line for 3.5: java -Xmx4g -jar /REDACTED/tools/gatk_3.5.0/GenomeAnalysisTK.jar -T HaplotypeCaller -R /REDACTED/resources/b37/gatkBundle/human_g1k_v37_decoy.fasta -I /REDACTED/melanoma/bam/recal/PFM-32.ra.rc.bam -L /REDACTED/resources/b37/targetRegions/nexteraRapidCapture_expandedExome/1.bed -ERC GVCF -o /REDACTED/melanoma/variantCall/gvcf/PFM-32/PFM-32_chr1.g.vcf -A RMSMappingQuality

And here's a single line from the output of each at the same position

3.4: 1 1431520 . C A,<NON_REF> 118.77 . BaseQRankSum=0.452;DP=14;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=-1.485;ReadPosRankSum=-0.452 GT:AD:GQ:PL:SB 0/1:8,6,0:99:147,0,187,170,205,375:4,4,2,4

3.5: 1 1431520 . C A,<NON_REF> 118.77 . BaseQRankSum=0.710;ClippingRankSum=0.452;DP=14;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.452;RAW_MQ=50400.00;ReadPosRankSum=-1.614 GT:AD:DP:GQ:PL:SB 0/1:8,6,0:14:99:147,0,187,170,205,375:4,4,2,4

I didn't notice anything relevant in the 3.5 release notes except for something about VQSR and MQ jittering which leads me to believe that MQ is indeed a valid annotation still, but how can I get it?

Thanks in advance

Created 2015-12-23 18:15:35 | Updated 2015-12-23 18:21:19 | Tags: haplotypecaller amplicon

Comments (11)

Hi all, I'm relatively new to GATK. At first I was intimidated by the amount of documentation, but lately I've really come to appreciate how there are answers to nearly all questions somewhere on this website.

After asking around to see which tool people recommend for tumour amplicon variant calling the consensus choice was obviously GATK. In line with that I've been trying to run a variant of the best practices to identify some variants in my amplicon samples. The intended deviation is of course to tell GATK to ignore the duplicate status of the reads as nearly all reads are duplicates.

Here are my commands:

# find the indels to realign
GenomeAnalysisTK.jar -T RealignerTargetCreator -R GRCh37-lite.fa -I $1 -o $1.realignment_targets.list --disable_read_filter DuplicateRead

# fix up the indels
GenomeAnalysisTK.jar -T IndelRealigner -R GRCh37-lite.fa -I $1 -o $1.realigned.bam   --disable_read_filter DuplicateRead -targetIntervals $1.realignment_targets.list

# base recalibration
GenomeAnalysisTK.jar -T BaseRecalibrator -R GRCh37-lite.fa -I $1.realigned.bam -o  $1.recal_data.table --disable_read_filter DuplicateRead  -knownSites dbSNP/common_all.vcf

GenomeAnalysisTK.jar -T PrintReads -R GRCh37-lite.fa -I $1.realigned.bam -BQSR $1.recal_data.table --disable_read_filter DuplicateRead -o $1.recal_reads.bam

# Indels are realigned, bases are re-calibrated, now do the variant calling
GenomeAnalysisTK.jar -T HaplotypeCaller -R GRCh37-lite.fa --disable_read_filter DuplicateRead -I $1.recal_reads.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o $1.raw_variants.vcf -bamout $1.raw_variants.bamout.bam

I'm finding that I'm still missing a small handful of variants that I expected to be able to call. I've followed the instructions here: http://gatkforums.broadinstitute.org/gatk/discussion/1235/i-expect-to-see-a-variant-at-a-specific-site-but-its-not-getting-called with hope to get to the bottom of it.

The bamout is really helpful, but as far as i can tell, my variant (shown under the cursor in the attached image ) is in the original, realigned-recalibrated, and bamout version of my bam. Keep in mind that these are amplicon data, so the depths extend wayy beyond what is shown in the image.

The mapping qualities of the reads are all roughly 50-60 and the base qualities (post recalibration are in the 40-50 range).

The variant allele fraction is about 25% at all points in the analysis.

I'm sure there is something simple that I'm missing. Can anyone suggest what I might think of changing so I could call the variant in the image?

Created 2015-12-22 14:29:56 | Updated | Tags: unifiedgenotyper haplotypecaller trio pedigree

Comments (10)


I've been exploring de novo mutation identification in the context of a pedigree of trios. I've run the UnfiedGenotyper (UG) given all the bam files for ~25 sets of trios and it appears to identify a set of de novo mutations. When I run the HaplotypeCaller (HC) pipeline, first generating gVCF files for each individual, and then using the merged gVCF files along with the pedigree for genotype refinement and de novo mutation calling, it also finds a number of de novo mutations annotated as hi-confidence de novo mutations. When I compare the UG de novo mutations to the high confidence HC list, there's very little overlap. Many of the UG hi-confidence de novo variants are called by HC, but listed as low-confidence de novo variants, and from looking at a few examples, it would appear that the HC calls have assigned lower genotype confidence levels for the parental (non-mutated, reference) genotypes. Could it be that because the gVCF files aren't storing position-specific information for the reference (non-mutated) positions in the genome, the pedigree-type de novo mutation calling is not as accurate as it could be? Should I be generating gVCFs that include position-specific information?

Many thanks for any insights. If it would help, I could post some examples.

Created 2015-12-21 19:30:11 | Updated | Tags: haplotypecaller gvcf

Comments (5)


I have three general questions about using HaplotypeCaller (I know I could have tested by myself, but I figured it might be reliable to get some answer from people who are developing the tool):

  1. For single sample analysis, is the vcf generated directly from HC the same as the vcf generated using GenotypeGVCFs on the gvcf generated from HC?
  2. For multi-sample analysis, in terms of speed, how is the performance of running GenotypeGVCFs on each gvcf, compared with combining all gvcfs to run joint-calling, assuming we can get all gvcfs in parallel (say for 500 samples)?
  3. It seems the gvcf can be generated in two modes, -ERC GVCF or -ERC BP_RESOLUTION. How different is the one generated using -ERC BP_RESOLUTION different from a vcf with all variant calls, reference calls and missing calls? And considering the size of the file, say for NA12878 whole genome, how different it is comparing the gvcf from -ERC GVCF and the one from -ERC BP_RESOLUTION?

Thank you very much for you attention and any information from you will be highly appreciated.

Created 2015-12-17 06:12:19 | Updated | Tags: haplotypecaller lowqual genotype-given-alleles

Comments (8)

Hi GATK team, I am working on a pipeline for exome sequencing variant calling. And I am only interested in the genotype for some specific positions so I used GENOTYPE_GIVEN_ALLELES mode with given vcf file. I only have "chromosome position ID REF ALT" info so I keeped other columns as . . I found that this mode is super slow, compared with discovery mode. And when it finished I found most of my records are marked with LowQual, but they have good sequencing depth and quality score. Can you tell me why they are marked with LowQual flag. Here are some output:

chr10 101421279 rs35229854 G A 0 LowQual AC=0;AF=0.00;AN=2;DB;DP=80;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.407 GT:AD:DP:GQ:PL 0/0:80,0:80:99:0,241,2781 chr10 101421288 rs147747082 C A 0 LowQual AC=0;AF=0.00;AN=2;DB;DP=81;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.332 GT:AD:DP:GQ:PL 0/0:81,0:81:99:0,244,2812 chr10 101421324 rs150267092 A G 0 LowQual AC=0;AF=0.00;AN=2;DB;DP=64;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.173 GT:AD:DP:GQ:PL 0/0:64,0:64:99:0,193,2224 chr10 101421366 rs370286436 C T 0 LowQual AC=0;AF=0.00;AN=2;DB;DP=59;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.101 GT:AD:DP:GQ:PL 0/0:59,0:59:99:0,178,2052 chr10 101421367 rs61729539 G A 0 LowQual AC=0;AF=0.00;AN=2;DB;DP=58;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.105 GT:AD:DP:GQ:PL 0/0:58,0:58:99:0,175,2014 chr10 101451259 rs11190245 T C 605.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.278;ClippingRankSum=2.143;DB;DP=32;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.119;QD=18.93;ReadPosRankSum=1.904;SOR=0.507 GT:AD:DP:GQ:PL 0/1:11,21:32:99:634,0,286

I used default setting for other step. references are from gatk-bundle-2.8-hg19

Thanks for your help : )

Created 2015-11-19 18:36:31 | Updated | Tags: haplotypecaller multiallelic strandallelecountsbysample

Comments (6)

I'm running haplotype caller (latest nightly build) with -A StrandAlleleCountsBySample parameter to get strand specific read counts (SAC). For variants with more than the default 6 maximal alt alleles there is a problem with the SAC field:

2 47641559 . TAAAAAAAAAAA T,TA,TAA,TAAA,TAAAA,TAAAAAA,<NON_REF> 1308.73 . BaseQRankSum=0.434;ClippingRankSum=0.768;DP=105;ExcessHet=3.0103;MLEAC=0,0,0,0,0,1,1;MLEAF=0.00,0.00,0.00,0.00,0.00,0.500,0.500;MQRankSum=-1.704;RAW_MQ=378000.00;ReadPosRankSum=1.971 GT:AD:DP:GQ:PL:SAC:SB 6/7:3,0,0,3,4,5,16,9:40:99:1346,1509,3479,1488,3459,3455,1204,2706,2706,2585,989,2303,2303,2215,2132,692,1723,1720,1604,1576,1507,277,1002,983,781,714,657,745,268,447,447,355,313,232,0,147:3,0,0,0,0,0,0,3,1,3,0,5,0,16,0,0:3,0,1,27

So there are 9 reads originating from another than one of the given alt alleles (=NON_REF), but the SAC field is missing these reads. This gets especially annoying if one of the NON_REF alleles is selected as most likely when combining the sample with others in GenotypeGVCFs.

Another example: 11 108141955 . CTTTT C,CT,CTT,CTTT,ATTTT,TTTTT,<NON_REF> 1552.73 . BaseQRankSum=-0.227;DP=704;ExcessHet=3.0103;MLEAC=0,0,0,1,0,0,0;MLEAF=0.00,0.00,0.00,0.500,0.00,0.00,0.00;MQ=60.02;MQRankSum=-0.254;ReadPosRankSum=1.249 GT:AD:DP:GQ:PL:SAC:SB 0/4:431,5,4,27,127,4,3,3:604:99:1590,3247,26394,3043,24416,23841,2156,18595,18550,17063,0,11232,11190,10498,9517,3572,20205,18965,15617,10454,20237,3558,20037,18797,15420,10362,19931,19926,2484,13421,13344,12357,9074,12834,12837,11563:213,218,2,3,2,2,14,13,54,73,0,4,0,3,0,0:213,218,72,98

Created 2015-11-17 15:29:39 | Updated | Tags: haplotypecaller genotypegvcfs

Comments (11)

Is there a way to use gzipped gVCFs files when using HaplotypeCaller and GenotypeGVCFs.

If so how do you make the index files? I can't seem to get it to work.

(I have searched the forum and can't seem to find a definitive answer. Sorry)

Created 2015-09-11 12:29:24 | Updated | Tags: haplotypecaller genotypegvcf rgq

Comments (11)

I work with non-human genomes and commonly need the confidence of the reference sites, so I was happy to see the inclusion of the RGQ score in the format field of GenotypeGVCFs. However, I am a little confused as to what this score means (how it is calculated). Out of curiosity I plotted the distribution of RGQ and GQ scores over ~1Mbp. A few things jumped out that I was hoping you could explain:

(1) There are two peaks of GQ and RGQ scores, one at 99 - which is obviously just the highest confidence score and another at exactly GQ/RGQ=45. You can see this in the GQ/RGQ distribution below. I've excluded the sites where RGQ/GQ = 0 or 99 (RGQ = blue, GQ=red) is there some reason why so many GT calls == 45?

(2) There are very few GQ = 0 calls and ~96% are GQ=99 - but in the RGQ ~42% == 0 and 54%=99. Is there any explanation why so many RGQ scores == 0? I fear that filtering on RGQ will bias the data against reference calls and include a disproportionate number of variant calls.

Created 2015-07-07 15:31:49 | Updated | Tags: haplotypecaller code-exception

Comments (8)

Dear GATK Team, I'm trying to run HaplotypeCaller, on a Bowtie2 bam file, but I'm having a Code exception. Any idea how to fix it? Best regards, Miguel Machado

Command: /scratch/mpmachado/NGStools/java_oracle/jre1.8.0_45/bin/java -jar /scratch/mpmachado/NGStools/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller -R /scratch/mpmachado/trabalho1diversidade/genomaReferencia/dadosGenoma/NC_004368.fna -I /scratch/mpmachado/trabalho1diversidade/genomaReferencia/outputs/bowtie/GBS_01.sorted_position.bam --out output.raw.snps.indels.vcf --genotyping_mode DISCOVERY --output_mode EMIT_ALL_SITES --sample_ploidy 1

INFO 16:21:59,507 HelpFormatter - -------------------------------------------------------------------------------- INFO 16:21:59,511 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-0-g7e26428, Compiled 2015/05/15 03:25:41 INFO 16:21:59,511 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 16:21:59,511 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 16:21:59,516 HelpFormatter - Program Args: -T HaplotypeCaller -R /scratch/mpmachado/trabalho1diversidade/genomaReferencia/dadosGenoma/NC_004368.fna -I /scratch/mpmachado/trabalho1diversidade/genomaReferencia/outputs/bowtie/GBS_01.sorted_position.bam --out output.raw.snps.indels.vcf --genotyping_mode DISCOVERY --output_mode EMIT_ALL_SITES --sample_ploidy 1 INFO 16:21:59,521 HelpFormatter - Executing as mpmachado@dawkins on Linux 2.6.32-5-amd64 i386; Java HotSpot(TM) Server VM 1.8.0_45-b14. INFO 16:21:59,521 HelpFormatter - Date/Time: 2015/07/07 16:21:59 INFO 16:21:59,521 HelpFormatter - -------------------------------------------------------------------------------- INFO 16:21:59,521 HelpFormatter - -------------------------------------------------------------------------------- INFO 16:22:00,257 GenomeAnalysisEngine - Strictness is SILENT INFO 16:22:00,360 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500 INFO 16:22:00,370 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 16:22:00,401 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 INFO 16:22:00,410 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 16:22:01,869 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.NullPointerException at java.util.TreeMap.compare(TreeMap.java:1290) at java.util.TreeMap.put(TreeMap.java:538) at java.util.TreeSet.add(TreeSet.java:255) at org.broadinstitute.gatk.utils.sam.ReadUtils.getSAMFileSamples(ReadUtils.java:70) at org.broadinstitute.gatk.engine.samples.SampleDBBuilder.addSamplesFromSAMHeader(SampleDBBuilder.java:66) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.initializeSampleDB(GenomeAnalysisEngine.java:846) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:296) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.4-0-g7e26428):
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

Created 2015-06-02 09:25:02 | Updated 2015-06-02 09:26:43 | Tags: haplotypecaller missing-genotype

Comments (16)

Dear GATK community,

I am using HaplotypeCaller for variant discovery and I have found some strange results in my VCF. It appears that this walker is making no calls in positions with reads of support for them in the original BAM.

For instance this position:

chr7 302528 rs28436118 A G 31807.9 PASS . GT:AD:DP:GQ:PL 1/1:0,256:256:99:8228,767,0 ./.:98,0:98:.:. ./.:81,0:81:.:. 1/1:0,134:134:99:4287,401,0

was not called for 2 of the 4 samples available. However, in both samples where the genotype is missing there are many reads supporting an homozygous reference call (0/0).

Do you have any idea of why this is happening? Cheers JMFA> ****

Created 2015-05-29 16:00:04 | Updated | Tags: haplotypecaller variant-calling

Comments (12)

Hello, I am trying to run HaplotypeCaller on my processed bam file but it keeps running out of memory. I have tried increasing the heap size to 90G but it still crash. This might have to do with the type of analysis I am doing... The sample are pool of pigs (6 individual so a ploidy of 12 for this particular sample) that have been sequenced on targeted regions, I use the bed file that has been given with the kit to narrow done the calling to the targeted regions. I have also reduce the number of alternative allele from 6 to 3. But it still crash after a while. Is there any other parameters I should try to modify to reduce the memory usage? I have attached the log file if you want to have a look at all the parameters. Cheers,


Created 2015-01-28 14:33:40 | Updated | Tags: haplotypecaller ad combinegvcfs

Comments (15)


I am using GATK v3.2.2 following the recommended practices (...HC -> CombineGVCFs -> GenotypeGVCFs ...) and while looking through suspicious variants I came across a few hetz with AD=X,0. Tracing them back I found two inconsistencies (bugs?);

1) Reordering of genotypes when combining gvcfs while the AD values are kept intact, which leads to an erronous AD for a heterozygous call. Also, I find it hard to understand why the 1bp insertion is emitted in the gvcf - there is no reads supporting it:

  • single sample gvcf 1 26707944 . A AG,G,<NON_REF> 903.73 . [INFO] GT:AD:DP:GQ:PL:SB 0/2:66,0,36,0:102:99:1057,1039,4115,0,2052,1856,941,3051,1925,2847:51,15,27,9

  • combined gvcf 1 26707944 . A G,AG,<NON_REF> . . [INFO] GT:AD:DP:MIN_DP:PL:SB [other_samples] ./.:66,0,36,0:102:.:1057,0,1856,1039,2052,4115,941,1925,3051,2847:51,15,27,9 [other_samples]

  • vcf
    1 26707944 . A G 3169.63 . [INFO] [other_samples] 0/1:66,0:102:99:1057,0,1856 [other_samples]

2) Incorrect AD is taken while genotyping gvcf files:

  • single sample gvcf: 1 1247185 rs142783360 AG A,<NON_REF> 577.73 . [INFO] GT:AD:DP:GQ:PL:SB 0/1:13,20,0:33:99:615,0,361,654,421,1075:7,6,17,3
  • combined gvcf 1 1247185 rs142783360 AG A,<NON_REF> . . [INFO] [other_samples] ./.:13,20,0:33:.:615,0,361,654,421,1075:7,6,17,3 [other_samples]

  • vcf
    1 1247185 . AG A 569.95 . [INFO] [other_samples] 0/1:13,0:33:99:615,0,361 [other_samples]

I have found multiple such cases here, and no errors nor warnings in the logs. I checked also with calls that I had done before on these samples, but in a smaller batch. There the AD values were correct, but there were plenty of other hetz with AD=X,0... I haven't looked closer into those.

Are these bugs that have been fixed in 3.3? Or maybe my brain is not working properly today and I miss sth obvious?

Best regards, Paweł

Created 2015-01-09 14:07:52 | Updated | Tags: haplotypecaller

Comments (18)


I'm working on targeted resequencing data and I'm doing a multi-sample variant calling with the HaplotypeCaller. First, I tried to call the variants in all the targeted regions by doing the calling at one time on a cluster. I thus specified all the targeted regions with the -L option.

Then, as it was taking too long, I decided to cut my interval list, chromosome by chromosome and to do the calling on each chromosome. At the end, I merged the VCFs files that I had obtained for the callings on the different chromosomes.

Then, I compared this merged VCF file with the vcf file that I obtained by doing the calling on all the targeted regions at one time. I noticed 1% of variation between the two variants lists. And I can't explain this stochasticity. Any suggestion?



Created 2014-12-04 19:17:24 | Updated | Tags: baserecalibrator haplotypecaller vcf convergence bootstrap

Comments (20)

I am identifying new sequence variants/genotypes from RNA-Seq data. The species I am working with is not well studied, and there are no available datasets of reliable SNP and INDEL variants.

For BaseRecallibrator, it is recommended that when lacking a reliable set of sequence variants: "You can bootstrap a database of known SNPs. Here's how it works: First do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence."

Setting up a script to run HaplotypeCaller and BaseRecallibrator in a loop should be fairly strait forward. What is a good strategy for comparing VCF files and assessing convergence?

Created 2014-11-18 02:40:18 | Updated | Tags: haplotypecaller

Comments (8)


I'd like to ask you if there is a paper describing the Haplotype Caller algorithm, if you could please send me the reference. I have tried to find it, but I only found the paper on GATK which is great, but it doesn't describe in detail the Haplotype Caller algorithm.

thank you,

Created 2014-09-19 14:34:45 | Updated | Tags: haplotypecaller malformedvcf genotypegvcfs

Comments (9)


I've tried my best to find the problem but without luck, so my first post...

First the background:

I have used HaplotypeCaller to make GVCFs per individual for 2 cohorts of 270 and 300 individuals each.
I then used CombineGVCFs to merged all the individuals in each cohort across 10Mb chunks across the genome.

These steps appear to have worked OK! Then I have been attempting to call genotypes from pairs GVCFs (one from each cohort) per region using GenotypeGVCFs with a command such as:

java -jar -Xmx6G /gpfs0/apps/well/gatk/3.2-2/GenomeAnalysisTK.jar \ -T GenotypeGVCFs \ -R ../hs37d5/hs37d5.fa \ --variant /well/hill/kate/exomes//haploc/cap270/cap270_1:230000000-240000000_merged_gvcf.vcf \ --variant /well/hill/kate/exomes//haploc/sepsis300/sepsis300_1:230000000-240000000_merged_gvcf.vcf \ -o /well/hill/kate/exomes//haploc/cap270-sepsis300/cap270-sepsis300_1:230000000-240000000_raw.vcf

This seems to start fine (and works perfectly to completion for some regions), but then throws a error for some of the files. For this example the error is:

ERROR MESSAGE: The provided VCF file is malformed at approximately line number 519534: ./.:29:75:25:0,63,945 is not a valid start position in the VCF format, for input source: /well/hill/kate/exomes/haploc/cap270/cap270_1:230000000-240000000_merged_gvcf.vcf

So to try and solve this I have grepped "./.:29:75:25:0,63,945 " from the VCF file (expecting to find a truncated line), but could find nothing wrong with the lines containing this expression. I looked carefully at the first line containing this after the last position that was written in the output - nothing out of the ordinary that I can see.

Any help/advice gratefully received! Kate

Created 2014-09-11 15:52:18 | Updated | Tags: vqsr haplotypecaller qualbydepth genotypegvcfs

Comments (25)

Hey there,

How can it be possible that some of my snps or indels calls miss the QD tag? I'm doing the recommended workflow and I've tested if for both RNAseq (hard filters complains, that's how I saw those tags were missing) and Exome sequencing (VQSR). How can a hard filter applied on QD on a line without actually that tag can be considered to pass that threshold too? I'm seeing a lot more INDELs in RNAseq where this kind of scenario is happening as well.

Here's the command lines that I used :

Forming gvcf files like this (for RNAseq in that case):

java -Djava.io.tmpdir=$LSCRATCH -Xmx46g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -pairHMM VECTOR_LOGLESS_CACHING -T HaplotypeCaller -I sample.bam -R ~/References/hg19/hg19.fasta --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o sample.g.vcf --dbsnp ~/References/hg19/dbSNP/dbsnp_138.b37.excluding_sites_after_129.CEU_converted.vcf -recoverDanglingHeads -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 0.0 -mmq 1 -U ALLOW_N_CIGAR_READS

GenotypeGVCF after :

java -Djava.io.tmpdir=$LSCRATCH -Xmx22g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -T GenotypeGVCFs --dbsnp ~/References/hg19/dbSNP/dbsnp_138.b37.excluding_sites_after_129.CEU_converted.vcf -A QualByDepth -A FisherStrand -o Joint_file/96RNAseq.10092014.STAR.q1.1kb.vcf -R ~/References/hg19/hg19.fasta

VQSR (separated for indels and snvs) or Hard filter at the end :


java -Djava.io.tmpdir=$LSCRATCH -Xmx10g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -T VariantRecalibrator -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -mode SNP -resource:1000G,known=false,training=true,truth=false,prior=10.0 ~/References/hg19/VQSR/1000G_phase1.snps.high_confidence.b37.noGL.converted.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ~/References/hg19/VQSR/hapmap_3.3.b37.noGL.nochrM.converted.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 ~/References/hg19/VQSR/1000G_omni2.5.b37.noGL.nochrM.converted.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 ~/References/hg19/VQSR/dbsnp_138.b37.excluding_sites_after_129.noGL.nochrM.converted.vcf -input snv.vcf -recalFile 96Exomes.HC.TruSeq.snv.RECALIBRATED -tranchesFile 96Exomes.HC.TruSeq.snv.tranches -rscriptFile 96Exomes.HC.TruSeq.snv.plots.R -R ~/References/hg19/hg19.fasta --maxGaussians 4

java -Djava.io.tmpdir=$LSCRATCH -Xmx10g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -T ApplyRecalibration -ts_filter_level 99.0 -mode SNP -input snv.vcf -recalFile 96Exomes.HC.TruSeq.snv.RECALIBRATED -tranchesFile 96Exomes.HC.TruSeq.snv.tranches -o 96Exomes.HC.TruSeq.snv.recal.vcf -R ~/References/hg19/hg19.fasta


java -Djava.io.tmpdir=$LSCRATCH -Xmx2g -jar /home/apps/Logiciels/GATK/3.1-1/GenomeAnalysisTK.jar -l INFO -T VariantFiltration -R ~/References/hg19/hg19.fasta -V 96RNAseq.STAR.q1.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o 96RNAseq.STAR.q1.FS30.QD2.vcf

Here are some examples for RNAseq :

chr1 6711349 . G A 79.10 PASS BaseQRankSum=-1.369e+00;ClippingRankSum=1.00;DP=635;FS=1.871;MLEAC=1;MLEAF=5.495e-03;MQ=60.00;MQ0=0;MQRankSum=-4.560e-01;ReadPosRankSum=-1.187e+00 GT:AD:DP:GQ:PL 0/0:8,0:8:24:0,24,280 ./.:0,0:0 0/0:9,0:9:21:0,21,248 0/0:7,0:7:21:0,21,196 0/0:7,0:7:21:0,21,226 0/0:8,0:8:21:0,21,227 0/0:8,0:8:21:0,21,253 0/0:7,0:7:21:0,21,218 0/0:9,0:9:27:0,27,282 1/1:0,0:5:15:137,15,0 0/0:2,0:2:6:0,6,47 0/0:28,0:28:78:0,78,860 0/0:7,0:7:21:0,21,252 0/0:2,0:2:6:0,6,49 0/0:5,0:5:12:0,12,152 0/0:3,0:3:6:0,6,90 0/0:4,0:4:12:0,12,126 0/0:9,0:9:21:0,21,315 0/0:7,0:7:21:0,21,256 0/0:7,0:7:21:0,21,160 0/0:8,0:8:21:0,21,298 0/0:20,0:20:60:0,60,605 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,67 0/0:2,0:2:6:0,6,71 0/0:14,0:14:20:0,20,390 0/0:7,0:7:21:0,21,223 0/0:7,0:7:21:0,21,221 0/0:4,0:4:12:0,12,134 0/0:2,0:2:6:0,6,54 ./.:0,0:0 0/0:4,0:4:9:0,9,118 0/0:8,0:8:21:0,21,243 0/0:6,0:6:15:0,15,143 0/0:8,0:8:21:0,21,244 0/0:7,0:7:21:0,21,192 0/0:2,0:2:6:0,6,54 0/0:13,0:13:27:0,27,359 0/0:8,0:8:21:0,21,245 0/0:7,0:7:21:0,21,218 0/0:12,0:12:36:0,36,354 0/0:8,0:8:21:0,21,315 0/0:7,0:7:21:0,21,215 0/0:2,0:2:6:0,6,49 0/0:10,0:10:24:0,24,301 0/0:7,0:7:21:0,21,208 0/0:7,0:7:21:0,21,199 0/0:2,0:2:6:0,6,47 0/0:3,0:3:9:0,9,87 0/0:2,0:2:6:0,6,73 0/0:7,0:7:21:0,21,210 0/0:8,0:8:22:0,22,268 0/0:7,0:7:21:0,21,184 0/0:7,0:7:21:0,21,213 0/0:5,0:5:9:0,9,135 0/0:7,0:7:21:0,21,200 0/0:4,0:4:12:0,12,118 0/0:7,0:7:21:0,21,232 0/0:7,0:7:21:0,21,232 0/0:7,0:7:21:0,21,217 0/0:8,0:8:21:0,21,255 0/0:9,0:9:24:0,24,314 0/0:8,0:8:21:0,21,221 0/0:9,0:9:24:0,24,276 0/0:9,0:9:21:0,21,285 0/0:3,0:3:6:0,6,90 0/0:2,0:2:6:0,6,57 0/0:13,0:13:20:0,20,385 0/0:2,0:2:6:0,6,48 0/0:11,0:11:27:0,27,317 0/0:8,0:8:21:0,21,315 0/0:9,0:9:24:0,24,284 0/0:7,0:7:21:0,21,228 0/0:14,0:14:33:0,33,446 0/0:2,0:2:6:0,6,64 0/0:2,0:2:6:0,6,72 0/0:7,0:7:21:0,21,258 0/0:10,0:10:27:0,27,348 0/0:7,0:7:21:0,21,219 0/0:9,0:9:21:0,21,289 0/0:20,0:20:57:0,57,855 0/0:4,0:4:12:0,12,146 0/0:7,0:7:21:0,21,205 0/0:12,0:14:36:0,36,1030 0/0:3,0:3:6:0,6,87 0/0:2,0:2:6:0,6,60 0/0:7,0:7:21:0,21,226 0/0:7,0:7:21:0,21,229 0/0:8,0:8:21:0,21,265 0/0:4,0:4:6:0,6,90 ./.:0,0:0 0/0:7,0:7:21:0,21,229 0/0:2,0:2:6:0,6,59 0/0:2,0:2:6:0,6,56 chr1 7992047 . T C 45.83 SnpCluster BaseQRankSum=1.03;ClippingRankSum=0.00;DP=98;FS=0.000;MLEAC=1;MLEAF=0.014;MQ=60.00;MQ0=0;MQRankSum=-1.026e+00;ReadPosRankSum=-1.026e+00 GT:AD:DP:GQ:PL ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,70 0/0:2,0:2:6:0,6,45 0/0:3,0:3:6:0,6,87 0/0:2,0:2:6:0,6,52 ./.:0,0:0 ./.:0,0:0 ./.:1,0:1 ./.:0,0:0 0/0:2,0:2:6:0,6,55 0/0:2,0:2:6:0,6,49 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,61 0/0:2,0:2:6:0,6,49 ./.:0,0:0 ./.:0,0:0 0/0:3,0:3:6:0,6,90 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,52 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,69 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,64 ./.:0,0:0 0/0:2,0:2:6:0,6,37 ./.:0,0:0 0/0:2,0:2:6:0,6,67 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,68 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:11,0:11:24:0,24,360 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,68 0/0:2,0:2:6:0,6,50 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,50 0/0:3,0:3:6:0,6,90 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:4:6:0,6,50 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:7,0:7:21:0,21,231 0/0:2,0:2:6:0,6,64 ./.:0,0:0 0/0:2,0:2:6:0,6,63 0/0:2,0:2:6:0,6,70 ./.:0,0:0 0/0:6,0:6:15:0,15,148 ./.:0,0:0 ./.:0,0:0 1/1:0,0:2:6:90,6,0 ./.:0,0:0 0/0:2,0:2:6:0,6,63 0/0:2,0:2:6:0,6,74 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,58 0/0:2,0:2:6:0,6,71 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49

For Exome Seq now : chr2 111878571 . C T 93.21 PASS DP=634;FS=0.000;MLEAC=1;MLEAF=5.319e-03;MQ=60.00;MQ0=0;VQSLOD=14.19;culprit=MQ GT:AD:DP:GQ:PL 0/0:8,0:8:24:0,24,243 0/0:4,0:4:9:0,9,135 0/0:7,0:7:18:0,18,270 0/0:7,0:7:21:0,21,230 0/0:16,0:16:48:0,48,542 0/0:8,0:8:21:0,21,315 0/0:6,0:6:18:0,18,186 0/0:5,0:5:15:0,15,168 0/0:6,0:6:15:0,15,225 0/0:10,0:10:30:0,30,333 0/0:7,0:7:21:0,21,239 0/0:6,0:6:18:0,18,202 0/0:6,0:6:15:0,15,225 0/0:7,0:7:21:0,21,225 0/0:8,0:8:24:0,24,272 0/0:5,0:5:15:0,15,168 1/1:0,0:13:13:147,13,0 0/0:2,0:2:6:0,6,73 0/0:8,0:8:24:0,24,256 0/0:14,0:14:4:0,4,437 0/0:3,0:3:9:0,9,85 0/0:4,0:4:12:0,12,159 0/0:7,0:7:21:0,21,238 0/0:5,0:5:15:0,15,195 0/0:7,0:7:15:0,15,225 0/0:12,0:12:36:0,36,414 0/0:4,0:4:12:0,12,156 0/0:7,0:7:0:0,0,190 0/0:2,0:2:6:0,6,64 0/0:7,0:7:21:0,21,242 0/0:7,0:7:21:0,21,234 0/0:8,0:8:24:0,24,267 0/0:7,0:7:21:0,21,245 0/0:7,0:7:21:0,21,261 0/0:6,0:6:18:0,18,204 0/0:8,0:8:24:0,24,302 0/0:5,0:5:15:0,15,172 0/0:9,0:9:24:0,24,360 0/0:18,0:18:51:0,51,649 0/0:5,0:5:15:0,15,176 0/0:2,0:2:6:0,6,70 0/0:14,0:14:33:0,33,495 0/0:4,0:4:9:0,9,135 0/0:8,0:8:21:0,21,315 0/0:4,0:4:12:0,12,149 0/0:4,0:4:6:0,6,90 0/0:10,0:10:27:0,27,405 0/0:3,0:3:6:0,6,90 0/0:4,0:4:12:0,12,133 0/0:14,0:14:6:0,6,431 0/0:4,0:4:12:0,12,151 0/0:5,0:5:15:0,15,163 0/0:3,0:3:9:0,9,106 0/0:7,0:7:21:0,21,237 0/0:7,0:7:21:0,21,268 0/0:8,0:8:21:0,21,315 0/0:2,0:2:6:0,6,68 ./.:0,0:0 0/0:3,0:3:9:0,9,103 0/0:7,0:7:21:0,21,230 0/0:3,0:3:6:0,6,90 0/0:9,0:9:26:0,26,277 0/0:7,0:7:21:0,21,236 0/0:5,0:5:15:0,15,170 ./.:1,0:1 0/0:15,0:15:45:0,45,653 0/0:8,0:8:24:0,24,304 0/0:6,0:6:15:0,15,225 0/0:3,0:3:9:0,9,103 0/0:2,0:2:6:0,6,79 0/0:7,0:7:21:0,21,241 0/0:4,0:4:12:0,12,134 0/0:3,0:3:6:0,6,90 0/0:5,0:5:15:0,15,159 0/0:4,0:4:12:0,12,136 0/0:5,0:5:12:0,12,180 0/0:11,0:11:21:0,21,315 0/0:13,0:13:39:0,39,501 0/0:3,0:3:9:0,9,103 0/0:8,0:8:24:0,24,257 0/0:2,0:2:6:0,6,73 0/0:8,0:8:24:0,24,280 0/0:4,0:4:12:0,12,144 0/0:4,0:4:9:0,9,135 0/0:8,0:8:24:0,24,298 0/0:4,0:4:12:0,12,129 0/0:5,0:5:15:0,15,184 0/0:2,0:2:6:0,6,62 0/0:2,0:2:6:0,6,65 0/0:9,0:9:27:0,27,337 0/0:7,0:7:21:0,21,230 0/0:7,0:7:21:0,21,239 0/0:5,0:5:0:0,0,113 0/0:11,0:11:33:0,33,369 0/0:7,0:7:21:0,21,248 0/0:10,0:10:30:0,30,395

Thanks for your help.

Created 2014-04-28 15:11:17 | Updated | Tags: allelebalance allelebalancebysample haplotypecaller variantannotator

Comments (16)

Version 3.1.1. Human normal samples.

I couldnt find AlleleBalance and AlleleBalanceBySample tags in my vcf outputs. Tags are not found even for single variant I tried HaplotypeCaller with -all or directly with -A AlleleBalance or -A AlleleBalanceBySample. Also I tried Variantannotator with -all or -A AlleleBalance or -A AlleleBalanceBySample.

Any help will be apreciated

Created 2014-04-16 14:11:37 | Updated | Tags: unifiedgenotyper haplotypecaller bug gatk variant-calling

Comments (18)

HI, I'd like to report a weird result from HaplotypeCaller.

We have a patient sequenced by targeted sequencing on HNF1B. This patient has been confirmed to have a whole gene deletion of HNF1B so we used this patient as a positive control. We expected to see no heterozygous variants called in HNF1B.

However, HaplotypeCaller called two heterozygous variants: one deletion (it didn't pass the FS strand bias filter and the ReadPosRankSumTest filter) and one substitution (this one passed all the quality filters). Both these two variants were not called by UnifiedGenotyper (and the variants called by UnifiedGenotyper in HNF1B region were all homozygous as what we expected)

Please see the VCF table:

There are three things I want to highlight: 1, The deletion is only 10 bases upstream of the substitution, but the FS score is enormous for the deletion whereas very low for the substitution. If there's a strand bias, it must affect both variants if they are so close to each other. 2, The score of ReadPosRankSumTest of the deletion didn't pass the filter because it's always called near the end of the reads. The ReadPosRankSumTest score for the substitution is missing. 3, The genotype was called 0/1 for the substitution, but if we look at the AD, there are 206 reads supporting the ref allele and 0 read supporting the alt allele. Going by the AD, it's clearly a homozygous ref/ref genotype.

Then I looked into the bam files. It turns out the all the alternative alleles of the substitution were from the ends of bad reads, and there are not too many of them after all. So the reads in the bam file also support a homozygous ref/ref genotype.

Therefore I'm really confused why the substitution has 0/1 genotype called by the HaplotypeCaller and why it passed the filter.

Many Thanks


Created 2013-10-16 12:50:25 | Updated | Tags: unifiedgenotyper haplotypecaller dp

Comments (37)

Dear GATK Team,

I've recently been exploring HaplotypeCaller and noticed that, for my data, it is reporting ~10x lower DP and AD values in comparison to reads visible in the igv browser and reported by the UnifiedGenotyper.

I'm analyzing a human gene panel of amplicon data produced on a MiSeq, 150bp paired end. The coverage is ~5,000x.

My pipeline is:

Novoalign -> GATK (recalibrate quality) -> GATK (re-align) -> HaplotypeCaller/UnifiedGenotyper.

Here are the minimum commands that reproduce the discrepancy:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.HC.vcf \
-L ROI.bed \
-dt NONE \
-nct 8

Example variant from sample1.HC.vcf:

chr17 41245466 . G A 18004.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.411;ClippingRankSum=-1.211;DP=462;FS=2.564;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.250;QD=31.14;ReadPosRankSum=1.159 GT:AD:DP:GQ:PL 1/1:3,458:461:99:18033,1286,0

... In comparison to using UnifiedGenotyper with exactly the same alignment file:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.UG.vcf \
-L ROI.bed \
-nct 4 \
-dt NONE \
-glm BOTH

Example variant from sample1.UG.vcf:

chr17 41245466 . G A 140732.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=5.488;DP=6382;Dels=0.00;FS=0.000;HaplotypeScore=568.8569;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.096;QD=22.05;ReadPosRankSum=0.104 GT:AD:DP:GQ:PL 1/1:56,6300:6378:99:140761,8716,0

I looked at the mapping quality and number of the alignments at the example region (200nt window) listed above and they look good:

awk '{if ($3=="chr17" && $4 > (41245466-100) && $4 < (41245466+100))  print}' sample1.rg.sam | awk '{count[$5]++} END {for(i in count) print count[i], i}' | sort -nr
8764 70
77 0

With other data generated in our lab, that has ~200x coverage and the same assay principle [just more amplicons], the DP reported by HaplotypeCaller corresponds perfectly to UnifiedGenotyper and igv.

Is there an explanation as to why I should see a difference between HaplotypeCaller and UnifiedGenotyper, using these kinds of data?

Many thanks in advance,


Created 2013-08-16 21:12:22 | Updated | Tags: haplotypecaller

Comments (48)

If I run HaplotypeCaller with a VCF file as the intervals file, -stand_emit_conf 0, and -out_mode EMIT_ALL_SITES, should I get back an output VCF with all the sites from the input VCF, whether or not there was a variant call there? If not, is there a way to force output even if the calls are 0/0 or ./. for everyone in the cohort?

I have been trying to run HC with the above options, but I can't understand why some variants are included in my output file and others aren't. Some positions are output with no alternate allele and GTs of 0 for everyone. However, other positions that I know have coverage are not output at all.