# Tagged with #haplotypecaller 6 documentation articles | 4 announcements | 88 forum discussions

Created 2014-05-08 22:09:16 | Updated 2015-05-16 04:53:11 | Tags: haplotypecaller haplotypescore assembly haplotype

This document details the procedure used by HaplotypeCaller to re-assemble read data and determine candidate haplotypes as a prelude to variant calling. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.

Note that we are still working on producing figures to complement the text. We will update this document as soon as the figures are ready. Note also that this is a provisional document and some final corrections may be made for accuracy and/or completeness. Feedback is most welcome!

### Overview

The previous step produced a list of ActiveRegions that showed some evidence of possible variation (see step 1 documentation). Now, we need to process each Active Region in order to generate a list of possible haplotypes based on the sequence data we have for that region.

To do so, the program first builds an assembly graph for each active region (determined in the previous step) using the reference sequence as a template. Then, it takes each read in turn and attempts to match it to a segment of the graph. Whenever portions of a read do not match the local graph, the program adds new nodes to the graph to account for the mismatches. After this process has been repeated with many reads, it typically yields a complex graph with many possible paths. However, because the program keeps track of how many reads support each path segment, we can select only the most likely (well-supported) paths. These likely paths are then used to build the haplotype sequences which will be used to call variants and assign per-sample genotypes in the next steps.

### 1. Reference graph assembly

First, we construct the reference assembly graph, which starts out as a simple directed DeBruijn graph. This involves decomposing the reference sequence into a succession of kmers (pronounced "kay-mers"), which are small sequence subunits that are k bases long. Each kmer sequence overlaps the previous kmer by k-1 bases. The resulting graph can be represented as a series of nodes and connecting edges indicating the sequential relationship between the adjacent bases. At this point, all the connecting edges have a weight of 0.

In addition to the graph, we also build a hash table of unique kmers, which we use to keep track of the position of nodes in the graph. At the beginning, the hash table only contains unique kmers found in the reference sequence, but we will add to it in the next step.

A note about kmer size: by default, the program will attempt to build two separate graphs, using kmers of 10 and 25 bases in size, respectively, but other kmer sizes can be specified from the command line with the -kmerSize argument. The final set of haplotypes will be selected from the union of the graphs obtained using each k.

This is where our simple reference graph turns into a read-threading graph, so-called because we're going to take each read in turn and try to match it to a path in the graph.

We start with the first read and compare its first kmer to the hash table to find if it has a match. If there is a match, we look up its position in the reference graph and record that position. If there is no match, we consider that it is a new unique kmer, so we add that unique kmer to the hash table and add a new node to the graph. In both cases, we then move on and repeat the process with the next kmer in the read until we reach the end of the read.

When two consecutive kmers in a read belong to two nodes that were already connected by an edge in the graph, we increase the weight of that edge by 1. If the two nodes were not connected yet, we add a new edge to the graph with a starting weight of 1. As we repeat the process on each read in turn, edge weights will accumulate along the paths that are best supported by the read data, which will help us select the most likely paths later on.

Note on graph complexity, cycles and non-unique kmers

For this process to work properly, we need the graph to be sufficiently complex (where the number of non-unique k-mers is less that 4-fold the number of unique kmers found in the data) and without cycles. In certain genomic regions where there are a lot of repeated sequences, these conditions may not be met, because repeats cause cycles and diminish the number of available unique kmers. If none of the kmer sizes provided results in a viable graph (complex enough and without cycles) the program will automatically try the operation again with larger kmer sizes. Specifically, we take the largest k provided by the user (or by the default settings) and increase it by 10 bases. If no viable graph can be obtained after iterating over increased kmer sizes 6 times, we give up and skip the active region entirely.

### 3. Graph refinement

Once all the reads have been threaded through the graph, we need to clean it up a little. The main cleaning-up operation is called pruning (like the gardening technique). The goal of the pruning operation is to remove noise due to errors. The basic idea is that sections of the graph that are supported by very few reads are most probably the result of stochastic errors, so we are going to remove any sections that are supported by fewer than a certain threshold number of reads. By default the threshold value is 2, but this can be controlled from the command line using the -minPruning argument. In practice, this means that linear chains in the graph (linear sequence of vertices and edges without any branching) where all edges have fewer than 2 supporting reads will be removed. Increasing the threshold value will lead to faster processing and higher specificity, but will decrease sensitivity. Decreasing this value will do the opposite, decreasing specificity but increasing sensitivity.

At this stage, the program also performs graph refinement operations, such as recovering dangling heads and tails from the splice junctions to compensate for issues that are related to limitations in graph assembly.

Note that if you are calling multiple samples together, the program also looks at how many of the samples support each segment, and only prunes segments for which fewer than a certain number of samples have the minimum required number of supporting reads. By default this sample number is 1, so as long as one sample in the cohort passes the pruning threshold, the segment will NOT be pruned. This is designed to avoid losing singletons (variants that are unique to a single sample in a cohort). This parameter can also be controlled from the command line using the -minPruningSamples argument, but keep in mind that increasing the default value may lead to decreased sensitivity.

### 4. Select best haplotypes

Now that the graph is all cleaned up, the program builds haplotype sequences by traversing all possible paths in the graph and calculates a likelihood score for each one. This score is calculated as the product of transition probabilities of the path edges, where the transition probability of an edge is computed as the number of reads supporting that edge divided by the sum of the support of all edges that share that same source vertex.

In order to limit the amount of computation needed for the next step, we limit the number of haplotypes that will be considered for each value of k (remember that the program builds graphs for multiple kmer sizes). This is easy to do since we conveniently have scores for each haplotype; all we need to do is select the N haplotypes with the best scores. By default that number is very generously set to 128 (so the program would proceed to the next step with up to 128 haplotypes per value of k) but this can be adjusted from the command line using the -maxNumHaplotypesInPopulation argument. You would mainly want to decrease this number in order to improve speed; increasing that number would rarely be reasonable, if ever.

### 5. Identify potential variation sites

Once we have a list of plausible haplotypes, we perform a Smith-Waterman alignment (SWA) of each haplotype to the original reference sequence across the active region in order to reconstruct a CIGAR string for the haplotype. Note that indels will be left-aligned; that is, their start position will be set as the leftmost position possible.

This finally yields the potential variation sites that will be put through the variant modeling step next, bringing us back to the "classic" variant calling methods (as used by GATK's UnifiedGenotyper and Samtools' mpileup). Note that this list of candidate sites is essentially a super-set of what will eventually be the final set of called variants. Every site that will be called variant is in the super-set, but not every site that is in the super-set will be called variant.

Created 2014-04-10 21:57:10 | Updated 2015-02-20 18:10:09 | Tags: haplotypecaller reference-model gvcf

This document describes the reference confidence model applied by HaplotypeCaller to generate genomic VCFs (gVCFS), invoked by -ERC GVCF or -ERC BP_RESOLUTION (see the FAQ on gVCFs for format details).

Please note that this document may be expanded with more detailed information in the near future.

### How it works

The mode works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. For each position in the genome we have either an ALT call (via the standard calling mechanism) or we can estimate the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:

• Estimate the confidence that no SNP exists at the site by contrasting all reads with the ref base vs all reads with any non-reference base.
• Estimate the confidence that no indel of size < X (determined by command line parameter) could exist at this site by calculating the number of reads that provide evidence against such an indel, and from this value estimate the chance that we would not have seen the allele confidently.

Based on this, we emit the genotype likelihoods (PL) and compute the GQ (from the PLs) for the least confidence of these two models.

We use a symbolic allele pair, <NON_REF>, to indicate that the site is not homozygous reference, and because we have an ALT allele we can provide allele-specific AD and PL field values.

For details of the gVCF format, please see the document that explains what is a gVCF.

Created 2014-04-03 20:20:08 | Updated 2014-10-22 19:22:34 | Tags: haplotypecaller genotypegvcfs combinegvcfs gvcf joint-analysis

### Overview

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variants analyses.

### Important caveat

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller 3.x contain additional information that is formatted in a very specific way. Read on to find out more.

### General comparison of VCF vs. gVCF

The key difference between a regular VCF and a gVCF is that the gVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a gVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION gVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

### The two types of gVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a gVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the gVCF header. The purpose of the blocks (also called banding) is to keep file size down, and there is no downside for the downstream analysis, so we do recommend using the -GVCF option.

### Example gVCF file

This is a banded gVCF produced by HaplotypeCaller with the -GVCF option.

As you can see in the first line, the basic file format is a valid version 4.1 VCF:

##fileformat=VCFv4.1
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)
##GVCFBlock=minGQ=60(inclusive),maxGQ=2147483647(exclusive)
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=b37>
##reference=file:///humgen/1kg/reference/human_g1k_v37.fasta

Toward the middle you see the ##GVCFBlock lines (after the ##FORMAT lines) (repeated here for clarity):

##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)

which indicate the GQ ranges used for banding (corresponding to the boundaries [5, 20, 60]).

You can also see the definition of the MIN_DP annotation in the ##FORMAT lines.

#### Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10000000 and ends at 20:10000116.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10000000    .   T   <NON_REF>   .   .   END=10000116    GT:DP:GQ:MIN_DP:PL  0/0:44:99:38:0,89,1385
20  10000118    .   T   <NON_REF>   .   .   END=10000210    GT:DP:GQ:MIN_DP:PL  0/0:42:99:38:0,80,1314
20  10000212    .   A   <NON_REF>   .   .   END=10000438    GT:DP:GQ:MIN_DP:PL  0/0:52:99:42:0,99,1403
20  10000439    .   T   G,<NON_REF> 1737.77 .   DP=57;MLEAC=2,0;MLEAF=1.00,0.00;MQ=221.41;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,56,0:56:99:1771,168,0,1771,168,1771:0,0,0,0
20  10000440    .   T   <NON_REF>   .   .   END=10000597    GT:DP:GQ:MIN_DP:PL  0/0:56:99:49:0,120,1800
20  10000598    .   T   A,<NON_REF> 1754.77 .   DP=54;MLEAC=2,0;MLEAF=1.00,0.00;MQ=185.55;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,53,0:53:99:1788,158,0,1788,158,1788:0,0,0,0
20  10000599    .   T   <NON_REF>   .   .   END=10000693    GT:DP:GQ:MIN_DP:PL  0/0:51:99:47:0,120,1800
20  10000695    .   G   <NON_REF>   .   .   END=10000757    GT:DP:GQ:MIN_DP:PL  0/0:48:99:45:0,120,1800
20  10000758    .   T   A,<NON_REF> 1663.77 .   DP=51;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.32;MQ0=0  GT:AD:DP:GQ:PL:SB   1/1:0,50,0:50:99:1697,149,0,1697,149,1697:0,0,0,0
20  10000759    .   A   <NON_REF>   .   .   END=10001018    GT:DP:GQ:MIN_DP:PL  0/0:40:99:28:0,65,1080
20  10001020    .   C   <NON_REF>   .   .   END=10001020    GT:DP:GQ:MIN_DP:PL  0/0:26:72:26:0,72,1080
20  10001021    .   T   <NON_REF>   .   .   END=10001021    GT:DP:GQ:MIN_DP:PL  0/0:25:37:25:0,37,909
20  10001022    .   C   <NON_REF>   .   .   END=10001297    GT:DP:GQ:MIN_DP:PL  0/0:30:87:25:0,72,831
20  10001298    .   T   A,<NON_REF> 1404.77 .   DP=41;MLEAC=2,0;MLEAF=1.00,0.00;MQ=171.56;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,41,0:41:99:1438,123,0,1438,123,1438:0,0,0,0
20  10001299    .   C   <NON_REF>   .   .   END=10001386    GT:DP:GQ:MIN_DP:PL  0/0:43:99:39:0,95,1226
20  10001387    .   C   <NON_REF>   .   .   END=10001418    GT:DP:GQ:MIN_DP:PL  0/0:41:42:39:0,21,315
20  10001419    .   T   <NON_REF>   .   .   END=10001425    GT:DP:GQ:MIN_DP:PL  0/0:45:12:42:0,9,135
20  10001426    .   A   <NON_REF>   .   .   END=10001427    GT:DP:GQ:MIN_DP:PL  0/0:49:0:48:0,0,1282
20  10001428    .   T   <NON_REF>   .   .   END=10001428    GT:DP:GQ:MIN_DP:PL  0/0:49:21:49:0,21,315
20  10001429    .   G   <NON_REF>   .   .   END=10001429    GT:DP:GQ:MIN_DP:PL  0/0:47:18:47:0,18,270
20  10001430    .   G   <NON_REF>   .   .   END=10001431    GT:DP:GQ:MIN_DP:PL  0/0:45:0:44:0,0,1121
20  10001432    .   A   <NON_REF>   .   .   END=10001432    GT:DP:GQ:MIN_DP:PL  0/0:43:18:43:0,18,270
20  10001433    .   T   <NON_REF>   .   .   END=10001433    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1201
20  10001434    .   G   <NON_REF>   .   .   END=10001434    GT:DP:GQ:MIN_DP:PL  0/0:44:18:44:0,18,270
20  10001435    .   A   <NON_REF>   .   .   END=10001435    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1130
20  10001436    .   A   AAGGCT,<NON_REF>    1845.73 .   DP=43;MLEAC=2,0;MLEAF=1.00,0.00;MQ=220.07;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,42,0:42:99:1886,125,0,1888,126,1890:0,0,0,0
20  10001437    .   A   <NON_REF>   .   .   END=10001437    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,0

Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

Created 2013-08-23 21:34:04 | Updated 2014-10-24 16:21:11 | Tags: unifiedgenotyper haplotypecaller ploidy

Use HaplotypeCaller!

The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper. Its ability to call SNPs is equivalent to that of the UnifiedGenotyper, its ability to call indels is far superior, and it is now capable of calling non-diploid samples. It also comprises several unique functionalities such as the reference confidence model (which enables efficient and incremental variant discovery on ridiculously large cohorts) and special settings for RNAseq data.

As of GATK version 3.3, we recommend using HaplotypeCaller in all cases, with no exceptions.

Caveats for older versions

If you are limited to older versions for project continuity, you may opt to use UnifiedGenotyper in the following cases:

• If you are working with non-diploid organisms (UG can handle different levels of ploidy while older versions of HC cannot)
• If you are working with pooled samples (also due to the HC’s limitation regarding ploidy)
• If you want to analyze more than 100 samples at a time (for performance reasons) (versions 2.x)

Created 2013-06-17 21:31:21 | Updated 2015-05-16 07:04:42 | Tags: haplotypecaller variant-discovery

#### Objective

Call variants on a single genome with the HaplotypeCaller, producing a raw (unfiltered) VCF.

#### Caveat

This is meant only for single-sample analysis. To analyze multiple samples, see the Best Practices documentation on joint analysis.

• TBD

#### Steps

1. Determine the basic parameters of the analysis
2. Call variants in your sequence data

### 1. Determine the basic parameters of the analysis

If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.

• Genotyping mode (–genotyping_mode)

This specifies how we want the program to determine the alternate alleles to use for genotyping. In the default DISCOVERY mode, the program will choose the most likely alleles out of those it sees in the data. In GENOTYPE_GIVEN_ALLELES mode, the program will only use the alleles passed in from a VCF file (using the -alleles argument). This is useful if you just want to determine if a sample has a specific genotype of interest and you are not interested in other alleles.

• Emission confidence threshold (–stand_emit_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.

• Calling confidence threshold (–stand_call_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.

The terms called and filtered are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.

### 2. Call variants in your sequence data

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R reference.fa \
-L 20 \
--genotyping_mode DISCOVERY \
-stand_emit_conf 10 \
-stand_call_conf 30 \
-o raw_variants.vcf

Note: This is an example command. Please look up what the arguments do and see if they fit your analysis before copying this. To see how the -L argument works, you can refer here: http://gatkforums.broadinstitute.org/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals#latest

#### Expected Result

This creates a VCF file called raw_variants.vcf, containing all the sites that the HaplotypeCaller evaluated to be potentially variant. Note that this file contains both SNPs and Indels.

Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.

Created 2012-07-30 17:37:12 | Updated 2015-10-30 19:34:24 | Tags: unifiedgenotyper haplotypecaller snp bamout

This can happen when you expect a call to be made based on the output of other variant calling tools, or based on examination of the data in a genome browser like IGV.

There are several possibilities, and among them, it is possible that GATK may be missing a real variant. But we are generally very confident in the calculations made by our tools, and in our experience, most of the time, the problem lies elsewhere. So, before you post this issue in our support forum, please follow these troubleshooting guidelines, which hopefully will help you figure out what's going on.

In all cases, to diagnose what is happening, you will need to look directly at the sequencing data at the position in question.

### 1. Generate the bamout and compare it to the input bam

If you are using HaplotypeCaller to call your variants (as you nearly always should) you'll need to run an extra step first to produce a file called the "bamout file". See this tutorial for step-by-step instructions on how to do this.

What often happens is that when you look at the reads in the original bam file, it looks like a variant should be called. However, once HaplotypeCaller has performed the realignment, the reads may no longer support the expected variant. Generating the bamout file and comparing it to the original bam will allow you to elucidate such cases.

In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!

### 2. Check the base qualities of the non-reference bases

The variant callers apply a minimum base quality threshold, under which bases will not be counted as supporting evidence for a variant. This is because low base qualities mean that the sequencing machine was not confident that it called the right bases. If your expected variant is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the DP field of the VCF is the unfiltered depth. You may believe you have good coverage at your site of interest, but since the variant callers ignore bases that fail the quality filters, the actual coverage seen by the variant callers may be lower than you think.

### 3. Check the mapping qualities of the reads that support the non-reference allele(s)

The quality of a base is capped by the mapping quality of the read that it is on. This is because low mapping qualities mean that the aligner had little confidence that the read was mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- in fact, you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

### 4. Check how many alternate alleles are present

By default the variant callers will only consider a certain number of alternate alleles. This parameter can be relaxed using the --max_alternate_alleles argument (see the HaplotypeCaller documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles increases the computational cost of the processing, scaling exponentially with the number of alternate alleles, which means it will use more resources and take longer. Unless you have a really good reason to change the default value, we highly recommend that you not modify this parameter.

### 5. When using UnifiedGenotyper, check for overlapping deletions

The UnifiedGenotyper ignores sites if there are too many overlapping deletions. This parameter can be relaxed using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument) but be aware that increasing its value could adversely affect the reliability of your results.

### 6. Check for systematic biases introduced by your sequencing technology

Some sequencing technologies introduce particular sources of bias. For example, in data produced by the SOLiD platform, alignments tend to have reference bias and it can be severe in some cases. If the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site, you are probably seeing false positives.

Created 2015-11-25 07:37:00 | Updated 2015-11-25 14:21:18 | Tags: haplotypecaller release mutect version-highlights topstory mutect2

The last GATK 3.x release of the year 2015 has arrived!

The major feature in GATK 3.5 is the eagerly awaited MuTect2 (beta version), which brings somatic SNP and Indel calling to GATK. This is just the beginning of GATK’s scope expansion into the somatic variant domain, so expect some exciting news about copy number variation in the next few weeks! Meanwhile, more on MuTect2 awesomeness below.

In addition, we’ve got all sorts of variant context annotation-related treats for you in the 3.5 goodie bag -- both new annotations and new capabilities for existing annotations, listed below.

In the variant manipulation space, we enhanced or fixed functionality in several tools including LeftAlignAndTrimVariants, FastaAlternateReferenceMaker and VariantEval modules. And in the variant calling/genotyping space, we’ve made some performance improvements across the board to HaplotypeCaller and GenotypeGVCFs (mostly by cutting out crud and making the code more efficient) including a few improvements specifically for haploids. Read the detailed release notes for more on these changes. Note that GenotypeGVCFs will now emit no-calls at sites where RGQ=0 in acknowledgment of the fact that those sites are essentially uncallable.

We’ve got good news for you if you’re the type who worries about disk space (whether by temperament or by necessity): we finally have CRAM support -- and some recommendations for keeping the output of BQSR down to reasonable file sizes, detailed below.

Finally, be sure to check out the detailed release notes for the usual variety show of minor features (including a new Queue job runner that enables local parallelism), bug fixes and deprecation notices (a few tools have been removed from the codebase, in the spirit of slimming down ahead of the holiday season).

### Introducing MuTect2 (beta): calling somatic SNPs and Indels natively in GATK

MuTect2 is the next-generation somatic SNP and indel caller that combines the DREAM challenge-winning somatic genotyping engine of the original MuTect with the assembly-based machinery of HaplotypeCaller.

The original MuTect (Cibulskis et al., 2013) was built on top of the GATK engine by the Cancer Genome Analysis group at the Broad Institute, and was distributed as a separate package. By all accounts it did a great job calling somatic SNPs, and was part of the winning entries for multiple DREAM challenges (including some submitted by groups outside the Broad). However it was not able to call indels; and the less said about the indel caller that accompanied it (first named SomaticIndelDetector then Indelocator) the better.

This new incarnation of MuTect leverages much of the HaplotypeCaller’s internal machinery (including the all-important graph assembly bit) to call both SNPs and indels together. Yet it retains key parts of the original MuTect’s internal genotyping engine that allow it to model somatic variation appropriately. This is a major differentiation point compared to HaplotypeCaller, which has expectations about ploidy and allele frequencies that make it unsuitable for calling somatic variants.

As a convenience add-on to MuTect2, we also integrated the cross-sample contamination estimation tool ContEst into GATK 3.5. Note that while the previous public version of this tool relied on genotyping chip data for its operation, this version of the tool has been upgraded to enable on-the-fly genotyping for the case where genotyping data is not available. Documentation of this feature will be provided in the near future. Both MuTect2 and ContEst are now featured in the Tool Documentation section of the Guide. Stay tuned for pipeline-level documentation on performing somatic variant discovery, to be added to the Best Practices docs in the near future.

Please note that this release of MuTect2 is a beta version intended for research purposes only and should not be applied in production/clinical work. MuTect2 has not yet undergone the same degree of scrutiny and validation as the original MuTect since it is so new. Early validation results suggest that MuTect2 has a tendency to generate more false positives as compared to the original MuTect; for example, it seems to overcall somatic mutations at low allele frequencies, so for now we recommend applying post-processing filters, e.g. by hard-filtering calls with low minor allele frequencies. Rest assured that data is being generated and the tools are being improved as we speak. We’re also looking forward to feedback from you, the user community, to help us make it better faster.

Finally, note also that MuTect2 is distributed under the same restricted license as the original MuTect; for-profit users are required to seek a license to use it (please email softwarelicensing@broadinstitute.org). To be clear, while MuTect2 is released as part of GATK, the commercial licensing has not been consolidated under a single license. Therefore, current holders of a GATK license will still need to contact our licensing office if they wish to use MuTect2.

### Annotate this: new and improved variant context annotations

Whew that was a long wall of text on MuTect2, wasn’t it. Let’s talk about something else now. Annotations! Not functional annotations, mind you -- we’re not talking about e.g. predicting synonymous vs. non-synonymous mutations here. I mean variant context annotations, i.e. all those statistics calculated during the variant calling process which we mostly use to estimate how confident we are that the variants are real vs. artifacts (for filtering and related purposes).

So we have two new annotations, BaseCountsBySample (what it says on the can) and ExcessHet (for excess heterozygosity, i.e. the number of heterozygote calls made in excess of the Hardy-Weinberg expectations), as well as a set of new annotations that are allele-specific versions of existing annotations (with AS_ prefix standing for Allele-Specific) which you can browse here. Right now we’re simply experimenting with these allele-specific annotations to determine what would be the best way to make use of them to improve variant filtering. In the meantime, feel free to play around with them (via e.g. VariantsToTable) and let us know if you come up with any interesting observations. Crowdsourcing is all the rage, let’s see if it gets us anywhere on this one!

We also made some improvements to the StrandAlleleCountsBySample annotation, to how VQSR handles MQ, and to how VariantAnnotator makes use of external resources -- and we fixed that annoying bug where default annotations were getting dropped. All of which you can read about in the detailed release notes.

### These Three Awesome File Hacks Will Restore Your Faith In Humanity’s Ability To Free Up Some Disk Space

CRAM support! Long-awaited by many, lovingly implemented by Vadim Zalunin at EBI and colleagues at the Sanger Institute. We haven’t done extensive testing, and there are a few tickets for improvements that are planned at the htsjdk level -- but it works well enough that we’re comfortable releasing it under a beta designation. Meaning have fun with it, but do your own thorough testing before putting it into production or throwing out your old BAMs!

Static binning of base quality scores. In a nutshell, binning (or quantizing) the base qualities in a BAM file means that instead of recording all possible quality values separately, we group them into bins represented by a single value (by default, 10, 20, 30 or 40). By doing this we end up having to record fewer separate numbers, which through the magic of BAM compression yields substantially smaller files. The idea is that we don’t actually need to be able to differentiate between quality scores at a very high resolution -- if the binning scheme is set up appropriately, it doesn’t make any difference to the variant discovery process downstream. This is not a new concept, but now the GATK engine has an argument to enable binning quality scores during the base recalibration (BQSR) process using a static binning scheme that we have determined produces optimal results in our hands. The level of compression is of course adjustable if you’d like to set your own tradeoff between compression and base quality resolution. We have validated that this type of binning (with our chosen default parameters) does not have any noticeable adverse effect on germline variant discovery. However we are still looking into some possible effects on somatic variant discovery, so we can’t yet recommend binning for that application.

Disable indel quality scores. The Base Recalibration process produces indel quality scores in addition to the regular base qualities. They are stored in the BI and BD tags of the read records, taking up a substantial amount of space in the resulting BAM files. There has been a lot of discussion about whether these indel quals are worth the file size inflation. Well, we’ve done a lot of testing and we’ve now decided that no, for most use cases the indel quals don’t make enough of a difference to justify the extra file size. The one exception to this is when processing PacBio data, it seems that indel quals may help model the indel-related errors of that technology. But for the rest, we’re now comfortable recommending the use of the --disable_indel_quals argument when writing out the recalibrated BAM file with PrintReads.

Created 2015-11-25 07:10:45 | Updated 2016-01-27 17:38:33 | Tags: Promote haplotypecaller release-notes mutect gatk3 mutect2

GATK 3.5 was released on November 25, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights.

### New tools

• MuTect2: somatic SNP and indel caller based on HaplotypeCaller and the original MuTect.
• ContEst: estimation of cross-sample contamination (primarily for use in somatic variant discovery).
• GatherBqsrReports: utility to gather recalibration tables from scatter-parallelized BaseRecalibrator runs.

### Variant Context Annotations

• Added allele-specific version of existing annotations: AS_BaseQualityRankSumTest, AS_FisherStrand, AS_MappingQualityRankSumTest, AS_RMSMappingQuality, AS_RankSumTest, AS_ReadPosRankSumTest, AS_StrandOddsRatio, AS_QualByDepth and AS_InbreedingCoeff.

• Added BaseCountsBySample annotation. Intended to provide insight into the pileup of bases used by HaplotypeCaller in the calling process, which may differ from the pileup observed in the original bam file because of the local realignment and additional filtering performed internally by HaplotypeCaller. Can only be requested from HaplotypeCaller, not VariantAnnotator.

• Added ExcessHet annotation. Estimates excess heterozygosity in a population of samples. Related to but distinct from InbreedingCoeff, which estimates evidence for inbreeding in a population. ExcessHet scales more reliably to large cohort sizes.

• Added FractionInformativeReads annotation. Reports the number of reads that were considered informative by HaplotypeCaller (over all samples).

• Enforced calculating GenotypeAnnotations before InfoFieldAnnotations. This ensures that the AD value is available to use in the QD calculation.

• Reorganized standard annotation groups processing to ensure that all default annotations always get annotated regardless of what is specified on the command line. This fixes a bug where default annotations were getting dropped when the command line included annotation requests.

• Made GenotypeGVCFs subset StrandAlleleCounts intelligently, i.e. subset the SAC values to the called alleles. Previously, when the StrandAlleleCountsBySample (SAC) annotation was present in GVCFs, GenotypeGVCFs carried it over to the final VCF essentially unchanged. This was problematic because SAC includes the counts for all alleles originally present (including NON-REF) even when some are not called in the final VCF. When the full list of original alleles is no longer available, parsing SAC could become difficult if not impossible.

• Added new MQ jittering functionality to improve how VQSR handles MQ. Note that HaplotypeCaller now calculates a new annotation called RAW_MQ per-sample, which is then integrated per-cohort by GenotypeGVCFs to produce the MQ annotation.

• VariantAnnotator can now annotate FILTER field from an external resource. Usage: --resource:foo resource.vcf --expression foo.FILTER

• VariantAnnotator can now check allele concordance when annotating with an external resource. Usage: --resourceAlleleConcordance

• Bug fix: The annotation framework was improved to allow for the collection of sufficient statistics during GVCF creation which are then used to compute the final annotation during the genotyping. This avoids the use of median as the representative annotation from the collection of values (one from each sample). TL;DR annotations will be more accurate when using the GVCF workflow for joint discovery.

### Variant manipulation tools

• Allowed overriding hard-coded cutoff for allele length in ValidateVariants and in LeftAlignAndTrimVariants. Usage: --reference_window_stop N where N is the desired cutoff.

• Also in LeftAlignAndTrimVariants, trimming multiallelic alleles is now the default behavior.

• Also in FastaAlternateReferenceMaker, fixed merging of contiguous intervals properly, and made the tool produce more informative contig names.

• Fixed a bug in CombineVariants that occurred when one record has a spanning deletion and needs a padded reference allele.

• Added a new VariantEval evaluation module, MetricsCollection, that summarizes metrics from several EV modules.

• Enabled family-level stratification in MendelianViolationEvaluator of VariantEval (if a ped file is provided), making it possible to count Mendelian violations for each family in a callset with multiple families.

• Added the ability to SelectVariants to enforce 4.2 version output of the VCF spec when processing older files. Use case: the 4.2 spec specifies that GQ must be an integer; by default we don’t enforce it (so if reading an older file that used decimals, we don’t change it) but the new argument --forceValidOutput converts the values on request. Not made default because of some performance slowdown -- so writing VCFs is now fast by default, compliant by choice.

### GVCF tools

• Various improvements to the tools’ performance, especially HaplotypeCaller, by making the code more efficient and cutting out crud.

• GenotypeGVCFs now emits a no-call (./.) when the evidence is too ambiguous to make a call at all (e.g. all the PLs are zero). Previously this would have led to a hom-ref call with RGQ=0.

• Fixed a bug in GenotypeGVCFs that sometimes generated invalid VCFs for haploid callsets. The tool was carrying over the AD from alleles that had been trimmed out, causing field length mismatches.

• Changed the genotyping implementation for haploid organisms to address performance problems reported when running GenotypeGVCFs on haploid callsets. Note that this change may lead to a slight loss of sensitivity at low-coverage sites -- let us know if you observe anything dramatic.

### Genotyping engine tweaks

• Ensured inputPriors get used if they are specified to the genotyper (previously they were ignored). Also improved docs on --heterozygosity and --indel_ heterozygosity priors.

• Fixed bug that affected the --ignoreInputSamples behavior of CalculateGenotypePosteriors.

• Limited emission of the scary warning message about max number of alleles (“this tool is set to genotype at most x alleles but we found more; only x will be used”) to a single occurrence unless DEBUG logging mode is activated. Otherwise it fills up our output logs.

### Miscellaneous tool fixes

• Added option to OverclippedReadFilter to not require soft-clips on both ends. Contributed by Jacob Silterra.

• Fixed a bug in IndelRealigner where the tool was incorrectly "fixing" mates when supplementary alignments are present. The patch involves ignoring supplementary alignments.

• Fixed a bug in CatVariants. Previously, VCF files were being sorted solely on the base pair position of the first record, ignoring the chromosome. This can become problematic when merging files from different chromosomes, especially if you have multiple VCFs per chromosome. Contributed by John Wallace.

### Engine-level behaviors and capabilities

• Support for reading and writing CRAM files. Some improvements are still expected in htsjdk. Contributed by Vadim Zalunin at EBI and collaborators at the Sanger Institute.

• Made interval-list output format dependent on the file extension (for RealignerTargetCreator). If the extension is .interval_list, output will be formatted as a proper Picard interval list (with sequence dictionary). Otherwise it will be a basic GATK interval list as previously.

• Adding static binning capability for base recalibration (BQSR).

### Queue

• Added a new JobRunner called ParallelShell that will run jobs locally on one node concurrently as specified by the DAG, with the option to limit the maximum number of concurrently running jobs using the flag maximumNumberOfJobsToRunConcurrently. Contributed by Johan Dahlberg.

• Updated extension for Picard CalculateHsMetrics to include PER_TARGET_COVERAGE argument and added extension for Picard CollectWgsMetrics.

### Deprecation notice

Removed:

• BeagleOutputToVCF, VariantsToBeagleUnphased, ProduceBeagleInput. These are tools for handling Beagle data. The latest versions of Beagle support VCF input and output, so there is no longer any reason for us to provide converters.
• ReadAdaptorTrimmer and VariantValidationAssessor. These were experimental tools which we think are not useful and not operating on a sufficiently sound basis.
• BaseCoverageDistribution and CoveredByNSamplesSites. These tools were redundant with DiagnoseTargets and/or DepthOfCoverage.
• LiftOverVariants, FilterLiftedVariants and liftOverVCF.pl. The Picard liftover tool LiftoverVCF works better and is easier to operate.
• sortByRef.pl. Use Picard SortVCF instead.
• ListAnnotations. This was intended as a utility for listing annotations easily from command line, but it has not proved useful.

### Meta

• Moved htsjdk & picard to version 1.141

Created 2015-05-15 04:52:05 | Updated 2015-11-25 07:08:50 | Tags: haplotypecaller release-notes genotypegvcfs gatk3

GATK 3.4 was released on May 15, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights.

### New tool

• ASEReadCounter: A tool to count read depth in a way that is appropriate for allele specific expression (ASE) analysis. It counts the number of reads that support the REF allele and the ALT allele, filtering low qual reads and bases and keeping only properly paired reads. See Highlights for more details.

### HaplotypeCaller & GenotypeGVCFs

• Important fix for genotyping positions over spanning deletions. Previously, if a SNP occurred in sample A at a position that was in the middle of a deletion for sample B, sample B would be genotyped as homozygous reference there (but it's NOT reference - there's a deletion). Now, sample B is genotyped as having a symbolic DEL allele. See Highlights for more details.
• Deprecated --mergeVariantsViaLD argument in HaplotypeCaller since it didn’t work. To merge complex substitutions, use ReadBackedPhasing as a post-processing step.
• Removed exclusion of MappingQualityZero, SpanningDeletions and TandemRepeatAnnotation from the list of annotators that cannot be annotated by HaplotypeCaller. These annotations are still not recommended for use with HaplotypeCaller, but this is no longer enforced by a hardcoded ban.
• Clamp the HMM window starting coordinate to 1 instead of 0 (contributed by nsubtil).
• Fixed the implementation of allowNonUniqueKmersInRef so that it applies to all kmer sizes. This resolves some assembly issues in low-complexity sequence contexts and improves calling sensitivity in those regions.
• Initialize annotations so that --disableDithering actually works.
• Automatic selection of indexing strategy based on .g.vcf file extension. See Highlights for more details.
• Removed normalization of QD based on length for indels. Length-based normalization is now only applied if the annotation is calculated in UnifiedGenotyper.
• Added the RGQ (Reference GenotypeQuality) FORMAT annotation to monomorphic sites in the VCF output of GenotypeGVCFs. Now, instead of stripping out the GQs for monomorphic ohm-ref sites, we transfer them to the RGQ. This is extremely useful for people who want to know how confident the hom-ref genotype calls are. See Highlights for more details.
• Removed GenotypeSummaries from default annotations.
• Added -uniquifySamples to GenotypeGVCFs to make it possible to genotype together two different datasets containing the same sample.
• Disallow changing -dcov setting for HaplotypeCaller (pending a fix to the downsampling control system) to prevent buggy behavior. See Highlights for more details.
• Raised per-sample limits on the number of reads in ART and HC. Active Region Traversal was using per sample limits on the number of reads that were too low, especially now that we are running one sample at a time. This caused issues with high confidence variants being dropped in high coverage data.
• Removed explicit limitation (20) of the maximum ploidy of the reference-confidence model. Previously there was a fixed-size maximum ploidy indel RCM likelihood cache; this was changed to a dynamically resizable one. There are still some de facto limitations which can be worked around by lowering the max alt alleles parameter.
• Made GQ of Hom-Ref Blocks in GVCF output be consistent with PLs.
• Fixed a bug where HC was not realigning against the reference but against the best haplotype for the read.
• Fixed a bug (in HTSJDK) that was causing GenotypeGVCFs to choke on sites with large numbers of alternate alleles (>140).
• Modified the way GVCFBlock header lines are named because the new HTSJDK version disallows duplicate header keys (aside from special-cased keys such as INFO and FORMAT).

### CombineGVCFs

• Added option to break blocks at every N sites. Using --breakBandsAtMultiplesOf N will ensure that no reference blocks span across genomic positions that are multiples of N. This is especially important in the case of scatter-gather where you don't want your scatter intervals to start in the middle of blocks (because of a limitation in the way -L works in the GATK for VCF records with the END tag). See Highlights for more details.
• Fixed a bug that caused the tool to stop processing after the first contig.
• Fixed a bug where the wrong REF allele was output to the combined gVCF.

### VariantRecalibrator

• Switched VQSR tranches plot ordering rule (ordering is now based on tranche sensitivity instead of novel titv).
• VQSR VCF header command line now contains annotations and tranche levels.

### SelectVariants

• Added -trim argument to trim (simplify) alleles to a minimal representation.
• Added -trimAlternates argument to remove all unused alternate alleles from variants. Note that this is pretty aggressive for monomorphic sites.
• Changed the default behavior to trim (remove) remaining alleles when samples are subset, and added the -noTrim argument to preserve original alleles.

### VariantAnnotator

• Improvements to the allele trimming functionalities.
• Added functionality to support multi-allelic sites when annotating a VCF with annotations from another callset. See Highlights for more details.

### CalculateGenotypePosteriors

• Fixed user-reported bug featuring "trio" family with two children, one parent.
• Added error handling for genotypes that are called but have no PLs.

### Various tools

• BQSR: Fixed an issue where GATK would skip the entire read if a SNP is entirely contained within a sequencing adapter (contributed by nsubtil); and improved how uncommon platforms (as encoded in RG:PL tag) are handled.
• DepthOfCoverage: Now logs a warning if incompatible arguments are specified.
• SplitSamFile: Fixed a bug that caused a NullPointerException.
• SplitNCigarReads: Fixed issue to make -fixNDN flag fully functional.
• IndelRealigner: Fixed an issue that was due to reads that have an incorrect CIGAR length.
• CombineVCFs: Minor change to an error check that was put into 3.3 so that identical samples don't need -genotypeMergeOption.
• VariantsToBinaryPED: Corrected swap between mother and father in PED file output.
• GenotypeConcordance: Monomorphic sites in the truth set are no longer called "Mismatching Alleles" when the comp genotype has an alternate allele.
• ReadBackedPhasing: Fixed a couple of bugs in MNP merging.
• CatVariants: Now allows different input / output file types, and spaces in directory names.
• VariantsToTable: Fixed a bug that affected the output of the FORMAT record lists when -SMA is specified. Note that FORMAT fields behave the same as INFO fields - if the annotation has a count of A (one entry per Alt Allele), it is split across the multiple output lines. Otherwise, the entire list is output with each field.

• Corrected logical expression in MateSameStrandFilter (contributed by user seru71).
• Handle X and = CIGAR operators appropriately
• Added -drf argument to disable default read filters. Limited to specific tools and specific filters (currently only DuplicateReadFilter).

### Annotations

• Calculate StrandBiasBySample using all alternate alleles as “REF vs. any ALT”.
• Modified InbreedingCoeff so that it works when genotyping uniquified samples (see GenotypeGVCFs changes).
• Changed GC Content value type from Integer to Float.
• Added StrandAlleleCountsBySample annotation. This annotation outputs the number of reads supporting each allele, stratified by sample and read strand; callable from HaplotypeCaller only.
• Made annotators emit a warning if they can't be applied.

### GATK Engine & common features

• Fixed logging of 'out' command line parameter in VCF headers; changed []-type arrays to lists so argument parsing works in VCF header commandline output.
• Modified GATK command line header for unique keys. The GATK command line header keys were being repeated in the VCF and subsequently lost to a single key value by HTSJDK. This resolves the issue by appending the name of the walker after the text "GATKCommandLine" and a number after that if the same walker was used more than once in the form: GATKCommandLine.(walker name) for the first occurrence of the walker, and GATKCommandLine.(walker name).# where # is the number of the occurrence of the walker (e.g. GATKCommandLine.SomeWalker.2 for the second occurrence of SomeWalker).
• Handle X and = CIGAR operators appropriately.
• Added barebones read/write CRAM support (no interval seeking!). See Highlights for more details.
• Cleaned up logging outputs / streams; messages (including HMM log messages) that were going to stdout now going to stderr.
• Improved error messages; when an error is related to a specific file, the engine now includes the file name in the error message.
• Fixed BCF writing when FORMAT annotations contain arrays.

### Queue

• Added -qsub-broad argument. When -qsub-broad is specified instead of -qsub, Queue will use the h_vmem parameter instead of h_rss to specify memory limit requests. This was done to accommodate changes to the Broad’s internal job scheduler. Also causes the GridEngine native arguments to be output by default to the logger, instead of only when in debug mode.
• Fixed the scala wrapper for Picard MarkDuplicates (needed because MarkDuplicates was moved to a different package within Picard).
• Added optional element "includeUnmapped" to the PartitionBy annotation. The value of this element (default true) determines whether Queue will explicitly run this walker over unmapped reads. This patch fixes a runtime error when FindCoveredIntervals was used with Queue.

### Documentation

• Plentiful enhancements and fixes to various tool docs, especially annotations and read filters.

### For developers

• Upgraded SLF4J to allow new convenient logging syntaxes.
• Patched maven pom file for slf4j-log4j12 version (contributed by user Biocyberman).
• Updated HTSJDK version (now pulling it in from Maven Central); various edits made to match.
• Collected VCF IDs and header lines into one place (GATKVCFConstants).

Created 2014-08-27 18:39:39 | Updated 2014-12-16 03:19:38 | Tags: haplotypecaller ploidy haploid polyploid

Until now, HaplotypeCaller was only capable of calling variants in diploid organisms due to some assumptions made in the underlying algorithms. I'm happy to announce that we now have a generalized version that is capable of handling any ploidy you specify at the command line!

This new feature, which we're calling "omniploidy", is technically still under development, but we think it's mature enough for the more adventurous to try out as a beta test ahead of the next official release. We'd especially love to get some feedback from people who work with non-diploids on a regular basis, so we're hoping that some of you microbiologists and assorted plant scientists will take it out for a spin and let us know how it behaves in your hands.

It's available in the latest nightly builds; just use the -ploidy argument to give it a whirl. If you have any questions or feedback, please post a comment on this article in the forum.

Caveat: the downstream tools involved in the new GVCF-based workflow (GenotypeGVCFs and CombineGVCFs) are not yet capable of handling non-diploid calls correctly -- but we're working on it.

#### UPDATE:

We have added omniploidy support to the GVCF handling tools, with the following limitations:

• When running, you need to indicate the sample ploidy that was used to generate the GVCFs with -ploidy. As usual 2 is the default ploidy.

• The system does not support mixed ploidy across samples nor positions. An error message will be thrown if you attempt to genotype GVCFs that have a mixture, or that have some genotype whose ploidy does not match the -ploidy argument.

#### LATEST UPDATE:

As of GATK version 3.3-0, the GVCF tools are capable of ad-hoc ploidy detection, and can handle mixed ploidies. See the release highlights for details.

Created 2016-02-05 15:25:48 | Updated 2016-02-05 15:28:04 | Tags: haplotypecaller gvcf no-calls

Hi,

I have generated a gVCF for an exome (with non-variant block records) from a BAM file belonging to the 1000Genomes data. I am using GATK tools version 3.5-0-g36282e4 and I have run the HaplotypeCaller as follows:

time java -jar $gatk_dir/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R$reference \ -I bamfile \ -ploidy 2 \ -stand_call_conf 20 \ -stand_emit_conf 10 \ -ERC GVCF \ -o output.g.vcf.gz Within the purpose of the analysis I am performing, from this gVCF I need to be able to know whether the positions are no-called, homozygous reference, variant sites or if the positions were not targeted in the exome sequencing. However, with the gVCF file I obtained I am not able to do it because there are only variant site records or non-variant block records where the GT tag is always "0/0". So I have few questions regarding the non-variant block records: 1. Why the output file does not contain any no-call ("./.") record? 2. Shouldn't regions where there are no reads have the tag GT equal to "./." instead of "0/0"? 3. How can regions without reads (not targeted) be distinguished from regions with reads that were not called? 4. When looking at the bam file with IGV, non-variant blocks displayed in gVCF contain regions with reads. What is the explanation for such behaviour? Thank you for your attention, Sofia Created 2016-02-05 09:25:43 | Updated | Tags: haplotypecaller best-practices merge rna-seq Hello, and thanks for making all the GATK tools! I have recently started to try my hand at variant calling of my RNA-seq data, following the GATK Best Practices more or less verbatim, only excluding indel alignment (because I am only interested in SNPs at this point) and the BQSR (partly because I have very high quality data, but mostly because I couldn't get it to work in the workflow). I have three replicates for each of my samples, and my question is where, if at all, I should merge the data from them. I am not sure if I can (or even should!) merge the FASTQ files before the alignment step, or merge the aligned BAM files, or something else. I read that for aligners such as BWA the options are (more or less) equivalent, but seeing as the RNA-seq Best Practice workflow using STAR... What would be the "correct" way to do it, if at all? How would merging (at some level) affect the speed of the workflow, and can I optimise that somehow? If it's a bad idea to do merging, how would I determine the "true" variant from my three resulting VCF-files at the end, for cases where they differ? Created 2016-02-03 20:16:19 | Updated | Tags: haplotypecaller ploidy I have NGS data from an organism that is sometimes found to be aneuploid (but usually diploid.) I am trying to determine the ploidy of the individual I have sequenced from my population of interest. I have ~100X coverage, but the best reference is so distantly-related that the normal SNP coverage plots that are used to estimate ploidy are too messy to interpret. I was hoping that I could use HaplotypeCaller here to help determine the number of haplotypes per chromosome (where 2 haplotypes/alleles per position indicates diploidy). I can't figure out how to interpret the resulting VCF file well enough to address this question. If HaplotypeCaller will not be sufficient, are there any other frequently used options to determine ploidy (or number of alleles/haplotypes per position)? Thanks for any assistance. Alex Created 2016-02-03 15:40:20 | Updated | Tags: commandlinegatk haplotypecaller pcrmodel Dear GATK team, I'm a bit confused by --pcr_indel_model argument in HaplotypeCaller. As a can see from the docs, this argument is not required, but in its description I still read the following: "VERY IMPORTANT: when using PCR-free sequencing data we definitely recommend setting this argument to NONE". Does some PCR-bias-oriented filtration is performed by default (so, for PCR-free datasets I should set it to NONE), or actually I don't need to set this argument to NONE (even processing PCR-free datasets) if I simply don't use it? Best regards, Svyatoslav Created 2016-01-28 15:08:47 | Updated | Tags: indelrealigner variantrecalibrator vqsr haplotypecaller best-practices The release notes for 3.5 state "Added new MQ jittering functionality to improve how VQSR handles MQ". My understanding is that in order to use this, we will need to use the --MQCapForLogitJitterTransform argument in VariantRecalibrator. I have a few questions on this: 1) Is the above correct, i.e. the new MQ jittering functionality is only used if --MQCapForLogitJitterTransform is set to something other than the default value of zero? 2) Is the use of MQCapForLogitJitterTransform recommended? 3) If we do use MQCapForLogitJitterTransform, the tool documentation states "We recommend to either use --read-filter ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller or use a MQCap=max+10 to take that into account". Is one of these to be preferred over the other? Given that it seems that sites that have been realigned can have values up to 70, but sites that have not can have values no higher than 60, it seems to me that the ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller option might be preferred, but I would like to check before running. Created 2016-01-28 13:26:14 | Updated | Tags: haplotypecaller optimization faster Hello everyone, I am using GATK in a clinical context for NGS diagnosis. The issue is that the HaplotypeCaller take some time, too much time actually (2h per patient). I tried this things : • reduce the bam file size by keeping only the genomic regions of my diagnosis genes but it looks like it still run all the hg19 genome. • ask "only variants" with the output_mode option but the output file is exactly the same than the default one. • use several CPU thread, but 1 CPU = 147 min, 2 CPU = 89 min, 3 CPU = 80 min. And I don't have this much CPU available so it is not interesting above 2 CPU , and still not fast enough. I can't use the data thread option right now, would it allow me to gain more time than the CPU option ? There is the interval option but I don't think it would allow me to gain enough time since I have gene of interest on almost all chromosomes. I would appreciate to have your guidance regarding this problem. How would you do to make this HaplotypeCaller step faster ? Many thanks in advance. Christ Created 2016-01-27 18:04:37 | Updated 2016-01-27 18:05:57 | Tags: haplotypecaller I was comparing the concordance of genotype calls as a function of Genotype Quality (GQ) for the GATK 3.5 Haplotyper. (using genotype given allele and emiting all sites, even those with score of 0). My "truth" are genotypes from Illumina OMNI 2.5M arrays .. and I am comparing the genotype calls from Illumina exome arrays. I notice that the Genotype Quality (GQ) scores of the HaplotypeCaller are focussed at intervals of "3". e.g. many many more scores at (orders of magnitude) GQ=0,3,6,9,12,15, ... than at GQ=1,2, 4,5, 7,8, 10,11 , 13,14 This unequal distribution in itself is surprising, but not an issue. I am noticing that the rate of concordance is NOT monotonically proportional to GQ. Those genotypes with non-multiple of 3 scores have much worst concordance. If I limit to multiple of 3, the GQ are monotonic. Concordance(GQ=12)>Concordance(GQ=9)>Concordance(GQ=6) but systematically, the non-multiple of 3 have worst concordance by orders of magnitude. e.g. Concordance(10 or 11) <Concordance(3) Why is that? So .. trying to look at the source code, for PairHMM (where the likelyhoods are computed), I am identifying this value "TRISTATE" correction (value 3.0) that is differentially applied to reads with "N". .. but I don't see 'N" in alignments for those non-multiple of 3 alignment. I looked at the "raw" bam file (not the output of haplotyper) for number of examples for scores from 5 to 22 .. coverage 2..14 None has pathologic features. • no N in alignment • only example one overlapped soft-clipped bases • usually only 1 read supporting the alternate allele (or one read supporting the reference) • usually all the reads have 100M cigars. • Variant is not next to the end of a soft-clipped reads. Is that a bug in the HMMPair scoring? p.s. The BAM has 32-offset qualities. Created 2016-01-19 16:27:08 | Updated | Tags: gccontent haplotypecaller homopolymerrun tandemrepeatannotator hrun I am using Genotype Given Allele with Haplotype Caller I am trying to explicitely request all annotations that the documentation says are compatible with the Haplotype caller (and that make sense for a single sample .. e.g. no hardy weinberg ..) the following annotations all have "NA" GCContent(GC) HomopolymerRun(Hrun) TandemRepeatAnnotator (STR RU RPA) .. but are valid requests because I get no errors from GATK. This is the command I ran (all on one line) java -Xmx40g -jar /data5/bsi/bictools/alignment/gatk/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller --input_file /data2/external_data/Weinshilboum_Richard_weinsh/s115343.beauty/Paired_analysis/secondary/Paired_10192014/IGV_BAM/pair_EX167687/s_EX167687_DNA_Blood.igv-sorted.bam --alleles:vcf /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/OMNI.vcf --phone_home NO_ET --gatk_key /projects/bsi/bictools/apps/alignment/GenomeAnalysisTK/3.1-1/Hossain.Asif_mayo.edu.key --reference_sequence /data2/bsi/reference/sequence/human/ncbi/hg19/allchr.fa --minReadsPerAlignmentStart 1 --disableOptimizations --dontTrimActiveRegions --forceActive --out /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/EX167687.0.0375.chr22.vcf --logging_level INFO -L chr22 --downsample_to_fraction 0.0375 --downsampling_type BY_SAMPLE --genotyping_mode GENOTYPE_GIVEN_ALLELES --standard_min_confidence_threshold_for_calling 20.0 --standard_min_confidence_threshold_for_emitting 0.0 --annotateNDA --annotation BaseQualityRankSumTest --annotation ClippingRankSumTest --annotation Coverage --annotation FisherStrand --annotation GCContent --annotation HomopolymerRun --annotation LikelihoodRankSumTest --annotation MappingQualityRankSumTest --annotation NBaseCount --annotation QualByDepth --annotation RMSMappingQuality --annotation ReadPosRankSumTest --annotation StrandOddsRatio --annotation TandemRepeatAnnotator --annotation DepthPerAlleleBySample --annotation DepthPerSampleHC --annotation StrandAlleleCountsBySample --annotation StrandBiasBySample --excludeAnnotation HaplotypeScore --excludeAnnotation InbreedingCoeff Log file is below( Notice "weird" WARNings about) "StrandBiasBySample annotation exists in input VCF header".. which make no sense because the header is empty other than the barebone fields. This is the barebone VCF head /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/OMNI.vcf ## fileformat=VCFv4.2 # CHROM POS ID REF ALT QUAL FILTER INFO chr1 723918 rs144434834 G A 30 PASS . chr1 729632 rs116720794 C T 30 PASS . chr1 752566 rs3094315 G A 30 PASS . chr1 752721 rs3131972 A G 30 PASS . chr1 754063 rs12184312 G T 30 PASS . chr1 757691 rs74045212 T C 30 PASS . chr1 759036 rs114525117 G A 30 PASS . chr1 761764 rs144708130 G A 30 PASS . This is the output INFO 10:03:06,080 HelpFormatter - --------------------------------------------------------------------------------- INFO 10:03:06,082 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12 INFO 10:03:06,083 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 10:03:06,083 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 10:03:06,086 HelpFormatter - Program Args: -T HaplotypeCaller --input_file /data2/external_data/Weinshilboum_Richard_weinsh/s115343.beauty/Paired_analysis/secondary/Paired_10192014/IGV_BAM/pair_EX167687/s_EX167687_DNA_Blood.igv-sorted.bam --alleles:vcf /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/OMNI.vcf --phone_home NO_ET --gatk_key /projects/bsi/bictools/apps/alignment/GenomeAnalysisTK/3.1-1/Hossain.Asif_mayo.edu.key --reference_sequence /data2/bsi/reference/sequence/human/ncbi/hg19/allchr.fa --minReadsPerAlignmentStart 1 --disableOptimizations --dontTrimActiveRegions --forceActive --out /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/EX167687.0.0375.chr22.vcf --logging_level INFO -L chr22 --downsample_to_fraction 0.0375 --downsampling_type BY_SAMPLE --genotyping_mode GENOTYPE_GIVEN_ALLELES --standard_min_confidence_threshold_for_calling 20.0 --standard_min_confidence_threshold_for_emitting 0.0 --annotateNDA --annotation BaseQualityRankSumTest --annotation ClippingRankSumTest --annotation Coverage --annotation FisherStrand --annotation GCContent --annotation HomopolymerRun --annotation LikelihoodRankSumTest --annotation MappingQualityRankSumTest --annotation NBaseCount --annotation QualByDepth --annotation RMSMappingQuality --annotation ReadPosRankSumTest --annotation StrandOddsRatio --annotation TandemRepeatAnnotator --annotation DepthPerAlleleBySample --annotation DepthPerSampleHC --annotation StrandAlleleCountsBySample --annotation StrandBiasBySample --excludeAnnotation HaplotypeScore --excludeAnnotation InbreedingCoeff INFO 10:03:06,093 HelpFormatter - Executing as m037385@franklin04-213 on Linux 2.6.32-573.8.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26. INFO 10:03:06,094 HelpFormatter - Date/Time: 2016/01/19 10:03:06 INFO 10:03:06,094 HelpFormatter - --------------------------------------------------------------------------------- INFO 10:03:06,094 HelpFormatter - --------------------------------------------------------------------------------- INFO 10:03:06,545 GenomeAnalysisEngine - Strictness is SILENT INFO 10:03:06,657 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Fraction: 0.04 INFO 10:03:06,666 SAMDataSourceSAMReaders - Initializing SAMRecords in serial INFO 10:03:07,012 SAMDataSourceSAMReaders - Done initializing BAM readers: total time 0.35 INFO 10:03:07,031 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 10:03:07,170 IntervalUtils - Processing 51304566 bp from intervals INFO 10:03:07,256 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 10:03:07,595 GenomeAnalysisEngine - Done preparing for traversal INFO 10:03:07,595 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 10:03:07,595 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 10:03:07,596 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime INFO 10:03:07,596 HaplotypeCaller - Disabling physical phasing, which is supported only for reference-model confidence output WARN 10:03:07,709 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail. WARN 10:03:07,709 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail. INFO 10:03:07,719 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units INFO 10:03:37,599 ProgressMeter - chr22:5344011 0.0 30.0 s 49.6 w 10.4% 4.8 m 4.3 m INFO 10:04:07,600 ProgressMeter - chr22:11875880 0.0 60.0 s 99.2 w 23.1% 4.3 m 3.3 m Using AVX accelerated implementation of PairHMM INFO 10:04:29,924 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file INFO 10:04:29,925 VectorLoglessPairHMM - Using vectorized implementation of PairHMM WARN 10:04:29,938 AnnotationUtils - Annotation will not be calculated, genotype is not called WARN 10:04:29,938 AnnotationUtils - Annotation will not be calculated, genotype is not called WARN 10:04:29,939 AnnotationUtils - Annotation will not be calculated, genotype is not called INFO 10:04:37,601 ProgressMeter - chr22:17412465 0.0 90.0 s 148.8 w 33.9% 4.4 m 2.9 m INFO 10:05:07,602 ProgressMeter - chr22:18643131 0.0 120.0 s 198.4 w 36.3% 5.5 m 3.5 m INFO 10:05:37,603 ProgressMeter - chr22:20133744 0.0 2.5 m 248.0 w 39.2% 6.4 m 3.9 m INFO 10:06:07,604 ProgressMeter - chr22:22062452 0.0 3.0 m 297.6 w 43.0% 7.0 m 4.0 m INFO 10:06:37,605 ProgressMeter - chr22:23818297 0.0 3.5 m 347.2 w 46.4% 7.5 m 4.0 m INFO 10:07:07,606 ProgressMeter - chr22:25491290 0.0 4.0 m 396.8 w 49.7% 8.1 m 4.1 m INFO 10:07:37,607 ProgressMeter - chr22:27044271 0.0 4.5 m 446.4 w 52.7% 8.5 m 4.0 m INFO 10:08:07,608 ProgressMeter - chr22:28494980 0.0 5.0 m 496.1 w 55.5% 9.0 m 4.0 m INFO 10:08:47,609 ProgressMeter - chr22:30866786 0.0 5.7 m 562.2 w 60.2% 9.4 m 3.8 m INFO 10:09:27,610 ProgressMeter - chr22:32908950 0.0 6.3 m 628.3 w 64.1% 9.9 m 3.5 m INFO 10:09:57,610 ProgressMeter - chr22:34451306 0.0 6.8 m 677.9 w 67.2% 10.2 m 3.3 m INFO 10:10:27,611 ProgressMeter - chr22:36013343 0.0 7.3 m 727.5 w 70.2% 10.4 m 3.1 m INFO 10:10:57,613 ProgressMeter - chr22:37387478 0.0 7.8 m 777.1 w 72.9% 10.7 m 2.9 m INFO 10:11:27,614 ProgressMeter - chr22:38534891 0.0 8.3 m 826.8 w 75.1% 11.1 m 2.8 m INFO 10:11:57,615 ProgressMeter - chr22:39910054 0.0 8.8 m 876.4 w 77.8% 11.4 m 2.5 m INFO 10:12:27,616 ProgressMeter - chr22:41738463 0.0 9.3 m 926.0 w 81.4% 11.5 m 2.1 m INFO 10:12:57,617 ProgressMeter - chr22:43113306 0.0 9.8 m 975.6 w 84.0% 11.7 m 112.0 s INFO 10:13:27,618 ProgressMeter - chr22:44456937 0.0 10.3 m 1025.2 w 86.7% 11.9 m 95.0 s INFO 10:13:57,619 ProgressMeter - chr22:45448656 0.0 10.8 m 1074.8 w 88.6% 12.2 m 83.0 s INFO 10:14:27,620 ProgressMeter - chr22:46689073 0.0 11.3 m 1124.4 w 91.0% 12.5 m 67.0 s INFO 10:14:57,621 ProgressMeter - chr22:48062438 0.0 11.8 m 1174.0 w 93.7% 12.6 m 47.0 s INFO 10:15:27,622 ProgressMeter - chr22:49363910 0.0 12.3 m 1223.6 w 96.2% 12.8 m 29.0 s INFO 10:15:57,623 ProgressMeter - chr22:50688233 0.0 12.8 m 1273.2 w 98.8% 13.0 m 9.0 s INFO 10:16:12,379 VectorLoglessPairHMM - Time spent in setup for JNI call : 0.061128124000000006 INFO 10:16:12,379 PairHMM - Total compute time in PairHMM computeLikelihoods() : 22.846350295 INFO 10:16:12,380 HaplotypeCaller - Ran local assembly on 25679 active regions INFO 10:16:12,434 ProgressMeter - done 5.1304566E7 13.1 m 15.0 s 100.0% 13.1 m 0.0 s INFO 10:16:12,435 ProgressMeter - Total runtime 784.84 secs, 13.08 min, 0.22 hours INFO 10:16:12,435 MicroScheduler - 727347 reads were filtered out during the traversal out of approximately 4410423 total reads (16.49%) INFO 10:16:12,435 MicroScheduler - -> 2 reads (0.00% of total) failing BadCigarFilter INFO 10:16:12,436 MicroScheduler - -> 669763 reads (15.19% of total) failing DuplicateReadFilter INFO 10:16:12,436 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 10:16:12,436 MicroScheduler - -> 57582 reads (1.31% of total) failing HCMappingQualityFilter INFO 10:16:12,437 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 10:16:12,437 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 10:16:12,437 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 10:16:12,438 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter Created 2016-01-19 08:46:46 | Updated | Tags: haplotypecaller Hello, I am using GATK pipeline for variant calling in Targeted Exome samples. While trying to view the variants in IGV at different levels of variant calling steps (aligning .bam files generated at BaseRecalibration and HaplotypeCaller steps against reference.fasta), i find that the total number of reads aligning at a particular variant location varies. ie. after BaseRecalibration and before HaplotypeCalling total read count at a variant location (as per IGV- BR.bam alignment) is 274. It shows A count is 1 and C count is 273. But after HaplotypeCaller step, The total read count is 460 (as per IGV-bamout alignment). It also shows that A count is 195 and C count is 265. At the same time the variant record from the corresponding g.vcf file shows something like this: 17 48253171 . C A,<NON_REF> 1802.77 . BaseQRankSum=4.848;ClippingRankSum=-1.291;DP=388;MLEAC=1,0;MLEAF=0.500,0.00;MQ=36.72;MQ0=0;MQRankSum=-14.293;ReadPosRankSum=-14.235 GT:AD:DP:GQ:PL:SB 0/1:217,104,0:321:99:1831,0,6990,2478,7307,9785:217,0,104,0 I understand that HC reassembles the active regions. But then also I see a drastic different in the read/Allele counts. My questions are: 1. Why the total read count(even Alternate Allele count) is increased suddenly in HC step compared to the previous step(s)? 2. Why the Allele counts and Read depth in the g.vcf file and that in the corresponding bamout files (IGV view) are different? 3. In the above situation, which allele count should be considered bamout(ie IGV) or g.vcf counts? Thanking you in advance, Aswathy Created 2016-01-14 19:02:12 | Updated | Tags: commandlinegatk haplotypecaller multi-sample gatk error Hi, I would like to force call a list of variants across my cohort using HaplotypeCaller to get more accurate QC metrics for each variant. I am using the following command: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -et NO_ET -K my.key -I my.cohort.list --alleles my.vcf -L my.vcf -out_mode EMIT_ALL_SITES -gt_mode GENOTYPE_GIVEN_ALLELES -stand_call_conf 30.0 -stand_emit_conf 0.0 -dt NONE -o final_my.vcf Here is a link to the input VCF file: VCF File Unfortunately, I keep running into the following error (I've tried GATK ver3.3 and ver3.5): INFO 18:49:21,288 ProgressMeter - chr1:11177077 21138.0 49.5 m 39.0 h 69.4% 71.3 m 21.8 m ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.IndexOutOfBoundsException: Index: 3, Size: 3 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at htsjdk.variant.variantcontext.VariantContext.getAlternateAllele(VariantContext.java:845) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:248) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:1059) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:221) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegionsTraverseActiveRegionMap.apply(TraverseActiveRegions.java:709) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegionsTraverseActiveRegionMap.apply(TraverseActiveRegions.java:705) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3.0-mssm-0-gaa95802): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Index: 3, Size: 3 ##### ERROR ------------------------------------------------------------------------------------------ Would appreciate your help in solving this issue. Created 2016-01-13 13:11:53 | Updated 2016-01-13 13:22:28 | Tags: haplotypecaller Hey there, I am using the haplotypecaller on our cluster and the stdout is written to a logfile. Unfortunately HC is writing the whole gvcf to the standard out although I specified an output file. This means really huge log files that I can not handle. Is there a way to prevent this behaviour and only get errors/ warnings from the haplotypecaller? All the best IIEBR Created 2016-01-13 12:45:53 | Updated | Tags: haplotypecaller pooled-calls memory Hello, I am using haplotypecaller version 3.4-46 with pooled data (ploidy 44) and I am running in to the error 'There was a failure because you did not provide enough memory to run this program'. I have a very high coverage (~10000x at each position). Searching in the forums, I have found suggestions of down sampling, (which haplotypecaller does itself in the version I am using) and changing the minPruning parameter (which I am afraid may further compromise the sensitivity of the variant calling given the samples are pooled). Previously I used unifiedgenotyper on the same data and I had no problems. Is there something else I could try to get rid of the memory issues or should I stick with unifiedgenotyper in my case? Thank you for your support. Created 2016-01-11 20:09:44 | Updated | Tags: haplotypecaller igv bamout Hi Everyone, I am using GATK version 3.3 and I noticed that the numbers in the DP and AD fields actually match the original bam file before realignment and does not match the bamout file produced by haplotype caller. For example, at chromosome 1 position 6253228 the DP is 27 and the AD is 27,0. The number of reads (counted by IGV) in the original bam file at that position is also 27. However, in the "bamout" bam file the number of reads is 56. Below is the line from the VCF file. chr1 6253228 . C . . . GT:AD:DP:GQ:PL 0/0:27,0:27:75:0,75,1125 Is the DP and AD fields supposed to match the original bamfile or the bamout bam file numbers? When I produce the "bamout" bamfile I use parameters -forceActive and -disableOptimizations. Thank you. Best, Sam Created 2016-01-11 15:50:06 | Updated | Tags: vqsr haplotypecaller bug error We are having problems running GATK VQSR on recent VCF files generated using GATK HaplotypeCaller in the N+1 mode. When running VariantRecalibrator we get the following error (with both 3.4 and 3.5), the error occurs. ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.5-0-g36282e4): … ##### ERROR MESSAGE: The allele with index 5 is not defined in the REF/ALT columns in the record ##### ERROR ------------------------------------------------------------------------------------------ JAVA=~/tools/jre1.8.0_66/bin/java REF=~/refs/bosTau6.fasta GATK=~/tools/GenomeAnalysisTK.jarJAVA -jar $GATK -T VariantRecalibrator -R$REF -input GATK-HC-3.4-DAMONA_10_Jan_2016.vcf.gz \
-resource:ill770k,known=false,training=true,truth=true,prior=15.0 /home/aeonsim/refs/LIC-seqdHDs.AC1plus.DP10x.vcf.gz \
-resource:VQSRMend,known=false,training=true,truth=false,prior=10.0  /home/projects/bos_taurus/damona/vcfs/recombination/GATK-HC-95.0tr-GQ40-10Dec-2015run.mend.Con.vcf.gz \
-an DP -an QD -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -mode SNP \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 97.5 -tranche 95.0 -tranche 92.5 -tranche 90.0 \
-recalFile GATK-HC-DAMONA-Jan-2016.recal -tranchesFile GATK-HC-DAMONA-Jan-2016.tranches \
-rscriptFile GATK-HC-DAMONA-Jan-2016.R

The VCF files were created using GenotypeGVCFs (v3.4-46-gbc02625) on GVCF files created with GATK HaplotypeCaller (N+1 mode, same version) which had been combined using GATK CombineGVCFs to merge ~750 whole bovine genome GVCFs to of 20-90 individuals. All stages occurred to finish successfully with out error until the VQSR stage. All processing of the GATK files was done using GATK tools (v3.4-46) and running the ValidateVariants tool on the files only gives warnings about “Reference allele is too long (116)” nothing about alleles not being defined in the Ref/Alt column.

java -jar GenomeAnalysisTK.jar -T ValidateVariants -R bosTau6.fasta  -V GATK.chr29.vcf.gz --dbsnp BosTau6_
dbSNP140_NCBI.vcf.gz -L chr29
ValidateVariants - Reference allele is too long (116) at position chr29:2513143; skipping that record. Set --referenceWindowStop >= 116
…
Successfully validated the input file.  Checked 531640 records with no failures.

Older versions of GATK worked successfully on part of this dataset last year, but now we've completed the dataset and rerun everything with the latest version of GATK (at the time 3.4-46) using only the GATK tools, GATK VQSR is unable to process the files that were created by it's own HaplotypeCaller (and from the same version), while validation tools supplied with GATK claim there is no problem and programs like bcftools are happily processing the files.

It appears to me that the problem may be around the introduction of the * in the Alt column for VCF4.2 (?) and a failure of VQSR to fully support this?

Created 2016-01-08 18:46:14 | Updated 2016-01-08 19:08:57 | Tags: haplotypecaller missing-genotype

Hello GATK Team,

I have 21 bam files that I ran through HaplotypeCaller in GVCF mode followed by GenotypeGVCF, using Version=3.4-0-g7e26428. I found a few entries that I am having a difficult time understanding.

Here is one of the entries in question (quick note: each sample is from a pool of 3-5 animals, explaining the high ploidy):

chr1   88988835    rs396411987 C   G   226099.87   .   AC=61;AF=0.455;AN=134;BaseQRankSum=-1.176e+00;ClippingRankSum=0.054;DB;DP=29307;FS=0.000;MLEAC=61;MLEAF=0.455;MQ=59.98;MQRankSum=0.603;QD=11.32;ReadPosRankSum=1.30;SOR=0.696   GT:AD:DP:GQ:PL  0/0/0/0/0/1/1/1/1/1:396,449,0,0:845:28:9249,2087,965,413,120,0,28,219,649,1588,7879 0/1/1/1/1/1/1/1/1/1:275,2150,0,0:2425:99:52058,17674,11483,7903,5424,3571,2143,1051,287,0,3240  0/0/0/1/1/1/1/1:626,1134,0,0:1760:99:24237,5532,2600,1118,316,0,197,1292,11237  0/0/1/1/1/1/1/1:225,824,0,0:1049:99:18893,5121,2835,1577,772,257,0,116,3439 ./././././././././.:0,0 ./././././././././.:0,0 ./././././././.:0,0 0/0/0/0/0/0/0/1:824,105,0,0:929:99:1309,0,239,705,1369,2290,3644,6013,20091 0/0/0/0/0/1/1/1:786,499,0,0:1285:99:9169,1197,249,0,139,633,1609,3600,16598 0/0/1/1/1/1:557,1132,0,2:1691:99:24696,4545,1719,433,0,563,9784 0/0/0/0/0/0/0/1:922,140,0,0:1062:99:1825,0,202,685,1401,2410,3908,6544,22289    0/0/0/0/1/1/1/1/1/1:650,828,0,0:1478:26:16909,4072,1966,903,311,26,0,254,906,2400,12402 ./././././././.:0,0 0/0/0/0/0/0/0/0/1/1:846,179,0,0:1025:95:2546,95,0,178,520,1015,1689,2617,3986,6389,20368    0/0/0/0/1/1/1/1:875,986,0,0:1861:99:20237,3745,1410,380,0,136,885,2819,17626    0/0/0/0/0/0/0/0:570,0:570:23:0,23,50,82,120,170,241,361,1800    0/0/0/0/0/1/1/1:798,604,0,0:1402:20:11395,1675,423,0,20,429,1344,3302,16391 0/0/0/0/1/1/1/1:644,730,0,0:1374:96:14884,2778,1049,285,0,96,643,2062,12779 0/0/0/0/0/1/1/1:613,417,0,0:1030:74:7703,1065,243,0,74,433,1174,2710,12915  0/0/0/1/1/1/1/1:239,513,0,0:752:13:11159,2667,1309,604,198,0,13,378,4191    ./././././././.:0,0

So, for the samples with genotypes present, AD is obviously quite high (in the 1000s, generally). It surprised me, then, that some samples don't have information (represented by ./././././././.). I went back to the original sample gvcf files and pulled the entries from a subset that were represented by ./././././././.:

A consistent difference between samples that are included and those that are not is the presence of multiple alternative alleles in the excluded samples (G,A and G,T in the above example). Is this the source of my troubles? Is there a way of forcing GATK to include those samples in the VCF, especially given that the support for the third allele seems pretty weak (2 reads out of ~1722 in sample 14 above)? It seems to me that these samples should reasonably be included in the output of GenotypeGVCF.

Apologies if the answer is somewhere in the forums/guide - I did check, but didn't find anything.

Thanks, Russ

Created 2016-01-08 12:37:01 | Updated 2016-01-08 12:37:38 | Tags: haplotypecaller contigs

Hey there,

I cannot figure out why I did not get the HaplotypeCaller to work properly. Maybe someone can please help me. I wanted to use only Chromsome20 for variant calling. I used a BAM and BAM.BAI file from 1000 genomes (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/alignment/). For the reference genome I used only Chromsome20 from this site (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/). I created a dict and index file with picard tools.

If I run the HaplotypeCaller like this: java -jar GenomeAnalysisTK.jar -R chr20.fa -T HaplotypeCaller -I HG00096.chrom20.ILLUMINA.bwa.GBR.low_coverage.20120522.bam I get the following error: ##### ERROR MESSAGE: Badly formed genome loc: Contig 1 given as location, but this contig isn't present in the Fasta sequence dictionary

If I run the same with the -L argument (-L 20) I get: #### ERROR MESSAGE: Input files reads and reference have incompatible contigs: The following contigs included in the intervals to process have different indices in the sequence dictionaries for the reads vs. the reference: [20]. As a result, the GATK engine will not correctly process reads from these contigs. You should either fix the sequence dictionaries for your reads so that these contigs have the same indices as in the sequence dictionary for your reference, or exclude these contigs from your intervals. This error can be disabled via -U ALLOW_SEQ_DICT_INCOMPATIBILITY, however this is not recommended as the GATK engine will not behave correctly.. ##### ERROR reads contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1, NC_007605, hs37d5] ##### ERROR reference contigs = [20]

Additionally, I adapted the fasta, fasta index and dictionary file from calling the chromsome "chr20" to "20" because previously it said: ##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: No overlapping contigs found.

I really don't know what I am doing wrong! Any help is appreciated! Regards, Kristina

Created 2016-01-07 21:50:30 | Updated | Tags: haplotypecaller

Hi Everyone,

I am using GATK version 3.3 and ran haplotype caller with -mmq 30 and -mbq 20. The output of one of the positions in the VCF is below.

chr13 58264978 . C . . . GT:AD:DP:GQ:PL 0/0:22,1:23:54:0,54,655

I used IGV to validate this position. IGV reports, like GATK does that there are 23 total reads covering that position. Although, there are also 4 reads that have a phred score of less than 20 at the position above. The DP statistic seems to be recording the number of unfiltered reads, but not the filtered ones. Do you know why this is?

Thank you.

Best, Sam

Created 2016-01-06 13:01:18 | Updated | Tags: combinevariants haplotypecaller best-practices dbsnp gatk combinegvcfs gvcf

Hi guys, I have recently jointly called 27 full genome data using GenotypeGVCFs approach. While i was trying to extract some chromosomes from the final file, i got this error The provided VCF file is malformed at approximately line number 16076: Unparsable vcf record with allele *.

I look into the file and I found some of the multi-allellic sites having * as seen in the attached picture.

I feel the problem could be that the program realised that more than one allele is present at that position but could not ascertain which allele. I may be wrong but please what do you think I can do to solve this problem. LAWAL

Created 2015-12-30 23:50:57 | Updated 2015-12-30 23:51:22 | Tags: haplotypecaller snp multi-sample

I am trying to use GATK HP ( v3.5 ) to call SNPs in amplicon seq data of a small genome of around 600bp. As shown in the attachment, the variants are not called between location 50 and 60, despite high coverage across many samples. ( There are total 96 samples ).

https://www.dropbox.com/s/a4311lbti2sn9ze/seq.png?dl=0

The base qualities are above 30 and mapping quality is also 60 ( bwa mem ). I also did not remove duplicates ( not marked as well ) as its amplicon seq data.

The command I used was ( multisample SNP calling ) :

java -Xmx50g -jar GenomeAnalysisTK.jar -nct 2 -R -T HaplotypeCaller -I merged_samples.bam -o gatk_out_raw_snps_indels.vcf --min_base_quality_score 30

Its the same case even with base quality 20.

Created 2015-12-29 20:00:32 | Updated | Tags: haplotypecaller rmsmappingquality annotations mq gatk3-5

I upgraded to GATK 3.5 to use MuTect2 and it works great! However, I'm now using the updated .jar in my germline variant calling pipeline and there's some issues. I only noticed it when I went to run VQSR, and the INDELs went through but not the SNPs. I got an error in the SNP log about the MQ annotation. It seems to be completely absent in the haplotype caller output (the .g.vcf's). I tried forcing it with -A RMSMappingQuality but it did not affect the output. However it works fine if I go back to 3.4

Command Line for 3.4: java -Xmx4g -jar /REDACTED/tools/gatk_3.4.0/GenomeAnalysisTK.jar -T HaplotypeCaller -R /REDACTED/resources/b37/gatkBundle/human_g1k_v37_decoy.fasta -I /REDACTED/melanoma/bam/recal/PFM-32.ra.rc.bam -L /REDACTED/resources/b37/targetRegions/nexteraRapidCapture_expandedExome/1.bed -ERC GVCF -o /REDACTED/melanoma/variantCall/gatk3.4/gvcf/PFM-32/PFM-32_chr1.g.vcf -A RMSMappingQuality

Command Line for 3.5: java -Xmx4g -jar /REDACTED/tools/gatk_3.5.0/GenomeAnalysisTK.jar -T HaplotypeCaller -R /REDACTED/resources/b37/gatkBundle/human_g1k_v37_decoy.fasta -I /REDACTED/melanoma/bam/recal/PFM-32.ra.rc.bam -L /REDACTED/resources/b37/targetRegions/nexteraRapidCapture_expandedExome/1.bed -ERC GVCF -o /REDACTED/melanoma/variantCall/gvcf/PFM-32/PFM-32_chr1.g.vcf -A RMSMappingQuality

And here's a single line from the output of each at the same position

I didn't notice anything relevant in the 3.5 release notes except for something about VQSR and MQ jittering which leads me to believe that MQ is indeed a valid annotation still, but how can I get it?

Created 2015-12-23 18:15:35 | Updated 2015-12-23 18:21:19 | Tags: haplotypecaller amplicon

Hi all, I'm relatively new to GATK. At first I was intimidated by the amount of documentation, but lately I've really come to appreciate how there are answers to nearly all questions somewhere on this website.

After asking around to see which tool people recommend for tumour amplicon variant calling the consensus choice was obviously GATK. In line with that I've been trying to run a variant of the best practices to identify some variants in my amplicon samples. The intended deviation is of course to tell GATK to ignore the duplicate status of the reads as nearly all reads are duplicates.

Here are my commands:

# find the indels to realign
GenomeAnalysisTK.jar -T RealignerTargetCreator -R GRCh37-lite.fa -I $1 -o$1.realignment_targets.list --disable_read_filter DuplicateRead

# fix up the indels
GenomeAnalysisTK.jar -T IndelRealigner -R GRCh37-lite.fa -I $1 -o$1.realigned.bam   --disable_read_filter DuplicateRead -targetIntervals 1.realignment_targets.list # base recalibration GenomeAnalysisTK.jar -T BaseRecalibrator -R GRCh37-lite.fa -I1.realigned.bam -o  $1.recal_data.table --disable_read_filter DuplicateRead -knownSites dbSNP/common_all.vcf GenomeAnalysisTK.jar -T PrintReads -R GRCh37-lite.fa -I$1.realigned.bam -BQSR $1.recal_data.table --disable_read_filter DuplicateRead -o$1.recal_reads.bam

# Indels are realigned, bases are re-calibrated, now do the variant calling
GenomeAnalysisTK.jar -T HaplotypeCaller -R GRCh37-lite.fa --disable_read_filter DuplicateRead -I $1.recal_reads.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o$1.raw_variants.vcf -bamout 1.raw_variants.bamout.bam I'm finding that I'm still missing a small handful of variants that I expected to be able to call. I've followed the instructions here: http://gatkforums.broadinstitute.org/gatk/discussion/1235/i-expect-to-see-a-variant-at-a-specific-site-but-its-not-getting-called with hope to get to the bottom of it. The bamout is really helpful, but as far as i can tell, my variant (shown under the cursor in the attached image ) is in the original, realigned-recalibrated, and bamout version of my bam. Keep in mind that these are amplicon data, so the depths extend wayy beyond what is shown in the image. The mapping qualities of the reads are all roughly 50-60 and the base qualities (post recalibration are in the 40-50 range). The variant allele fraction is about 25% at all points in the analysis. I'm sure there is something simple that I'm missing. Can anyone suggest what I might think of changing so I could call the variant in the image? Created 2015-12-22 14:29:56 | Updated | Tags: unifiedgenotyper haplotypecaller trio pedigree Greetings. I've been exploring de novo mutation identification in the context of a pedigree of trios. I've run the UnfiedGenotyper (UG) given all the bam files for ~25 sets of trios and it appears to identify a set of de novo mutations. When I run the HaplotypeCaller (HC) pipeline, first generating gVCF files for each individual, and then using the merged gVCF files along with the pedigree for genotype refinement and de novo mutation calling, it also finds a number of de novo mutations annotated as hi-confidence de novo mutations. When I compare the UG de novo mutations to the high confidence HC list, there's very little overlap. Many of the UG hi-confidence de novo variants are called by HC, but listed as low-confidence de novo variants, and from looking at a few examples, it would appear that the HC calls have assigned lower genotype confidence levels for the parental (non-mutated, reference) genotypes. Could it be that because the gVCF files aren't storing position-specific information for the reference (non-mutated) positions in the genome, the pedigree-type de novo mutation calling is not as accurate as it could be? Should I be generating gVCFs that include position-specific information? Many thanks for any insights. If it would help, I could post some examples. Created 2015-12-21 19:30:11 | Updated | Tags: haplotypecaller gvcf Hi, I have three general questions about using HaplotypeCaller (I know I could have tested by myself, but I figured it might be reliable to get some answer from people who are developing the tool): 1. For single sample analysis, is the vcf generated directly from HC the same as the vcf generated using GenotypeGVCFs on the gvcf generated from HC? 2. For multi-sample analysis, in terms of speed, how is the performance of running GenotypeGVCFs on each gvcf, compared with combining all gvcfs to run joint-calling, assuming we can get all gvcfs in parallel (say for 500 samples)? 3. It seems the gvcf can be generated in two modes, -ERC GVCF or -ERC BP_RESOLUTION. How different is the one generated using -ERC BP_RESOLUTION different from a vcf with all variant calls, reference calls and missing calls? And considering the size of the file, say for NA12878 whole genome, how different it is comparing the gvcf from -ERC GVCF and the one from -ERC BP_RESOLUTION? Thank you very much for you attention and any information from you will be highly appreciated. Created 2015-12-21 17:55:25 | Updated | Tags: haplotypecaller genotypegvcfs combinegvcfs Hi all, I'm very new to GATK. I'm trying to map an EMS mutation in Arabidopsis. I have fastq files of a wt M3 bulk and a mut M3 bullk (both offspring of the same parent). The strategy is to call for SNPs->GenotypeGVCFs to a single file. That was done succesfully (I think). Next step is to look for SNPs that are homozygous (1/1) for the mut reads and het (1/0 or 0/0) or ref in the wt bulk; I used this command for this: grep -v '^##'line.genotype10.vcf | awk 'BEGIN{FS=" "; OFS=" "} $10~/^1\/1/ && ($11~/^1\/0/ || $11~/^0\/0/) {$3=$7=""; print$0}' | sed 's/ */ /g' >file.taxt

Tha also worked pretty well. I noticed that I have ~150,000 records (SNPs or indels) using the HC but after merging the files using the GenotypeGVCFs I'm left w/ only a few thousands records. The same happens if I use CombineGVCFs (which keep ~150,000 records) and then go for GenotypeGVCFs.

The problem is that with such low # of reads it doesn't recognize a genomic region that fulfil that hom requirement for the mut bulk and het/ref for the wt one. My question are:

1. Why does GenotypeGVCFs reduces the read #.
2. If anyone has other suggestions that would be great.

Thanks a lot, Guy

Created 2015-12-17 19:38:46 | Updated | Tags: indelrealigner unifiedgenotyper haplotypecaller gatk

Does anyone know of an effective way to determine haplotypes or phasing data for SNPs and STRs? I understand that STRs are inherently difficult for aligners; however, I'm trying to determine haplotypes for a large number of STRs (including the flanking region information...SNPs) on a large number of samples. So, manual verification is not really an option. We've developed an in-house perl script that calls STRs accurately; however, it currently does not include flanking region information.

Any help is greatly appreciated.

Created 2015-12-17 06:12:19 | Updated | Tags: haplotypecaller lowqual genotype-given-alleles

Hi GATK team, I am working on a pipeline for exome sequencing variant calling. And I am only interested in the genotype for some specific positions so I used GENOTYPE_GIVEN_ALLELES mode with given vcf file. I only have "chromosome position ID REF ALT" info so I keeped other columns as . . I found that this mode is super slow, compared with discovery mode. And when it finished I found most of my records are marked with LowQual, but they have good sequencing depth and quality score. Can you tell me why they are marked with LowQual flag. Here are some output:

chr10 101421279 rs35229854 G A 0 LowQual AC=0;AF=0.00;AN=2;DB;DP=80;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.407 GT:AD:DP:GQ:PL 0/0:80,0:80:99:0,241,2781 chr10 101421288 rs147747082 C A 0 LowQual AC=0;AF=0.00;AN=2;DB;DP=81;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.332 GT:AD:DP:GQ:PL 0/0:81,0:81:99:0,244,2812 chr10 101421324 rs150267092 A G 0 LowQual AC=0;AF=0.00;AN=2;DB;DP=64;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.173 GT:AD:DP:GQ:PL 0/0:64,0:64:99:0,193,2224 chr10 101421366 rs370286436 C T 0 LowQual AC=0;AF=0.00;AN=2;DB;DP=59;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.101 GT:AD:DP:GQ:PL 0/0:59,0:59:99:0,178,2052 chr10 101421367 rs61729539 G A 0 LowQual AC=0;AF=0.00;AN=2;DB;DP=58;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.105 GT:AD:DP:GQ:PL 0/0:58,0:58:99:0,175,2014 chr10 101451259 rs11190245 T C 605.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.278;ClippingRankSum=2.143;DB;DP=32;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.119;QD=18.93;ReadPosRankSum=1.904;SOR=0.507 GT:AD:DP:GQ:PL 0/1:11,21:32:99:634,0,286

I used default setting for other step. references are from gatk-bundle-2.8-hg19

Thanks for your help : )

Created 2015-12-15 21:30:02 | Updated | Tags: realignertargetcreator haplotypecaller

Hi GATK,

I'm runing the GATK germline variant calling pipeline. I have some questions about realignertargetcreator and haplotypecaller's parameters. Please correct me at any point if I went wrong.

1. For germline variant calling, if a patient has different types of normal (normal blood, solid tissue normal, etc.), which one should I run haplotypecaller or every type?

2. If I decide run all the normal types of a patient, should I call them together by realigntargetcreator to create a general interval?

3. In haplotypecaller, the intervals restrict the active region. However, there's a --forceActive flag in haplotypecaller. I'm just wondering what's the trade off for using --forceActive instead of the intervals. Is the interval kinda like exome capture so that it would work better with WXS?

Created 2015-12-15 16:52:07 | Updated | Tags: haplotypecaller output

Hello! On certain runs I get only .vcf file as an output and sometimes I see both .vcf and .vcf.idx as output files. Is there an issue with the runs that yield only .vcf file and not .vcf.idx?

Created 2015-12-10 18:43:04 | Updated | Tags: haplotypecaller heterozygosity genotypegvcfs

Hi GATK team,

I was hoping I could get some insight on determining rate of heterozygosity from a gvcf file. We have three diploid lizard samples. Each was run through our GATK pipeline using HC in GVCF mode with -ERC GVCF followed by joint genotyping using all three samples.

I want to determine the rate of heterozygosity in each lizard by counting the number of heterozygous sites and dividing by the number of callable sites (i.e not './.') for every position in the genome (whether or not it is a variant).

However after reading a response to a question on the forum http://gatkforums.broadinstitute.org/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf, "Short answer is that you shouldn't be looking at the genotype calls emitted by HC in GVCF mode. Longer answer, the gVCF is meant to be only an intermediate and the genotype calls are not final"

I am note sure this is the correct way to count heterozygous sites.

From the HC gvcf file, could I extract the number of callable sites from the GCVFBlocks and variant entries in the file to get the total number of callable sites (excluding './.' entries)? Then count the number of heterozygouse genotypes from the joint genotyping gvcf output?

HC command:

java -Xmx3g -jar /home/dut/bin/GATK/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ --variant_index_type LINEAR \ --variant_index_parameter 128000 \ -ERC GVCF \ -R reference.fa \ -I sample001.bam \ -stand_call_conf 30 \ -stand_emit_conf 30 \ -mbq 17 \ -o sample001.rawVAR.vcf)

JointGenotyping command: java -Xmx3g -jar /home/dut/bin/GATK/GenomeAnalysisTK.jar \ -T GenotypeGVCFs \ --includeNonVariantSites \ -ploidy 2 \ -R reference.fa \ --variant sample8450.rawVAR.vcf \ --variant sample003.rawVAR.vcf \ --variant sample001.rawVAR.vcf \ -o jg_sample001_sample003_sample8450.vcf

We are using GATK version 3.3.0

Any suggestions are appreciated! Thank you for your time.

Best, Morgan

Created 2015-12-10 16:33:06 | Updated | Tags: haplotypecaller alignment swa

Hello community,

I am really new to the field of genome analysis (I've got a computer science background). At the moment I am dealing with the topic of variant calling and decided to use / understand the HaplotypeCaller. So please apologize my stupid question.. I do not understand why the HaplotypeCaller does the whole alignment step again (or did I understand that wrong?). What is the purpose to use the BAM file and re-align it?

Best regards, Kristina

Created 2015-12-09 18:43:45 | Updated | Tags: haplotypecaller

It seems as if the handling of sam/bam/cram files with higher-than-usual base quality scores is inconsistent. In particular, a base quality score of 93 causes HaplotypeCaller to throw an exception if a bed file is supplied, but not if the target region is expressed in chr:start-end format. For instance, for the same bam file, invoking HaplotypeCaller like this:

-T HaplotypeCaller -R my-ref.fa -I alignment.bam -L my-bed-file.bed

returns an error regarding higher-than-expected base quality scores, but

-T HaplotypeCaller -R my-ref.fa -I alignment.bam -L chr1:500-1000

works just fine, even though the region in the bed file is identical to the region given as a command line arg.

Confusing and a little frustrating.

Created 2015-12-01 13:17:38 | Updated | Tags: haplotypecaller queue

I tried HaplotypeCaller on whole-genome sample in GVCF mode and it takes several days and ends up incomplete. I would like to use Queue script to parallelize the jobs. I am new to using Queue scripts and framed the below script using the posts in the forum and have errors.

class HaplotypeCaller extends QScript {
// Create an alias 'qscript' to be able to access variables in the VariantCaller.
// 'qscript' is now the same as 'VariantCaller.this'
qscript =>

// Required arguments. All initialized to empty values.
@Input(doc="The reference file for the bam files.", shortName="R", required=true)
var referenceFile: File = _

@Input(doc="One or more bam files.", shortName="I")
var bamFiles: List[File] = Nil

@Input(doc="Output core filename.", shortName="O", required=true)
var out: File = _

@Argument(doc="Maxmem.", shortName="mem", required=true)
var maxMem: Int = _

@Argument(doc="Number of scatters", shortName="nsc", required=true)
var numScatters: Int = _

@Argument(doc="Minimum phred-scaled confidence to call variants", shortName="stand_call_conf", required=true)
var standCallConf: Int = _ //30 //default: best-practices value

@Argument(doc="Minimum phred-scaled confidence to emit variants", shortName="stand_emit_conf", required=true)
var standEmitConf: Int = _ //10 //default: best-practices value

@Argument(doc="Mode for emitting reference confidenc scores", shortName="ERC", required=true)
var EmitRefConfidence: Boolean = true

// The following arguments are all optional.
@Input(doc="An optional file with known SNP sites.", shortName="D", required=false)
var dbsnpFile: File = _

@Input(doc="An optional file with targets intervals.", shortName="L", required=false)
var targetFile: File = _

@Argument(doc="Amount of padding (in bp) to add to each interval", shortName="ip", required=false)

def script() {
val haplotypeCaller = new HaplotypeCaller

// All required input
haplotypeCaller.input_file = bamFiles
haplotypeCaller.reference_sequence = referenceFile
haplotypeCaller.out = qscript.out + ".g.vcf"

haplotypeCaller.scatterCount = numScatters
haplotypeCaller.memoryLimit = maxMem

haplotypeCaller.stand_emit_conf = standEmitConf
haplotypeCaller.stand_call_conf = standCallConf
haplotypeCaller.emitRefConfidence = GVCF
// Optional input
if (dbsnpFile != null) {
haplotypeCaller.D = dbsnpFile
}
if (targetFile != null) {
haplotypeCaller.L :+= targetFile
}

}
}

The script is run from the command line as follows:

java -jar /proj/lohi/Canine_Tools/GATK-Queue-3.5/Queue.jar -S HaplotypeCaller.scala -R canFam3.fa -I BD01_recalibrated.bam -stand_call_conf 30 -stand_emit_conf 10 -nct 4 -ERC GVCF -O testQueue -run -debug

which throws the error:

INFO  15:06:10,940 QScriptManager - Compiling 1 QScript
ERROR 15:06:11,096 QScriptManager - HaplotypeCaller.scala:3: object QScript is not a member of package org.broadinstitute.sting.queue
ERROR 15:06:11,102 QScriptManager - import org.broadinstitute.sting.queue.QScript
ERROR 15:06:11,103 QScriptManager -        ^
ERROR 15:06:11,129 QScriptManager - HaplotypeCaller.scala:5: object extensions is not a member of package org.broadinstitute.sting.queue
ERROR 15:06:11,133 QScriptManager - import org.broadinstitute.sting.queue.extensions.gatk._
ERROR 15:06:11,134 QScriptManager -                                       ^
ERROR 15:06:11,144 QScriptManager - class HaplotypeCaller extends QScript {
ERROR 15:06:11,145 QScriptManager -                               ^
ERROR 15:06:11,335 QScriptManager -     var referenceFile: File = _
ERROR 15:06:11,335 QScriptManager -                        ^
ERROR 15:06:11,338 QScriptManager -     @Input(doc="The reference file for the bam files.", shortName="R", required=true)
ERROR 15:06:11,339 QScriptManager -      ^
ERROR 15:06:11,347 QScriptManager -     var bamFiles: List[File] = Nil
ERROR 15:06:11,347 QScriptManager -                        ^
ERROR 15:06:11,350 QScriptManager -     @Input(doc="One or more bam files.", shortName="I")
ERROR 15:06:11,351 QScriptManager -      ^
ERROR 15:06:11,623 QScriptManager -     var out: File = _
ERROR 15:06:11,623 QScriptManager -              ^
ERROR 15:06:11,625 QScriptManager -     @Input(doc="Output core filename.", shortName="O", required=true)
ERROR 15:06:11,626 QScriptManager -      ^
ERROR 15:06:11,632 QScriptManager -     @Argument(doc="Maxmem.", shortName="mem", required=true)
ERROR 15:06:11,633 QScriptManager -      ^
ERROR 15:06:11,635 QScriptManager -     @Argument(doc="Number of cpu threads per data thread", shortName="nct", required=true)
ERROR 15:06:11,636 QScriptManager -      ^
ERROR 15:06:11,638 QScriptManager -     @Argument(doc="Number of scatters", shortName="nsc", required=true)
ERROR 15:06:11,639 QScriptManager -      ^
ERROR 15:06:11,641 QScriptManager -     @Argument(doc="Minimum phred-scaled confidence to call variants", shortName="stand_call_conf", required=true)
ERROR 15:06:11,642 QScriptManager -      ^
ERROR 15:06:11,644 QScriptManager -     @Argument(doc="Minimum phred-scaled confidence to emit variants", shortName="stand_emit_conf", required=true)
ERROR 15:06:11,644 QScriptManager -      ^
ERROR 15:06:11,651 QScriptManager -     @Argument(doc="Mode for emitting reference confidenc scores", shortName="ERC", required=true)
ERROR 15:06:11,652 QScriptManager -      ^
ERROR 15:06:11,654 QScriptManager -     var dbsnpFile: File = _
ERROR 15:06:11,654 QScriptManager -                    ^
ERROR 15:06:11,656 QScriptManager -     @Input(doc="An optional file with known SNP sites.", shortName="D", required=false)
ERROR 15:06:11,656 QScriptManager -      ^
ERROR 15:06:11,658 QScriptManager -     var targetFile: File = _
ERROR 15:06:11,658 QScriptManager -                     ^
ERROR 15:06:11,660 QScriptManager -     @Input(doc="An optional file with targets intervals.", shortName="L", required=false)
ERROR 15:06:11,660 QScriptManager -      ^
ERROR 15:06:11,662 QScriptManager -     @Argument(doc="Amount of padding (in bp) to add to each interval", shortName="ip", required=false)
ERROR 15:06:11,663 QScriptManager -      ^
ERROR 15:06:11,888 QScriptManager - HaplotypeCaller.scala:55: value input_file is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,888 QScriptManager -     haplotypeCaller.input_file = bamFiles
ERROR 15:06:11,889 QScriptManager -                         ^
ERROR 15:06:11,896 QScriptManager - HaplotypeCaller.scala:56: value reference_sequence is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,897 QScriptManager -     haplotypeCaller.reference_sequence = referenceFile
ERROR 15:06:11,897 QScriptManager -                         ^
ERROR 15:06:11,907 QScriptManager - HaplotypeCaller.scala:59: value scatterCount is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,908 QScriptManager -     haplotypeCaller.scatterCount = numScatters
ERROR 15:06:11,909 QScriptManager -                         ^
ERROR 15:06:11,915 QScriptManager - HaplotypeCaller.scala:60: value memoryLimit is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,916 QScriptManager -     haplotypeCaller.memoryLimit = maxMem
ERROR 15:06:11,916 QScriptManager -                         ^
ERROR 15:06:11,924 QScriptManager -                         ^
ERROR 15:06:11,930 QScriptManager - HaplotypeCaller.scala:63: value stand_emit_conf is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,930 QScriptManager -     haplotypeCaller.stand_emit_conf = standEmitConf
ERROR 15:06:11,931 QScriptManager -                         ^
ERROR 15:06:11,937 QScriptManager - HaplotypeCaller.scala:64: value stand_call_conf is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,938 QScriptManager -     haplotypeCaller.stand_call_conf = standCallConf
ERROR 15:06:11,938 QScriptManager -                         ^
ERROR 15:06:11,944 QScriptManager - HaplotypeCaller.scala:65: value emitRefConfidence is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,945 QScriptManager -         haplotypeCaller.emitRefConfidence = GVCF
ERROR 15:06:11,945 QScriptManager -                         ^
ERROR 15:06:11,952 QScriptManager - HaplotypeCaller.scala:68: value D is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,952 QScriptManager -         haplotypeCaller.D = dbsnpFile
ERROR 15:06:11,953 QScriptManager -                             ^
ERROR 15:06:11,960 QScriptManager - HaplotypeCaller.scala:71: value L is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,961 QScriptManager -         haplotypeCaller.L :+= targetFile
ERROR 15:06:11,961 QScriptManager -                             ^
ERROR 15:06:11,967 QScriptManager - HaplotypeCaller.scala:72: value ip is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,967 QScriptManager -         haplotypeCaller.ip = intervalPadding
ERROR 15:06:11,968 QScriptManager -                             ^
ERROR 15:06:11,971 QScriptManager -         ^
ERROR 15:06:11,981 QScriptManager - 64 errors found
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.gatk.queue.QException: Compile of HaplotypeCaller.scala failed with 64 errors
at org.broadinstitute.gatk.queue.QCommandLine.org$broadinstitute$gatk$queue$QCommandLine$$qScriptPluginManagerlzycompute(QCommandLine.scala:94) at org.broadinstitute.gatk.queue.QCommandLine.orgbroadinstitutegatkqueueQCommandLine$$qScriptPluginManager(QCommandLine.scala:92)

##### ERROR ------------------------------------------------------------------------------------------

Created 2015-11-17 15:29:39 | Updated | Tags: haplotypecaller genotypegvcfs

Is there a way to use gzipped gVCFs files when using HaplotypeCaller and GenotypeGVCFs.

If so how do you make the index files? I can't seem to get it to work.

(I have searched the forum and can't seem to find a definitive answer. Sorry)

Created 2015-11-11 15:45:08 | Updated | Tags: haplotypecaller downsampling ad dp read-counts

What I have learned so far from other discussions about HaplotypeCaller:

• read counts for positions with very high coverage are downsampled
• this does not affect variant calling
• this does affect DP and AD fields in the output (g)vcf file
• don't use -nct parameter with HC
• downsampling is hard-coded and can't be influenced by parameters

Nonetheless two problems remain: The HC doc says "This tool applies the following downsampling settings by default. To coverage: 500" Why is it possible to observe much higher coverage (DP, AD) values in the output vcf file?

I observe SNPs where the recalibrated bam file in IGV has a depth of 1385 for the reference and 1233 for alternate allele but 839 (reference) and 246 (alt) in the HaplotypeCaller vcf file. Maybe this happens by chance, as reads for downsampling are chosen at random or it is related to this bug [gatkforums.broadinstitute.org/discussion/5882/uncorrect-strand-bias-due-to-downsampling-haplotypecaller](http://gatkforums.broadinstitute.org/discussion/5882/uncorrect-strand-bias-due-to-downsampling-haplotypecaller

Both observations lead to the conclusion that DP and AD values from HC output are of little use for samples with high (where does high start? 500?) coverage.

Created 2015-11-06 11:17:59 | Updated | Tags: haplotypecaller memory

Hi, I am new to the HaplotypeCaller and have huge problems getting it to run ok. I have WGS re-sequencing bam files with ~30-60 coverage (bam files are >3GB in size). I am running these in ERC mode as suggested, but within minutes, 3/4 are killed by the cluster due to exceeding memory. I am using the following command:

java -Xmx32g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -I $bamfile -minPruning 4 --min_base_quality_score$min_base_qual --min_mapping_quality_score $min_map_qual -rf DuplicateRead -rf BadMate -rf BadCigar -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -R$ref -o $HCdir"."HC.$bamfile".""."g.vcf -ploidy $cohort1_ploidy -stand_emit_conf$stand_emit -stand_call_conf $stand_call --pcr_indel_model NONE " I have varied the amount of memory I allocate up to -Xmx256 with no improvements, and this seems a bit odd to me? Even adding the minPruning did not seem to improve the situation. I have looked at previous posts and know that HC appears quite memory greedy, but is this normal to this extent? Many thanks in advance for any pointers. Created 2015-10-29 14:19:26 | Updated | Tags: haplotypecaller variantfiltration drosophila Hi team, (this is really two questions) 1. Do you have any recommendations for hard-filtering haplotypecaller-generated vcfs ? This was my previous filter for the unifiedgenotyper output" GenomeAnalysisTK -R${ref} \
-T VariantFiltration \
-V {my_vcf} \ -filter "QUAL<1000.0" -filterName "LowQual" \ -filter "MQ0>=4&&((MQ0/(1.0*DP))>0.1)" -filterName "BadVal" \ -filter "MQ<60" -filterName "LowMQ" \ -filter "QD<5.0" -filterName "LowQD" \ -filter "FS>60" -filterName "FishStra" \ -filter "DP<2000" -filterName "lowTotDP" \ -o qual_marked.vcf Obviously fields such as MQ0 won't work as this isn't present in the HC-generated vcf, and obviously there are many fields to filter on. (There are 222 samples and 1.9m variants in the vcf) 1. One filter that I'm really keen to apply but never got the hang-of, is to drop all individual genotype calls where the coverage is less than 10X. (This is because I'm really interested in getting the genotype correct, rather than actually detecting mutations). Sincerely, William Gilks Created 2015-10-29 03:04:43 | Updated 2015-10-29 03:05:33 | Tags: haplotypecaller nct Hi GATK team, First I'd like to thank you guys for the tools that you're making available for the community! The problem is that I have run my sample using Haplotype Caller and I faced some missing variants when I ran the HaplotypeCaller running with nct. How I figured out ? 2) When I ran with the command (with nct deactivated): java -Xmx10g -jar GenomeAnalysisTK.jar -R ucsc.hg19.fasta -I 165019-0-LAOM-N10_26_001_L001_1.realigned.recal.bam --dbsnp dbsnp_138.hg19.vcf -T HaplotypeCaller -stand_emit_conf 30.0 -stand_call_conf 30.0 -dcov 5000 --genotyping_mode DISCOVERY -A FisherStrand -A AlleleBalance -A BaseCounts -A StrandOddsRatio -A StrandBiasBySample --max_alternate_alleles 3 -o .165019-0-LAOM-N10_26_001_L001_1.realigned.recal.gatk.high.vcf -L NUTRI.list chr12 48239835 rs1544410 C T 6129.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=11.887;DB;DP=509;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSu m=1.141;QD=12.04;ReadPosRankSum=0.418;SOR=0.640 GT:AD:GQ:PL:SB 0/1:261,247:99:6158,0,6273:261,0,247,0 1) When I ran with the command below (with nct activated) : java -Xmx10g -jar GenomeAnalysisTK.jar -R ucsc.hg19.fasta -nct 8 -I 165019-0-LAOM-N10_26_001_L001_1.realigned.recal.bam --dbsnp dbsnp_138.hg19.vcf -T HaplotypeCaller -stand_emit_conf 30.0 -stand_call_conf 30.0 -dcov 5000 --genotyping_mode DISCOVERY -A FisherStrand -A AlleleBalance -A BaseCounts -A StrandOddsRatio -A StrandBiasBySample --max_alternate_alleles 3 -o .165019-0-LAOM-N10_26_001_L001_1.realigned.recal.gatk.high.vcf -L NUTRI.list The variant above is missing from my vcf (it's not called). Checked the flags: DP = 509 (good depth); QD 12.04 > 2.0; I have ran with the bamout option to see the variant with HaplotypeCaller and it shows the variant as you can see at the figure below. This is the original bam as INPUT at my variant caller My HaplotypeCaller is running at version 3.3. PS: I have seen some threads at the forum about missing variants running with nct, is is correct? Created 2015-10-28 08:22:47 | Updated | Tags: haplotypecaller gt variant-calling Hello: I met a question when I used the GATK pipeline. When I perform single calling for my Sample A & B, I get the results like: Sample A Chr01 2245 . A C,G 171.31 PASS ... GT:AD:DP:GQ:PL 1/2:0,1,6:7:1:221,202,199,19,0,1 Sample B Chr01 2245 . A G 192.84 PASS ... GT:AD:DP:GQ:PL 1/1:0,8:8:18:221,18,0 These results are different. However, when I perform total calling for these two samples simultaneously, at that chromosone-position, I get this result: Chr01 2245 . A G 387.43 . .... GT:AD:DP:GQ:PL 1/1:0,6:7:18:220,18,0(A) 1/1:0,8:8:18:221,18,0(B) So that the SNP of Sample A is no longer C/G but just a G. I don't clearly know how it works out. Thanks for any help from your team. Lyc Created 2015-10-27 12:46:50 | Updated 2015-10-27 12:47:24 | Tags: haplotypecaller Hello. I am implementing my pipeline to Roche and I have a list of variants that are validated with Sanger. In my pipeline I am using HaplotypeCaller to call variants, but there are variants that are not called. I have seen the bam file with IGV and I have found those variants in my reads. For example: • I have SNP at this coordinate of BRCA2: 32906729 (A->C), the reads with this snp are 100% (150 reads in total) • I have SNP at this coordinate of BRCA1: 41215825 (G->A), the reads with this snp are 100% (260 reads in total) but I have not found they in my vcf. Why is there this problem? Could you help me? Thank you in advance Best regards Created 2015-10-21 13:46:12 | Updated 2015-10-21 13:56:18 | Tags: haplotypecaller genotypegvcfs gvcf gatk3-4 Hi all, I'm currently confused about the snips called as shown below. If I am not mistaken, the first row shows gatk called an 34 bp insertion in sample 001 at position 3229753. It didn't call anything for sample 001 on position 3229753, but then for position 3229756, it calls another 15bp insertion for sample 001, which overlaps completely with the first insertion. I have three questions about this. 1) Is my interpretation of the data shown below correct 2) If this is correct, is this expected behaviour for gatk? What kind of circumstances are expected to generate these results? 3) How can I interpret these conflicting snips, should I just pick the call with the highest confidence and ignore the other? What about if a lower-confidence call is a substring of a previous call in another sample? #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001 002 003 004 gi|ref| 3229753 0 A AACTTGCCTGCCACGCTTTTCTTTATACTTAACCC 9635.2 0 AC=3;AF=1.00;AN=3;DP=304;FS=0.000;MLEAC=3;MLEAF=1.00;MQ=59.86;QD=29.65;SOR=0.779 GT:AD:DP:GQ:PL 1:0,48:48:99:2153,0 1:0,84:84:99:3696,0 .:0,0 1:0,85:85:99:3813,0 gi|ref| 3229754 0 A ACTTGCCTGCCACGCTTTTCTTTATACTTAACCCAGGCGCTAATTCATCTGCAACG 3012.2 0 AC=1;AF=1.00;AN=1;DP=291;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=59.91;QD=28.35;SOR=0.910 GT:AD:DP:GQ:PL .:0,0 .:0,0 1:0,69:69:99:3039,0 .:0,0 gi|ref| 3229756 0 G GCGCTAATTCATCTGC 3654.2 0 AC=3;AF=1.00;AN=3;DP=74;FS=0.000;MLEAC=3;MLEAF=1.00;MQ=60.00;QD=28.36;SOR=0.747 GT:AD:DP:GQ:PL 1:0,17:17:99:854,0 1:0,25:25:99:1213,0 .:0,0 1:0,32:32:99:1614,0 Created 2015-10-21 10:08:51 | Updated | Tags: haplotypecaller gatk Dear Sir/Madam, I am running HaplotypeCaller (GATK 3.2.2) and gets the error message below for some of the samples. I have seen this error posted before at the Forum and the recommendation has often been to change to a newer version of GATK. The problem is we have been using the version 3.2.2 for over 2000 samples in the same project and those will be analysed together so I am afraid to switch version for just a few samples. I would greatly appreciate any help! Best, Lina ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.ArrayIndexOutOfBoundsException: 125 at org.broadinstitute.gatk.utils.sam.AlignmentUtils.calcNumHighQualitySoftClips(AlignmentUtils.java:437) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.ReferenceConfidenceModel.calcGenotypeLikelihoodsOfRefVsAny(ReferenceConfidenceModel.java:291) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.isActive(HaplotypeCaller.java:839) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.addIsActiveResult(TraverseActiveRegions.java:618) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.access800(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegionsActiveRegionIterator.hasNext(TraverseActiveRegions.java:378) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:268) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:273) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: 125 ##### ERROR ------------------------------------------------------------------------------------------ Created 2015-10-19 13:10:00 | Updated | Tags: haplotypecaller Hello GATK team ! I am facing a problem that I am not exactly sure your caller can address and I would need your opinion on that. I use GATK last version (3.4.46) haplotypeCaller to call my variants (after all the best practices). I am getting the following two variants : chr17 41222982 . ATTC A 8447.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.398;ClippingRankSum=-0.136;DP=515;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.02;MQRankSum=1.616;QD=16.40;ReadPosRankSum=-19.399;SOR=0.470 GT:AD:DP:GQ:PL 0/1:292,223:515:99:8485,0,17722 chr17 41222986 . T TAAAA 8363.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-12.032;ClippingRankSum=0.954;DP=515;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.02;MQRankSum=-1.440;QD=16.24;ReadPosRankSum=-19.368;SOR=0.453 GT:AD:DP:GQ:PL 0/1:292,223:515:99:8401,0,17731 These is 1 deletion directly followed by 1 insertion. As you can see, the number of reads harbouring the reference (292) versus the alternate (223) is exactly the same. The problem is that my biologists are looking one single mutation called "delins" and because HC calls 2 distinct variants, I have 2 annotations and not 1 (should be something like c.1234delTTCinsAAAA). Do you have any idea how I could handle that using HC ? Or maybe after getting the vcf with a post processing tool ? Thanks a lot. Manon Created 2015-10-14 01:38:47 | Updated | Tags: haplotypecaller Hi GATK team, I'm running HaplotypeCaller using the following command: -T HaplotypeCaller -R all.chrs.fasta -I filename.bam -o filename.vcf -stand_emit_conf 10 -stand_call_conf 20 -nct 8 -rf BadCigar on PacBio reads aligned to a reference genome using BWA. I don't know why I'm getting an empty VCF file. Please find attached the log file. I see that ~50% of the reads are filtered out but the second half should produce some variants, shouldn't it? Is it a coverage issue now? Appreciate a lot you help and thanks in advance for any piece of advice! Created 2015-10-09 20:43:20 | Updated | Tags: haplotypecaller bug gatk variant-calling This SNP has 30 reads supporting it with most of them being mapq60 and good fred scores. using version 3.4-46-gbc02625 and my command was java -jar ~/GenomeAnalysisTK.jar -R /mnt/opt/refdata/fasta/hg19/hg19.fa -I /mnt/park/gemcode/haynes/rfa/RFA_phaser5/_SNPINDEL_PHASER_BAM/ATTACH_SERAFIM/fork0/files/output.bam -T HaplotypeCaller --genotyping_mode DISCOVERY -L chr3:33100000-33300000 -stand_emit_conf 10 -stand_call_conf 30 Let me know if you want more info or want me to submit a detailed bug report with relevant files. Thanks, Haynes Created 2015-10-06 13:31:19 | Updated | Tags: haplotypecaller small-sample-size Hi, I'm new to exome sequencing, sorry if the questions have really obvious answers. My data set contains only 3 different samples from mother, father and daughter. So far I'm doing the standard thing - IndelRealigner -> HaplotypeCaller -> VariantRecalibrator.. Quesion 1: HaplotypeCaller is recommended. I tried UnifiedGenotyper as well, which outputs about 30% more raw variants. Is that expected? Question 2: This thread recommends using public data from 1000genomes if the sample size is smaller than 30. Available data sets from 1000GP don't use the Nextera Illumina technology for capture. Is that a problem, should I look for public data that uses the exact same approach as us? Thanks for your help, I appreciate it ! :-) Created 2015-10-05 16:13:45 | Updated | Tags: haplotypecaller bug Basically, i'm seeing variants called where I have no reads. Not sure why, but maybe the developers might know why? All the data needed to replicate this can be found at http://ac.gt/haplotypecaller.tar.gz I saw this when I created the g.vcf on the full BAM file with the full genome.fa, but I also tried re-making the g.vcf with just the reads around the variant, and a genome.fa of just chromosome 1. It gave the same result so the above link just that data from which you can re-run: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ./genome.fa -I ./input.bam --emitRefConfidence GVCF -o ./output.g.vcf Created 2015-10-05 13:56:01 | Updated 2015-10-05 14:02:00 | Tags: haplotypecaller maxreadsinregionpersample minreadsperalignstart Hello, Here I have a question about downsample_to_coverage in HaplotypeCaller. I found -dcov cannot be used in HaplotypeCaller and I tried to change the values of parameters maxReadsInRegionPerSample and minReadsPerAlignStart to change the coverage level, but what I got the coverage of result files is still default coverage level. so I wanna ask what parameter in HaplotypeCaller could change the level of coverage? if they are above two parameters, then how could I increase the downsample_coverage? Created 2015-10-03 00:45:25 | Updated | Tags: haplotypecaller rna-seq Hi, I am trying to call SNPs from RNA-seq data. The data that I have is a pooled sample (from the larvae of shellfish 1000s of larvae pooled together to get enough RNA) I have 6 of those samples. Can I use GATK to call SNPs in these pooled samples ? Hamdi Created 2015-09-30 17:06:21 | Updated | Tags: haplotypecaller gvcf quality-score After applying the standard RNA-Seq pipeline (with STAR, etc) I called varients with the command: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R chromosome.fa -I ./final.bam -dontUseSoftClippedBases --variant_index_type LINEAR --variant_index_parameter 128000 --emitRefConfidence GVCF -o ./final.gvcf On the resultant gVCF file, I ran a little python script to see the distribution of calling quality across the different called genotypes: • x-axis is quality score rounded to the nearest integer • y-axis is the number of variants at that quality score  As you can see, its mostly heterozygous variants, which is what I expect since this data comes from highly inbred mice. What i didn't expect however is the periodicity. Is that normal? Now I presumably I need to filter these variants on some number of quality score, and from this I really dont know where. 0? 50? 75? Code to generate this data: #!/usr/bin/env python2.7 import collections with open('/home/john/overnight/outputs/ctrl_all_FVB.gvcf', 'rb') as f: data = {} for line in f: if line[0] == '#': continue line = line.split('\t') if line[5] == '.': continue gt = line[9][:3] try: data[gt][int(float(line[5]))] += 1 except KeyError: data[gt] = collections.defaultdict(int) for gt,qualities in data.items(): print '\n',gt for qual,count in sorted(qualities.items()): print qual,count Created 2015-09-27 22:16:57 | Updated | Tags: haplotypecaller fasta Hi, I have tried to solve several issues which came up while trying to run the HaplotypeCaller. For this one, I didn't find anything on google and to be honest when pasting the error, google doesn't even find something similar. ERROR MESSAGE: Badly formed genome loc: Contig NC_007605 given as location, but this contig isn't present in the Fasta sequence dictionary Can anyone please tell me what's the problem here? The fasta file I got was the one downloaded from the bundle: human_g1k_v37.fasta.gz Any help would be really appreciated. Thank you!! Created 2015-09-22 08:56:30 | Updated | Tags: haplotypecaller strandoddsratio sor I am using GATK HaplotypeCaller to call variation with the following command: java -Xmx20g -jar GenomeAnalysisTK.jar -l INFO -R hg19.fa -T HaplotypeCaller -nct 16 -I D-2.realigned.recal.bam -I D-3.realigned.recal.bam -I D-4.realigned.recal.bam --dbsnp hg19_GATK_snp137.vcf -o D-2_D-3_D-4.raw.vcf -A StrandOddsRatio -A AlleleBalance -A BaseCounts -A StrandBiasBySample -A FisherStrand However, there is a problem in the VCF file generated by this command. The SOR information in the header definition line did not exist in the mutation list. ##INFO= chr1 13116 rs201725126 T G 94.57 . AC=1;AF=0.250;AN=4;BaseQRankSum=1.754;DB;DP=7;FS=0.000;MLEAC=1;MLEAF=0.250;MQ=29.47;MQ0=0;MQRankSum=-1.754;QD=23.64;ReadPosRankSum=-0.550 GT:AD:GQ:PL:SB 0/0:3,0:9:0,9,191:0,0,0,0 0/1:1,3:44:123,0,44:0,0,0,0 ./. I have tried VariantAnnotator but still got the same problem. Could you please tell me where the problem exist and how to solve it? Thanks ! Created 2015-09-17 00:02:05 | Updated 2015-09-17 00:02:45 | Tags: haplotypecaller rnaseq gatk-walkers Hi, I have been running HP for my RNA-seq data java -Xmx16g -jar GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -Rref \ -I INPUT.bam \ -stand_call_conf 50.0 \ -stand_emit_conf 10.0 \ -o output.vcf

my process is killed when it was 82% completed. Is there a way to resume the run without running from the beginning ?

Thanks Best Regards T. Hamdi Kitapci

Created 2015-09-16 13:37:51 | Updated | Tags: haplotypecaller dbsnp

Hi,

I am having the following problem: I use the HaplotypeCaller (GATK 3.3.0) for variant calling. To identify variants that are known according to dbSNP, I use the "--dbsnp" statement and define a dbSNP file (vcf file). I thought, that everything would work fine, but by coincidence I observed a (in my eyes really serious) problem: The same call is recognized in the case of one sample, but not in the case of another sample. These are the two important lines of the vcf files that get reported:

As we exclude known variants for our analysis, it is essential that this step works correctly. Yet, I am pretty insecure what to do no. The variant seems to be well known (according to information on the ncbi homepage). Yet, why was it not identified in the other sample???

It would be great if anyone could help me. Many thanks in advance!

Sarah

Created 2015-09-16 13:26:09 | Updated | Tags: haplotypecaller bamout

Hi all, I used multi-threading mode on HaplotypeCaller hoping to save some time. But seemed like bamout can not be emitted in multi-threading mode. I searched the answers. But I am still not sure if the latest 3.4-46 version can support multi-threading with bamout. BTW, I am still using the old 3.3-0 version. If you say yes, now 3.4 version can support multi-threading bam, then I will ask the computing core to update gatk for me. Or maybe I just delete the bamout option to save some time. But I really prefer not to do so because I need to check the depth and coverage of mapping results actually finally used for variant calling. My command line: java -Xmx12g -jar GATK_JARS/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -nct 12 \ -R human_g1k_v37.fasta \ --dbsnp dbsnp_138.b37.vcf \ -I recal_realigned_b37.dedup.sorted.bam \ --genotyping_mode DISCOVERY \ -stand_emit_conf 10 \ -stand_call_conf 20 \ --emitRefConfidence GVCF \ --variant_index_type LINEAR \ --variant_index_parameter 128000 \ -o raw_var_TKDOME.g.vcf \ -bamout force_bamout_TKDOME_b37.bam -forceActive -disableOptimizations BTW, is it necessary to add --variant_index_type LINEAR and --variant_index_parameter 128000 in Version 3.3? Thank you very much! Created 2015-09-16 07:43:20 | Updated | Tags: haplotypecaller splitncigarreads-error First of all I thank for the Tool , I am using this GATK var calling for my RNA-seq data.. I have been following the commands said in the site but its stopping me at the splitting BAM file step with the following error, ERROR ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ERROR ERROR MESSAGE: Badly formed genome loc: Contig * given as location, but this contig isn't present in the Fasta sequence dictionary The command I used is, /opt/husar/bin/java-1.7 -jar /GenomeAnalysisTK-3.2-2.jar -T SplitNCigarReads -R /human_genome37_gatk.fa -I BM_ID_reorder.bam -o BM_ID_split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS I tried do variant calling on the duplicate removed BAM file, which also throwed error message as, ##### ERROR ##### ERROR MESSAGE: SAM/BAM file BM_ID_reorder.bam is malformed: Reference index 1912602624 not found in sequence dictionary. ##### ERROR - The command line I used for this, /opt/husar/bin/java-1.7 -jar -Xincgc -Xmx1586MNGSUTILDIR/java/GenomeAnalysisTK-3.2-2.jar -T HaplotypeCaller -R /human_genome37_gatk.fa -I BM_ID_reorder.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o BM_ID.vcf

Created 2015-09-15 01:47:02 | Updated | Tags: unifiedgenotyper haplotypecaller

I've read through many of your posts/responses regarding HaplotypeCaller not calling variants, and tried many of the suggestions you've made to others, but I'm still missing variants. My situation is a little different (I'm trying to identify variants from Sanger sequence reads) but I'm hoping you might have additional ideas or can see something I've overlooked. I hope I haven't given you too much information below, but I've seen it mentioned that too much info is better than not enough.

A while back, I generated a variant call set from Illumina Next Gen Sequencing data using UnifiedGenotyper (circa v2.7.4), identifying ~46,000 discordant variants between the genomes of two haploid strains of S. cerevisiae. Our subsequent experiments included Sanger sequencing ~95 kb of DNA across 17 different loci in these two strains. I don't think any of the SNP calls were false positives, and there were very, very few were false negatives.

Since then, we've constructed many strains by swapping variants at these loci between these two strains of yeast. To check if a strain was constructed successfully, we PCR the loci of interest, and Sanger sequencing the PCR product. I'm trying to use GATK (version 3.4-46) HaplotypeCaller (preferably, or alternatively UnifiedGenotyper) in a variant detection pipeline to confirm a properly constructed strain. I convert the .ab1 files to fastqs using EMBOSS seqret, map the Sanger reads using bwa mem ($bwa mem -R$RG $refFasta$i > ${outDir}/samFiles/${fileBaseName}.sam), merge the sam files for each individual, and then perform the variant calling separately for each individual. I do not dedup (I actually intentionally leave out the -M flag in bwa), nor do I realign around indels (I plan to in the future, but there aren't any indels in any of the regions we are currently looking at), or do any BQSR in this pipeline. Also, when I do the genotyping after HaplotypCaller, I don't do joint genotyping, each sample (individual) gets genotyped individually.

In general, this pipeline does identify many variants from the Sanger reads, but I'm still missing many variant calls that I can clearly see in the Sanger reads. Using a test set of 36 individuals, I examined the variant calls made from 364 Sanger reads that cover a total of 63 known variant sites across three ~5kb loci (40 SNPs in locus 08a-s02, 9 SNPs in locus 10a-s01, 14 SNPs in locus 12c-s02). Below are some example calls to HaplotypeCaller and UnifiedGenotyper, as well as a brief summary statement of general performance using the given command. I've also included some screenshots from IGV showing the alignments (original bam files and bamOut files) and SNP calls from the different commands.

Ideally, I'd like to use the HaplotypeCaller since not only can it give me a variant call with a confidence value, but it can also give me a reference call with a confidence value. And furthermore, I'd like to stay in DISCOVERY mode as opposed to Genotype Given Alleles, that way I can also assess whether any experimental manipulations we've performed might have possibly introduced new mutations.

Again, I'm hoping someone can advice me on how to make adjustments to reduce the number of missed calls.

Call 1: The first call to HaplotypeCaller I'm showing produced the least amount of variant calls at sites where I've checked the Sanger reads.

java -Xmx4g -jar $gatkJar \ -R$refFasta \
-T HaplotypeCaller \
-I $inBam \ -ploidy 1 \ -nct 1 \ -bamout${inBam%.bam}_hapcallRealigned.bam \
-forceActive \
-disableOptimizations \
-dontTrimActiveRegions \
--genotyping_mode DISCOVERY \
--emitRefConfidence BP_RESOLUTION \
--intervals $outDir/tmp.intervals.bed \ --min_base_quality_score 5 \ --standard_min_confidence_threshold_for_calling 0 \ --standard_min_confidence_threshold_for_emitting 0 \ -A VariantType \ -A SampleList \ -A AlleleBalance \ -A BaseCounts \ -A AlleleBalanceBySample \ -o$outDir/vcfFiles/${fileBaseName}_hc_bp_raw.g.vcf Call 2: I tried a number of different -kmerSize values [(-kmerSize 10 -kmerSize 25), (-kmerSize 9), (-kmerSize 10), (-kmerSize 12), (-kmerSize 19), (-kmerSize 12 -kmerSize 19), (maybe some others). I seemed to have the best luck when using -kmerSize 12 only; I picked up a few more SNPs (where I expected them), and only lost one SNP call as compared Call 1. java -Xmx4g -jar$gatkJar \
-R $refFasta \ -T HaplotypeCaller \ -I$inBam \
-ploidy 1 \
-nct 1 \
-bamout {inBam%.bam}_kmer_hapcallRealigned.bam \ -forceActive \ -disableOptimizations \ -dontTrimActiveRegions \ --genotyping_mode DISCOVERY \ --emitRefConfidence BP_RESOLUTION \ --interval_padding 500 \ --intervalsoutDir/tmp.intervals.bed \
--min_base_quality_score 5 \
--standard_min_confidence_threshold_for_calling 0 \
--standard_min_confidence_threshold_for_emitting 0 \
-kmerSize 12 \
-A VariantType \
-A SampleList \
-A AlleleBalance \
-A BaseCounts \
-A AlleleBalanceBySample \
-o $outDir/vcfFiles/${fileBaseName}_hc_bp_kmer_raw.g.vcf

Call 3: I tried adjusting --minPruning 1 and --minDanglingBranchLength 1, which helped more than playing with kmerSize. I picked up many more SNPs compared to both Call 1 and Call 2 (but not necessarily the same SNPs I gained in Call 2).

java -Xmx4g -jar $gatkJar \ -R$refFasta \
-T HaplotypeCaller \
-I $inBam \ -ploidy 1 \ -nct 1 \ -bamout${inBam%.bam}_adv_hapcallRealigned.bam \
-forceActive \
-disableOptimizations \
-dontTrimActiveRegions \
--genotyping_mode DISCOVERY \
--emitRefConfidence BP_RESOLUTION \
--intervals $outDir/tmp.intervals.bed \ --min_base_quality_score 5 \ --standard_min_confidence_threshold_for_calling 0 \ --standard_min_confidence_threshold_for_emitting 0 \ --minPruning 1 \ --minDanglingBranchLength 1 \ -A VariantType \ -A SampleList \ -A AlleleBalance \ -A BaseCounts \ -A AlleleBalanceBySample \ -o$outDir/vcfFiles/${fileBaseName}_hc_bp_adv_raw.g.vcf Call 4: I then tried adding both --minPruning 1 --minDanglingBranchLength 1 and -kmerSize 12 all at once, and I threw in a --min_mapping_quality_score 5. I maybe did slightly better... than in Calls 1-4. I did actually lose 1 SNP compared to Calls 1-4, but I got most of the additional SNPs I got from using Call 3, as well as some of the SNPs I got from using Call 2. java -Xmx4g -jar$gatkJar \
-R $refFasta \ -T HaplotypeCaller \ -I$inBam \
-ploidy 1 \
-nct 1 \
-bamout ${inBam%.bam}_hailMary_raw.bam \ -forceActive \ -disableOptimizations \ -dontTrimActiveRegions \ --genotyping_mode DISCOVERY \ --emitRefConfidence BP_RESOLUTION \ --interval_padding 500 \ --intervals$outDir/tmp.intervals.bed \
--min_base_quality_score 5 \
--min_mapping_quality_score 10 \
--standard_min_confidence_threshold_for_calling 0 \
--standard_min_confidence_threshold_for_emitting 0 \
--minPruning 1 \
--minDanglingBranchLength 1 \
-kmerSize 12 \
-A VariantType \
-A SampleList \
-A AlleleBalance \
-A BaseCounts \
-A AlleleBalanceBySample \
-o $outDir/vcfFiles/${fileBaseName}_hailMary_raw.g.vcf

Call 5: As I mentioned above, I've experience better performance (or at least I've done a better job executing) with UnifiedGenotyper. I actually get the most SNPs called at the known SNP sites, in individuals where manual examination confirms a SNP.

java -Xmx4g -jar $gatkJar \ -R$refFasta \
-T UnifiedGenotyper \
-I $inBam \ -ploidy 1 \ --output_mode EMIT_ALL_SITES \ -glm BOTH \ -dt NONE -dcov 0 \ -nt 4 \ -nct 1 \ --intervals$outDir/tmp.intervals.bed \
--min_base_quality_score 5 \
--standard_min_confidence_threshold_for_calling 0 \
--standard_min_confidence_threshold_for_emitting 0 \
-minIndelCnt 1 \
-A VariantType \
-A SampleList \
-A AlleleBalance \
-A BaseCounts \
-A AlleleBalanceBySample \
-o $outDir/vcfFiles/${fileBaseName}_ug_emitAll_raw.vcf

I hope you're still with me :)

None of the above commands are calling all of the SNPs that I (maybe naively) would expect them to. "Examples 1-3" in the first attached screenshot are three individuals with reads (two reads each) showing the alternate allele. The map quality scores for each read are 60, and the base quality scores at this position for individual #11 are 36 and 38, and for the other individuals, the base quality scores are between 48-61. The reads are very clean upstream of this position, the next upstream SNP is about ~80bp away, and the downstream SNP at the position marked for "Examples 4-6" is ~160 bp away. Commands 1 and 2 do not elicit a SNP call for Examples 1-6, Command 3 get the calls at both positions for individual 10, Command 4 for gets the both calls for individuals 10 and the upstream SNP for individual 11. Command 5 (UnifiedGenotyper) gets the alt allele called in all 3 individuals at the upstream position, and the alt allele called for individuals 10 and 12 at the downstream position. Note that in individual 11, there is only one read covering the downstream variant position, where UnifiedGenotyper missed the call.

Here is the vcf output for those two positions from each command. Note that there are more samples in the per-sample breakdown for the FORMAT tags. The last three groups of FORMAT tags correspond to the three individuals I've shown in the screenshots.

Command 1 output

Examples 1-3    649036  .   G   .   .   .   AN=11;DP=22;VariantType=NO_VARIATION;set=ReferenceInAll GT:AD:DP:RGQ    .   .   .   .   .   .   .   .   .   .   .:0:0:0 0:0:2:0 0:2:2:89    0:0:2:0 0:2:2:84    0:0:2:0 0:2:2:89    0:0:2:0 0:2:2:89    0:0:2:0 0:0:2:0 0:0:2:0
Examples 4-6    649160  .   C   .   .   .   AN=11;DP=21;VariantType=NO_VARIATION;set=ReferenceInAll GT:AD:DP:RGQ    .   .   .   .   .   .   .   .   .   .   .:0:0:0 0:0:2:0 0:2:2:89    0:0:2:0 0:2:2:0 0:0:2:0 0:2:2:71    0:0:2:0 0:2:2:44    0:0:2:0 0:0:1:0 0:0:2:0

Command 2 output

Examples 1-3    649036  .   G   A   26.02   .   ABHom=1.00;AC=6;AF=0.545;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-011;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-011_merged_sorted_hc_bp_kmer_raw.vcf  GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .:0:0:.:.:0 1:0,2:.:56:56,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:84    1:0,1:.:45:45,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:89    1:0,1:.:45:45,0 1:0,2:.:88:88,0 0:0:2:.:.:0
Examples 4-6    649160  .   C   A   13.22   .   AC=3;AF=0.273;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;OND=1.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hc_bp_kmer_raw.vcf   GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .   .   .       .   .   .   .   .   .   .   .:0:0:.:.:0 0:0:2:.:.:0 0:2:2:.:.:89    1:0,1:.:43:43,0 0:2:2:.:.:0 0:0:2:.:.:0 0:2:2:.:.:71    1:0,0,1:.:37:37,0   0:2:2:.:.:44    1:0,1:.:34:34,0 0:0:1:.:.:0 0:0:2:.:.:0

Command 3 output

Examples 1-3    649036  .   G   A   36.01   .   ABHom=1.00;AC=3;AF=0.273;AN=11;DP=20;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_hc_bp_adv_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_adv_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_hc_bp_adv_raw.vcf    GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .:0:0:.:.:0 1:0,2:.:66:66,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:84    1:0,1:.:45:45,0 0:2:2:.:.:89    0:0:2:.:.:0 0:2:2:.:.:89    0:0:2:.:.:0 0:0:2:.:.:0 0:0:2:.:.:0
Examples 4-6    649160  .   C   A   13.22   .   ABHom=1.00;AC=1;AF=0.091;AN=11;DP=20;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-004;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_adv_raw.vcf    GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .:0:0:.:.:0 0:0:2:.:.:0 0:2:2:.:.:89    1:0,1:.:43:43,0 0:2:2:.:.:0 0:0:2:.:.:0 0:2:2:.:.:71    0:0:2:.:.:0 0:2:2:.:.:44    0:0:2:.:.:0 0:0:1:.:.:0 0:0:2:.:.:0

Command 4 output

Examples 1-3    649036  .   G   A   26.02   .   ABHom=1.00;AC=6;AF=0.545;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-011;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-011_merged_sorted_hailMary_raw.vcf  GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .:0:0:.:.:0 1:0,2:.:56:56,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:84    1:0,1:.:45:45,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:89    1:0,1:.:45:45,0 1:0,2:.:88:88,0 0:0:2:.:.:0
Examples 4-6    649160  .   C   A   13.22   .   AC=3;AF=0.273;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;OND=1.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hailMary_raw.vcf GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .:0:0:.:.:0 0:0:2:.:.:0 0:2:2:.:.:89    1:0,1:.:43:43,0 0:2:2:.:.:0 0:0:2:.:.:0 0:2:2:.:.:71    1:0,0,1:.:37:37,0   0:2:2:.:.:44    1:0,1:.:34:34,0 0:0:1:.:.:0 0:0:2:.:.:0

Command 5 output

Examples 1-3    649036  .   G   A   26.02   .   ABHom=1.00;AC=7;AF=0.636;AN=11;DP=22;Dels=0.00;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQ0=0;SOR=2.303;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-011,qHZT-12c-s02_r2657_p4096_dJ-012;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-011_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-012_merged_sorted_ug_emitAll_raw.vcf  GT:AD:DP:GQ:PL  ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./../.  ./. ./. ./. ./. ./. ./. 1:0,2:2:56:56,0 0:.:2   1:0,2:2:99:117,0    0:.:2   1:0,2:2:99:122,0    0:.:2   1:0,2:2:67:67,0 0:.:2   1:0,2:2:99:110,0    1:0,2:2:84:84,0 1:0,2:2:99:127,0
Examples 4-6    649160  .   C   A   46  .   ABHom=1.00;AC=5;AF=0.455;AN=11;DP=21;Dels=0.00;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQ0=0;Samples=qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-012;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-012_merged_sorted_ug_emitAll_raw.vcf  GT:AD:DP:GQ:PL  ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./../.  ./. ./. ./. ./. ./. ./. 0:.:2   0:.:2   1:0,2:2:76:76,0 0:.:2   1:0,2:2:70:70,0 0:.:2   1:0,1:2:37:37,0 0:.:2   1:0,2:2:60:60,0 0:.:1   1:0,2:2:75:75,0

There are many more examples of missed SNP calls. When using the HaplotypeCaller, I'm missing ~23% of the SNP calls. So...what can I do to tweak my variant detection pipeline so that I don't miss so many SNP calls?

As I mentioned, I'm currently getting better results with the UnifiedGenotyper walker. I'm only missing about 2% of all Alt SNP calls. Also, about half of that 2% are improperly being genotyped as Ref by Command #5. It appears to me that most of the variant calls I'm missing using the UnifiedGenotyper are at positions where I only have a single Sanger read covering the base, and the base quality score starts to fall below 25 (such as in individual #11 in the first attached screen shot, base quality score was 20). Attached is a second IGV screenshot of a different locus where I've also missed SNP calls using Command 5 (Examples 7-9). I've also included the read details for those positions, as well as the VCF file output from Command 5. I have seen at least one instance where I had two Sanger reads reporting an alternate allele, however, UG did not call the variant. In that case though, the base quality scores in both reads were very low (8); mapping quality was 60 for both reads.

Does anyone have any suggestions as to how I might alter any of the parameters to reduce (hopefully eliminate) the missed SNP calls. I think I would accept false positives over false negatives in this case. Or does anyone have any other idea as to what my problem might be?

Thanks so much! Matt Maurer

Command 5 output for second screen shot file:

VCF output The samples shown in the second attached screen shot correspond to the 11th and 12th groupings in the per-sample breakdown of the FORMAT tags.

Examples 7-8    163422  .   G   C   173 .   ABHom=1.00;AC=2;AF=0.182;AN=11;DP=14;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQ0=0;SOR=1.609;Samples=qHZT-08a-s02_r2657_p4094_dJ-002,qHZT-08a-s02_r2657_p4094_dJ-008;VariantType=SNP;set=qHZT-08a-s02_r2657_p4094_dJ-002_merged_sorted_ug_emitAll_raw.vcf-qHZT-08a-s02_r2657_p4094_dJ-008_merged_sorted_ug_emitAll_raw.vcf GT:AD:DP:GQ:PL  0:.:1   1:0,4:4:99:203,0    0:.:1   0:.:1   0:.:1   0:.:1   0:.:1   1:0,1:1:54:54,0 0:.:1   ./. 0:.:1   0:.:1   ./. ./. ./. ./. ./. ./. ./. ./. ./../.  ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./.
Example 9   163476  .   A   G   173 .   ABHom=1.00;AC=2;AF=0.167;AN=12;DP=15;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=1;MLEAF=1.00;MQ0=0;SOR=1.609;Samples=qHZT-08a-s02_r2657_p4094_dJ-002,qHZT-08a-s02_r2657_p4094_dJ-008;VariantType=SNP;set=qHZT-08a-s02_r2657_p4094_dJ-002_merged_sorted_ug_emitAll_raw.vcf-qHZT-08a-s02_r2657_p4094_dJ-008_merged_sorted_ug_emitAll_raw.vcf  GT:AD:DP:GQ:PL  0:.:1   1:0,4:4:99:203,0    0:.:1   0:.:1   0:.:1   0:.:1   0:.:1   1:0,1:1:57:57,0 0:.:1   0:.:1   0:.:1   0:.:1   .   .   .   .   .   .   .   .   .   .   .

Also, why are the GT's sometimes "./." as they are for site163422, and sometimes "." as they are for site 163476?

Example#7

Sample = qHZT-08a-s02_r2657_p4094_dJ-011
----------------------
Location = 163,422
Alignment start = 163,293 (+)
Cigar = 34S833M1D72M1I50M1I9M
Mapped = yes
Mapping quality = 60
Secondary = no
Supplementary = no
Duplicate = no
Failed QC = no
----------------------
Base = C
Base phred quality = 23
----------------------
RG = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11
NM = 20
AS = 858
XS = 0
-------------------

Example 8

Sample = qHZT-08a-s02_r2657_p4094_dJ-011
----------------------
Location = 163,476
Alignment start = 163,293 (+)
Cigar = 34S833M1D72M1I50M1I9M
Mapped = yes
Mapping quality = 60
Secondary = no
Supplementary = no
Duplicate = no
Failed QC = no
----------------------
Base = G
Base phred quality = 15
----------------------
RG = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11
NM = 20
AS = 858
XS = 0
-------------------

Example #9

Sample = qHZT-08a-s02_r2657_p4094_dJ-012
----------------------
Location = 163,422
Alignment start = 163,329 (+)
Cigar = 67S16M1D181M1D634M1D9M1I8M1I62M1D17M4S
Mapped = yes
Mapping quality = 60
Secondary = no
Supplementary = no
Duplicate = no
Failed QC = no
----------------------
Base = C
Base phred quality = 18
----------------------
RG = qHZT-08a-s02_L_r2657_p4094_dJ-012_pcrP1_oMM575_2014-11-27_A12
NM = 87
AS = 480
XS = 0
-------------------

Created 2015-09-14 16:27:46 | Updated | Tags: haplotypecaller vcf bam rnaseq variant-calling

Hello,

I'm using GATK to call variants in my RNA-Seq data. I'm noticing something strange, perhaps someone can help? For a number of sites the VCF is reporting things I cannot replicate from BAMs. How can I recover the reads that contribute to a variant call? Here is an example for 1 site in 1 sample, but I've observed this at many sites/samples:

$grep 235068463 file.vcf chr1 235068463 . T C 1795.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-3.530;ClippingRankSum=-0.535;DP=60;FS=7.844;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=0.401;QD=29.93;ReadPosRankSum=3.557 GT:AD:DP:GQ:PL 0/1:5,55:60:44:1824,0,44 60 reads, 5 T, 55 C. But loading the bam in IGV, I do not see any T reads. Similarly:$ samtools view -uh file.md.realn.bam chr1:235068463-235068463 |samtools mpileup - |grep 235068463
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
chr1    235068463   N   60  cCCccccCCCcccccCcccccccccCCCccCCCCCcCcccccCCCcCcCCccCCCCccCC    >CA@B@>A>BA@BCABACCC:@@ACABBBCAACBBCABCB@CABBAB?>A?CBBAAAABA

There are just 60 C's at that location. How do I decide what the genotype here is? C/C or C/T ?

For methodology I'm using gatk/3.2.0. I tried using HC from gatk/3.3.1 and got the same result. The bam and vcf files come from the final two lines: -2 pass STAR -Mark Dups -SplitNCigarReads -RealignerTargetCreator -IndelRealigner -BaseRecalibrator -PrintReads -MergeSamFiles.jar -Mark Dups -RealignerTargetCreator -IndelRealigner -HaplotyeCaller

Thanks, Kipp

Created 2015-09-14 10:00:46 | Updated | Tags: haplotypecaller indels

Hi all,

I have the below INDEL call from GATK-3.3 Haplotype caller.

chr17   39190954    .   G   GCAGCAGCTTGGCTGGCAGCAGCTGGTCTCA 770.52  PASS    AC=1;AF=0.500;AN=2;DP=138;

The command used:

java -Xmx10G -jar GenomeAnalysisTK.jar -R %s -T HaplotypeCaller -I %s -L %s -stand_emit_conf 10 -stand_call_conf 30
--genotyping_mode DISCOVERY -o %s

DP in the INFO field is 138 and AD from the FORMAT field is 0,0. I understand that DP and AD are unfiltered and filtered depths. However, having 0 reads is something alarming. Could someone help me to understand the differing read depths.

Created 2015-09-11 12:29:24 | Updated | Tags: haplotypecaller genotypegvcf rgq

I work with non-human genomes and commonly need the confidence of the reference sites, so I was happy to see the inclusion of the RGQ score in the format field of GenotypeGVCFs. However, I am a little confused as to what this score means (how it is calculated). Out of curiosity I plotted the distribution of RGQ and GQ scores over ~1Mbp. A few things jumped out that I was hoping you could explain:

(1) There are two peaks of GQ and RGQ scores, one at 99 - which is obviously just the highest confidence score and another at exactly GQ/RGQ=45. You can see this in the GQ/RGQ distribution below. I've excluded the sites where RGQ/GQ = 0 or 99 (RGQ = blue, GQ=red) is there some reason why so many GT calls == 45?

(2) There are very few GQ = 0 calls and ~96% are GQ=99 - but in the RGQ ~42% == 0 and 54%=99. Is there any explanation why so many RGQ scores == 0? I fear that filtering on RGQ will bias the data against reference calls and include a disproportionate number of variant calls.

Created 2015-09-11 10:31:16 | Updated | Tags: haplotypecaller gvcf

I came across some unusual variants called by HaplotypeCaller running in gvcf mode while working on human WGS data (the example gvcf line can be seen below). The genotype in almost all samples is undefined i.e. "./.", despite the good coverage reported in DP field (only one sample is identified as 0/1). Moreover, in "./." genotyped samples all reads fall into reference allele group of AD field, therefore I would anticipate "0/0" genotype rather than "./.". I have also inspected several bam files visually and did not find any obvious mapping problems. I have attached two IGV snapshots of the variant region: first is from an example "./." genotyped patient and second one is from the only patient with variant. The region seems to have good 25-30x coverage with majority of mapping qualities equal to 60. However, apparently there is some other insertion nearby. The GATK version I am using is 2015.1-3.4.0-1-ga5ca3fc and reference genome is GRCh38.

Could you please explain why the inferred genotype is "./." instead of "0/0" ?

Best,

Ewa

chr1 100474610 rs568102277 T TG 358.91 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.54;ClippingRankSum=0.419;DB;DP=4026;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=1.36;QD=13.29;ReadPosRankSum=0.814;SOR=0.551 GT:AD:DP:GQ:PGT:PID:PL ./.:36,0:36 ./.:37,0:37 ./.:33,0:33 ./.:30,0:30 ./.:36,0:36 ./.:32,0:32 ./.:36,0:36 ./.:32,0:32 ./.:31,0:31 ./.:37,0:37 ./.:27,0:27 ./.:34,0:34 ./.:38,0:38 ./.:28,0:28 ./.:29,0:29 ./.:31,0:31 ./.:25,0:25 ./.:24,0:24 ./.:19,0:19 ./.:41,0:41 ./.:24,0:24 ./.:27,0:27 ./.:26,0:26 ./.:28,0:28 ./.:31,0:31 ./.:38,0:38 ./.:27,0:27 ./.:22,0:22 ./.:31,0:31 ./.:27,0:27 ./.:29,0:29 ./.:28,0:28 ./.:34,0:34 ./.:20,0:20 ./.:26,0:26 ./.:33,0:33 ./.:26,0:26 ./.:26,0:26 ./.:31,0:31 ./.:32,0:32 ./.:34,0:34 ./.:27,0:27 ./.:28,0:28 ./.:37,0:37 ./.:38,0:38 ./.:25,0:25 ./.:31,0:31 ./.:37,0:37 ./.:31,0:31 ./.:32,0:32 ./.:30,0:30 ./.:38,0:38 ./.:36,0:36 ./.:32,0:32 ./.:40,0:40 ./.:32,0:32 ./.:42,0:42 ./.:37,0:37 ./.:29,0:29 ./.:42,0:42 ./.:31,0:31 ./.:36,0:36 ./.:35,0:35 ./.:31,0:31 ./.:35,0:35 ./.:32,0:32 ./.:30,0:30 ./.:30,0:30 ./.:36,0:36 ./.:34,0:34 ./.:28,0:28 ./.:37,0:37 ./.:34,0:34 ./.:24,0:24 ./.:31,0:31 ./.:33,0:33 ./.:36,0:36 ./.:37,0:37 ./.:48,0:48 ./.:25,0:25 ./.:39,0:39 ./.:26,0:26 ./.:23,0:23 ./.:39,0:39 ./.:29,0:29 ./.:33,0:33 ./.:37,0:37 ./.:27,0:27 ./.:29,0:29 ./.:42,0:42 ./.:28,0:28 ./.:29,0:29 ./.:30,0:30 ./.:39,0:39 ./.:39,0:39 ./.:35,0:35 ./.:31,0:31 ./.:29,0:29 ./.:23,0:23 ./.:30,0:30 ./.:24,0:24 ./.:29,0:29 ./.:26,0:26 ./.:19,0:19 ./.:26,0:26 ./.:16,0:16 ./.:27,0:27 ./.:24,0:24 ./.:34,0:34 ./.:28,0:28 ./.:41,0:41 ./.:41,0:41 ./.:39,0:39 ./.:24,0:24 0/1:11,16:27:99:1|0:100474609_G_GT:381,0,245 ./.:36,0:36 ./.:26,0:26 ./.:27,0:27 ./.:29,0:29 ./.:29,0:29 ./.:28,0:28 ./.:24,0:24 ./.:19,0:19 ./.:31,0:31 ./.:33,0:33 ./.:23,0:23 ./.:25,0:25 ./.:31,0:31 ./.:34,0:34 ./.:26,0:26

Created 2015-08-31 03:14:54 | Updated 2015-08-31 03:16:12 | Tags: haplotypecaller

Hi, I'm hoping you can help resolve the behaviour of HaplotypeCaller with respect to a certain position.

Here's the IGV screenshot, with these filters: MQ>30, filter secondaries and dups. The DP is 14-18 across this deletion.

HC called a TATA deletion in this proband, with this gvcf call: 5 67597220 rs71655141 GTATA G,<NON_REF> 95.14 . DB;DP=12;MLEAC=2,0;MLEAF=1.00,0.00;MQ=57.93;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,3,0:3:10:132,10,0,132,10,132:0,0,1,2

It's calling this as GT=1/1 with AD=0,3.

Clearly this is likely all noise, and a tough region of the genome to make a call in, but i'm curious why the depth is 3, how HC handles the multiple overlapping deletions - ie how it only makes the delTATA call.

I'm using GATK 3.3, and following best practices.

cheers, Mark

Created 2015-08-30 22:27:03 | Updated 2015-08-30 22:30:48 | Tags: haplotypecaller ploidy pooled-calls

Hello everyone,

I was reading the haplotype caller documentation and noticed the "--sample_ploidy/-ploidy" flag. The description reads "Ploidy (number of chromosomes) per sample. For pooled data, set to (Number of samples in each pool * Sample Ploidy)."

My question is, what exactly is a pooled experiment? Is it when I have multiple samples? I have separate files for each of my 8 samples and the organism only has one chromosome. So would the number I set be 8*1? Or is this pooled number for multiple samples within a file, and in which case, I would specify 1 instead of 8.

Thanks! Raymosrunerx

Created 2015-08-27 18:54:14 | Updated 2015-08-27 18:54:44 | Tags: haplotypecaller bug

Haplotype Caller output this record, how can it have an AD of 0,0? 7 21584892 . T TAA 257.77 . AC=1;AF=0.500;AN=2;DP=118;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=70.00;SOR=4.804 GT:AD:GQ:PL 0/1:0,0:99:125,0,112

GATK version is 3.4-46

Created 2015-08-10 16:15:23 | Updated | Tags: haplotypecaller

Hello,

I am having an issue with haplotypecaller omitting true heterozygotes. Attached is an IGV image of the VCF (top track), de novo reassembled BAM file (middle) and input BAM file (bottom). This appears to be happening all over. I was wondering what I can do to address this issue.

commandline:

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I sample.realigned.marked.sorted.bqsr.unique.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o sample.hapcal.raw.vcf -nct 12

p.s. I get the same results when running without multiple threads and when outputting the rearranged BAM file used for variant calling.

Created 2015-07-31 00:11:55 | Updated | Tags: haplotypecaller kmer bamout

Hello, I am using HaplotypeCaller in order to get haplotype sequences from individual samples (several samples per species) for gene tree/species tree analysis. The reads are from an exome capture experiment. Because I am running individual samples I have limited the max # of haplotypes to 2. However, the default behavior of using two kmer size (10 and 25) results in up to four haplotypes per exon (interval) in the bamout file. I have found that if I supply a kmerSize parameter I get only 2 haplotypes but these differ depending on the kmer I supply. The difference is not only subsetting of the snps found with multiple kmer sizes but distinct snps called with different kmer sizes as well. I would like to run the analysis with multiple kmerSizes specified and have the caller only output the two most likely haplotypes. Is this possible and, if so, how can I do it? Or, am I misunderstanding how the caller works?

I think I understand why different kmer sizes would result in different snps called but if anyone could explain it to me I'd love confirmation.

Here is my original command line before experimenting with kmer sizes: java -jar /opt/local/NGS/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller -R /Users/bdorsey/Documents/Dioon/Capture_seqs_assembly/captured_seqs_uniq.fa -I /Volumes/HD2/Capture_assembly/Dioon1/contigs/Dioon1_m1n350r.10x.sp5.bam -L /Volumes/HD2/Capture_assembly/Dioon1/exonsCov10sp5.list --activeRegionIn /Volumes/HD2/Capture_assembly/Dioon1/exonsCov10sp5.list --maxNumHaplotypesInPopulation 2 --minReadsPerAlignmentStart 5 -out_mode EMIT_ALL_SITES -ERC BP_RESOLUTION --forceActive --dontTrimActiveRegions --activeRegionMaxSize 10000 -bamWriterType CALLED_HAPLOTYPES --disableOptimizations -bamout /Volumes/HD2/Capture_assembly/Dioon1/haplo/Dioon1.haplos.bam -o /Volumes/HD2/Capture_assembly/Dioon1/haplo/Dioon1.haplos.g.vcf

Thanks very much for any help. Cheers, Brian D

Created 2015-07-23 11:26:42 | Updated | Tags: fisherstrand haplotypecaller downsampling strand-bias

Hi GATK team, Again thanks a lot for the wonderful tools you're offering to the community.

I have recently switched from UnifiedGenotyper to Haplotype Caller (1 sample at a time, DNASeq). I was planning to use the same hard filtering procedure that I was using previously, including the filter of the variants with FS > 60. However I am facing an issue probably due to the downsampling done by HC.

I should have 5000 reads, but DP is around 500/600 which I understood is due to downsampling (even with -dt NONE). I did understand that it does not impact in the calling itself. However it is annoying me for 2 reasons 1) Calculating frequency of the variant using the AD field is not correct (not based on all reads) 2) I get variants with FS >60 whereas when you look at the entire set of reads, there is absolutely no strand bias.

Example with this variant chr17 41245466 rs1799949 G A 7441.77 STRAND_BIAS; AC=1;AF=0.500;AN=2;BaseQRankSum=7.576;DB;DP=1042;FS=63.090;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.666;QD=7.14;ReadPosRankSum=-11.896;SOR=5.810 GT:AD:GQ:PL:SB 0/1:575,258:99:7470,0,21182:424,151,254,4

When I observe all reads I have the following counts, well shared on the + and - strands Allele G : 1389 (874+, 515-) Allele A : 1445 (886+, 559-)

Could you please tell me how to avoid such an issue ? (By the way, this variant is a true one and should not be filtered out).

Thanks a lot.

Created 2015-07-22 12:08:20 | Updated 2015-07-22 12:09:10 | Tags: unifiedgenotyper haplotypecaller snp-calling input-prior

I'm using HaplotypeCaller (but it could be also possible to use this option with UnifiedGenotyper) for a very special experimental design in a no-human species, where we have an expectation for the prior probabilities of each genotype. I'm planning to call SNPs for single diploid individuals using HaplotypeCaller and afterwards for the whole dataset with GenotypeGVCFs.

Nevertheless, I'm confused about the structure of the prior probabilities command line. In the documentation, it says: "Values will be added and Pr(AC=0) will be 1-sum, so that they sum up to one. f) If user-defined values add to more than one, an error will be produced". So I'll require to provide two prior probabilities out of the 3 for each genotype (0/0, 0/1 and 1/1). My first guess is that the prior that I don't need to provide is for the reference homozygous (0/0) based on the Pr(AC=0) specified in the documentation. I would like to know if this idea is correct.

My second problem if is the two input_prior options are positional parameters. If so, and if my first guess for the Pr(AC=0) is correct, do they represent the probability of 0/1 and 1/1, that is, Pr(AC=1) and Pr(AC=2)?

More concretely, I'm going to provide an example where you don't expect any heterozygous call. In that case, is it correct that the argument will be _--input_prior 0.5 --inputprior 0?

Thank you very much.

Created 2015-07-07 15:31:49 | Updated | Tags: haplotypecaller code-exception

Dear GATK Team, I'm trying to run HaplotypeCaller, on a Bowtie2 bam file, but I'm having a Code exception. Any idea how to fix it? Best regards, Miguel Machado

INFO 16:21:59,507 HelpFormatter - -------------------------------------------------------------------------------- INFO 16:21:59,511 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-0-g7e26428, Compiled 2015/05/15 03:25:41 INFO 16:21:59,511 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 16:21:59,511 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 16:21:59,516 HelpFormatter - Program Args: -T HaplotypeCaller -R /scratch/mpmachado/trabalho1diversidade/genomaReferencia/dadosGenoma/NC_004368.fna -I /scratch/mpmachado/trabalho1diversidade/genomaReferencia/outputs/bowtie/GBS_01.sorted_position.bam --out output.raw.snps.indels.vcf --genotyping_mode DISCOVERY --output_mode EMIT_ALL_SITES --sample_ploidy 1 INFO 16:21:59,521 HelpFormatter - Executing as mpmachado@dawkins on Linux 2.6.32-5-amd64 i386; Java HotSpot(TM) Server VM 1.8.0_45-b14. INFO 16:21:59,521 HelpFormatter - Date/Time: 2015/07/07 16:21:59 INFO 16:21:59,521 HelpFormatter - -------------------------------------------------------------------------------- INFO 16:21:59,521 HelpFormatter - -------------------------------------------------------------------------------- INFO 16:22:00,257 GenomeAnalysisEngine - Strictness is SILENT INFO 16:22:00,360 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500 INFO 16:22:00,370 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 16:22:00,401 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 INFO 16:22:00,410 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 16:22:01,869 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR ------------------------------------------------------------------------------------------

Created 2015-06-02 09:25:02 | Updated 2015-06-02 09:26:43 | Tags: haplotypecaller missing-genotype

Dear GATK community,

I am using HaplotypeCaller for variant discovery and I have found some strange results in my VCF. It appears that this walker is making no calls in positions with reads of support for them in the original BAM.

For instance this position:

chr7 302528 rs28436118 A G 31807.9 PASS . GT:AD:DP:GQ:PL 1/1:0,256:256:99:8228,767,0 ./.:98,0:98:.:. ./.:81,0:81:.:. 1/1:0,134:134:99:4287,401,0

was not called for 2 of the 4 samples available. However, in both samples where the genotype is missing there are many reads supporting an homozygous reference call (0/0).

Do you have any idea of why this is happening? Cheers JMFA> ****

Created 2015-05-20 13:06:46 | Updated | Tags: haplotypecaller java genotypegvcfs

Dear GATK,

I used the HaplotypeCaller with "-dcov 500 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000" to produce 60 gvcf files, that worked fine. However, GenotypeGVCFs gets stuck on a position and runs out of memory after about 24hours, even when I allocate 240Gb. Testing a short region of 60kb does not help. Here was my command line: software/jre1.7.0_25/bin/java -Xmx240g -jar GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R Reference.fasta -L chrom14:2240000-2300000 --variant 60samples_gvcf.list -o output.vcf

If I split my list of 60 gvcf files into two lists of 30 samples each, GenotypeGVCFs works fine for both batches within 15 minutes (~10Gb of memory).
I tested with 47 samples, it took 8 hours (31gb of memory) for a 60kb region. Once I use more than ~55 samples, it takes forever and crashes.

Any help will be much appreciated! Thanks,

Antoine

Created 2015-05-07 09:22:47 | Updated 2015-05-07 10:09:41 | Tags: haplotypecaller format genotypegvcfs

I have a potential bug running GATK GenotypeGVCFs. It complains that there is a DP in the INFO field, but in my haplotypecaller-generated -mg.g.vcf.gz's I do not have a DP in the info, I do have DP in the FORMAT field though, but that's present in the headers as shown below the error output.

Any idea what could be the problem?

INFO  18:30:12,694 HelpFormatter - --------------------------------------------------------------------------------
INFO  18:30:12,698 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-geee94ec, Compiled 2015/03/09 14:27:22
INFO  18:30:12,699 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  18:30:12,706 HelpFormatter - Program Args: -l INFO -T GenotypeGVCFs -R /net/NGSanalysis/ref/Mus_musculus.GRCm38/index/bwa/Mus_musculus.GRCm38.dna.primary_assembly.fa -o /dev/stdout -ploidy 2 --num_threads 32 --intervals:targets,BED /net/NGSanalysis/ref/Mus_musculus.GRCm38/bed/SeqCap/ex100/110624_MM10_exome_L2R_D02_EZ_HX1-ex100.bed --max_alternate_alleles 20 -V:3428_10_14_SRO_185_TGGCTTCA-mg,VCF 3428_10_14_SRO_185_TGGCTTCA-mg.g.vcf.gz -V:3428_11_14_SRO_186_TGGTGGTA-mg,VCF 3428_11_14_SRO_186_TGGTGGTA-mg.g.vcf.gz -V:3428_12_13_SRO_422_TTCACGCA-mg,VCF 3428_12_13_SRO_422_TTCACGCA-mg.g.vcf.gz -V:3428_13_13_SRO_492_AACTCACC-mg,VCF 3428_13_13_SRO_492_AACTCACC-mg.g.vcf.gz -V:3428_14_13_SRO_493_AAGAGATC-mg,VCF 3428_14_13_SRO_493_AAGAGATC-mg.g.vcf.gz -V:3428_15_14_SRO_209_AAGGACAC-mg,VCF 3428_15_14_SRO_209_AAGGACAC-mg.g.vcf.gz -V:3428_16_14_SRO_218_AATCCGTC-mg,VCF 3428_16_14_SRO_218_AATCCGTC-mg.g.vcf.gz -V:3428_17_14_SRO_201_AATGTTGC-mg,VCF 3428_17_14_SRO_201_AATGTTGC-mg.g.vcf.gz -V:3428_18_13_SRO_416_ACACGACC-mg,VCF 3428_18_13_SRO_416_ACACGACC-mg.g.vcf.gz -V:3428_19_14_SRO_66_ACAGATTC-mg,VCF 3428_19_14_SRO_66_ACAGATTC-mg.g.vcf.gz -V:3428_1_13_SRO_388_GTCGTAGA-mg,VCF 3428_1_13_SRO_388_GTCGTAGA-mg.g.vcf.gz -V:3428_20_14_SRO_68_AGATGTAC-mg,VCF 3428_20_14_SRO_68_AGATGTAC-mg.g.vcf.gz -V:3428_21_14_SRO_210_AGCACCTC-mg,VCF 3428_21_14_SRO_210_AGCACCTC-mg.g.vcf.gz -V:3428_22_14_SRO_256_AGCCATGC-mg,VCF 3428_22_14_SRO_256_AGCCATGC-mg.g.vcf.gz -V:3428_23_14_SRO_270_AGGCTAAC-mg,VCF 3428_23_14_SRO_270_AGGCTAAC-mg.g.vcf.gz -V:3428_24_13_SRO_452_ATAGCGAC-mg,VCF 3428_24_13_SRO_452_ATAGCGAC-mg.g.vcf.gz -V:3428_2_13_SRO_399_GTCTGTCA-mg,VCF 3428_2_13_SRO_399_GTCTGTCA-mg.g.vcf.gz -V:3428_3_13_SRO_461_GTGTTCTA-mg,VCF 3428_3_13_SRO_461_GTGTTCTA-mg.g.vcf.gz -V:3428_4_13_SRO_462_TAGGATGA-mg,VCF 3428_4_13_SRO_462_TAGGATGA-mg.g.vcf.gz -V:3428_5_13_SRO_465_TATCAGCA-mg,VCF 3428_5_13_SRO_465_TATCAGCA-mg.g.vcf.gz -V:3428_6_13_SRO_402_TCCGTCTA-mg,VCF 3428_6_13_SRO_402_TCCGTCTA-mg.g.vcf.gz -V:3428_7_13_SRO_474_TCTTCACA-mg,VCF 3428_7_13_SRO_474_TCTTCACA-mg.g.vcf.gz -V:3428_8_13_SRO_531_TGAAGAGA-mg,VCF 3428_8_13_SRO_531_TGAAGAGA-mg.g.vcf.gz -V:3428_9_14_SRO_166_TGGAACAA-mg,VCF 3428_9_14_SRO_166_TGGAACAA-mg.g.vcf.gz
INFO  18:30:12,714 HelpFormatter - Executing as roel@utonium.nki.nl on Linux 2.6.32-504.12.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_75-b13.
INFO  18:30:12,714 HelpFormatter - Date/Time: 2015/05/06 18:30:12
INFO  18:30:12,715 HelpFormatter - --------------------------------------------------------------------------------
INFO  18:30:12,715 HelpFormatter - --------------------------------------------------------------------------------
INFO  18:30:15,963 GenomeAnalysisEngine - Strictness is SILENT
INFO  18:30:16,109 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  18:30:29,705 IntervalUtils - Processing 101539431 bp from intervals
WARN  18:30:29,726 IndexDictionaryUtils - Track 3428_10_14_SRO_185_TGGCTTCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,727 IndexDictionaryUtils - Track 3428_11_14_SRO_186_TGGTGGTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,727 IndexDictionaryUtils - Track 3428_12_13_SRO_422_TTCACGCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,728 IndexDictionaryUtils - Track 3428_13_13_SRO_492_AACTCACC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,728 IndexDictionaryUtils - Track 3428_14_13_SRO_493_AAGAGATC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,728 IndexDictionaryUtils - Track 3428_15_14_SRO_209_AAGGACAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,729 IndexDictionaryUtils - Track 3428_16_14_SRO_218_AATCCGTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,729 IndexDictionaryUtils - Track 3428_17_14_SRO_201_AATGTTGC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,730 IndexDictionaryUtils - Track 3428_18_13_SRO_416_ACACGACC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,730 IndexDictionaryUtils - Track 3428_19_14_SRO_66_ACAGATTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,730 IndexDictionaryUtils - Track 3428_1_13_SRO_388_GTCGTAGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,731 IndexDictionaryUtils - Track 3428_20_14_SRO_68_AGATGTAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,731 IndexDictionaryUtils - Track 3428_21_14_SRO_210_AGCACCTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,731 IndexDictionaryUtils - Track 3428_22_14_SRO_256_AGCCATGC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,732 IndexDictionaryUtils - Track 3428_23_14_SRO_270_AGGCTAAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,732 IndexDictionaryUtils - Track 3428_24_13_SRO_452_ATAGCGAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,732 IndexDictionaryUtils - Track 3428_2_13_SRO_399_GTCTGTCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,733 IndexDictionaryUtils - Track 3428_3_13_SRO_461_GTGTTCTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,733 IndexDictionaryUtils - Track 3428_4_13_SRO_462_TAGGATGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,733 IndexDictionaryUtils - Track 3428_5_13_SRO_465_TATCAGCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,734 IndexDictionaryUtils - Track 3428_6_13_SRO_402_TCCGTCTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,734 IndexDictionaryUtils - Track 3428_7_13_SRO_474_TCTTCACA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,734 IndexDictionaryUtils - Track 3428_8_13_SRO_531_TGAAGAGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,735 IndexDictionaryUtils - Track 3428_9_14_SRO_166_TGGAACAA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
INFO  18:30:29,749 MicroScheduler - Running the GATK in parallel mode with 32 total threads, 1 CPU thread(s) for each of 32 data thread(s), of 64 processors available on this machine
INFO  18:30:29,878 GenomeAnalysisEngine - Preparing for traversal
INFO  18:30:29,963 GenomeAnalysisEngine - Done preparing for traversal
INFO  18:30:29,964 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  18:30:29,965 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  18:30:29,966 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
INFO  18:30:30,562 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files
INFO  18:31:00,420 ProgressMeter -       1:4845033         0.0    30.0 s      50.3 w        0.0%    46.7 h      46.7 h
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.IllegalStateException: Key DP found in VariantContext field INFO at 1:4839315 but this key isn't defined in the VCFHeader.  We require all VCFs to have complete VCF headers by default.
at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:115)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java:291) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java:280)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-geee94ec): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Key DP found in VariantContext field INFO at 1:4839315 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default. ##### ERROR ------------------------------------------------------------------------------------------ for f in *.g.vcf.gz; do echo -e "\n--$f --"; zcat "$f" | sed -n -r "/^#.*DP/p;/^1\t4839315\t/{p;q;}"; done ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839317 GT:DP:GQ:MIN_DP:PL 0/0:22:0:21:0,0,432 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839317 GT:DP:GQ:MIN_DP:PL 0/0:20:0:20:0,0,410 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839315 GT:DP:GQ:MIN_DP:PL 0/0:29:0:29:0,0,773 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839315 GT:DP:GQ:MIN_DP:PL 0/0:25:2:25:0,3,790 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839316 GT:DP:GQ:MIN_DP:PL 0/0:33:0:33:0,0,837 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839315 GT:DP:GQ:MIN_DP:PL 0/0:23:31:23:0,31,765 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GA G,<NON_REF> 0 . ClippingRankSum=-0.578;MLEAC=0,0;MLEAF=0.00,0.00 GT:DP:GQ:PL:SB 0/0:21:39:0,39,488,60,491,512:20,0,0,0 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839315 GT:DP:GQ:MIN_DP:PL 0/0:18:0:18:0,0,514 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839316 GT:DP:GQ:MIN_DP:PL 0/0:29:0:29:0,0,810 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839316 GT:DP:GQ:MIN_DP:PL 0/0:33:0:33:0,0,812 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839317 GT:DP:GQ:MIN_DP:PL 0/0:28:0:25:0,0,624 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GA G,<NON_REF> 0.08 . ClippingRankSum=-0.189;MLEAC=1,0;MLEAF=0.500,0.00 GT:DP:GQ:PL:SB 0/1:17:20:20,0,311,62,320,382:14,0,3,0 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GA G,<NON_REF> 6.76 . ClippingRankSum=-0.374;MLEAC=1,0;MLEAF=0.500,0.00 GT:DP:GQ:PL:SB 0/1:25:43:43,0,401,102,417,519:20,0,3,2 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GA G,<NON_REF> 0 . ClippingRankSum=-1.095;MLEAC=0,0;MLEAF=0.00,0.00 GT:DP:GQ:PL:SB 0/0:23:1:0,1,395,56,406,460:19,0,0,0 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839317 GT:DP:GQ:MIN_DP:PL 0/0:28:0:28:0,0,626 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GA G,<NON_REF> 5.99 . ClippingRankSum=-0.584;MLEAC=1,0;MLEAF=0.500,0.00 GT:DP:GQ:PL:SB 0/1:18:42:42,0,293,84,305,388:13,1,3,1 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839317 GT:DP:GQ:MIN_DP:PL 0/0:22:0:22:0,0,558 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G GA,<NON_REF> 6.76 . ClippingRankSum=0.850;MLEAC=1,0;MLEAF=0.500,0.00 GT:DP:GQ:PL:SB 0/1:19:43:43,0,262,87,274,361:12,3,4,0 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GA G,<NON_REF> 16.82 . ClippingRankSum=-0.784;MLEAC=1,0;MLEAF=0.500,0.00 GT:DP:GQ:PL:SB 0/1:21:54:54,0,352,102,367,470:16,0,4,1 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839317 GT:DP:GQ:MIN_DP:PL 0/0:26:0:25:0,0,419 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839316 GT:DP:GQ:MIN_DP:PL 0/0:30:0:30:0,0,771 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839315 GT:DP:GQ:MIN_DP:PL 0/0:34:77:34:0,78,1136 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839316 GT:DP:GQ:MIN_DP:PL 0/0:26:0:20:0,0,397 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GAA G,GA,<NON_REF> 22.75 . ClippingRankSum=-2.181;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00 GT:DP:GQ:PL:SB 0/2:11:22:60,22,209,0,87,104,63,153,113,176:4,2,3,0 Created 2015-04-22 21:11:25 | Updated 2015-04-22 21:14:24 | Tags: haplotypecaller Hi GATK team, I am using the GATK 3.3 HaplotypeCaller. I found if i use different -L i will have different genotype on the same location. My para is very simple: -T HaplotypeCaller -R ucsc.hg19.fasta -mbq 20 --emitRefConfidence GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -I test.bam -L chr17 -o output.vcf I check the same site. If i set -L chr17:36070200-36071000, there is a reported SNV. if i set -L chr17:36000000-36140000, there is no SNV report. if i set larger: -L chr17:30000000-40000000, it just show up again. if i use the whole chr17, it gone. This is very confusing me. it is a C->G variant at chr17. The snp is looks like: GT 0/1 AB 0.84 AD 257,49,0 DP 306 GQ 99 PGT 0|1 PID 36070590_GTCACCAT_G PL 2966,0,10470,3755,10800,14555 SB 14,243,49,0 While I checked the bam file with IGV. There are two wired things: (1) there is no read supporting the non-reference allele. (2) the depth is about 600, far more deeper than the HaplotypeCaller reported. Why would this happen? Created 2015-01-28 14:33:40 | Updated | Tags: haplotypecaller ad combinegvcfs Hi, I am using GATK v3.2.2 following the recommended practices (...HC -> CombineGVCFs -> GenotypeGVCFs ...) and while looking through suspicious variants I came across a few hetz with AD=X,0. Tracing them back I found two inconsistencies (bugs?); 1) Reordering of genotypes when combining gvcfs while the AD values are kept intact, which leads to an erronous AD for a heterozygous call. Also, I find it hard to understand why the 1bp insertion is emitted in the gvcf - there is no reads supporting it: • single sample gvcf 1 26707944 . A AG,G,<NON_REF> 903.73 . [INFO] GT:AD:DP:GQ:PL:SB 0/2:66,0,36,0:102:99:1057,1039,4115,0,2052,1856,941,3051,1925,2847:51,15,27,9 • combined gvcf 1 26707944 . A G,AG,<NON_REF> . . [INFO] GT:AD:DP:MIN_DP:PL:SB [other_samples] ./.:66,0,36,0:102:.:1057,0,1856,1039,2052,4115,941,1925,3051,2847:51,15,27,9 [other_samples] • vcf 1 26707944 . A G 3169.63 . [INFO] [other_samples] 0/1:66,0:102:99:1057,0,1856 [other_samples] 2) Incorrect AD is taken while genotyping gvcf files: • single sample gvcf: 1 1247185 rs142783360 AG A,<NON_REF> 577.73 . [INFO] GT:AD:DP:GQ:PL:SB 0/1:13,20,0:33:99:615,0,361,654,421,1075:7,6,17,3 • combined gvcf 1 1247185 rs142783360 AG A,<NON_REF> . . [INFO] [other_samples] ./.:13,20,0:33:.:615,0,361,654,421,1075:7,6,17,3 [other_samples] • vcf 1 1247185 . AG A 569.95 . [INFO] [other_samples] 0/1:13,0:33:99:615,0,361 [other_samples] I have found multiple such cases here, and no errors nor warnings in the logs. I checked also with calls that I had done before on these samples, but in a smaller batch. There the AD values were correct, but there were plenty of other hetz with AD=X,0... I haven't looked closer into those. Are these bugs that have been fixed in 3.3? Or maybe my brain is not working properly today and I miss sth obvious? Best regards, Paweł I am running HC3.3-0 with the following options (e.g. GENOTYPE_GIVEN_ALLELES):$java7 -Djava.io.tmpdir=tmp -Xmx3900m \
-jar $jar \ --analysis_type HaplotypeCaller \ --reference_sequence$ref \
--input_file $BAM \ --intervals$CHROM \
--dbsnp $dbSNP \ --out$out \
-stand_call_conf 0 \
-stand_emit_conf 0 \
-A Coverage -A FisherStrand -A HaplotypeScore -A MappingQualityRankSumTest -A QualByDepth -A RMSMappingQuality -A ReadPosRankSumTest \
-L $allelesVCF \ -L 20:60000-70000 \ --interval_set_rule INTERSECTION \ --genotyping_mode GENOTYPE_GIVEN_ALLELES \ --alleles$allelesVCF \
--emitRefConfidence NONE \
--output_mode EMIT_ALL_SITES \

The file $allelesVCF contains these neighbouring SNPs: 20 60807 . C T 118.96 . 20 60808 . G A 46.95 . 20 61270 . A C 2870.18 . 20 61271 . T A 233.60 . I am unable to call these neighbouring SNPs; despite reads being present in the file$BAM, which shouldn't matter anyway. I also tried adding --interval_merging OVERLAPPING_ONLY to the command line, but that didn't solve the problem. What am I doing wrong? I should probably add GATK breaker/misuser to my CV...

Thank you as always.

P.S. The CommandLineGATK documentation does not say, what the default value for --interval_merging is.

P.P.S. Iterative testing a bit slow, because HC always has to do this step:

HCMappingQualityFilter - Filtering out reads with MAPQ < 20

Created 2014-12-04 19:17:24 | Updated | Tags: baserecalibrator haplotypecaller vcf convergence bootstrap

I am identifying new sequence variants/genotypes from RNA-Seq data. The species I am working with is not well studied, and there are no available datasets of reliable SNP and INDEL variants.

For BaseRecallibrator, it is recommended that when lacking a reliable set of sequence variants: "You can bootstrap a database of known SNPs. Here's how it works: First do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence."

Setting up a script to run HaplotypeCaller and BaseRecallibrator in a loop should be fairly strait forward. What is a good strategy for comparing VCF files and assessing convergence?

Created 2014-12-04 18:54:17 | Updated | Tags: haplotypecaller vcf rnaseq

Specifically, what does the 'start' component of this flag mean? Do the reads all have to start in exactly the same location? Alternatively, does the flag specify the total number of reads that must overlap a putative variant before that variant will be considered for calling?

Created 2014-12-02 16:32:32 | Updated | Tags: haplotypecaller bam gatk

Hi,

I want to use HaplotypeCaller to call families together. I have bam files for each individual in the 4 families I am studying, as well as a ped file describing the pedigree information. The problem is that these families have complex pedigrees, with the parents (mother and father), the children, and then one grandchild for each child (do not have information about the other parent of the grandchild). I would like to call these families with their complex pedigrees together, and I would like to call all the 4 families together to maximize the power of the calling. However, I'm not sure how to do that with just the -ped option. -ped seems to be designed for only one family or cohort, and I'm not sure it would work for me to feed it all my bams as inputs. Are there any other tools for GATK that I could use to call complex pedigrees?

The other possibility would be to call the 4 trios separately and each child-grandchild pair separately, but not sure how to do that either with just -ped. What would you recommend?

And finally, I get an error message saying that --emitRefConfidence only works for single sample mode. It seems that I should omit this option when I run HaplotypeCaller on my families, but are there any other options that I should use for cohort calling besides the standard ones (which I know to be --variant_index_type and --variant_index_parameter)?

Thanks, Alva

Created 2014-11-19 14:29:23 | Updated | Tags: haplotypecaller

Hi I have been trying HaplotypeCaller to find SNPs and INDELS in viral read data (haploid) but am finding that it throws away around half of my reads and I don't understand why. A small proportion (8%) are filtered out duplicates and 0.05% fail on mapping quality but I can't account for the majority of lost reads. I appreciate that GATK wasn't built for viral sequences but would you have an idea of what could be causing this? I use the following command after marking duplicates and realigning around indels: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R Ref.fasta -I realigned_reads.bam --genotyping_mode DISCOVERY -ploidy 1 -bamout reassembled.bam -o rawvariants.vcf I have also tried the same file with UnifiedGenotype and I get the result I expect i.e. most of my reads are retained and I have SNP calls that agree with a VCF constructed in a different program so I assume the reads are lost as part of the local realignment?

Thanks Kirstyn

Created 2014-10-06 16:31:26 | Updated | Tags: haplotypecaller

Hi,

I am using HaplotypeCaller for variant calling on GATK version 3.2.2 on whole genome Illumina reads. I used the following command as per best practices with and without multithreading option (-nct).

java -jar /GenomeAnalysisTK-3-2-2/GenomeAnalysisTK.jar -T HaplotypeCaller -nct 10 -I infile.re.recal.bam -R /genome/human_g1k_v37.fasta -o outfile_raw.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3

Without -nct option variants found : 207828 And with -nct option variants found: 207850

-SK

Created 2014-08-25 20:03:14 | Updated | Tags: haplotypecaller

Can a GATK tool automatically name detected variants, i.e. assign them a unique identifier within user-specified parameters?

Created 2014-04-28 15:11:17 | Updated | Tags: allelebalance allelebalancebysample haplotypecaller variantannotator

Version 3.1.1. Human normal samples.

I couldnt find AlleleBalance and AlleleBalanceBySample tags in my vcf outputs. Tags are not found even for single variant I tried HaplotypeCaller with -all or directly with -A AlleleBalance or -A AlleleBalanceBySample. Also I tried Variantannotator with -all or -A AlleleBalance or -A AlleleBalanceBySample.

Any help will be apreciated

Created 2013-11-21 16:25:35 | Updated | Tags: haplotypecaller

Hi,

I was running the haplotypeCaller for many samples, but some variants (validated as true positives by using other techniques) within these samples are not called by the haplotypeCaller. I saw in the bam files that most of these variants are located on the outside of duplicated reads (around 200 reads). Most of my data consists of duplicated reads. First I thought that the duplicated reads were filtered out by the read filters which are automatically applied (like duplicateReadFilter), but when I checked it this was not the case. I was wondering why my true variants are not called by the HaplotypeCaller and if there is an option to resolve this problem?

Thank you!

Created 2013-10-16 12:50:25 | Updated | Tags: unifiedgenotyper haplotypecaller dp

Dear GATK Team,

I've recently been exploring HaplotypeCaller and noticed that, for my data, it is reporting ~10x lower DP and AD values in comparison to reads visible in the igv browser and reported by the UnifiedGenotyper.

I'm analyzing a human gene panel of amplicon data produced on a MiSeq, 150bp paired end. The coverage is ~5,000x.

My pipeline is:

Novoalign -> GATK (recalibrate quality) -> GATK (re-align) -> HaplotypeCaller/UnifiedGenotyper.

Here are the minimum commands that reproduce the discrepancy:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.HC.vcf \
-L ROI.bed \
-dt NONE \
-nct 8

Example variant from sample1.HC.vcf:

... In comparison to using UnifiedGenotyper with exactly the same alignment file:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.UG.vcf \
-L ROI.bed \
-nct 4 \
-dt NONE \
-glm BOTH

Example variant from sample1.UG.vcf:

I looked at the mapping quality and number of the alignments at the example region (200nt window) listed above and they look good:

awk '{if ($3=="chr17" &&$4 > (41245466-100) && $4 < (41245466+100)) print}' sample1.rg.sam | awk '{count[$5]++} END {for(i in count) print count[i], i}' | sort -nr
8764 70
77 0

With other data generated in our lab, that has ~200x coverage and the same assay principle [just more amplicons], the DP reported by HaplotypeCaller corresponds perfectly to UnifiedGenotyper and igv.

Is there an explanation as to why I should see a difference between HaplotypeCaller and UnifiedGenotyper, using these kinds of data?

Sam

Created 2013-05-30 02:04:50 | Updated | Tags: haplotypecaller

Hi I have been running HaplotypeCaller on >700 monkey alignments and came across this error in some intervals:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.IllegalStateException: Mismatch between the reference haplotype and the reference assembly graph path. for graph BaseGraph{kmerSize=10} graph = GGAATAACTCCAGGCAACCA
GTTCCAGCCGCCTCCTCCCTGTCTCCTTCAAGGTTCCCTTCCTCTACCTGCAATTTACAACCTCAGTGGTTCCCCAGGGCTCTGTCCTGCGCCCTCAGTGCTTCCCTTCTGCACGTTTTCCCAGGCAATCTCTTCCTGCCTCTGGGCACCAACTCCATCCGTATAGAGATAGTT
CCCACAGGCACAGCCC haplotype = CCAGGCAACCAGTTCCAGCCGCCTCCTCCCTGTCTCCTTCAAGGTTCCCTTCCTCTACCTGCAATTTACAACCTCAGTGGTTCCCCAGGGCTCTGTCCTGCGCCCTCAGTGCTTCCCTTCTGCACGTTTTCCCAGGCAATCTCTT
CCTGCCTCTGGGCACCAACTCCATCCGTATAGAGATAGTTCCCACAGGCACAGCCC
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:665) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:661)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version nightly-2013-05-17-g2c8b717):
##### ERROR
##### ERROR Please check the documentation guide to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum