# Tagged with #haplotypecaller 6 documentation articles | 4 announcements | 89 forum discussions

Created 2015-04-26 02:02:50 | Updated 2015-04-26 02:08:43 | Tags: tutorials haplotypecaller bamout debugging

### 1. Overview

As you may know, HaplotypeCaller performs a local reassembly and realignment of the reads in the region surrounding potential variant sites (see the HaplotypeCaller method docs for more details on why and how this is done). So it often happens that during the calling process, the reads get moved to different mapping positions than what you can observe in the BAM file that you originally provided to HC as input.

These remappings usually explain most discordances between calls that are expected based on the original data and actual calls made by HaplotypeCaller, so it's very useful to be able to visualize what rearrangements the tool has made.

### 2. Generating the bamout for a single site or interval

To generate the bamout file for a specific site or interval, just run HaplotypeCaller on the region around the site or interval of interest using the -L argument to restrict the analysis to that region (adding about 500 bp on either side) and using the -bamout argument to specify the name of the bamout file that will be generated.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I recalibrated.bam -o hc_variants.vcf -L 20:10255630-10255840 -bamout bamout.bam


If you were using any additional parameters in your original variant calling (including -ERC and related arguments), make sure to include them in this command as well so that you can make an apples-to-apples comparison.

Then you open up both the original bam and the bamout file together in a genome browser such as IGV. On some test data from our favorite sample, NA12878, this is what you would see:

You can see that the bamout file, on top, contains data only for the ActiveRegion that was within the analysis interval specified by -L. The two blue reads represent the artificial haplotypes constructed by HaplotypeCaller (you may need to adjust your IGV settings to see the same thing on your machine).

You can see a whole group of reads neatly aligned, with an insertion in the middle. In comparison, the original data shown in the lower track has fewer reads with insertions, but has several reads with mismapped ends. This is a classic example of a site where realignment through reassembly has provided additional evidence for an indel, allowing HaplotypeCaller to call it confidently. In contrast, UnifiedGenotyper was not able to call this insertion confidently.

### 3. Generating the bamout for multiple intervals or the whole genome

Although we don't recommend doing this by default because it will cause slower performance and take up a lot of storage space, you can generate a bamout that contains many more intervals, or even covers the whole genome. To do so, just run the same command, but this time, pass your list of intervals to -L, or simply omit it if you want the entire genome to be included.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I recalibrated.bam -o hc_variants.vcf -bamout bamout.bam


This time, if you zoom out a bit in IGV, you will see multiple stacks of reads corresponding to the various ActiveRegions that were identified and processed.

### 4. Forcing an output in a region that is not covered in the bamout

In some cases HaplotypeCaller does not complete processing on an ActiveRegion that it has started. This is typically because there is either almost no evidence of variation once the remapping has been done, or on the contrary, the region is very messy and there is too much complexity. In both cases, the program is designed to give up in order to avoid wasting time. This is a good thing most of the time, but it does mean that sometimes you will have no output in the bamout for the site you are trying to troubleshoot.

The good news is that in most cases it is possible to force HaplotypeCaller to go through with the full processing so that it will produce bamout output for your site of interest. To do so, simply add the flags -forceActive and -disableOptimizations to your command line, in addition to the -bamout argument of course.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I recalibrated.bam -L 20:10371667-10375021 -o hc_forced.vcf -bamout force_bamout.bam -forceActive -disableOptimizations


In this other region, you can see that the original mapping (middle track) was a bit messy with some possible evidence of variation, and in fact UnifiedGenotyper called a SNP in this region (top variant track). But HaplotypeCaller did not call the SNP, and did not output anything in our first bamout file (top read track). When you force an output in that region using the two new flags, you see in the forced bamout (bottom read track) that the remapped data is a lot cleaner and the evidence for variation is essentially gone.

It is also possible to force an ActiveRegion to be triggered at specific intervals; see the HaplotypeCaller tool docs for more details on how this is done.

Created 2014-07-23 17:36:49 | Updated 2014-09-09 21:25:47 | Tags: haplotypecaller likelihoods haplotype hc

This document describes the procedure used by HaplotypeCaller to evaluate the evidence for variant alleles based on candidate haplotypes determined in the previous step for a given ActiveRegion. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.

### Overview

The previous step produced a list of candidate haplotypes for each ActiveRegion, as well as a list of candidate variant sites borne by the non-reference haplotypes. Now, we need to evaluate how much evidence there is in the data to support each haplotype. This is done by aligning each sequence read to each haplotype using the PairHMM algorithm, which produces per-read likelihoods for each haplotype. From that, we'll be able to derive how much evidence there is in the data to support each variant allele at the candidate sites, and that produces the actual numbers that will finally be used to assign a genotype to the sample.

### 1. Evaluating the evidence for each candidate haplotype

We originally obtained our list of haplotypes for the ActiveRegion by constructing an assembly graph and selecting the most likely paths in the graph by counting the number of supporting reads for each path. That was a fairly naive evaluation of the evidence, done over all reads in aggregate, and was only meant to serve as a preliminary filter to whittle down the number of possible combinations that we're going to look at in this next step.

Now we want to do a much more thorough evaluation of how much evidence we have for each haplotype. So we're going to take each individual read and align it against each haplotype in turn (including the reference haplotype) using the PairHMM algorithm (see Durbin et al., 1998). If you're not familiar with PairHMM, it's a lot like the BLAST algorithm, in that it's a pairwise alignment method that uses a Hidden Markov Model (HMM) and produces a likelihood score. In this use of the PairHMM, the output score expresses the likelihood of observing the read given the haplotype by taking into account the information we have about the quality of the data (i.e. the base quality scores and indel quality scores).

This produces a big table of likelihoods where the columns are haplotypes and the rows are individual sequence reads. (example figure TBD)

The table essentially represents how much supporting evidence there is for each haplotype (including the reference), itemized by read.

### 2. Evaluating the evidence for each candidate site and corresponding alleles

Having per-read likelihoods for entire haplotypes is great, but ultimately we want to know how much evidence there is for individual alleles at the candidate sites that we identified in the previous step. To find out, we take the per-read likelihoods of the haplotypes and marginalize them over alleles, which produces per-read likelihoods for each allele at a given site. In practice, this means that for each candidate site, we're going to decide how much support each read contributes for each allele, based on the per-read haplotype likelihoods that were produced by the PairHMM.

This may sound complicated, but the procedure is actually very simple -- there is no real calculation involved, just cherry-picking appropriate values from the table of per-read likelihoods of haplotypes into a new table that will contain per-read likelihoods of alleles. This is how it happens. For a given site, we list all the alleles observed in the data (including the reference allele). Then, for each read, we look at the haplotypes that support each allele; we select the haplotype that has the highest likelihood for that read, and we write that likelihood in the new table. And that's it! For a given allele, the total likelihood will be the product of all the per-read likelihoods. (example fig TBD)

At the end of this step, sites where there is sufficient evidence for at least one of the variant alleles considered will be called variant, and a genotype will be assigned to the sample in the next (final) step.

Created 2014-04-10 21:57:10 | Updated 2015-02-20 18:10:09 | Tags: haplotypecaller reference-model gvcf

This document describes the reference confidence model applied by HaplotypeCaller to generate genomic VCFs (gVCFS), invoked by -ERC GVCF or -ERC BP_RESOLUTION (see the FAQ on gVCFs for format details).

Please note that this document may be expanded with more detailed information in the near future.

### How it works

The mode works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. For each position in the genome we have either an ALT call (via the standard calling mechanism) or we can estimate the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:

• Estimate the confidence that no SNP exists at the site by contrasting all reads with the ref base vs all reads with any non-reference base.
• Estimate the confidence that no indel of size < X (determined by command line parameter) could exist at this site by calculating the number of reads that provide evidence against such an indel, and from this value estimate the chance that we would not have seen the allele confidently.

Based on this, we emit the genotype likelihoods (PL) and compute the GQ (from the PLs) for the least confidence of these two models.

We use a symbolic allele pair, <NON_REF>, to indicate that the site is not homozygous reference, and because we have an ALT allele we can provide allele-specific AD and PL field values.

For details of the gVCF format, please see the document that explains what is a gVCF.

Created 2014-04-03 20:20:08 | Updated 2014-10-22 19:22:34 | Tags: haplotypecaller genotypegvcfs combinegvcfs gvcf joint-analysis

### Overview

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variants analyses.

### Important caveat

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller 3.x contain additional information that is formatted in a very specific way. Read on to find out more.

### General comparison of VCF vs. gVCF

The key difference between a regular VCF and a gVCF is that the gVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a gVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION gVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

### The two types of gVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a gVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the gVCF header. The purpose of the blocks (also called banding) is to keep file size down, and there is no downside for the downstream analysis, so we do recommend using the -GVCF option.

### Example gVCF file

This is a banded gVCF produced by HaplotypeCaller with the -GVCF option.

As you can see in the first line, the basic file format is a valid version 4.1 VCF:

##fileformat=VCFv4.1
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)
##GVCFBlock=minGQ=60(inclusive),maxGQ=2147483647(exclusive)
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=b37>
##reference=file:///humgen/1kg/reference/human_g1k_v37.fasta


Toward the middle you see the ##GVCFBlock lines (after the ##FORMAT lines) (repeated here for clarity):

##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)


which indicate the GQ ranges used for banding (corresponding to the boundaries [5, 20, 60]).

You can also see the definition of the MIN_DP annotation in the ##FORMAT lines.

#### Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10000000 and ends at 20:10000116.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10000000    .   T   <NON_REF>   .   .   END=10000116    GT:DP:GQ:MIN_DP:PL  0/0:44:99:38:0,89,1385
20  10000118    .   T   <NON_REF>   .   .   END=10000210    GT:DP:GQ:MIN_DP:PL  0/0:42:99:38:0,80,1314
20  10000212    .   A   <NON_REF>   .   .   END=10000438    GT:DP:GQ:MIN_DP:PL  0/0:52:99:42:0,99,1403
20  10000439    .   T   G,<NON_REF> 1737.77 .   DP=57;MLEAC=2,0;MLEAF=1.00,0.00;MQ=221.41;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,56,0:56:99:1771,168,0,1771,168,1771:0,0,0,0
20  10000440    .   T   <NON_REF>   .   .   END=10000597    GT:DP:GQ:MIN_DP:PL  0/0:56:99:49:0,120,1800
20  10000598    .   T   A,<NON_REF> 1754.77 .   DP=54;MLEAC=2,0;MLEAF=1.00,0.00;MQ=185.55;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,53,0:53:99:1788,158,0,1788,158,1788:0,0,0,0
20  10000599    .   T   <NON_REF>   .   .   END=10000693    GT:DP:GQ:MIN_DP:PL  0/0:51:99:47:0,120,1800
20  10000695    .   G   <NON_REF>   .   .   END=10000757    GT:DP:GQ:MIN_DP:PL  0/0:48:99:45:0,120,1800
20  10000758    .   T   A,<NON_REF> 1663.77 .   DP=51;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.32;MQ0=0  GT:AD:DP:GQ:PL:SB   1/1:0,50,0:50:99:1697,149,0,1697,149,1697:0,0,0,0
20  10000759    .   A   <NON_REF>   .   .   END=10001018    GT:DP:GQ:MIN_DP:PL  0/0:40:99:28:0,65,1080
20  10001020    .   C   <NON_REF>   .   .   END=10001020    GT:DP:GQ:MIN_DP:PL  0/0:26:72:26:0,72,1080
20  10001021    .   T   <NON_REF>   .   .   END=10001021    GT:DP:GQ:MIN_DP:PL  0/0:25:37:25:0,37,909
20  10001022    .   C   <NON_REF>   .   .   END=10001297    GT:DP:GQ:MIN_DP:PL  0/0:30:87:25:0,72,831
20  10001298    .   T   A,<NON_REF> 1404.77 .   DP=41;MLEAC=2,0;MLEAF=1.00,0.00;MQ=171.56;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,41,0:41:99:1438,123,0,1438,123,1438:0,0,0,0
20  10001299    .   C   <NON_REF>   .   .   END=10001386    GT:DP:GQ:MIN_DP:PL  0/0:43:99:39:0,95,1226
20  10001387    .   C   <NON_REF>   .   .   END=10001418    GT:DP:GQ:MIN_DP:PL  0/0:41:42:39:0,21,315
20  10001419    .   T   <NON_REF>   .   .   END=10001425    GT:DP:GQ:MIN_DP:PL  0/0:45:12:42:0,9,135
20  10001426    .   A   <NON_REF>   .   .   END=10001427    GT:DP:GQ:MIN_DP:PL  0/0:49:0:48:0,0,1282
20  10001428    .   T   <NON_REF>   .   .   END=10001428    GT:DP:GQ:MIN_DP:PL  0/0:49:21:49:0,21,315
20  10001429    .   G   <NON_REF>   .   .   END=10001429    GT:DP:GQ:MIN_DP:PL  0/0:47:18:47:0,18,270
20  10001430    .   G   <NON_REF>   .   .   END=10001431    GT:DP:GQ:MIN_DP:PL  0/0:45:0:44:0,0,1121
20  10001432    .   A   <NON_REF>   .   .   END=10001432    GT:DP:GQ:MIN_DP:PL  0/0:43:18:43:0,18,270
20  10001433    .   T   <NON_REF>   .   .   END=10001433    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1201
20  10001434    .   G   <NON_REF>   .   .   END=10001434    GT:DP:GQ:MIN_DP:PL  0/0:44:18:44:0,18,270
20  10001435    .   A   <NON_REF>   .   .   END=10001435    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1130
20  10001436    .   A   AAGGCT,<NON_REF>    1845.73 .   DP=43;MLEAC=2,0;MLEAF=1.00,0.00;MQ=220.07;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,42,0:42:99:1886,125,0,1888,126,1890:0,0,0,0
20  10001437    .   A   <NON_REF>   .   .   END=10001437    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,0


Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

Created 2013-06-17 21:31:21 | Updated 2015-05-16 07:04:42 | Tags: haplotypecaller variant-discovery

#### Objective

Call variants on a single genome with the HaplotypeCaller, producing a raw (unfiltered) VCF.

#### Caveat

This is meant only for single-sample analysis. To analyze multiple samples, see the Best Practices documentation on joint analysis.

• TBD

#### Steps

1. Determine the basic parameters of the analysis
2. Call variants in your sequence data

### 1. Determine the basic parameters of the analysis

If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.

• Genotyping mode (–genotyping_mode)

This specifies how we want the program to determine the alternate alleles to use for genotyping. In the default DISCOVERY mode, the program will choose the most likely alleles out of those it sees in the data. In GENOTYPE_GIVEN_ALLELES mode, the program will only use the alleles passed in from a VCF file (using the -alleles argument). This is useful if you just want to determine if a sample has a specific genotype of interest and you are not interested in other alleles.

• Emission confidence threshold (–stand_emit_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.

• Calling confidence threshold (–stand_call_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.

The terms called and filtered are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.

### 2. Call variants in your sequence data

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R reference.fa \
-L 20 \
--genotyping_mode DISCOVERY \
-stand_emit_conf 10 \
-stand_call_conf 30 \
-o raw_variants.vcf


Note: This is an example command. Please look up what the arguments do and see if they fit your analysis before copying this. To see how the -L argument works, you can refer here: http://gatkforums.broadinstitute.org/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals#latest

#### Expected Result

This creates a VCF file called raw_variants.vcf, containing all the sites that the HaplotypeCaller evaluated to be potentially variant. Note that this file contains both SNPs and Indels.

Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.

Created 2012-07-30 17:37:12 | Updated 2015-04-26 02:30:27 | Tags: unifiedgenotyper haplotypecaller snp bamout debugging

This can happen when you expect a call to be made based on the output of other variant calling tools, or based on examination of the data in a genome browser like IGV.

There are several possibilities, and among them, it is possible that GATK may be missing a real variant. But we are generally very confident in the calculations made by our tools, and in our experience, most of the time, the problem lies elsewhere. So, before you post this issue in our support forum, please follow these troubleshooting guidelines, which hopefully will help you figure out what's going on.

In all cases, to diagnose what is happening, you will need to look directly at the sequencing data at the position in question.

### 1. Generate the bamout and compare it to the input bam

If you are using HaplotypeCaller to call your variants (as you nearly always should) you'll need to run an extra step first to produce a file called the "bamout file". See this tutorial for step-by-step instructions on how to do this.

What often happens is that when you look at the reads in the original bam file, it looks like a variant should be called. However, once HaplotypeCaller has performed the realignment, the reads may no longer support the expected variant. Generating the bamout file and comparing it to the original bam will allow you to elucidate such cases.

In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!

### 2. Check the base qualities of the non-reference bases

The variant callers apply a minimum base quality threshold, under which bases will not be counted as supporting evidence for a variant. This is because low base qualities mean that the sequencing machine was not confident that it called the right bases. If your expected variant is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the DP field of the VCF is the unfiltered depth. You may believe you have good coverage at your site of interest, but since the variant callers ignore bases that fail the quality filters, the actual coverage seen by the variant callers may be lower than you think.

### 3. Check the mapping qualities of the reads that support the non-reference allele(s)

The quality of a base is capped by the mapping quality of the read that it is on. This is because low mapping qualities mean that the aligner had little confidence that the read was mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- in fact, you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

### 4. Check how many alternate alleles are present

By default the variant callers will only consider a certain number of alternate alleles. This parameter can be relaxed using the --max_alternate_alleles argument (see the HaplotypeCaller documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles increases the computational cost of the processing, scaling exponentially with the number of alternate alleles, which means it will use more resources and take longer. Unless you have a really good reason to change the default value, we highly recommend that you not modify this parameter.

### 5. When using UnifiedGenotyper, check for overlapping deletions

The UnifiedGenotyper ignores sites if there are too many overlapping deletions. This parameter can be relaxed using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument) but be aware that increasing its value could adversely affect the reliability of your results.

### 6. Check for systematic biases introduced by your sequencing technology

Some sequencing technologies introduce particular sources of bias. For example, in data produced by the SOLiD platform, alignments tend to have reference bias and it can be severe in some cases. If the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site, you are probably seeing false positives.

Created 2015-05-15 04:52:05 | Updated | Tags: haplotypecaller release-notes genotypegvcfs gatk3

GATK 3.4 was released on May 15, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights to be published soon.

Note that the release is in progress at time of posting -- it may take a couple of hours before the new GATK jar file is updated on the downloads page.

### New tool

• ASEReadCounter: A tool to count read depth in a way that is appropriate for allele specific expression (ASE) analysis. It counts the number of reads that support the REF allele and the ALT allele, filtering low qual reads and bases and keeping only properly paired reads. See Highlights for more details.

### HaplotypeCaller & GenotypeGVCFs

• Important fix for genotyping positions over spanning deletions. Previously, if a SNP occurred in sample A at a position that was in the middle of a deletion for sample B, sample B would be genotyped as homozygous reference there (but it's NOT reference - there's a deletion). Now, sample B is genotyped as having a symbolic DEL allele. See Highlights for more details.
• Deprecated --mergeVariantsViaLD argument in HaplotypeCaller since it didn’t work. To merge complex substitutions, use ReadBackedPhasing as a post-processing step.
• Removed exclusion of MappingQualityZero, SpanningDeletions and TandemRepeatAnnotation from the list of annotators that cannot be annotated by HaplotypeCaller. These annotations are still not recommended for use with HaplotypeCaller, but this is no longer enforced by a hardcoded ban.
• Clamp the HMM window starting coordinate to 1 instead of 0 (contributed by nsubtil).
• Fixed the implementation of allowNonUniqueKmersInRef so that it applies to all kmer sizes. This resolves some assembly issues in low-complexity sequence contexts and improves calling sensitivity in those regions.
• Initialize annotations so that --disableDithering actually works.
• Automatic selection of indexing strategy based on .g.vcf file extension. See Highlights for more details.
• Removed normalization of QD based on length for indels. Length-based normalization is now only applied if the annotation is calculated in UnifiedGenotyper.
• Added the RGQ (Reference GenotypeQuality) FORMAT annotation to monomorphic sites in the VCF output of GenotypeGVCFs. Now, instead of stripping out the GQs for monomorphic ohm-ref sites, we transfer them to the RGQ. This is extremely useful for people who want to know how confident the hom-ref genotype calls are. See Highlights for more details.
• Removed GenotypeSummaries from default annotations.
• Added -uniquifySamples to GenotypeGVCFs to make it possible to genotype together two different datasets containing the same sample.
• Disallow changing -dcov setting for HaplotypeCaller (pending a fix to the downsampling control system) to prevent buggy behavior. See Highlights for more details.
• Raised per-sample limits on the number of reads in ART and HC. Active Region Traversal was using per sample limits on the number of reads that were too low, especially now that we are running one sample at a time. This caused issues with high confidence variants being dropped in high coverage data.
• Removed explicit limitation (20) of the maximum ploidy of the reference-confidence model. Previously there was a fixed-size maximum ploidy indel RCM likelihood cache; this was changed to a dynamically resizable one. There are still some de facto limitations which can be worked around by lowering the max alt alleles parameter.
• Made GQ of Hom-Ref Blocks in GVCF output be consistent with PLs.
• Fixed a bug where HC was not realigning against the reference but against the best haplotype for the read.
• Fixed a bug (in HTSJDK) that was causing GenotypeGVCFs to choke on sites with large numbers of alternate alleles (>140).
• Modified the way GVCFBlock header lines are named because the new HTSJDK version disallows duplicate header keys (aside from special-cased keys such as INFO and FORMAT).

### CombineGVCFs

• Added option to break blocks at every N sites. Using --breakBandsAtMultiplesOf N will ensure that no reference blocks span across genomic positions that are multiples of N. This is especially important in the case of scatter-gather where you don't want your scatter intervals to start in the middle of blocks (because of a limitation in the way -L works in the GATK for VCF records with the END tag). See Highlights for more details.
• Fixed a bug that caused the tool to stop processing after the first contig.
• Fixed a bug where the wrong REF allele was output to the combined gVCF.

### VariantRecalibrator

• Switched VQSR tranches plot ordering rule (ordering is now based on tranche sensitivity instead of novel titv).
• VQSR VCF header command line now contains annotations and tranche levels.

### SelectVariants

• Added -trim argument to trim (simplify) alleles to a minimal representation.
• Added -trimAlternates argument to remove all unused alternate alleles from variants. Note that this is pretty aggressive for monomorphic sites.
• Changed the default behavior to trim (remove) remaining alleles when samples are subset, and added the -noTrim argument to preserve original alleles.
• Added --keepOriginalDP argument.

### VariantAnnotator

• Improvements to the allele trimming functionalities.
• Added functionality to support multi-allelic sites when annotating a VCF with annotations from another callset. See Highlights for more details.

### CalculateGenotypePosteriors

• Fixed user-reported bug featuring "trio" family with two children, one parent.
• Added error handling for genotypes that are called but have no PLs.

### Various tools

• BQSR: Fixed an issue where GATK would skip the entire read if a SNP is entirely contained within a sequencing adapter (contributed by nsubtil); and improved how uncommon platforms (as encoded in RG:PL tag) are handled.
• DepthOfCoverage: Now logs a warning if incompatible arguments are specified.
• SplitSamFile: Fixed a bug that caused a NullPointerException.
• SplitNCigarReads: Fixed issue to make -fixNDN flag fully functional.
• IndelRealigner: Fixed an issue that was due to reads that have an incorrect CIGAR length.
• CombineVCFs: Minor change to an error check that was put into 3.3 so that identical samples don't need -genotypeMergeOption.
• VariantsToBinaryPED: Corrected swap between mother and father in PED file output.
• GenotypeConcordance: Monomorphic sites in the truth set are no longer called "Mismatching Alleles" when the comp genotype has an alternate allele.
• ReadBackedPhasing: Fixed a couple of bugs in MNP merging.
• CatVariants: Now allows different input / output file types, and spaces in directory names.
• VariantsToTable: Fixed a bug that affected the output of the FORMAT record lists when -SMA is specified. Note that FORMAT fields behave the same as INFO fields - if the annotation has a count of A (one entry per Alt Allele), it is split across the multiple output lines. Otherwise, the entire list is output with each field.

• Corrected logical expression in MateSameStrandFilter (contributed by user seru71).
• Handle X and = CIGAR operators appropriately
• Added -drf argument to disable default read filters. Limited to specific tools and specific filters (currently only DuplicateReadFilter).

### Annotations

• Calculate StrandBiasBySample using all alternate alleles as “REF vs. any ALT”.
• Modified InbreedingCoeff so that it works when genotyping uniquified samples (see GenotypeGVCFs changes).
• Changed GC Content value type from Integer to Float.
• Added StrandAlleleCountsBySample annotation. This annotation outputs the number of reads supporting each allele, stratified by sample and read strand; callable from HaplotypeCaller only.
• Made annotators emit a warning if they can't be applied.

### GATK Engine & common features

• Fixed logging of 'out' command line parameter in VCF headers; changed []-type arrays to lists so argument parsing works in VCF header commandline output.
• Modified GATK command line header for unique keys. The GATK command line header keys were being repeated in the VCF and subsequently lost to a single key value by HTSJDK. This resolves the issue by appending the name of the walker after the text "GATKCommandLine" and a number after that if the same walker was used more than once in the form: GATKCommandLine.(walker name) for the first occurrence of the walker, and GATKCommandLine.(walker name).# where # is the number of the occurrence of the walker (e.g. GATKCommandLine.SomeWalker.2 for the second occurrence of SomeWalker).
• Handle X and = CIGAR operators appropriately.
• Added barebones read/write CRAM support (no interval seeking!). See Highlights for more details.
• Cleaned up logging outputs / streams; messages (including HMM log messages) that were going to stdout now going to stderr.
• Improved error messages; when an error is related to a specific file, the engine now includes the file name in the error message.
• Fixed BCF writing when FORMAT annotations contain arrays.

### Queue

• Added -qsub-broad argument. When -qsub-broad is specified instead of -qsub, Queue will use the h_vmem parameter instead of h_rss to specify memory limit requests. This was done to accommodate changes to the Broad’s internal job scheduler. Also causes the GridEngine native arguments to be output by default to the logger, instead of only when in debug mode.
• Fixed the scala wrapper for Picard MarkDuplicates (needed because MarkDuplicates was moved to a different package within Picard).
• Added optional element "includeUnmapped" to the PartitionBy annotation. The value of this element (default true) determines whether Queue will explicitly run this walker over unmapped reads. This patch fixes a runtime error when FindCoveredIntervals was used with Queue.

### Documentation

• Plentiful enhancements and fixes to various tool docs, especially annotations and read filters.

### For developers

• Upgraded SLF4J to allow new convenient logging syntaxes.
• Patched maven pom file for slf4j-log4j12 version (contributed by user Biocyberman).
• Updated HTSJDK version (now pulling it in from Maven Central); various edits made to match.
• Collected VCF IDs and header lines into one place (GATKVCFConstants).

Created 2014-10-23 18:53:52 | Updated 2015-05-12 17:24:14 | Tags: Troll haplotypecaller ploidy release-notes genotypegvcfs gatk3 genotyperefinement

GATK 3.3 was released on October 23, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.

## Haplotype Caller

• Improved the accuracy of dangling head merging in the HC assembler (now enabled by default).
• Physical phasing information is output by default in new sample-level PID and PGT tags.
• Added the --sample_name argument. This is a shortcut for people who have multi-sample BAMs but would like to use -ERC GVCF mode with a particular one of those samples.
• Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
• Fixed IndexOutOfBounds error associated with tail merging.

## Variant Recalibrator

• New --ignore_all_filters option. If specified, the variant recalibrator will ignore all input filters and treat sites as unfiltered.

## GenotypeGVCFs

• Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
• Bug fix for the case when we assumed ADs were in the same order if the number of alleles matched.
• Changed the default GVCF GQ Bands from 5,20,60 to be 1..60 by 1s, 60...90 by 10s and 99 in order to give finer resolution.
• Bug fix in the exact model when calling multi-allelic variants. QUAL field is now more accurate.

## RNAseq analysis

• Bug fixes for working with unmapped reads.

## CalculateGenotypePosteriors

• New annotation for low- and high-confidence possible de novos (only annotates biallelics).
• FamilyLikelihoodsUtils now add joint likelihood and joint posterior annotations.
• Restricted population priors based on discovered allele count to be valid for 10 or more samples.

## DepthOfCoverage

• Fixed rare bug triggered by hash collision between sample names.

## SelectVariants

• Updated the --keepOriginalAC functionality in SelectVariants to work for sites that lose alleles in the selection.

• Read groups that are excluded by sample_name, platform, or read_group arguments no longer appear in the header.
• The performance penalty associated with filtering by read group has been essentially eliminated.

## Annotations

• StrandOddsRatio is now a standard annotation that is output by default.
• We used to output zero for FS if there was no data available at a site, now we omit FS.
• Extensive rewrite of the annotation documentation.

## Queue

• Fixed issue related to spaces in job names that were fine in GridEngine 6 but break in (Son of) GE8.
• Improved scatter contigs algorithm to be fairer when splitting many contigs into few parts (contributed by @smowton)

## Documentation

• We now generate PHP files instead of HTML.
• We now output a JSON version of the tool documentation that can be used to generate wrappers for GATK commands.

## Miscellaneous

• Output arguments --no_cmdline_in_header, --sites_only, and --bcf for VCF files, and --bam_compression, --simplifyBAM, --disable_bam_indexing, and --generate_md5 for BAM files moved to the engine level.
• htsjdk updated to version 1.120.1620

Created 2014-08-27 18:39:39 | Updated 2014-12-16 03:19:38 | Tags: haplotypecaller ploidy haploid polyploid beta

Until now, HaplotypeCaller was only capable of calling variants in diploid organisms due to some assumptions made in the underlying algorithms. I'm happy to announce that we now have a generalized version that is capable of handling any ploidy you specify at the command line!

This new feature, which we're calling "omniploidy", is technically still under development, but we think it's mature enough for the more adventurous to try out as a beta test ahead of the next official release. We'd especially love to get some feedback from people who work with non-diploids on a regular basis, so we're hoping that some of you microbiologists and assorted plant scientists will take it out for a spin and let us know how it behaves in your hands.

It's available in the latest nightly builds; just use the -ploidy argument to give it a whirl. If you have any questions or feedback, please post a comment on this article in the forum.

Caveat: the downstream tools involved in the new GVCF-based workflow (GenotypeGVCFs and CombineGVCFs) are not yet capable of handling non-diploid calls correctly -- but we're working on it.

#### UPDATE:

We have added omniploidy support to the GVCF handling tools, with the following limitations:

• When running, you need to indicate the sample ploidy that was used to generate the GVCFs with -ploidy. As usual 2 is the default ploidy.

• The system does not support mixed ploidy across samples nor positions. An error message will be thrown if you attempt to genotype GVCFs that have a mixture, or that have some genotype whose ploidy does not match the -ploidy argument.

#### LATEST UPDATE:

As of GATK version 3.3-0, the GVCF tools are capable of ad-hoc ploidy detection, and can handle mixed ploidies. See the release highlights for details.

Created 2013-03-14 12:29:25 | Updated 2014-12-01 17:34:07 | Tags: unifiedgenotyper official haplotypecaller printreads bug

As reported here:

If you encounter this bug too, please don't post a new question about it. Feel free to comment in this thread to let us know you have also had the same problem. Tell us what version of the GATK you were using and post your command line.

Thank you for your patience while we work to fix this issue.

### Latest update: we found that the three tools (PrintRead, HaplotypeCaller and UnifiedGenotyper) had different issues that only produced the same symptom (the ArrayIndexOutOfBoundsException error). The underlying issues have all been fixed in version 2.5. If you encounter an ArrayIndexOutOfBoundsException in later versions, it is certainly caused by a different problem, so please open a new thread to report the issue.

Created 2015-06-29 17:57:31 | Updated 2015-06-29 17:58:38 | Tags: depthperallelebysample haplotypecaller asereadcounter

Hi,

I was running ASEReadCounter and came across a "stack trace" error after completion of 14.1%.

I am running the command like this:

java -jar ../GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -R /results2/indexes/genomes/mm9/fasta/mm9.fa -T ASEReadCounter -o filterReadCounts.chr13.58264978.txt -I dedupFolder/D4.indel.recal.dedup.reordered.bam -sites variants/gatk/D4.allBases.chr13.vcf -minDepth 20 --minMappingQuality 30 --minBaseQuality 20

The output is as follows:

INFO 13:27:40,125 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:27:40,130 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-0-g7e26428, Compiled 2015/05/15 03:25:41 INFO 13:27:40,130 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 13:27:40,131 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 13:27:40,136 HelpFormatter - Program Args: -R /results2/indexes/genomes/mm9/fasta/mm9.fa -T ASEReadCounter -o filterReadCounts.chr13.58264978.txt -I dedupFolder/D4.indel.recal.dedup.reordered.bam -sites variants/gatk/D4.allBases.chr13.vcf -minDepth 20 --minMappingQuality 30 --minBaseQuality 20 -U ALLOW_N_CIGAR_READS -rf BadCigar INFO 13:27:40,143 HelpFormatter - Executing as szimmerm@al2.aecom.yu.edu on Linux 2.6.18-371.12.1.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_09-b05. INFO 13:27:40,144 HelpFormatter - Date/Time: 2015/06/29 13:27:40 INFO 13:27:40,145 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:27:40,145 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:27:40,937 GenomeAnalysisEngine - Strictness is SILENT INFO 13:27:41,075 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 10000 INFO 13:27:41,091 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:27:41,186 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.09 INFO 13:27:41,474 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 13:27:42,051 GenomeAnalysisEngine - Done preparing for traversal INFO 13:27:42,052 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 13:27:42,053 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 13:27:42,054 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 13:28:12,062 ProgressMeter - chr10:61439995 7224296.0 30.0 s 4.0 s 2.3% 21.6 m 21.1 m INFO 13:28:42,066 ProgressMeter - chr10:107123365 1.4729978E7 60.0 s 4.0 s 4.0% 24.8 m 23.8 m INFO 13:29:12,078 ProgressMeter - chr11:29429929 2.1979505E7 90.0 s 4.0 s 6.0% 25.0 m 23.5 m INFO 13:29:42,085 ProgressMeter - chr11:67787354 2.8914135E7 120.0 s 4.0 s 7.4% 26.8 m 24.8 m INFO 13:30:12,090 ProgressMeter - chr11:88702416 3.4725261E7 2.5 m 4.0 s 8.2% 30.3 m 27.8 m INFO 13:30:42,095 ProgressMeter - chr11:110188026 4.0515208E7 3.0 m 4.0 s 9.0% 33.2 m 30.2 m INFO 13:31:12,100 ProgressMeter - chr12:44912778 4.8755304E7 3.5 m 4.0 s 11.2% 31.3 m 27.8 m INFO 13:31:42,107 ProgressMeter - chr12:104866935 5.6805392E7 4.0 m 4.0 s 13.4% 29.8 m 25.8 m INFO 13:32:12,113 ProgressMeter - chr12:121257414 5.9799881E7 4.5 m 4.0 s 14.1% 32.0 m 27.5 m

##### ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: 0 at org.broadinstitute.gatk.tools.walkers.rnaseq.ASEReadCounter.map(ASEReadCounter.java:216) at org.broadinstitute.gatk.tools.walkers.rnaseq.ASEReadCounter.map(ASEReadCounter.java:102) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

##### ERROR ------------------------------------------------------------------------------------------

Is there any reason for this error, or is it just a bug? My overall goal is to get the filtered count of reads that support a given allele for an individual sample. DepthPerAlleleBySample gives unfiltered counts instead of filtered ones. If there is a way to get the filtered counts besides using ASEReadCounter that would also be great. Thanks!

Created 2015-06-25 18:23:01 | Updated | Tags: haplotypecaller

Hello, While running HaplotypeCaller in genotyping mode, I am getting following error:

##### ERROR stack trace

java.lang.IndexOutOfBoundsException: Index: 6, Size: 6 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at htsjdk.variant.variantcontext.VariantContext.getAlternateAllele(VariantContext.java:849) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:252) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:883) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:226) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

##### ERROR ------------------------------------------------------------------------------------------

The command line given by me was: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ./chr17.fa -I ./calpop2/1_recal.bam -L 17 --genotyping_mode GENOTYPE_GIVEN_ALLELES -alleles 17dbsnp.vcf -stand_emit_conf 10 -stand_call_conf 30 -o ./calpop2/1_pop.vcf.

Created 2015-06-25 14:25:13 | Updated | Tags: haplotypecaller snp-calling

Using the following line I'm calling SNPs.
java -Xmx4g -jar ${GATKPath} -R$ref -T HaplotypeCaller -ploidy 1 -I downsample.bam -o hapreadyAll.vcf -bamout bamout.bam -dontUseSoftClippedBases

Shown in the screen shot it seems for this regions a bamout file along with a SNP should be output. Is here a HC command to force this region and others like it to output a SNP call?

Created 2015-06-24 18:55:35 | Updated | Tags: haplotypecaller multi-sample rnaseq pooled-calls

Hi everybody! I've recently started working with GATK and after reading documents, tutorial and forum discussions, I set a pipeline for my experiment. I'm dealing with multi-sample RNAseq for which GATK tools are less improved and verified than DNAseq, thus I 'd like to have your suggestions. Briefly, this is the experiment: 2 phenotypes of Sparus aurata, 8 libary per phenotype, each library consists on a pool (not barcoded) of 3 animals. Thus I have a total of** 16 samples**. My goal is to find the total number of variant sites and compare the allele frequencies between the two phenotypes. **I lack genome and SNPs database. **

Step by step: 1) I used STAR (not 2-pass) in order to map reads against my de novo assembly. STAR --runThreadN 16 --genomeDir ./GenomeIndex --readFilesIn XXX.fastq --alignIntronMax 19 --outSAMtype BAM SortedByCoordinate --outSAMmapqUnique 60 --outFilterMultimapNmax 5 --outFilterMismatchNmax 4 2) I used the picard-tools to Mark duplicates 3) I used the picard-tools to AddOrReplaceReadGroups 4) I used the picard-tools to BuildBamIndex 5) I called the haplotype for the 16 samples with the following command: GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I sample.bam -dontUseSoftClippedBases -ploidy 6 -ERC GVCF -o output.g.vcf 6) I used the GenotypeGVCFs to merge the samples from the same population in an unique vcf file as follow GenomeAnalysisTK.jar -T GenotypeGVCFs -R reference.fasta -stand_call_conf 20 -stand_emit_conf 20 -ploidy 6 --variant sample1.g.vcf --variant sample2.g.vcf --variant sample3.g.vcf (8 samples) -o output_HC.vcf

Finally I'm going to filter the results with the Variant Filtration: GenomeAnalysisTK.jar -T VariantFiltration -R reference.fasta -V output.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o utputFiltered.vcf

What do you think?

Now I'd like to compare the two populations, but how??? manually in excel files? vcf tools seem not to handle ploidy higher than 2. Does anyone deal with these issues and can kindly give some tips?

Best Marianna

Created 2015-06-23 12:34:18 | Updated | Tags: haplotypecaller pcrmodel

What is the default value for pcrModel in HaplotypeCaller ?

NONE no specialized PCR error model will be applied; if base insertion/deletion qualities are present they will be used HOSTILE a most aggressive model will be applied that sacrifices true positives in order to remove more false positives AGGRESSIVE a more aggressive model will be applied that sacrifices true positives in order to remove more false positives CONSERVATIVE a less aggressive mo

Created 2015-06-19 19:20:58 | Updated | Tags: haplotypecaller rnaseq genotypegvcfs gvcf

Hello,

I was wondering if there is a way to output all annotations for all sites when running HaplotypeCaller with BP_RESOLUTION. Currently it outputs all annotations for only called variants. Thanks in advance.

Created 2015-06-05 07:41:51 | Updated | Tags: haplotypecaller gvcf

Hi,

I first use HaplotypeCaller to call variants with --emitRefConfidence GVCF, and then the tool GenotypeGVCFs on only the sample. In other words, I did not apply GenotypeGVCFs on cohort but on the sample itself. There was a record called in the first step but was not output in the second step. The particular record is showed below:

The command lines used are below:

java -Xmx5g -Djava.io.tmpdir=pwd/tmp -jar /Software/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T HaplotypeCaller -R /Data/bundle_2.8_hg19/ucsc.hg19.fasta -I /input/chr1.bam -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 --dbsnp /Data/bundle_2.8_hg19/dbsnp_138.hg19.vcf -A StrandOddsRatio -A Coverage -A QualByDepth -A FisherStrand -A MappingQualityRankSumTest -A ReadPosRankSumTest -A RMSMappingQuality -o /output/chr1.gvcf.vcf -L chr1

java -Xmx5g -Djava.io.tmpdir=pwd/tmp -jar /Software/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /Data/bundle_2.8_hg19/ucsc.hg19.fasta --variant /output/chr1.gvcf.vcf -A StrandOddsRatio -A Coverage -A QualByDepth -A FisherStrand -A MappingQualityRankSumTest -A ReadPosRankSumTest -A RMSMappingQuality -o /output/chr1.gatkHC.vcf --dbsnp /Data/bundle_2.8_hg19/dbsnp_138.hg19.vcf -stand_call_conf 30.0 -stand_emit_conf 10.0 -L /BED/chr1.bed

I thought that this variant should be emitted into the final vcf, since the qual is 23.75 which is greater than 10.0 (set by -stand_emit_conf). Do I misunderstand something here?

Thank you!

Created 2015-06-03 10:26:13 | Updated | Tags: haplotypecaller indels

Hi,

I am wondering whether GATK HC uses read pairs information (such as reads that are mapped with unexpected separation distances or orientation) to detect large structural variations ?

Thank you, Fabrice

Created 2015-06-02 11:52:06 | Updated | Tags: bqsr haplotypecaller

I was wondering if GATK will behave correctly if I use the -BQSR option to HaplotypeCaller instead of PrintReads, thus saving a processing step in my pipeline. AFAIK, HaplotypeCaller is the only part of our pipeline that actually uses the base score values.

Created 2015-06-02 09:25:02 | Updated 2015-06-02 09:26:43 | Tags: haplotypecaller missing-genotype

Dear GATK community,

I am using HaplotypeCaller for variant discovery and I have found some strange results in my VCF. It appears that this walker is making no calls in positions with reads of support for them in the original BAM.

For instance this position:

chr7 302528 rs28436118 A G 31807.9 PASS . GT:AD:DP:GQ:PL 1/1:0,256:256:99:8228,767,0 ./.:98,0:98:.:. ./.:81,0:81:.:. 1/1:0,134:134:99:4287,401,0

was not called for 2 of the 4 samples available. However, in both samples where the genotype is missing there are many reads supporting an homozygous reference call (0/0).

Do you have any idea of why this is happening? Cheers JMFA> ********

Created 2015-05-29 16:00:04 | Updated | Tags: haplotypecaller

Hello, I am trying to run HaplotypeCaller on my processed bam file but it keeps running out of memory. I have tried increasing the heap size to 90G but it still crash. This might have to do with the type of analysis I am doing... The sample are pool of pigs (6 individual so a ploidy of 12 for this particular sample) that have been sequenced on targeted regions, I use the bed file that has been given with the kit to narrow done the calling to the targeted regions. I have also reduce the number of alternative allele from 6 to 3. But it still crash after a while. Is there any other parameters I should try to modify to reduce the memory usage? I have attached the log file if you want to have a look at all the parameters. Cheers,

Julien

Created 2015-05-23 21:49:38 | Updated | Tags: bqsr haplotypecaller best-practices dnaseq

Hi, I have ~150 WGS all sequenced in batches on Illumina HiSeq over a number of years, on a number of machines/flowcells.

I want to perform BQSR on my BAM files before calling variants. However for my organism I do not have access to a resource like dbSNP for true variants. So I am following the protocol by doing a first round of variant calling, subsetting to the SNPs with highest confidence and using these as the known variants for BQSR.

My question is, should I carry this out on samples individually, i.e. one BAM per sample, on which I use HaplotypeCaller for initial variant calling, then subset to best SNPs, use these for BaseRecalibrator and apply the calibration to the original sample before carrying on with single sample variant finding and then joint genotyping as per best practices....

OR

As I have multiple samples from the same flowcells and lanes, would I gain more information by combining those samples into a multisample BAM first, before carrying out BQSR? I'm a little unsure of how the covariates used by BQSR and actually calculated and whether I can increase the accuracy of my recalibration in this way? Each sample has between 500M and 5 billion nucleotides sequenced.

Many thanks.

Created 2015-05-23 05:08:22 | Updated 2015-05-23 05:12:25 | Tags: haplotypecaller bug gatk

Hi All,

I cannot perform joint genotype calling using gVCFs using the GenotypeGVCF command on my dataset containing 352 mtDNA samples.

Does any one has run into this problem before, or has suggestions on how to get this working.

I have attached the standard output for a hanging run.

I can generate VCF files individually no problems.

INFO 17:06:45,732 GenomeAnalysisEngine - Strictness is SILENT INFO 17:06:45,867 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 17:06:47,721 MicroScheduler - Running the GATK in parallel mode with 4 total threads, 1 CPU thread(s) for each of 4 data thread(s), of 20 processors available on this machine INFO 17:06:47,791 GenomeAnalysisEngine - Preparing for traversal INFO 17:06:47,793 GenomeAnalysisEngine - Done preparing for traversal INFO 17:06:47,794 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:06:47,794 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 17:06:47,795 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 17:06:48,436 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files INFO 17:07:17,800 ProgressMeter - gi|251831106|ref|NC_012920.1|:1 0.0 30.0 s 49.6 w 0.0% 15250.3 w 15250.3 w Waiting for data... (interrupt to abort) GATK VERSION: The Genome Analysis Toolkit (GATK) v3.4-0-g7e26428, Compiled 2015/05/15 03:25:41 java version "1.7.0_51"

Thanks for your time. James Boocock

Created 2015-05-22 18:55:21 | Updated | Tags: haplotypecaller

The HaplotypeCaller typically does a great job calling SNPs, but there are a few occasions where reads are realigned causing a SNP to be missed. Image 1 shows three samples where the UnifiedGenotyper called correctly an AC=2 SNP and the HaplotypeCaller called it an AC=1. Also, due to a misalignment, surrounding the real SNP, additional AC=1 SNPs show. The bamout file of the rearrangement is shown in image 2. Here the reads forced into alignment can be seen causing the misidentified AC=1 calls. Is this a HaplotypeCaller bug or is there an additional option that can be applied to prevent this misalignment?

Thank you,

Running v3.4-0-g7e26428

Created 2015-05-22 08:00:24 | Updated 2015-05-22 08:01:11 | Tags: haplotypecaller dp

Hi,

currently I'm running GATK version 3.3.0 and run in the problem, that quite often I don't get a value for DP after the HaplotypeCaller. There actually is nothing (just a dot) in the VCF file.

chr11 57569573 . A ACC 1640.8 VQSRTrancheINDEL99.90to100.00 . GT:AD:DP:GQ:PGT:PID:PL 0/1:57,16:.:99:0|1:57569569_CAGG_C:480,0,7262 chr11 57569575 . A AT 1686.05 VQSRTrancheINDEL99.90to100.00 . GT:AD:DP:GQ:PGT:PID:PL 0/1:60,16:.:99:0|1:57569569_CAGG_C:492,0,7092

What woks is to annotate afterwards using the VariantAnnotator. But in the documentation it states, that the VariantAnnotator defines the depth by the total amount of reads at that specific position. In contrast the DP after HaplotypeCaller should be the number of informative reads for that position. I can find that difference when I check for the position in GEMINI prior and after Annotation with Variant annotator.

Patient 1 (after reannotation using VariantAnnotator) Position depth variant 57563990 75 C/C 57569568 81 CAGG/C 57569572 78 A/ACC 57569574 81 A/AT

Patient 2 (after reannotation using VariantAnnotator) Postition depth variant 57563990 73 C/T 57569568 104 CAGG/CAGG 57569572 106 A/A 57569574 106 A/A

After HaplotypeCaller Postion patient1 patient2 ...(other patients) 57563990 48 -1 -1 -1 -1 -1 C/C C/T C/T C/T C/T C/T 57569568 -1 66 52 -1 56 60 CAGG/C CAGG/CAGG CAGG/CAGG CAGG/C CAGG/CAGG CAGG/CAGG 57569572 -1 101 88 -1 101 77 A/ACC A/A A/A A/ACC A/A A/A 57569574 -1 97 87 -1 94 78 A/AT A/A A/A A/AT A/A A/A 57576040 -1 -1 -1 -1 -1 8 ./. ./. ./. ./. ./. C/C 57578937 -1 -1 -1 -1 -1 -1 ./. ./. ./. ./. ./. TTG/T 57578941 -1 -1 -1 -1 -1 -1 ./. ./. ./. ./. ./. C/CTG 57578942 -1 -1 -1 -1 -1 -1 ./. ./. ./. ./. ./. C/G

E.g. patient 2 has in the position 57569568 66 informative reads, but 104 reads in total. To check how valid a Variant is, I think it is way more interesting to know the number of informative reads.

In this specific case we checked for those mutation (57569572 and 57569574) via sanger and didn't find anything. If we would know that only 2 or 3 reads would be informative we might have gone to another variant instead.

Am I doing something wrong, or is there a bug in the software?

One example how I call the HaplotypeCaller.

java -Xmx10g -jar /data/ngs/bin/GATK/3.3-0/GenomeAnalysisTK.jar -l INFO -et NO_ET \ -T HaplotypeCaller \ -R /data/ngs/programs/bundle_2.8/ucsc.hg19.fasta \ -D /data/ngs/programs/bundle_2.8/dbsnp_138.hg19.vcf \ -L /data/ngs/exome_targets/SureSelect_V4_S03723314_Regions.bed \ -ERC GVCF \ -variant_index_type LINEAR \ -variant_index_parameter 128000 \ -stand_call_conf 30 \ -stand_emit_conf 30 \ -A Coverage -A AlleleBalance \ -A QualByDepth -A HaplotypeScore \ -nct 4 \ -I out/06-BQSR/21_383_1.bam \ -o out/08-variant-calling/21_383_1/21_383_1-raw.vcf \ -log out/08-variant-calling/21_383_1/21_383_1-raw_calling.log

Anselm

Created 2015-05-21 06:27:17 | Updated | Tags: haplotypecaller genotypeconcordance

Dear GATK Team,

I performed a variant caller comparison, and the genotype concordance analysis of HaplotypeCaller's results a little strange to me, and I can't understand it at all.

ALLELES_MATCH: 0 EVAL_SUPERSET_TRUTH: 2438 EVAL_SUBSET_TRUTH: 0 ALLELES_DO_NOT_MATCH: 636 EVAL_ONLY: 20 TRUTH_ONLY: 1014546

The last and first value could be true? Or anyone can tell what could cause this? I got very different (and reasonable) results with other callers.

Many thanks!

Created 2015-05-20 13:06:46 | Updated | Tags: haplotypecaller java genotypegvcfs

Dear GATK,

I used the HaplotypeCaller with "-dcov 500 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000" to produce 60 gvcf files, that worked fine. However, GenotypeGVCFs gets stuck on a position and runs out of memory after about 24hours, even when I allocate 240Gb. Testing a short region of 60kb does not help. Here was my command line: software/jre1.7.0_25/bin/java -Xmx240g -jar GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R Reference.fasta -L chrom14:2240000-2300000 --variant 60samples_gvcf.list -o output.vcf

If I split my list of 60 gvcf files into two lists of 30 samples each, GenotypeGVCFs works fine for both batches within 15 minutes (~10Gb of memory).
I tested with 47 samples, it took 8 hours (31gb of memory) for a 60kb region. Once I use more than ~55 samples, it takes forever and crashes.

Any help will be much appreciated! Thanks,

Antoine

Created 2015-05-20 05:27:42 | Updated | Tags: haplotypecaller

Hello, I want to ask about the step in GATK for variant calling with HaplotypeCaller. There are 2 steps for that in the guide, which result in gvcf or vcf. I still don't understand what is the difference even though it is written that the gvcf is for cohort analysis workflow. How do I choose which analysis workflow I should use, the vcf of the gvcf? My target is to identify SNP in every sample I have and my analysis don't require to compare one sample to each other. Just to get the variation per sample.

Created 2015-05-18 16:18:23 | Updated 2015-05-18 16:18:53 | Tags: haplotypecaller gatk3-4 warnings

GATK3.4 is throwing lots of warnings. Should I be concerned about any of them?

I'm not sure, which --variant_index_type to set, so I will make sure to use the correct file extension going forward:

WARN  17:02:08,480 GATKVCFUtils - Naming your output file using the .g.vcf extension will automatically set the appropriate values  for --variant_index_type and --variant_index_parameter
WARN  17:02:08,484 GATKVCFUtils - Creating Tabix index for out_HaplotypeCaller/20/APP5339451.vcf.gz, ignoring user-specified index type and parameter


I also got this warning with earlier versions:

WARN  17:02:10,495 IndexDictionaryUtils - Track dbsnp doesn't have a sequence dictionary built in, skipping dictionary validation


Is this something to be concerned about?

WARN  17:02:18,771 PairHMMLikelihoodCalculationEngine$1 - Failed to load native library for VectorLoglessPairHMM - using Java implementation of LOGLESS_CACHING  I didn't ask for these annotations: WARN 17:02:19,078 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper WARN 17:02:19,079 InbreedingCoeff - Annotation will not be calculated, must provide a valid PED file (-ped) from the command line.  My command line looked like this: -A Coverage -A FisherStrand -A StrandOddsRatio -A MappingQualityRankSumTest -A QualByDepth -A RMSMappingQuality -A ReadPosRankSumTest  Created 2015-05-14 15:47:19 | Updated | Tags: haplotypecaller Hi there, I've been looking at some WES data where I am comparing pre and post GT refinement workflow data and came across where it seems like to me that HC does not take into account overlapping reads...I thought I read a year a two ago that Unified Genotyper does take the consensus when reads overlap (as opposed to counting each base in the overlap individually), but I'm guessing that HC does not? I had a variant that was called a denovo prior to refinement, but not after refinement (the reason why I am looking at the variant). When I looked at the data in IGV I saw that the site had 2 alternate alleles and 4 reference alleles so I could see why HC made the call, but when I switched to "viewed as pairs" I can see that it really is only 3 molecules because both ends overlapped and thus if you took the consensus there would only be one molecule with the variant base. So if the basic logic is that you need two reads supporting the alternate haplotype for it to be considered, then if you took into account overlapping pairs, then this would not have been considered. I also went back to the gvcf file and saw that it was counting all 6 reads (instead of 3). 1 3426401 rs368993654 A G 47.50 PASS AC=1;AC_Orig=2;AF=0.500;AF_Orig=1.364e-03;AN=2;AN_Orig=1466;BaseQRankSum=0.00;CSQ=G|ENSG00000162591|ENST00000356575|Transcript|intron_variant||||||TMP_ESP_1_3426401|11/36|MEGF6|||||||YES||||protein_coding|ENSP00000348982||CCDS41237.1|||||,G|ENSG00000162591|ENST00000294599|Transcript|intron_variant||||||TMP_ESP_1_3426401||8/29|MEGF6|||||||||||protein_coding|ENSP00000294599|||||||,G|ENSG0000012591|ENST00000485002|Transcript|intron_variant&NMD_transcript_variant||||||TMP_ESP_1_3426401||11/36|MEGF6|||||||||||nonsense_mediated_decay|ENSP00000419033|||||||;ClippingRankSum=0.937;DB;DP=6;FS=0.000;GC=72.28;GQ_MEAN=25.64;GQ_SDDEV=14.21;InbreedingCoeff=-0.0209;MQ=60.00;MQ0=0;MQRankSum=0.00;NCC=34;NDA=1;QD=3.65;ReadPosRankSum=0.00;SOR=1.329;Samples=AnotherUnrelatedGuy,ThisGuy;VQSLOD=0.905;VariantType=SNP;culprit=QD;hiConfDeNovo=ThisGuy;set=variant GT:AB:AD:DP:GQ:PL 0/1:0.670:4,2:6:31:31,0,81 I am attaching the two plots as well (before and after viewing as pairs). The GVCF files were created with 3.3, everything downstream of that was done with a Jan 15 nightly (CombineGVCF, GenotypeGVCF and CGP). I realize that I am splitting hairs here, but just wanted to send this for my edification/confirm what I am thinking if nothing else...or just as a forum for my incoherent babbling. Thanks Kurt Created 2015-05-11 13:57:37 | Updated | Tags: unifiedgenotyper depthofcoverage haplotypecaller mendelianviolations Hi, Two questions, which relate to Unified Genotyper or Haplocaller: 1. Detecting and determining the exact genotype of an individual seems to be inaccurate at read depths less than 10X. This is because there aren't enough reads to accurately say whether it could be heterozygous. Is there an option in either Unified Genotyper or Haplocaller, that excludes sites in individuals that have below 10X? Alternatively how can these sites be removed from the vcf - or set to missing ? 2. I'm sure I've read somewhere that pedigree information can be supplied to a variant caller (such as Unified Genotyper or Haplocaller), and used to improve the calling accuracy/speed/efficiency. I am calling variants on one parent and multiple offspring. Apologies if these questions are answered in the user manual. I regularly look it but have not found much to answer my questions. Cheers, Blue Created 2015-05-07 09:22:47 | Updated 2015-05-07 10:09:41 | Tags: haplotypecaller format genotypegvcfs info-field I have a potential bug running GATK GenotypeGVCFs. It complains that there is a DP in the INFO field, but in my haplotypecaller-generated -mg.g.vcf.gz's I do not have a DP in the info, I do have DP in the FORMAT field though, but that's present in the headers as shown below the error output. Any idea what could be the problem? INFO 18:30:12,694 HelpFormatter - -------------------------------------------------------------------------------- INFO 18:30:12,698 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-geee94ec, Compiled 2015/03/09 14:27:22 INFO 18:30:12,699 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 18:30:12,699 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 18:30:12,706 HelpFormatter - Program Args: -l INFO -T GenotypeGVCFs -R /net/NGSanalysis/ref/Mus_musculus.GRCm38/index/bwa/Mus_musculus.GRCm38.dna.primary_assembly.fa -o /dev/stdout -ploidy 2 --num_threads 32 --intervals:targets,BED /net/NGSanalysis/ref/Mus_musculus.GRCm38/bed/SeqCap/ex100/110624_MM10_exome_L2R_D02_EZ_HX1-ex100.bed --max_alternate_alleles 20 -V:3428_10_14_SRO_185_TGGCTTCA-mg,VCF 3428_10_14_SRO_185_TGGCTTCA-mg.g.vcf.gz -V:3428_11_14_SRO_186_TGGTGGTA-mg,VCF 3428_11_14_SRO_186_TGGTGGTA-mg.g.vcf.gz -V:3428_12_13_SRO_422_TTCACGCA-mg,VCF 3428_12_13_SRO_422_TTCACGCA-mg.g.vcf.gz -V:3428_13_13_SRO_492_AACTCACC-mg,VCF 3428_13_13_SRO_492_AACTCACC-mg.g.vcf.gz -V:3428_14_13_SRO_493_AAGAGATC-mg,VCF 3428_14_13_SRO_493_AAGAGATC-mg.g.vcf.gz -V:3428_15_14_SRO_209_AAGGACAC-mg,VCF 3428_15_14_SRO_209_AAGGACAC-mg.g.vcf.gz -V:3428_16_14_SRO_218_AATCCGTC-mg,VCF 3428_16_14_SRO_218_AATCCGTC-mg.g.vcf.gz -V:3428_17_14_SRO_201_AATGTTGC-mg,VCF 3428_17_14_SRO_201_AATGTTGC-mg.g.vcf.gz -V:3428_18_13_SRO_416_ACACGACC-mg,VCF 3428_18_13_SRO_416_ACACGACC-mg.g.vcf.gz -V:3428_19_14_SRO_66_ACAGATTC-mg,VCF 3428_19_14_SRO_66_ACAGATTC-mg.g.vcf.gz -V:3428_1_13_SRO_388_GTCGTAGA-mg,VCF 3428_1_13_SRO_388_GTCGTAGA-mg.g.vcf.gz -V:3428_20_14_SRO_68_AGATGTAC-mg,VCF 3428_20_14_SRO_68_AGATGTAC-mg.g.vcf.gz -V:3428_21_14_SRO_210_AGCACCTC-mg,VCF 3428_21_14_SRO_210_AGCACCTC-mg.g.vcf.gz -V:3428_22_14_SRO_256_AGCCATGC-mg,VCF 3428_22_14_SRO_256_AGCCATGC-mg.g.vcf.gz -V:3428_23_14_SRO_270_AGGCTAAC-mg,VCF 3428_23_14_SRO_270_AGGCTAAC-mg.g.vcf.gz -V:3428_24_13_SRO_452_ATAGCGAC-mg,VCF 3428_24_13_SRO_452_ATAGCGAC-mg.g.vcf.gz -V:3428_2_13_SRO_399_GTCTGTCA-mg,VCF 3428_2_13_SRO_399_GTCTGTCA-mg.g.vcf.gz -V:3428_3_13_SRO_461_GTGTTCTA-mg,VCF 3428_3_13_SRO_461_GTGTTCTA-mg.g.vcf.gz -V:3428_4_13_SRO_462_TAGGATGA-mg,VCF 3428_4_13_SRO_462_TAGGATGA-mg.g.vcf.gz -V:3428_5_13_SRO_465_TATCAGCA-mg,VCF 3428_5_13_SRO_465_TATCAGCA-mg.g.vcf.gz -V:3428_6_13_SRO_402_TCCGTCTA-mg,VCF 3428_6_13_SRO_402_TCCGTCTA-mg.g.vcf.gz -V:3428_7_13_SRO_474_TCTTCACA-mg,VCF 3428_7_13_SRO_474_TCTTCACA-mg.g.vcf.gz -V:3428_8_13_SRO_531_TGAAGAGA-mg,VCF 3428_8_13_SRO_531_TGAAGAGA-mg.g.vcf.gz -V:3428_9_14_SRO_166_TGGAACAA-mg,VCF 3428_9_14_SRO_166_TGGAACAA-mg.g.vcf.gz INFO 18:30:12,714 HelpFormatter - Executing as roel@utonium.nki.nl on Linux 2.6.32-504.12.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_75-b13. INFO 18:30:12,714 HelpFormatter - Date/Time: 2015/05/06 18:30:12 INFO 18:30:12,715 HelpFormatter - -------------------------------------------------------------------------------- INFO 18:30:12,715 HelpFormatter - -------------------------------------------------------------------------------- INFO 18:30:15,963 GenomeAnalysisEngine - Strictness is SILENT INFO 18:30:16,109 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 18:30:29,705 IntervalUtils - Processing 101539431 bp from intervals WARN 18:30:29,726 IndexDictionaryUtils - Track 3428_10_14_SRO_185_TGGCTTCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,727 IndexDictionaryUtils - Track 3428_11_14_SRO_186_TGGTGGTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,727 IndexDictionaryUtils - Track 3428_12_13_SRO_422_TTCACGCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,728 IndexDictionaryUtils - Track 3428_13_13_SRO_492_AACTCACC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,728 IndexDictionaryUtils - Track 3428_14_13_SRO_493_AAGAGATC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,728 IndexDictionaryUtils - Track 3428_15_14_SRO_209_AAGGACAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,729 IndexDictionaryUtils - Track 3428_16_14_SRO_218_AATCCGTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,729 IndexDictionaryUtils - Track 3428_17_14_SRO_201_AATGTTGC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,730 IndexDictionaryUtils - Track 3428_18_13_SRO_416_ACACGACC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,730 IndexDictionaryUtils - Track 3428_19_14_SRO_66_ACAGATTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,730 IndexDictionaryUtils - Track 3428_1_13_SRO_388_GTCGTAGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,731 IndexDictionaryUtils - Track 3428_20_14_SRO_68_AGATGTAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,731 IndexDictionaryUtils - Track 3428_21_14_SRO_210_AGCACCTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,731 IndexDictionaryUtils - Track 3428_22_14_SRO_256_AGCCATGC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,732 IndexDictionaryUtils - Track 3428_23_14_SRO_270_AGGCTAAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,732 IndexDictionaryUtils - Track 3428_24_13_SRO_452_ATAGCGAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,732 IndexDictionaryUtils - Track 3428_2_13_SRO_399_GTCTGTCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,733 IndexDictionaryUtils - Track 3428_3_13_SRO_461_GTGTTCTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,733 IndexDictionaryUtils - Track 3428_4_13_SRO_462_TAGGATGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,733 IndexDictionaryUtils - Track 3428_5_13_SRO_465_TATCAGCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,734 IndexDictionaryUtils - Track 3428_6_13_SRO_402_TCCGTCTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,734 IndexDictionaryUtils - Track 3428_7_13_SRO_474_TCTTCACA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,734 IndexDictionaryUtils - Track 3428_8_13_SRO_531_TGAAGAGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,735 IndexDictionaryUtils - Track 3428_9_14_SRO_166_TGGAACAA-mg doesn't have a sequence dictionary built in, skipping dictionary validation INFO 18:30:29,749 MicroScheduler - Running the GATK in parallel mode with 32 total threads, 1 CPU thread(s) for each of 32 data thread(s), of 64 processors available on this machine INFO 18:30:29,878 GenomeAnalysisEngine - Preparing for traversal INFO 18:30:29,963 GenomeAnalysisEngine - Done preparing for traversal INFO 18:30:29,964 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 18:30:29,965 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 18:30:29,966 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 18:30:30,562 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files INFO 18:31:00,420 ProgressMeter - 1:4845033 0.0 30.0 s 50.3 w 0.0% 46.7 h 46.7 h ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.IllegalStateException: Key DP found in VariantContext field INFO at 1:4839315 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default. at htsjdk.variant.vcf.VCFEncoder.fieldIsMissingFromHeaderError(VCFEncoder.java:176) at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:115) at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:221) at org.broadinstitute.gatk.engine.io.storage.VariantContextWriterStorage.add(VariantContextWriterStorage.java:182) at org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub.add(VariantContextWriterStub.java:269) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.reduce(GenotypeGVCFs.java:351) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.reduce(GenotypeGVCFs.java:119) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java:291)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java:280) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:279) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-geee94ec):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR
##### ERROR MESSAGE: Key DP found in VariantContext field INFO at 1:4839315 but this key isn't defined in the VCFHeader.  We require all VCFs to have complete VCF headers by default.
##### ERROR ------------------------------------------------------------------------------------------


for f in *.g.vcf.gz; do echo -e "\n-- $f --"; zcat "$f" | sed -n -r "/^#.*DP/p;/^1\t4839315\t/{p;q;}"; done

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:22:0:21:0,0,432
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:20:0:20:0,0,410
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:29:0:29:0,0,773
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:25:2:25:0,3,790
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:33:0:33:0,0,837
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:23:31:23:0,31,765
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     0       .       ClippingRankSum=-0.578;MLEAC=0,0;MLEAF=0.00,0.00        GT:DP:GQ:PL:SB  0/0:21:39:0,39,488,60,491,512:20,0,0,0
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:18:0:18:0,0,514
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:29:0:29:0,0,810
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:33:0:33:0,0,812
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:28:0:25:0,0,624
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     0.08    .       ClippingRankSum=-0.189;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:17:20:20,0,311,62,320,382:14,0,3,0
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     6.76    .       ClippingRankSum=-0.374;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:25:43:43,0,401,102,417,519:20,0,3,2
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     0       .       ClippingRankSum=-1.095;MLEAC=0,0;MLEAF=0.00,0.00        GT:DP:GQ:PL:SB  0/0:23:1:0,1,395,56,406,460:19,0,0,0
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:28:0:28:0,0,626
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     5.99    .       ClippingRankSum=-0.584;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:18:42:42,0,293,84,305,388:13,1,3,1
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:22:0:22:0,0,558
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       GA,<NON_REF>    6.76    .       ClippingRankSum=0.850;MLEAC=1,0;MLEAF=0.500,0.00        GT:DP:GQ:PL:SB  0/1:19:43:43,0,262,87,274,361:12,3,4,0
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     16.82   .       ClippingRankSum=-0.784;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:21:54:54,0,352,102,367,470:16,0,4,1
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:26:0:25:0,0,419
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:30:0:30:0,0,771
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:34:77:34:0,78,1136
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:26:0:20:0,0,397
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GAA     G,GA,<NON_REF>  22.75   .       ClippingRankSum=-2.181;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00        GT:DP:GQ:PL:SB  0/2:11:22:60,22,209,0,87,104,63,153,113,176:4,2,3,0


Created 2015-05-05 00:40:58 | Updated | Tags: haplotypecaller queue gatk

Question

Hey folks,

We've run in to a situation where we have hundreds of queue managed GATK HaplotypeCaller.scala jobs which we'd like to run. This has led to hundreds of instances of Queue.jar in their post scatter watching state holding on to cores in our cluster. I'm curious about plans for Queue and whether or not a job dependency tree model has been considered for future versions, or whether or not this is already available.

By that, I mean something along the lines of the scatter kicks off the child jobs and a check/resubmit_faileds/gather job is queued on the cluster with a qsub -W depend=after:child1jobid[:child2jobid: ...:childNjobid]. The initial scatter job would end, freeing up a core, and resources would be available to the sub jobs. Then, when the child jobs finish up, the dependencies would be met and the next job would execute (when resources are available) and pick up the successful, resubmit the failed, lather, rinse , repeat.

Thanks in advance for your patience with me in the phrasing of my question, as I am approaching this as a cluster admin, not as the developer who has implemented the queue.

Cheers, Scott

Created 2015-04-28 08:17:21 | Updated | Tags: unifiedgenotyper haplotypecaller coverage bias

Hi,

I am doing joint variant calling for Illumina paired end data of 150 monkeys. Coverage varies from 3-30 X with most individuals having around 4X coverage.

I was doing all the variant detection and hard-filtering (GATK Best Practices) process with both UnifiedGenotyper and Haplotype caller.

My problem is that HaplotypeCaller shows a much stronger bias for calling the reference allele in low coverage individuals than UnifiedGenotyper does. Is this a known issue?

In particular, consider pairwise differences across individuals: The absolute values are lower for low coverage individuals than for high coverage, for both methods, since it is more difficult to make calls for them. However, for UnifiedGenotyper, I can correct for this by calculating the "accessible genome size" for each pair of individuals by substracting from the total reference length all the filtered sites and sites where one of the two individuals has no genotype call (./.). If I do this, there is no bias in pairwise differences for UnifiedGenotyper. Values are comparable for low and high coverage individuals (If both pairs consist of members of similar populations).

However, for HaplotypeCaller, this correction does not remove bias due to coverage. Hence, it seems that for UnifiedGenotyper low coverage individuals are more likely to have no call (./.) but if there is a call it is not biased towards reference or alternative allele (at least compared to high coverage individuals). For HaplotypeCaller, on the other hand, it seems that in cases of doubt the genotype is more likely to be set to reference. I can imagine that this is an effect of looking for similar haplotypes in the population.

Can you confirm this behaviour? For population genetic analysis this effect is highly problematic. I would trade in more false positive if this removed the bias. Note that when running HaplotypeCaller, I used a value of 3*10^(-3) for the expected heterozygosity (--heterozygosity) which is the average cross individuals diversity and thus already at the higher-end for within individual heterozygosity. I would expect the problem to be even worse if I chose lower values.

Can you give me any recommendation, should I go back using UnifiedGenotyper or is there any way to solve this problem?

Created 2015-04-24 14:29:30 | Updated 2015-04-24 15:01:33 | Tags: haplotypecaller genotype-given-allele gga

Hi Team,

I want to call variants for some samples using Genotype Given Allele mode(GGA) of HaplotypeCaller. I have used UnifiedGenotyper earlier in GGA mode.

I am not sure about the right approach to do GGA mode in HaplotypeCaller.

You make GVCFs for all BAMs using HaplotypeCaller in GGA mode and later use the GenotypeGVCFs walker to make VCF from these GVCFs. Is this the way to do this ?

Or Since GVCFs has all the sites from BAMs, can GenotypeGVCF can do a variant calling using GGA mode. I know that the GenotypeGVCF doesn't have the GGA option currently. But just checking the right approach.

Created 2015-04-22 21:11:25 | Updated 2015-04-22 21:14:24 | Tags: haplotypecaller

Hi GATK team,

I am using the GATK 3.3 HaplotypeCaller. I found if i use different -L i will have different genotype on the same location. My para is very simple: -T HaplotypeCaller -R ucsc.hg19.fasta -mbq 20 --emitRefConfidence GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -I test.bam -L chr17 -o output.vcf

I check the same site. If i set -L chr17:36070200-36071000, there is a reported SNV. if i set -L chr17:36000000-36140000, there is no SNV report. if i set larger: -L chr17:30000000-40000000, it just show up again. if i use the whole chr17, it gone.

This is very confusing me. it is a C->G variant at chr17. The snp is looks like: GT 0/1 AB 0.84 AD 257,49,0 DP 306 GQ 99 PGT 0|1 PID 36070590_GTCACCAT_G PL 2966,0,10470,3755,10800,14555 SB 14,243,49,0

While I checked the bam file with IGV. There are two wired things: (1) there is no read supporting the non-reference allele. (2) the depth is about 600, far more deeper than the HaplotypeCaller reported. Why would this happen?

Created 2015-04-20 10:57:05 | Updated 2015-04-20 11:03:28 | Tags: unifiedgenotyper combinevariants haplotypecaller genotype-given-alleles

Question: Is it possible to have CV merge like bcftools does it?

I get this warning, when running UG in GGA mode using an -alleles vcf generated with CV:

WARN  10:17:21,394 GenotypingGivenAllelesUtils - Multiple valid VCF records detected in the alleles input file at site 20:106089, only considering the first record


I made this call with HC from 10 samples:

20  106089  .   CA  C


And this call with UG from 10 other samples:

20  106089  .   C   A


CV merges like this:

20  106089  .   C   A
20  106089  .   CA  C


bcftools merges like this:

20  106089  .   CA  AA,C


The UG recall from the CV generated -alleles vcf is incomplete:

20  106089  .   C   A


The UG recall from the bcftools generated -alleles vcf is complete:

20  106089  .   CA  AA,C


Is it possible to have CV merge like bcftools does it?

I'm really not sure. It's not a use case that UG was designed for (with UG we kept SNPs and indels separate until post-analysis), so I would recommend being cautious with it.

I checked the genotypes and UG seems to handle merged MNPs and indels just fine; see below. But I will do some additional testing. Or I might just take the safe path and do the recalling separately for SNPs and indels as suggested. The reason I have UG and HC calls in the first place is because I have low and high coverage data for different cohorts. I want to create a merged dataset.

Despite --interval_padding 100 helping to recall more sites with HC in GGA mode as per previous recommendation, some sites still fail to be called with HC in GGA mode. Hence I opted for UG.

UG calls on samples 1-10:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  535 545 546 550 554 564 567 574 575 578
20  106089  .   C   A   16.19   .   AC=2;AF=0.125;AN=16;BaseQRankSum=-0.854;DP=37;Dels=0.00;FS=0.000;HaplotypeScore=1.5282;MLEAC=2;MLEAF=0.125;MQ=58.74;MQ0=0;MQRankSum=-0.560;QD=2.70;ReadPosRankSum=-1.797;SOR=0.935;VariantType=SNP  GT:AD:DP:GQ:PL  0/0:3,0:3:6:0,6,76  0/0:4,2:6:9:0,9,115 0/1:3,1:4:24:24,0,80    0/0:6,0:6:12:0,12,130   0/1:1,1:2:29:30,0,29    ./. 0/0:7,0:7:15:0,15,188   0/0:3,1:4:6:0,6,74  ./. 0/0:5,0:5:12:0,12,142


HC calls on samples 11-20:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  585 590 622 625 628 640 655 668 687 693

20  106089  .   CA  C   47.95   .   AC=5;AF=0.250;AN=20;BaseQRankSum=0.925;DP=36;FS=1.850;InbreedingCoeff=0.0646;MLEAC=5;MLEAF=0.250;MQ=59.48;MQ0=0;MQRankSum=0.175;QD=3.00;ReadPosRankSum=-1.725;SOR=0.387 GT:AD:GQ:PL 0/0:2,0:6:0,6,49    0/0:2,0:6:0,6,49    0/0:3,0:12:0,12,130 0/0:5,0:15:0,15,122 0/0:2,0:6:0,6,46    0/1:2,1:14:14,0,39  0/1:2,1:15:15,0,38  0/0:4,0:12:0,12,93  0/1:3,1:12:12,0,46  1/1:0,3:9:67,9,0


UG GGA recalls on samples 1-20:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  535 545 546 550 554 564 567 574 575 578 585 590 622 625 628 640 655 668 687 693
20  106089  .   CA  AA,C    110.56  .   AC=0,8;AF=0.00,0.222;AN=36;DP=81;FS=0.000;InbreedingCoeff=0.5076;MLEAC=0,6;MLEAF=0.00,0.167;MQ=58.56;MQ0=0;QD=3.45;SOR=0.859;VariantType=MULTIALLELIC_MIXED GT:AD:DP:GQ:PL:SB   0/0:0,0,0:3:0:0,0,0,6,6,52:0,0,0,0  0/2:0,0,1:6:0:5,5,5,0,0,109:0,0,1,0 0/2:0,0,1:4:0:12,12,12,0,0,47:0,0,1,0   0/0:0,0,0:6:0:0,0,0,17,17,123:0,0,0,0   0/0:0,0,0:2:0:0,0,0,3,3,10:0,0,0,0  ./. 0/0:0,0,0:7:0:0,0,0,9,9,60:0,0,0,0  0/2:0,0,1:4:0:12,12,12,0,0,61:0,0,0,1   ./. 0/0:0,0,1:5:0:0,0,0,4,4,30:0,0,0,1  0/0:0,0,0:3:0:0,0,0,6,6,49:0,0,0,0  0/0:0,0,0:3:0:0,0,0,9,9,76:0,0,0,0  0/0:0,0,1:4:0:0,0,0,1,1,22:0,0,1,0  0/0:0,0,0:7:0:0,0,0,18,18,149:0,0,0,0   0/0:0,0,0:4:0:0,0,0,11,11,76:0,0,0,0    0/2:0,0,1:5:0:9,9,9,0,0,65:0,0,0,1  0/2:0,0,1:4:0:12,12,12,0,0,60:0,0,0,1   0/0:0,0,0:5:0:0,0,0,15,15,116:0,0,0,0   0/2:0,0,1:6:0:12,12,12,0,0,47:0,0,0,1   2/2:0,0,3:3:9:67,67,67,9,9,0:0,0,3,0


http://gatkforums.broadinstitute.org/discussion/5249/overcalling-deletion-in-unifiedgenotyper-genotype-given-alleles-mode


P.S. I might gate crash your Cambridge party this week despite not being invited :D The course was already fully booked, when you announced it. I don't have a time machine!

Created 2015-04-16 15:31:12 | Updated | Tags: haplotypecaller

Hi, I am interested in calling SNPs for a set of 150 bacterial genomes (genome size ~1Mb). I'm attempting to use the HaplotypeCaller and am running into errors with the java memory: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java.

There is an estimated run time of ~11 days. I have increased the memory to 20g and am limiting the max_alternate_alleles as well as shown below:

java -d64 -Xmx20g -jar $EXECGATK \ -T HaplotypeCaller \ -R$REF \ -I $DATAPATH${BAMLIST} \ -stand_call_conf 20 \ -stand_emit_conf 20 \ --sample_ploidy 1 \ --maxNumHaplotypesInPopulation 198 \ --max_alternate_alleles 3 \ -L "gi|15594346|ref|NC_001318.1|" \ -o ${OUTPATH}${BASE}.chr.snps.indels.vcf

Is there a way to call only SNPs as my understanding is that indel calling is memory intensive and I am going to focus on SNPs for this part of my analysis? Or is there another way to make this analysis more efficient?

Thank you!

Created 2015-04-15 10:37:45 | Updated | Tags: haplotypecaller targeted-sequencing

Dear all,

we use the Broad best practice pipeline to call germline variants on Panel-seq data, i.e. our chip comprises the Omimom. Hence, it is not a small target set but significantly smaller than for instance common exome-seq.

The results are good so far, the mixture model plots look decent to me. Here are my questions: 1. Is there any best practice for such panels? 2. I attached a file with the tranches plot for the SNPs, what do you think? Shall we increase the --ts_filter_level for SNPs to, say, 99.99 (it is now 99.9 as in the best practice for exome-seq)?

Thank you, Chris

Created 2015-04-14 12:49:42 | Updated 2015-04-14 12:53:11 | Tags: haplotypecaller

Hi GATK team,

I have recently performed an analysis to try and select a GQ cut-off to be applied to WES data post-VQSR (applied 99.9% to the data). The WES data was called using HaplotypeCaller GenomeAnalysisTK-3.2-2 (over ~3000) samples and VQSR was applied (using the same GATK version). To decide on a GQ threshold, I looked at the correlation (over different GQ filters applied to the WES data) of chip genotypes and the sequencing genotypes (~350 samples were both genotyped and sequenced). The genotype data has been QC'ed s normally is in GWAS. The correlation is just the r squared (r2) for each variant between 2 vectors: one with the 350 chip genotypes and the other with the 350 sequencing genotypes. I finally estimated the average r2 per GQ quality filter applied and also counted how many genotypes were being dropped (ie., no longer variant sites). The result of this is the following figure, which I think looks a bit odd and suggests that the GQ is perhaps multi-modal. Have you ever seen this or have any suggestions as to why this might be? The blue line is the correlation (left y axis) and the green is the proportion of GTs dropped (right y axis). The x axis is the GQ filters applied to the data from 0 to 50.

Many thanks, Eva

Created 2015-04-14 03:01:09 | Updated | Tags: haplotypecaller

Hi, There are some confused result found in the output gvcf of HaplotypeCaller. Why the genotypes is 0/0 but called a GC->G deletion? see below for example.

chr8 118915071 . GC G,<NON_REF> 0 . DP=37;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.29;MQ0=0 GT:AD:DP:GQ:PL:SB 0/0:35,0,0:35:99:0,104,1078,105,1079,1080:27,8,0,0


Thanks a lot! Hiu

Created 2015-04-07 19:11:23 | Updated | Tags: haplotypecaller allownonuniquekmersinref

Hi,

I'm calling SNPs for bacterial genomes. I've been using UnifiedGenotyper to call SNPs, and I'm looking to migrate to HaplotypeCaller. This change from UnifiedGenotyper to HaplotypeCaller requires validating SNPs being called. I've came across a situation where a SNP is being called in only some genomes using the HaplotypeCaller, but the SNP is clearer present in all genomes when visually validated in IGV. After trying many HaplotypeCaller arguments, only when using -allowNonUniqueKmersInRef was the position correctly called in all genomes. Can you describe what this flag is doing to allow the position to be called correctly in all genomes, and describe why when using the default HaplotypeCaller parameters this SNP is not found in all?

This is the position when called using -allowNonUniqueKmersInRef. gi|561108321|ref|NC_018143.2| 3336835 . T C 1278.77 . AC=2;AF=1.00;AN=2;DP=44;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.18;MQ0=0;QD=29.06;SOR=0.739 GT:AD:DP:GQ:PL 1/1:0,43:43:99:1307,128,0

Thank you, Tod

Created 2015-04-01 14:17:52 | Updated | Tags: vqsr haplotypecaller best-practices gvcf

I am currently processing ~100 exomes and following the Best Practice recommendations for Pre-processing and Variant Discovery. However, there are a couple of gaps in the documentation, as far as I can tell, regarding exactly how to proceed with VQSR with exome data. I would be grateful for some feedback, particularly regarding VQSR. The issues are similar to those discussed on this thread: http://gatkforums.broadinstitute.org/discussion/4798/vqsr-using-capture-and-padding but my questions aren't fully-addressed there (or elsewhere on the Forum as far as I can see).

Prior Steps: 1) All samples processed with same protocol (~60Mb capture kit) - coverage ~50X-100X 2) Alignment with BWA-MEM (to whole genome) 3) Remove duplicates, indel-realignment, bqsr 4) HC to produce gVCFs (-ERC) 5) Genotype gVCFs

This week I have been investigating VQSR, which has generated some questions.

Q1) Which regions should I use from my data for building the VQSR model?

Here I have tried 3 different input datasets:

a) All my variant positions (11Million positions) b) Variant positions that are in the capture kit (~326k positions) - i.e. used bedtools intersect to only extract variants from (1) c) Variant positions that are in the capture kit with padding of 100nt either side (~568k positions) - as above but bed has +/-100 on regions + uniq to remove duplicate variants that are now in more than one bed region

For each of the above, I have produced "sensitive" and "specific" datasets: "Specific" --ts_filter_level 90.0 \ for both SNPs and INDELs "Sensitive" --ts_filter_level 99.5 \ for SNPs, and --ts_filter_level 99.0 \ for INDELs (as suggested in the definitive FAQ https://www.broadinstitute.org/gatk/guide/article?id=1259 )

I also wanted to see what effect, if any, the "-tranche" argument has - i.e. does it just allow for ease of filtering, or does it affect the mother generated, since it was not clear to me. I applied either 5 tranches or 6:

5-tranche: -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 90.0 \ for both SNPs and INDELs 6-tranche: -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 95.0 -tranche 90.0 \ for both SNPs and INDELs

To compare the results I then used bed intersect to get back to the variants that are within the capture kit (~326k, as before). The output is shown in the spreadsheet image below.

http://i58.tinypic.com/25gc4k7.png

What the table appears to show me, is that at the "sensitive" settings (orange background), the results are largely the same - the difference between "PASS" in the set at the bottom where all variants were being used, and the others is mostly accounted for by variants being pushed into the 99.9-100 tranche.

However, when trying to be specific (blue background), the difference between using all variants, or just the capture region/capture+100 is marked. Also surprising (at least for me) is the huge difference in "PASS" in cells E15 and E16, where the only difference was the number of tranches given to the model (note that there is very little difference in the analogous cells in Rows 5/6 andRows 10/11.

Q2) Can somebody explain why there is such a difference in "PASS" rows between All-SPEC and the Capture(s)-Spec Q3) Can somebody explain why 6 tranches resulted in ~23k more PASSes than 5 tranches for the All-SPEC Q4) What does "PASS" mean in this context - a score =100? Is it an observation of a variant position in my data that has been observed in the "truth" set? It isn't actually described in the header of the VCF, though presumably the following corresponds: FILTER=<ID=VQSRTrancheSNP99.90to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -38682.8777"> Q5) Similarly, why do no variants fall below my lower tranche threshold of 90? Is it because they are all reliable at least to this level?

Q6) Am I just really confused? :-(

Created 2015-03-25 20:33:49 | Updated 2015-03-25 20:38:23 | Tags: haplotypecaller

Hi,

The variant was called by MiSeq reporter and GATK2.7 UnifiedGenotyper with VAF 19% after soft-clipping the primer sequences, as also seen in BAM file got according to GATK best practice (BWA, indel realignment, base recalibration, but no dedup due to amplicon-based sequencing). But GATK 3.3 HaplotypeCaller called a single base pair deletion instead (A: 4 (1%); C: 527) :

variants-HC3.3.vcf: chr13 28592642 . C . 1356.23 . AN=2;DP=469;MQ=59.95 GT:AD:DP 0/0:469:469 variants-UG_.vcf: chr13 28592642 rs121913488 C A 1785.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-7.477;DB;DP=731;Dels=0.00;FS=0.713;HaplotypeScore=62.1894;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=0.522;QD=2.44;ReadPosRankSum=0.174 GT:AD:DP:GQ:PL 0/1:596,134:731:99:1814,0,11941

The variant is at the end of one amplicon of lower coverage, close to another amplicon of much higher coverage (please see attached beforeHC.png, the soft-clipped part is primer sequences). If I trim 30 bp of the primer sequence instead of soft-clipping, the variant was called with strand bias under low stringency: 8 (12%: 1+, 7-), C: 59 (6+, 53-) chr13 28592642 rs121913488 C A 20.80 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.750;ClippingRankSum=0.250;DB;DP=20;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-1.950;QD=1.04;ReadPosRankSum=-1.450;SOR=1.085 GT:AD:DP:GQ:PL 0/1:15,4:19:49:49,0,354 The variant was not called at all if I trim 22, 24, 27 or 32 bp instead of 30 bp from both ends.

Why the variant was not called by HaplotypeCaller after soft-clipping primer sequences? How can I adjust anything during the HaplotypeCaller to make this variant called? Thanks!

Created 2015-03-19 14:50:22 | Updated | Tags: haplotypecaller genotypegvcfs

Hello!

I would like to run GenotypeGVCFs on 209 WES, called with HC (--emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000).

When I run GenotypeGVCFs, with this command (computing nodes have 8 cores and 24G of memory) :

java -Xmx24g -jar $GATK_JAR \ -R Homo_sapiens.GRCh37_decoy.fa \ -T GenotypeGVCFs \ -nt 8 \ -V gvcf.all.list \ -o calls.vcf  It estimates a huge runtime and just dies hanging: INFO 10:27:00,790 HelpFormatter - -------------------------------------------------------------------------------- INFO 10:27:00,795 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 INFO 10:27:00,796 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 10:27:00,796 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 10:27:00,800 HelpFormatter - Program Args: -R Homo_sapiens.GRCh37_decoy.fa -T GenotypeGVCFs -nt 8 -V gvcf.all.list -o calls.vcf INFO 10:27:00,810 HelpFormatter - Executing as emixaM@r107-n50 on Linux 2.6.32-504.12.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-ea-b07. INFO 10:27:00,810 HelpFormatter - Date/Time: 2015/03/19 10:27:00 INFO 10:27:00,811 HelpFormatter - -------------------------------------------------------------------------------- INFO 10:27:00,811 HelpFormatter - -------------------------------------------------------------------------------- INFO 10:27:04,719 GenomeAnalysisEngine - Strictness is SILENT INFO 10:27:04,882 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 10:27:41,565 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 1 CPU thread(s) for each of 8 data thread(s), of 8 processors available on this machine INFO 10:27:43,169 GenomeAnalysisEngine - Preparing for traversal INFO 10:27:43,179 GenomeAnalysisEngine - Done preparing for traversal INFO 10:27:43,179 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 10:27:43,180 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 10:27:43,180 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 10:27:44,216 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files INFO 10:28:15,035 ProgressMeter - 1:1000201 0.0 31.0 s 52.7 w 0.0% 27.0 h 27.0 h INFO 10:29:17,386 ProgressMeter - 1:1068701 0.0 94.0 s 155.8 w 0.0% 76.7 h 76.6 h INFO 10:30:18,055 ProgressMeter - 1:1115101 0.0 2.6 m 256.1 w 0.0% 5.0 d 5.0 d  What did I do wrong? Cheers! Created 2015-03-11 13:48:14 | Updated | Tags: haplotypecaller gvcf scaffolds Hi Team, 1 BAM = 1 individual my question is regarding the HaplotypeCaller and scaffolds in a BAM file. When I want to do the individual SNP-calling procedure (--emitRefConfidence GVCF) before the Joint Genotyping, I found that with my number of scaffolds the process is computationally quite costy. I now ran for every BAM the HaplotypeCaller just for a single scafflod (by using -L) Question is: Do you see any downside in this approach regarding the result quality? Or are the scaffolds treated independently anyways and my approach is fine? The next step would be to combine the gvcfs to a single one again (corresponding to the original BAM) and then do joint genotyping on a cohort of gvcfs (-> cohort of individuals) Thanks a lot! Alexander Created 2015-03-10 18:27:56 | Updated | Tags: haplotypecaller genotypegvcfs gvcf I run the following command for "GenotypeGVCFs" for 3 VCF files output of HaplotypeCaller as below: java data/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \ -R data/ucsc.hg19.fasta \ -T GenotypeGVCFs \ --variant data/47V_post.ERC.vcf \ --variant data/48V_post.ERC.vcf \ --variant data/49V_post.ERC.vcf \ --out data/Combined_geno_3files.vcf but in a final VCF output there is no rsID information and all rows are "." what is the problem? I am really confused. Could you please advise how to get SNP-ID in the output VCF Thanks Created 2015-03-09 18:47:53 | Updated | Tags: haplotypecaller combinegvcfs gvcf I used the following command to combine 3 VCF files which are outputs of HaplotypeCaller: java -jar data/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \ -R data/ucsc.hg19.fasta \ -T CombineGVCFs \ --variant data/47V_post.ERC.vcf \ --variant data/48V_post.ERC.vcf \ --variant data/49V_post.ERC.vcf \ --out data/Combined_3files.vcf However, after combined all 3 files, in the output final VCF, I can only see ./. genotypes. What is the problem? how I can to fix this? Thanks Created 2015-03-08 09:52:48 | Updated | Tags: haplotypecaller dp Hello GATK team, 1. I've noticed a strange variant in the gvcf output of HC: Raw vcf: 6 32552140 rs17882663 T A, 108.18 . DB;DP=0;MLEAC=2,0;MLEAF=1.00,0.00;MQ=0.00;MQ0=0 GT:GQ:PGT:PID:PL:SB 1/1:9:0|1:32552059_G_T:135,9,0,135,9,135:0,0,0,0 After GenortpeGVCFs: 6 32552140 rs17882663 T A 107.28 . AC=2;AF=1.00;AN=2;DB;DP=0;FS=0.000;GQ_MEAN=9.00;MLEAC=2;MLEAF=1.00;MQ=0.00;MQ0=0;NCC=0;SOR=0.693 GT:AD:GQ:PGT:PID:PL 1/1:0,0:9:1|1:32552059_G_T:135,9,0 The DP and AD are 0, but there is a variant - A. What do you think? Why does it happened? 1. What is the difference between the DP in the format part and in the info part? I looked for an answer in the documentation, but couldn't find one. Bellow is an example of a big difference between the values of this two. 3 195511214 . G GACCTGTGGATGCTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGTGTC 673.77 . AC=1;AF=0.500;AN=2;DP=169;FS=0.000;GQ_MEAN=38.00;MLEAC=1;MLEAF=0.500;MQ=56.57;MQ0=0;NCC=0;SOR=0.693 GT:AD:DP:GQ:PL 0/1:0,0:0:38:702,0,38 Maya Created 2015-03-04 13:47:51 | Updated 2015-03-04 14:17:40 | Tags: haplotypecaller mappingqualityzero mq Hi, I am using the standard HaplotypeCaller/GenotypeGVCFs pipeline with the --includeNonVariantSites option. I always get MQ0=0 for all SNPs and don't get it reported for non-variants, even if I add "--annotation MappingQualityZero" to both commands. With UnifiedGenotyper, MQ0 values are reported correctly. Is this a bug? Also, with HaplotypeCaller, MQ is only reported for variants, not for non-variants. Is there a way to get this annotation reported for non-variants, such as with UnifiedGenotpyer? I am using Version 3.3-0, but collegues have a similar problem with version 3.2-2. Thanks, Hannes Created 2015-03-03 21:03:41 | Updated | Tags: haplotypecaller selectvariants combinegvcfs GATK team, I currently have many WES gVCFs called with GATK 3.x HaplotypeCaller, and I'm now looking to combine them and run GenotypeGVCFs. Unfortunately, I forgot to add the "-L" argument to HC to reduce the size of the resulting gVCFs, and CombineGVCFs looks like it's taking much longer than I expect it to. Is there any potential problem with using the "-L" argument to SelectVariants to reduce the size of my gVCFs and then use those smaller gVCFs in the CombineGVCFs stage (and beyond), or do I have to re-call HaplotypeCaller again? Would it be better to extend the boundaries of the target file by a certain amount to avoid recalling HaplotypeCaller? Thanks, John Wallace Created 2015-03-02 11:20:33 | Updated | Tags: haplotypecaller reference-bias Hi there, I am fairly new to using the GATK and was hoping someone could answer a little question I have. I have been using HaplotypeCaller to call two variants at a locus (A and B). As I understand it, my calls may be subject to reference bias if the reference genome I use to call variants is based individuals carrying allele A and allele B is fairly diverged from A. This will result in an underestimate of GQ and perhaps undercall variants for B. My question is, how does HaplotypeCaller treat Ns in the reference genome? Does it still estimate genotypes for these? Or does it treat the site as invariant? Cheers Created 2015-03-02 09:39:51 | Updated | Tags: haplotypecaller genotypegvcfs Hi GATK team, In the documentation of GenotypeGVCFs it is writen: "Input - One or more Haplotype Caller gVCFs to genotype" I have 3 questions regarding this tool: 1. I wonder, what is the meaning of running it on one sample? 2. I tried to run it on one sample, and noticed that the genotype quality is different than the one in the original gvcf file from HC. What is causing this difference? I'm asking this since running the tool on one sample means that there are no other samples to consider in recalculating the quality. 3. Last question - From your experience, what is the best way to analyze one exome sample? Should I run HC with the default genotype_mode parameter and do hard filtering? Should I run HC in GVCF mode, run GenotypeGVCFs and than do hard filtering? Any other suggestion? Thank you for the answer, Maya Created 2015-02-25 02:12:39 | Updated | Tags: vqsr baserecalibrator haplotypecaller knownsites resources variant-recalibration Hi, I have a general question about the importance of known VCFs (for BQSR and HC) and resources file (for VQSR). I am working on rice for which the only known sites are the dbSNP VCF files which are built on a genomic version older than the reference genomic fasta file which I am using as basis. How does it affect the quality/accuracy of variants? How important is to have the exact same build of the genome as the one on which the known VCF is based? Is it better to leave out the known sites for some of the steps than to use the version which is built on a different version of the genome for the same species? In other words, which steps (BQSR, HC, VQSR etc) can be performed without the known sites/resource file? If the answers to the above questions are too detailed, can you please point me to any document, if available, which might address this issue? Thanks, NB Created 2015-02-24 13:42:22 | Updated | Tags: haplotypecaller sex-chromosomes Hi GATK team, I have exome samples, some males and some females. I mapped the female to a reference genome without the Y chromosome, and continued with each sample the Best Practice steps. The reason for that mapping is that we don't want to lose some of the females reads to the homologous regions of chro Y. Will I be able to run GenotypeGVCFs on those samples? Is this a good way to do the genotyping on the sex chromosomes? thank in advanced, Maya Created 2015-02-23 13:25:09 | Updated 2015-02-23 14:06:33 | Tags: haplotypecaller gvcf stand-emit-conf stand-call-conf I am using HC 3.3-0-g37228af to generate GVCFs, including the parameters (full command below): stand_emit_conf 10 stand_call_conf 30 The process completes fine, but when I look at the header of the gvcf produced, they are shown as follows: standard_min_confidence_threshold_for_calling=-0.0 standard_min_confidence_threshold_for_emitting=-0.0 After trying various tests, it appears that setting these values is incompatible with -ERC GVCF (which requires "-variant_index_type LINEAR" and "-variant_index_parameter 128000" ) 1) Can you confirm if this is expected behaviour, and why this should be so? 2) Is this another case where the GVCF is in intermediate file, and hence every possible variant is emitted initially? 3) Regardless of the answers above, is stand_call_conf equivalent to requiring a GQ of 30?  java -Xmx11200m -Djava.io.tmpdir=$TMPDIR -jar /apps/GATK/3.3-0/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-I /E000007/target_indel_realignment/E000007.6.bqsr.bam \
-R /project/production/Indexes/samtools/hsapiens.hs37d5.fasta \
-et NO_ET \
-K /project/production/DAT/apps/GATK/2.4.9/ourkey \
-dt NONE \
-L 10 \
-A AlleleBalance \
-A BaseCounts \
-A BaseQualityRankSumTest \
-A ChromosomeCounts \
-A ClippingRankSumTest \
-A Coverage \
-A DepthPerAlleleBySample \
-A DepthPerSampleHC \
-A FisherStrand \
-A GCContent \
-A HaplotypeScore \
-A HardyWeinberg \
-A HomopolymerRun \
-A ClippingRankSumTest \
-A LikelihoodRankSumTest \
-A LowMQ \
-A MappingQualityRankSumTest \
-A MappingQualityZero \
-A MappingQualityZeroBySample \
-A NBaseCount \
-A QualByDepth \
-A RMSMappingQuality \
-A StrandBiasBySample \
-A StrandOddsRatio \
-A VariantType \
-ploidy 2 \
--min_base_quality_score 10 \
-ERC GVCF \
-variant_index_type LINEAR \
-variant_index_parameter 128000 \
--GVCFGQBands 20 \
--standard_min_confidence_threshold_for_calling 30 \
--standard_min_confidence_threshold_for_emitting 10


Created 2015-02-19 14:29:26 | Updated | Tags: haplotypecaller genotypegvcfs

Hi Sheila and Geraldine

When I run HaplotypeCaller (v3.3-0-g37228af) in GENOTYPE_GIVEN_ALLELES mode with --emitRefConfidence GVCF I get the error:

Invalid command line: Argument ERC/gt_mode has a bad value: you cannot request reference confidence output and GENOTYPE_GIVEN_ALLELES at the same time

It is however strange that GENOTYPE_GIVEN_ALLELES is mentioned in the --max_alternate_alleles section of the GenotypeGVCFs documentation.

Maybe I'm missing something?

Thanks, Gerrit

Created 2015-02-19 11:51:41 | Updated | Tags: haplotypecaller

Hello,

I was wondering whether there is any paper or document that describes the mathematical models of Haplotype caller?

Created 2015-02-18 15:22:27 | Updated | Tags: haplotypecaller vcf transcript-information gqx

Dear all,

I need to generate vcf file with GATK and I need to have TI (transcript information) in INFO field and VF and GQX in FORMAT field.

Could you help me please with arguments.

My agruments are to call variants:

java -jar $gatk -T$variant_caller -R $reference -I in.bam -D dbsnp -L bed_file -o .raw.vcf I still have no TI info. and in FORMAT i have GT:AD:DP:GQ:PL and i NEED GT:AD:DP:GQ:PL:VF:GQX. Thank you for any help with command line arguments. Paul. Created 2015-02-18 12:54:06 | Updated | Tags: haplotypecaller non-human vcf-filtering no-truth-set Hey GATK team, Are the guidelines suggested in this forum page applicable for filtering calls made by GATK-3.3-0's HaplotypeCaller? Cheers, Mika Created 2015-02-13 17:15:34 | Updated | Tags: haplotypecaller gvcf I have 13 whole exome sequencing samples, and unfortunately, I'm having a hard time getting HaplotypeCaller to complete within the time frame the cluster I use allows (150 hours). I use 10 nodes at a time with 10gb ram with 8 cores per node. Is there any way to speed up this rate? I tried using HaplotypeCaller in GVCF mode with the following command: java -d64 -Xmx8g -jar$GATKDIR/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R $REF --dbsnp$DBSNP \ -I 7-27_realigned.bam \ -o 7-27_hg19.vcf \ -U ALLOW_UNSET_BAM_SORT_ORDER \ -gt_mode DISCOVERY \ -mbq 20 \ -stand_emit_conf 20 -G Standard -A AlleleBalance -nct 16 \ --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

Am I doing something incorrectly? Is there anything I can tweak to minimize the runtime? What is the expected runtime for WES on a standard setup (a few cores and some ram)?

Created 2015-02-11 21:42:59 | Updated | Tags: haplotypecaller

Hi,

We are running the best practices pipeline for variant discovery with GATK 3-2.2. When running the HaplotypeCaller with the flags -A AlleleBalance & -A HomopolymerRun to generate a gVCF, there are no ABHet/ABHom or HRun annotations showing up in the gVCF. I tried running VariantAnnotator on the gVCF and still no annotations.

The documentation on both of these annotations state that they only work for biallelic variants. I suspected that the , tag that is on every variant might be causing the tool to treat biallelic variants as multiallelic. So I stripped the , from the gVCF using sed and reran the VariantAnnotator and voila...I got the annotations.

Is there a way to either generate the gVCF without the , tag (and probabilities that go with it), or instruct the VariantAnnotator to ignore the , tag.

Thanks for you help.

Tom Kolar

Created 2015-02-10 21:12:01 | Updated | Tags: haplotypecaller activityprofile

I'm working on an association mapping project in a non-model bird (so there is a reference genome, but it may have problems). We're looking for SNPs linked to a phenotype using paired end GBS data and using the haplotypecaller for SNP calling. We found a single SNP that explains a significant portion of the phenotypic variation. When we actually look at the SNP, it turns out to be an artifact derived from the haplotype caller reassembly. There is 10 bp insertion in the reference not found in this population and when the reassembly happens, it realigns it to produce a SNP. Just looking at the sequence of the reads themselves, there is no SNP.

So the question is, what is causing the reassembly in some samples and not others. I tried outputting the activity profile but since it is smoothed it hard to figure exactly where the difference is happening. Is it possible output an unsmoothed activity profile? Are there other ways to figure out how exactly the active region is being picked?

Thanks.

Created 2015-02-05 21:44:46 | Updated 2015-02-05 21:45:35 | Tags: haplotypecaller gvcf

Hi, I have noticed that every time I repeat a gVCF call on the same sample (~same Bam file), the output gVCF files are not exactly same. They are almost similar, but there will be a few differences here and there & there will be a minute difference in unix-file-sizes as well. Is that something that is expected??

Shalabh Suman

Created 2015-02-04 08:52:18 | Updated 2015-02-04 08:53:33 | Tags: haplotypecaller illumina genotypegvcfs combinegvcfs

Hi GATK-ers,

I have been given ~2000 gVCFs generated by Illumina (one sample per gVCF). Though they are in standard gVCF format, they were generated by an Illumina pipeline (https://support.basespace.illumina.com/knowledgebase/articles/147078-gvcf-file if you're really curious) and not the Haplotype Caller. As a result (I think ... ), the GATK doesn't want to process them (I have tried CombineGVCFs and GenotypeGVCFs to no avail). Is there a GATK walker or some other tool that will make my gVCFs GATK-friendly? I need to be able to merge this data together to make it analyze-able because in single-sample VCF format it's pretty useless at the moment.

My only other thought has been to expand all the ref blocks of data and then merge everything together, but this seems like it will result in the creation of a massive amount of data.

Any suggestions you may have are greatly appreciated!!!

Sara

Created 2015-02-02 10:18:45 | Updated | Tags: haplotypecaller ploidy error-message

Hi,

I have a number of samples that consist of multiple individuals from the same population pooled together, and have been truing to use HaplotypeCaller to call the variants. I have set the (ploidy to 2 * number of individuals) but keep getting the same or similar error message, after running for several hours or days:

##### ERROR ------------------------------------------------------------------------------------------

and I'm not sure what I can do to rectify it... Obviously I can't limit the ploidy, it is what it is, and I thought that HC only allows a maximum of six alleles anyway?

My code is below, and any help would be appreciated.

java -Xmx24g -jar ~/bin/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T HaplotypeCaller -nct 6 \ -R ~/my_ref_sequence \ --intervals ~/my_intervals_file \ -ploidy 180 \ -log my_log_file \ -I ~/my_input_bam \ -o ~/my_output_vcf

Created 2015-01-30 11:27:05 | Updated | Tags: haplotypecaller readbackedphasing phasing allele-counts

I'm a bit confused regarding the new GATK version and new HC-functions. I'm trying to call haplotypes in a family of plants. I call Haplotypes using haplotype caller, then I want to run Read-backed phasing on the raw vcfs and then CalculateGenotypePosterios to add pedigree information. The CalculateGenotypePosterios-Walker seems to need the format Fields AC and AN, but they are not produced by the HaplotypeCaller. They used to be in earlier HC-Versions though...(?). How can I fix this? And is this a proper workflow at all? Is Read-backed phasing needed or has it become redundant with the new HC-Version being able to do physical phasing. Would it be "enough" to run HC for phasing and CalculateGenotypePosterios to add pedigree information? Anyhow the problem of missing ac and an fields remains. I would be greatful for some help on this.

Thsi is how a raw vcf produced by HC looks like

# CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GfGa4742

GSVIVT01012145001 1 . G . . END=113 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 GSVIVT01012145001 114 . C . . END=164 GT:DP:GQ:MIN_DP:PL 0/0:172:99:164:0,120,1800 GSVIVT01012145001 165 . T C, 7732.77 . DP=175;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,173,0:173:99:0|1:165_T_C:7761,521,0,7761,521,7761:0,0,165,8 GSVIVT01012145001 166 . G . . END=166 GT:DP:GQ:MIN_DP:PL 0/0:174:72:174:0,72,1080 GSVIVT01012145001 167 . T . . END=175 GT:DP:GQ:MIN_DP:PL 0/0:174:66:174:0,60,900 GSVIVT01012145001 176 . T . . END=191 GT:DP:GQ:MIN_DP:PL 0/0:174:57:173:0,57,855 GSVIVT01012145001 192 . A . . END=194 GT:DP:GQ:MIN_DP:PL 0/0:173:54:173:0,54,810 GSVIVT01012145001 195 . T . . END=199 GT:DP:GQ:MIN_DP:PL 0/0:174:51:173:0,51,765

And this is the Error Message I get

##### ERROR MESSAGE: Key AC found in VariantContext field INFO at GSVIVT01012145001:1 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.

Created 2015-01-29 16:09:51 | Updated 2015-01-29 16:19:12 | Tags: haplotypecaller

Hi there,

This is maybe a silly question but I want to know if it is normal or not. I have just run HC in GVCF mode for each of my bam files. When I see the output...:

# CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1

Is it normal to have "sample1" instead of "bam file name" or something like that? I have realized because when I was running GenotypeGVCFs, I was only getting one column named also "sample1", instead of one column per input vcf (what I think that is the normal output, isn't it? So, what I've done is to change "sample1" of each vcf file with the corresponding sample name to run GenotypeGVCF.

I hope I've explained myself more or less.

Created 2015-01-28 14:33:40 | Updated | Tags: haplotypecaller ad combinegvcfs

Hi,

I am using GATK v3.2.2 following the recommended practices (...HC -> CombineGVCFs -> GenotypeGVCFs ...) and while looking through suspicious variants I came across a few hetz with AD=X,0. Tracing them back I found two inconsistencies (bugs?);

1) Reordering of genotypes when combining gvcfs while the AD values are kept intact, which leads to an erronous AD for a heterozygous call. Also, I find it hard to understand why the 1bp insertion is emitted in the gvcf - there is no reads supporting it:

• single sample gvcf 1 26707944 . A AG,G,<NON_REF> 903.73 . [INFO] GT:AD:DP:GQ:PL:SB 0/2:66,0,36,0:102:99:1057,1039,4115,0,2052,1856,941,3051,1925,2847:51,15,27,9

• combined gvcf 1 26707944 . A G,AG,<NON_REF> . . [INFO] GT:AD:DP:MIN_DP:PL:SB [other_samples] ./.:66,0,36,0:102:.:1057,0,1856,1039,2052,4115,941,1925,3051,2847:51,15,27,9 [other_samples]

• vcf
1 26707944 . A G 3169.63 . [INFO] [other_samples] 0/1:66,0:102:99:1057,0,1856 [other_samples]

2) Incorrect AD is taken while genotyping gvcf files:

• single sample gvcf: 1 1247185 rs142783360 AG A,<NON_REF> 577.73 . [INFO] GT:AD:DP:GQ:PL:SB 0/1:13,20,0:33:99:615,0,361,654,421,1075:7,6,17,3
• combined gvcf 1 1247185 rs142783360 AG A,<NON_REF> . . [INFO] [other_samples] ./.:13,20,0:33:.:615,0,361,654,421,1075:7,6,17,3 [other_samples]

• vcf
1 1247185 . AG A 569.95 . [INFO] [other_samples] 0/1:13,0:33:99:615,0,361 [other_samples]

I have found multiple such cases here, and no errors nor warnings in the logs. I checked also with calls that I had done before on these samples, but in a smaller batch. There the AD values were correct, but there were plenty of other hetz with AD=X,0... I haven't looked closer into those.

Are these bugs that have been fixed in 3.3? Or maybe my brain is not working properly today and I miss sth obvious?

Best regards, Paweł

Created 2015-01-26 11:01:54 | Updated 2015-01-26 11:47:17 | Tags: fisherstrand haplotypecaller jexl strand-bias filtering hardfilters vcf-filtering

Hi, I need to apply hard filters to my data. In cases where I have lower coverage I plan to use the Fisher Strand annotation, and in higher coverage variant calls, SOR (using a JEXL expression to switch between them: DP < 20 ? FS > 50.0 : SOR > 3).

The variant call below (some annotations snipped), which is from a genotyped gVCF from HaplotypeCaller (using a BQSR'ed BAM file), looks well supported (high QD, high MQ, zero MQ0). However, there appears to be some strand bias (SOR=3.3):

788.77 . DP=34;FS=5.213;MQ=35.37;MQ0=0;QD=25.44;SOR=3.334 GT:AD:DP:GQ:PL 1/1:2,29:31:35:817,35,0

In this instance the filter example above would be applied.

## My Question

Is this filtering out a true positive? And what kind of cut-offs should I be using for FS and SOR?

The snipped annotations ReadPosRankSum=-1.809 and BaseQRankSum=-0.8440 for this variant also indicate minor bias that the evidence to support this variant call also has some bias (the variant appears near the end of reads in low quality bases, compared to the reads supporting the reference allele).

## My goal

This is part of a larger hard filter I'm applying to a set of genotyped gVCFs called from HaplotypeCaller.

I'm filtering HomRef positions using this JEXL filter:

vc.getGenotype("%sample%").isHomRef() ? ( vc.getGenotype("%sample%").getAD().size == 1 ? (DP < 10) : ( ((DP - MQ0) < 10) || ((MQ0 / (1.0 * DP)) >= 0.1) || MQRankSum > 3.2905 || ReadPosRankSum > 3.2905 || BaseQRankSum > 3.2905 ) ) : false

And filtering HomVar positions using this JEXL:

vc.getGenotype("%sample%").isHomVar() ? ( vc.getGenotype("%sample%").getAD().0 == 0 ? ( ((DP - MQ0) < 10) || ((MQ0 / (1.0 * DP)) >= 0.1) || QD < 5.0 || MQ < 30.0 ) : ( BaseQRankSum < -3.2905 || MQRankSum < -3.2905 || ReadPosRankSum < -3.2905 || (MQ0 / (1.0 * DP)) >= 0.1) || QD < 5.0 || (DP < 20 ? FS > 60.0 : SOR > 3.5) || MQ < 30.0 || QUAL < 100.0 ) ) : false

My goal is true positive variants only and I have high coverage data, so the filtering should be relatively stringent. Unfortunately I don't have a database I could use to apply VQSR, henceforth the comprehensive filtering strategy.

Created 2015-01-23 16:55:57 | Updated | Tags: vqsr haplotypecaller bam gatk genotypegvcfs

Hi,

I have recal.bam files for all the individuals in my study (these constitute 4 families), and each bam file contains information for one chromosome for one individual. I was wondering if it is best for me to pass all the files for a single individual together when running HaplotypeCaller, if it will increase the accuracy of the calling, or if I can just run HaplotypeCaller on each individual bam file separately.

Also, I was wondering at which step I should be using CalculateGenotypePosteriors, and if it will clean up the calls substantially. VQSR already filters the calls, but I was reading that CalculateGenotypePosteriors actually takes pedigree files, which would be useful in my case. Should I try to use CalculateGenotypePosteriors after VQSR? Are there other relevant filtering or clean-up tools that I should be aware of?

Alva

Created 2015-01-23 08:37:49 | Updated | Tags: haplotypecaller bug

Hello,

I have a pb with HaplotypeCaller.
I saw that this error has been reported as linked with -nct option, but I did not use it.

Strangely enough, I think my command worked some times ago (before server restart ?)

All reference files I used are from the GATK bundle.

Sletort

Created 2015-01-21 19:53:26 | Updated | Tags: baserecalibrator haplotypecaller vcf bam merge rnaseq

Hi, I am working with RNA-Seq data from 6 different samples. Part of my research is to identify novel polymorphisms. I have generated a filtered vcf file for each sample. I would like to now combine these into a single vcf.

I am concerned about sites that were either not covered by the RNA-Seq analysis or were no different from the reference allele in some individuals but not others. These sites will be ‘missed’ when haplotypeCaller analyzes each sample individually and will not be represented in the downstream vcf files.

When the files are combined, what happens to these ‘missed’ sites? Are they automatically excluded? Are they treated as missing data? Is the absent data filled in from the reference genome?

Alternatively, can BaseRecallibrator and/or HaplotypeCaller simultaneously analyze multiple bam files?

Is it common practice to combine bam files for discovering sequence variants?

Created 2015-01-19 14:58:38 | Updated | Tags: haplotypecaller

Hi there, maybe this is a very basic question but, I've read several posts and sections of GATK Best Practices and tutorials but I can not get the point. What is the main difference between using GATK HC in gVCF mode instead of in multi-sample mode? I know that HC in GVCF mode is used to do variant discovery analysis on cohorts of samples, but what is the meaning of "cohorts of samples"? If I have 2 groups of samples, one WT and the other mutant, should I use GVCF mode? I've read almost all the tutorials and Howto's of GATK and I can not understand at all.

And another question, can I give to HC more than one bam file? Maybe using -I several times? And if I introduce several files, in the raw vcf output file, what I'm going to find? Several columns one per sample?

Created 2015-01-16 06:37:45 | Updated | Tags: haplotypecaller best-practices next-generation-sequencing

Excuse me:

After calling the genotype using Haplotyper Caller in gatk, i manually check the reads covering the variant sites. But i found one exception: The genotype of one sample was 0/0，wild type,but the reads for this site were .,,,..,,c.,c.c,.cccc. I think almost all the reads should be . or ,. Is this case normal?

Created 2015-01-15 18:03:12 | Updated | Tags: haplotypecaller

Hi, I have been using the best practices pipeline for illumnia samples for quite long time. Everything is fine. Recently, I am using the same pipeline for IONTORRENT samples. Everything is fine until the haplotypecaller step. I only got two samples to do the haplotypecaller. One bam file is around 9G, and the other is around 13G. But the speed of this step is really slow. It need around 24 weeks to finish haplotypecaller step. It's really strange. Cause if I run the haplotypecaller for a family with 9 persons (Illumina), it only took around 9 days to finish it. How come these two files would need 24 weeks. Does anyone have any idea about what may cause this problem? The bam file of those 9 person are around the same size of the two IONTORRENT samples.

Thank you.

Created 2015-01-15 10:42:39 | Updated | Tags: haplotypecaller pooled-calls

Hello,

I am having difficulty in understanding the vcf output of haplotype caller when I use it for a pool of two individuals. I set ploidy to 4 in this case.

When ploidy is 2, one get the heterozygote genotype as 0/1 which is understandable. In the case of ploidy equals to 4, how I can interpret such a genotype (GT:AD:DP:GQ:PL 0/0/0/1:8,3:11:5:95,0,5,24,232) given that this is a pooled sample so it is not possible to know the genotype. I just was wondering what this means.

Thank you, Homa

Created 2015-01-14 10:31:34 | Updated | Tags: haplotypecaller

Hello,

I tried to find this option but I was unsuccessful. I would like to call SNPs only on a subset of my bam files, that is only on specific chromosomes. I have now done so by getting a subset of bamfile, I was wondering if there is a way to provide Haplotype caller a bed file with the list of chromosomes I want for SNP calling in order to skip the subsetting step?

Thanks, Lucia

Created 2015-01-13 09:49:04 | Updated | Tags: haplotypecaller error variant-calling kras

Hi, I'm using GATK haplotypecaller in order to detect varianti in a tumor sample analyzed with Illumina WES 100X2 paired end mode. I know KRAS p.G12 variant (chr12:25398284) is present in my sample (because previously seen with Sanger sequencing, KRAS is the oncogenic event in this kind of tumor). I aligned the fastq file with bwa after quality control e adapter trimmig. In my bam file I'm able to view the variant at genomic coordinate chr12:25398284 as you can see in the picture (IGV):

however GATK haplotypecaller doesn't call the variant anyway. This is my basic command line:

java -Xmx4g -jar /opt/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37.fasta -I my.bam -stand_call_conf 0 -stand_emit_conf 0 -o output.file

and I tryed to tune a a lot of parameter such as stand_call_conf; -stand_emit_conf; --heterozygosity; --min_base_quality_score and so on.

This is the pileup string of the chr12:25398284 coordinate in my bam file 12 25398284 C 55 ,,T.,,...,,,,,,T.......,t,t.....,,,t..,,,,,,,,,,,,.^],^],^],^], BB@ECAFCECBCBBB@DBCCCDDCABADBDBCCDD@BBEADDEDBCADBB@@BAC

The base quality is good and the mapping quality too, but the haplotypecaller does not determine any active region around the position chr12:25398284

Any suggestion about the reason of misscalling??

Many thanks, Valentina

Created 2015-01-12 06:03:20 | Updated | Tags: unifiedgenotyper haplotypecaller downsample

Does UG and HC downsample before or after exclusion of marked duplicates? I guess code is king for the answer...

Created 2015-01-09 16:47:35 | Updated | Tags: bqsr haplotypecaller variantfiltration vcf bam workflows snps gatk3

Hi all, I'm in a bit of a daze going through all the documentation and I wanted to do a sanity check on my workflow with the experts. I have ~120 WGS of a ~24Mb fungal pathogen. The end-product of my GATK workflow would be a high quality call set of SNPs, restricted to the sites for which we have confidence in the call across all samples (so sites which are not covered by sufficient high quality reads in one or more samples will be eliminated).

Therefore my workflow (starting from a sorted indexed BAM file of reads from a single sample, mapped to reference with bwa mem) is this:

• 01- BAM: Local INDEL realignment (RealignerTargetCreator/IndelRealigner)
• 02- BAM: MarkDuplicates
• 03- BAM: Local INDEL realignment second pass (RealignerTargetCreator/IndelRealigner)
• 04- BAM: Calling variants using HaplotypeCaller
• 05- VCF: Hard filter variants for truth set for BQSR (there is no known variant site databases so we can use our best variants from each VCF file for this). The filter settings are: "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" and we also filter out heterozygous positions using "isHet == 1".
• 06- VCF: Calculate BQSR table using the high quality hard-filtered variants from step 05.
• 07- BAM: Apply BQSR recalibration from previous step to BAM file from step 04.
• 08- BAM: Calling variants on recalibrated BAM file from previous step using HaplotypeCaller, also emitting reference sites using --output_mode EMIT_ALL_SITES \ and --emitRefConfidence GVCF \

Does this sound like a reasonable thing to do? What options should I use in step 8 in order for HC to tell me how confident it is, site-by-site about it's calls, including those that are homozygous reference? I notice that when using --output_mode EMIT_ALL_CONFIDENT_SITES \ and --emitRefConfidence GVCF \ I am missing a lot of the annotation I get when just outputting variant sites (e.g. QD).

Created 2015-01-09 14:07:52 | Updated | Tags: haplotypecaller

Hi,

I'm working on targeted resequencing data and I'm doing a multi-sample variant calling with the HaplotypeCaller. First, I tried to call the variants in all the targeted regions by doing the calling at one time on a cluster. I thus specified all the targeted regions with the -L option.

Then, as it was taking too long, I decided to cut my interval list, chromosome by chromosome and to do the calling on each chromosome. At the end, I merged the VCFs files that I had obtained for the callings on the different chromosomes.

Then, I compared this merged VCF file with the vcf file that I obtained by doing the calling on all the targeted regions at one time. I noticed 1% of variation between the two variants lists. And I can't explain this stochasticity. Any suggestion?

Thanks!

Maguelonne

Created 2015-01-07 21:16:08 | Updated | Tags: haplotypecaller genotypegvcfs

Hi

I am using GATK version 3.3 (latest release) and following the best practice for variant calling. Steps i am using is 1. Run haplotye caller and get the gVCF file 2. Run GenotypeGVCF across the samples to the get the VCF file

The Question is about the Genotype calling. Here is an example:

GVCF file looks like this for a particular position chr1 14677 . G . . END=14677 GT:DP:GQ:MIN_DP:PL 0/0:9:0:9:0,0,203

VCF file looks like this after genotyping with multiple samples. chr1 14677 . G A 76.56 PASS . . GT:AD:DP:GQ:PL 0/1:8,1:9:4:4,0,180

In GVCF file there was no support for the alternate allele, but VCF file shows that there are 1 read supporting the alternate call.

Thanks ! Saurabh

Created 2014-12-12 07:11:53 | Updated | Tags: haplotypecaller stack-trace-error

Here is my command I actually used.

##### ERROR ------------------------------------------------------------------------------------------

I am running HC3.3-0 with the following options (e.g. GENOTYPE_GIVEN_ALLELES):

$java7 -Djava.io.tmpdir=tmp -Xmx3900m \ -jar$jar \
--analysis_type HaplotypeCaller \
--reference_sequence $ref \ --input_file$BAM \
--intervals $CHROM \ --dbsnp$dbSNP \
--out $out \ -stand_call_conf 0 \ -stand_emit_conf 0 \ -A Coverage -A FisherStrand -A HaplotypeScore -A MappingQualityRankSumTest -A QualByDepth -A RMSMappingQuality -A ReadPosRankSumTest \ -L$allelesVCF \
-L 20:60000-70000 \
--interval_set_rule INTERSECTION \
--genotyping_mode GENOTYPE_GIVEN_ALLELES \
--alleles $allelesVCF \ --emitRefConfidence NONE \ --output_mode EMIT_ALL_SITES \  The file$allelesVCF contains these neighbouring SNPs:

20  60807   .   C   T   118.96  .
20  60808   .   G   A   46.95   .
20  61270   .   A   C   2870.18 .
20  61271   .   T   A   233.60  .


I am unable to call these neighbouring SNPs; despite reads being present in the file BAM, which shouldn't matter anyway. I also tried adding --interval_merging OVERLAPPING_ONLY to the command line, but that didn't solve the problem. What am I doing wrong? I should probably add GATK breaker/misuser to my CV... Thank you as always. P.S. The CommandLineGATK documentation does not say, what the default value for --interval_merging is. P.P.S. Iterative testing a bit slow, because HC always has to do this step: HCMappingQualityFilter - Filtering out reads with MAPQ < 20  Created 2014-12-02 16:32:32 | Updated | Tags: haplotypecaller bam gatk Hi, I want to use HaplotypeCaller to call families together. I have bam files for each individual in the 4 families I am studying, as well as a ped file describing the pedigree information. The problem is that these families have complex pedigrees, with the parents (mother and father), the children, and then one grandchild for each child (do not have information about the other parent of the grandchild). I would like to call these families with their complex pedigrees together, and I would like to call all the 4 families together to maximize the power of the calling. However, I'm not sure how to do that with just the -ped option. -ped seems to be designed for only one family or cohort, and I'm not sure it would work for me to feed it all my bams as inputs. Are there any other tools for GATK that I could use to call complex pedigrees? The other possibility would be to call the 4 trios separately and each child-grandchild pair separately, but not sure how to do that either with just -ped. What would you recommend? And finally, I get an error message saying that --emitRefConfidence only works for single sample mode. It seems that I should omit this option when I run HaplotypeCaller on my families, but are there any other options that I should use for cohort calling besides the standard ones (which I know to be --variant_index_type and --variant_index_parameter)? Thanks, Alva Created 2014-11-19 14:29:23 | Updated | Tags: haplotypecaller Hi I have been trying HaplotypeCaller to find SNPs and INDELS in viral read data (haploid) but am finding that it throws away around half of my reads and I don't understand why. A small proportion (8%) are filtered out duplicates and 0.05% fail on mapping quality but I can't account for the majority of lost reads. I appreciate that GATK wasn't built for viral sequences but would you have an idea of what could be causing this? I use the following command after marking duplicates and realigning around indels: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R Ref.fasta -I realigned_reads.bam --genotyping_mode DISCOVERY -ploidy 1 -bamout reassembled.bam -o rawvariants.vcf I have also tried the same file with UnifiedGenotype and I get the result I expect i.e. most of my reads are retained and I have SNP calls that agree with a VCF constructed in a different program so I assume the reads are lost as part of the local realignment? Thanks Kirstyn Created 2014-11-18 02:40:18 | Updated | Tags: haplotypecaller Hi, I'd like to ask you if there is a paper describing the Haplotype Caller algorithm, if you could please send me the reference. I have tried to find it, but I only found the paper on GATK which is great, but it doesn't describe in detail the Haplotype Caller algorithm. thank you, Created 2014-11-17 11:59:46 | Updated | Tags: haplotypecaller genotypegvcfs combinegvcfs Hi all, I am attempting to use the HaplotyperCaller / CombineGVCFs / GenotypeGVCFs to call variants on chrom X and Y of 769 samples (356 males, 413 females) sequenced at 12x coverage (WG sequening, but right not only calling X and Y). I have called the samples according to the best practises using the HaplotypeCaller, using ploidy = 1 for males on X and Y and ploidy =2 for females on X, e.g.: INFO 16:28:45,750 HelpFormatter - Program Args: -R /gcc/resources/b37/indices/human_g1k_v37.fa -T HaplotypeCaller -L X -ploidy 1 -minPruning 3 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -I /target/gpfs2/gcc/groups/gonl/projects/trio-analysis/rawdata_release2/A102.human_g1k_v37.trio_realigned.bam --sample_name A102a -o /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/A102a.chrX.hc.g.vcf Then I have used CombineGVCFs to combine my samples in batches of 100 samples. Now I am attempting to genotype them and I face the same issue on both X (males + females) and Y (males only): It starts running fine and then just hang on a certain position. At first it crashed asking for additional memory but now with 24Gb or memory it simply stays at a single position for 24hrs until my job eventually stops due to walltime. Here is the chrom X output: INFO 15:00:39,501 HelpFormatter - Program Args: -R /gcc/resources/b37/indices/human_g1k_v37.fa -T GenotypeGVCFs -ploidy 1 --dbsnp /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf -stand_call_conf 10 -stand_emit_conf 10 --max_alternate_alleles 4 -o /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.vcf -L X -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.1.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.2.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.3.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.4.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.5.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.6.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.7.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.8.g.vcf INFO 15:00:39,507 HelpFormatter - Executing as lfrancioli@targetgcc15-mgmt on Linux 3.0.80-0.5-default amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13. INFO 15:00:39,507 HelpFormatter - Date/Time: 2014/11/12 15:00:39 INFO 15:00:39,508 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:00:39,508 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:00:40,951 GenomeAnalysisEngine - Strictness is SILENT INFO 15:00:41,153 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 15:57:53,416 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.4.g.vcf.idx INFO 17:09:39,597 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.5.g.vcf.idx INFO 18:21:00,656 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.6.g.vcf.idx INFO 19:30:46,624 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.7.g.vcf.idx INFO 20:22:38,368 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.8.g.vcf.idx WARN 20:26:45,716 FSLockWithSharedLockAcquisitionTask - WARNING: Unable to lock file /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf.idx because an IOException occurred with message: No locks available. INFO 20:26:45,718 RMDTrackBuilder - Could not acquire a shared lock on index file /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf.idx, falling back to using an in-memory index for this GATK run. INFO 20:33:29,491 IntervalUtils - Processing 155270560 bp from intervals INFO 20:33:29,628 GenomeAnalysisEngine - Preparing for traversal INFO 20:33:29,635 GenomeAnalysisEngine - Done preparing for traversal INFO 20:33:29,636 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 20:33:29,637 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 20:33:29,638 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 20:33:29,948 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files INFO 20:33:59,642 ProgressMeter - X:65301 0.0 30.0 s 49.6 w 0.0% 19.8 h 19.8 h INFO 20:34:59,820 ProgressMeter - X:65301 0.0 90.0 s 149.1 w 0.0% 59.4 h 59.4 h ... INFO 20:52:01,064 ProgressMeter - X:177301 0.0 18.5 m 1837.7 w 0.1% 11.3 d 11.2 d INFO 20:53:01,066 ProgressMeter - X:177301 0.0 19.5 m 1936.9 w 0.1% 11.9 d 11.9 d ... INFO 14:58:25,243 ProgressMeter - X:177301 0.0 18.4 h 15250.3 w 0.1% 96.0 w 95.9 w INFO 14:59:38,112 ProgressMeter - X:177301 0.0 18.4 h 15250.3 w 0.1% 96.1 w 96.0 w INFO 15:00:47,482 ProgressMeter - X:177301 0.0 18.5 h 15250.3 w 0.1% 96.2 w 96.1 w =>> PBS: job killed: walltime 86440 exceeded limit 86400

I would really appreciate if you could give me some pointer as how to handle this situation.

Thanks! Laurent

Created 2014-11-03 13:43:38 | Updated | Tags: haplotypecaller rnaseq pooled-calls

Hello,

First of all, thank you for your detailed best practice pipeline for SNP calling from RNA-seq data.

I have pooled RNA seq data which I need to call SNP from. Each library consists of a pooled sample of 2-3 individuals of the same sex-tissue combination.

I was wondering if Haplotype caller can handle SNP calling from pooled sequences or is it better if I use FreeBayes?

I understand that these results come from experimenting with the data but it would be great if you could share your experiences with me on this.

Cheers, Homa

Created 2014-09-11 15:52:18 | Updated | Tags: vqsr haplotypecaller qualbydepth genotypegvcfs

Hey there,

How can it be possible that some of my snps or indels calls miss the QD tag? I'm doing the recommended workflow and I've tested if for both RNAseq (hard filters complains, that's how I saw those tags were missing) and Exome sequencing (VQSR). How can a hard filter applied on QD on a line without actually that tag can be considered to pass that threshold too? I'm seeing a lot more INDELs in RNAseq where this kind of scenario is happening as well.

Here's the command lines that I used :

# VQSR

java -Djava.io.tmpdir=$LSCRATCH -Xmx10g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -T VariantRecalibrator -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -mode SNP -resource:1000G,known=false,training=true,truth=false,prior=10.0 ~/References/hg19/VQSR/1000G_phase1.snps.high_confidence.b37.noGL.converted.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ~/References/hg19/VQSR/hapmap_3.3.b37.noGL.nochrM.converted.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 ~/References/hg19/VQSR/1000G_omni2.5.b37.noGL.nochrM.converted.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 ~/References/hg19/VQSR/dbsnp_138.b37.excluding_sites_after_129.noGL.nochrM.converted.vcf -input snv.vcf -recalFile 96Exomes.HC.TruSeq.snv.RECALIBRATED -tranchesFile 96Exomes.HC.TruSeq.snv.tranches -rscriptFile 96Exomes.HC.TruSeq.snv.plots.R -R ~/References/hg19/hg19.fasta --maxGaussians 4 java -Djava.io.tmpdir=$LSCRATCH -Xmx10g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -T ApplyRecalibration -ts_filter_level 99.0 -mode SNP -input snv.vcf -recalFile 96Exomes.HC.TruSeq.snv.RECALIBRATED -tranchesFile 96Exomes.HC.TruSeq.snv.tranches -o 96Exomes.HC.TruSeq.snv.recal.vcf -R ~/References/hg19/hg19.fasta

# HARD FILTER (RNASeq)

java -Djava.io.tmpdir=$LSCRATCH -Xmx2g -jar /home/apps/Logiciels/GATK/3.1-1/GenomeAnalysisTK.jar -l INFO -T VariantFiltration -R ~/References/hg19/hg19.fasta -V 96RNAseq.STAR.q1.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o 96RNAseq.STAR.q1.FS30.QD2.vcf Here are some examples for RNAseq : chr1 6711349 . G A 79.10 PASS BaseQRankSum=-1.369e+00;ClippingRankSum=1.00;DP=635;FS=1.871;MLEAC=1;MLEAF=5.495e-03;MQ=60.00;MQ0=0;MQRankSum=-4.560e-01;ReadPosRankSum=-1.187e+00 GT:AD:DP:GQ:PL 0/0:8,0:8:24:0,24,280 ./.:0,0:0 0/0:9,0:9:21:0,21,248 0/0:7,0:7:21:0,21,196 0/0:7,0:7:21:0,21,226 0/0:8,0:8:21:0,21,227 0/0:8,0:8:21:0,21,253 0/0:7,0:7:21:0,21,218 0/0:9,0:9:27:0,27,282 1/1:0,0:5:15:137,15,0 0/0:2,0:2:6:0,6,47 0/0:28,0:28:78:0,78,860 0/0:7,0:7:21:0,21,252 0/0:2,0:2:6:0,6,49 0/0:5,0:5:12:0,12,152 0/0:3,0:3:6:0,6,90 0/0:4,0:4:12:0,12,126 0/0:9,0:9:21:0,21,315 0/0:7,0:7:21:0,21,256 0/0:7,0:7:21:0,21,160 0/0:8,0:8:21:0,21,298 0/0:20,0:20:60:0,60,605 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,67 0/0:2,0:2:6:0,6,71 0/0:14,0:14:20:0,20,390 0/0:7,0:7:21:0,21,223 0/0:7,0:7:21:0,21,221 0/0:4,0:4:12:0,12,134 0/0:2,0:2:6:0,6,54 ./.:0,0:0 0/0:4,0:4:9:0,9,118 0/0:8,0:8:21:0,21,243 0/0:6,0:6:15:0,15,143 0/0:8,0:8:21:0,21,244 0/0:7,0:7:21:0,21,192 0/0:2,0:2:6:0,6,54 0/0:13,0:13:27:0,27,359 0/0:8,0:8:21:0,21,245 0/0:7,0:7:21:0,21,218 0/0:12,0:12:36:0,36,354 0/0:8,0:8:21:0,21,315 0/0:7,0:7:21:0,21,215 0/0:2,0:2:6:0,6,49 0/0:10,0:10:24:0,24,301 0/0:7,0:7:21:0,21,208 0/0:7,0:7:21:0,21,199 0/0:2,0:2:6:0,6,47 0/0:3,0:3:9:0,9,87 0/0:2,0:2:6:0,6,73 0/0:7,0:7:21:0,21,210 0/0:8,0:8:22:0,22,268 0/0:7,0:7:21:0,21,184 0/0:7,0:7:21:0,21,213 0/0:5,0:5:9:0,9,135 0/0:7,0:7:21:0,21,200 0/0:4,0:4:12:0,12,118 0/0:7,0:7:21:0,21,232 0/0:7,0:7:21:0,21,232 0/0:7,0:7:21:0,21,217 0/0:8,0:8:21:0,21,255 0/0:9,0:9:24:0,24,314 0/0:8,0:8:21:0,21,221 0/0:9,0:9:24:0,24,276 0/0:9,0:9:21:0,21,285 0/0:3,0:3:6:0,6,90 0/0:2,0:2:6:0,6,57 0/0:13,0:13:20:0,20,385 0/0:2,0:2:6:0,6,48 0/0:11,0:11:27:0,27,317 0/0:8,0:8:21:0,21,315 0/0:9,0:9:24:0,24,284 0/0:7,0:7:21:0,21,228 0/0:14,0:14:33:0,33,446 0/0:2,0:2:6:0,6,64 0/0:2,0:2:6:0,6,72 0/0:7,0:7:21:0,21,258 0/0:10,0:10:27:0,27,348 0/0:7,0:7:21:0,21,219 0/0:9,0:9:21:0,21,289 0/0:20,0:20:57:0,57,855 0/0:4,0:4:12:0,12,146 0/0:7,0:7:21:0,21,205 0/0:12,0:14:36:0,36,1030 0/0:3,0:3:6:0,6,87 0/0:2,0:2:6:0,6,60 0/0:7,0:7:21:0,21,226 0/0:7,0:7:21:0,21,229 0/0:8,0:8:21:0,21,265 0/0:4,0:4:6:0,6,90 ./.:0,0:0 0/0:7,0:7:21:0,21,229 0/0:2,0:2:6:0,6,59 0/0:2,0:2:6:0,6,56 chr1 7992047 . T C 45.83 SnpCluster BaseQRankSum=1.03;ClippingRankSum=0.00;DP=98;FS=0.000;MLEAC=1;MLEAF=0.014;MQ=60.00;MQ0=0;MQRankSum=-1.026e+00;ReadPosRankSum=-1.026e+00 GT:AD:DP:GQ:PL ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,70 0/0:2,0:2:6:0,6,45 0/0:3,0:3:6:0,6,87 0/0:2,0:2:6:0,6,52 ./.:0,0:0 ./.:0,0:0 ./.:1,0:1 ./.:0,0:0 0/0:2,0:2:6:0,6,55 0/0:2,0:2:6:0,6,49 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,61 0/0:2,0:2:6:0,6,49 ./.:0,0:0 ./.:0,0:0 0/0:3,0:3:6:0,6,90 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,52 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,69 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,64 ./.:0,0:0 0/0:2,0:2:6:0,6,37 ./.:0,0:0 0/0:2,0:2:6:0,6,67 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,68 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:11,0:11:24:0,24,360 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,68 0/0:2,0:2:6:0,6,50 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,50 0/0:3,0:3:6:0,6,90 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:4:6:0,6,50 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:7,0:7:21:0,21,231 0/0:2,0:2:6:0,6,64 ./.:0,0:0 0/0:2,0:2:6:0,6,63 0/0:2,0:2:6:0,6,70 ./.:0,0:0 0/0:6,0:6:15:0,15,148 ./.:0,0:0 ./.:0,0:0 1/1:0,0:2:6:90,6,0 ./.:0,0:0 0/0:2,0:2:6:0,6,63 0/0:2,0:2:6:0,6,74 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,58 0/0:2,0:2:6:0,6,71 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 For Exome Seq now : chr2 111878571 . C T 93.21 PASS DP=634;FS=0.000;MLEAC=1;MLEAF=5.319e-03;MQ=60.00;MQ0=0;VQSLOD=14.19;culprit=MQ GT:AD:DP:GQ:PL 0/0:8,0:8:24:0,24,243 0/0:4,0:4:9:0,9,135 0/0:7,0:7:18:0,18,270 0/0:7,0:7:21:0,21,230 0/0:16,0:16:48:0,48,542 0/0:8,0:8:21:0,21,315 0/0:6,0:6:18:0,18,186 0/0:5,0:5:15:0,15,168 0/0:6,0:6:15:0,15,225 0/0:10,0:10:30:0,30,333 0/0:7,0:7:21:0,21,239 0/0:6,0:6:18:0,18,202 0/0:6,0:6:15:0,15,225 0/0:7,0:7:21:0,21,225 0/0:8,0:8:24:0,24,272 0/0:5,0:5:15:0,15,168 1/1:0,0:13:13:147,13,0 0/0:2,0:2:6:0,6,73 0/0:8,0:8:24:0,24,256 0/0:14,0:14:4:0,4,437 0/0:3,0:3:9:0,9,85 0/0:4,0:4:12:0,12,159 0/0:7,0:7:21:0,21,238 0/0:5,0:5:15:0,15,195 0/0:7,0:7:15:0,15,225 0/0:12,0:12:36:0,36,414 0/0:4,0:4:12:0,12,156 0/0:7,0:7:0:0,0,190 0/0:2,0:2:6:0,6,64 0/0:7,0:7:21:0,21,242 0/0:7,0:7:21:0,21,234 0/0:8,0:8:24:0,24,267 0/0:7,0:7:21:0,21,245 0/0:7,0:7:21:0,21,261 0/0:6,0:6:18:0,18,204 0/0:8,0:8:24:0,24,302 0/0:5,0:5:15:0,15,172 0/0:9,0:9:24:0,24,360 0/0:18,0:18:51:0,51,649 0/0:5,0:5:15:0,15,176 0/0:2,0:2:6:0,6,70 0/0:14,0:14:33:0,33,495 0/0:4,0:4:9:0,9,135 0/0:8,0:8:21:0,21,315 0/0:4,0:4:12:0,12,149 0/0:4,0:4:6:0,6,90 0/0:10,0:10:27:0,27,405 0/0:3,0:3:6:0,6,90 0/0:4,0:4:12:0,12,133 0/0:14,0:14:6:0,6,431 0/0:4,0:4:12:0,12,151 0/0:5,0:5:15:0,15,163 0/0:3,0:3:9:0,9,106 0/0:7,0:7:21:0,21,237 0/0:7,0:7:21:0,21,268 0/0:8,0:8:21:0,21,315 0/0:2,0:2:6:0,6,68 ./.:0,0:0 0/0:3,0:3:9:0,9,103 0/0:7,0:7:21:0,21,230 0/0:3,0:3:6:0,6,90 0/0:9,0:9:26:0,26,277 0/0:7,0:7:21:0,21,236 0/0:5,0:5:15:0,15,170 ./.:1,0:1 0/0:15,0:15:45:0,45,653 0/0:8,0:8:24:0,24,304 0/0:6,0:6:15:0,15,225 0/0:3,0:3:9:0,9,103 0/0:2,0:2:6:0,6,79 0/0:7,0:7:21:0,21,241 0/0:4,0:4:12:0,12,134 0/0:3,0:3:6:0,6,90 0/0:5,0:5:15:0,15,159 0/0:4,0:4:12:0,12,136 0/0:5,0:5:12:0,12,180 0/0:11,0:11:21:0,21,315 0/0:13,0:13:39:0,39,501 0/0:3,0:3:9:0,9,103 0/0:8,0:8:24:0,24,257 0/0:2,0:2:6:0,6,73 0/0:8,0:8:24:0,24,280 0/0:4,0:4:12:0,12,144 0/0:4,0:4:9:0,9,135 0/0:8,0:8:24:0,24,298 0/0:4,0:4:12:0,12,129 0/0:5,0:5:15:0,15,184 0/0:2,0:2:6:0,6,62 0/0:2,0:2:6:0,6,65 0/0:9,0:9:27:0,27,337 0/0:7,0:7:21:0,21,230 0/0:7,0:7:21:0,21,239 0/0:5,0:5:0:0,0,113 0/0:11,0:11:33:0,33,369 0/0:7,0:7:21:0,21,248 0/0:10,0:10:30:0,30,395 Thanks for your help. Created 2014-08-28 02:16:12 | Updated 2014-08-28 02:40:19 | Tags: haplotypecaller minpruning Hi GATK developer, I am working on a human WES project with an average sequencing depth 42X per individual. Recent version of GATK (v3.2-2) was used in my study. To investigate which "-minPruning" setting is better for my project, I ran HaplotypeCaller and tested the "-minPruning 3" and "-minPruning 2", so I generate two sets of gvcf files. I found some interesting cases when I compared these two sets of files. For example in the same individual (we name this individual as A): Locus "chr1 1654007" from the GVCF file for individual A (-minPruning 3) is listed as below: chr1 1654007 . A . . END=1654007 GT:DP:GQ:MIN_DP:PL 0/0:64:0:64:0,0,1922 Locus "chr1 1654007" from the GVCF file for individual A (-minPruning 2) is listed as below: chr1 1654007 rs199558549 A T, 155.90 . DB;DP=6;MLEAC=2,0;MLEAF=1.00,0.00;MQ=27.73;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,6,0:6:18:189,18,0,189,18,189:0,0,4,2 You can see it is quite different for the results in terms of the final genotyping call and the sequencing depth ("DP" tag) generated from "-minPruning 3" and "-minPruning 2" setting. I am wondering what exactly happens for this locus so I check this locus using IGV. I found at this locus, there are 6 reads supporting the ALT allele ("T") and 64 reads supporting the REF allele ("A"). So I believe the result from "-minPruning 3" should be correct. Just wondering is it normal considering the different setting of "-minPruning 3" and "-minPruning 2"or is it just a bug? If it is just because of the different setting of "-minPruning 3" and "-minPruning 2", I would say the setting of "-minPruning" is critical for the final results so a guideline of how to set the "-minPruning" based on the sequencing depth will be important. How do you think? Thanks, Qiongyi Created 2014-06-30 18:53:29 | Updated | Tags: haplotypecaller gvcf Hello, It seems the banded gvcf produced by HaplotypeCaller is missing many positions in the genome. We are not able to find half the specific positions we want to look at. To be clear, these positions are not even contained in any band, they are simply not there. For example, we were looking for position 866438 on chromosome 1 and these are the relevant lines in the gvcf: 1 866433 . G C, 412.77 . BaseQRankSum=-1.085;DP=16;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-1.519;ReadPosRankSum=0.651 GT:AD:GQ:PL:SB 0/1:15,1,0:99:441,0,23572,546,23612,24157:0,0,0,0 1 866434 . G . . END=866435 GT:DP:GQ:MIN_DP:PL 0/0:22:45:21:0,45,675 1 866436 . ACGTG A, 403.73 . BaseQRankSum=-0.510;DP=17;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.714;ReadPosRankSum=0.434 GT:AD:GQ:PL:SB 0/1:16,1,0:99:441,0,23572,546,23612,24157:0,0,0,0 1 866441 . A . . END=866441 GT:DP:GQ:MIN_DP:PL 0/0:21:0:21:0,0,402 We looked at the corresponding bam files and there seems to be good coverage for these positions. Moreover, if HaplotypeCaller is run in BP_RESOLUTION mode, then these positions are present in the resulting vcf. Any idea what might be going wrong? Thanks, Juber Created 2014-06-27 09:45:39 | Updated 2014-06-27 09:46:05 | Tags: depthofcoverage haplotypecaller Hi we've been looking at results of a recent run of GATK-HC (3.1-1) using the new N1 pipeline and we've been seeing something odd in the distribution of the Depth of Coverage (Using DP from the Genotype Fields) we're seeing for the raw unfiltered variants. All our samples are sequenced using PCR-Free libraries and have two lanes of sequence (~24x mapped depth) and looking at depth of coverage from bedtools we see a nice clean distribution (red in graph) but when we look at the data from the HaplotypeCaller sites (Blue in graph) we see a bimodal distribution with an excess of variants called at lower coverage (~16x) vs the most common coverage of around 24x. We've seen this in all the samples we've looked at so far, so it's not just a one off. I've had a quick look at read depth from another variant caller (Platypus) and there we see no evidence of this bimodal distribution in the variants it has called. Is this expected behaviour? If so why does it occur? If not any idea what is going on here, is it a bug in the variant caller or the depth statistics? Do you see the same thing in other datasets? Thanks! Created 2014-06-11 17:47:15 | Updated | Tags: haplotypecaller java multi-sample I'm running the HaplotypeCaller on a series of samples using a while loop in a bash script and for some samples the HaplotypeCaller is stopping part way through the file. My command was: java -Xmx18g -jar$Gpath/GenomeAnalysisTK.jar \ -nct 8 \ -l INFO \ -R $ref \ -log$log/$plate.$prefix.HaplotypeCaller.log \ -T HaplotypeCaller \ -I $bam/$prefix.realign.bam \ --emitRefConfidence GVCF \ -variant_index_type LINEAR \ -variant_index_parameter 128000 \ -o $gvcf/$prefix.GATK.gvcf.vcf

Most of the samples completed and the output looks good, but for some I only have a truncated gvcf file with no index. When I look at the log it looks like this:

INFO  17:25:15,289 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:25:15,291 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:21
INFO  17:25:15,291 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  17:25:15,294 HelpFormatter - Program Args: -nct 8 -l INFO -R /home/owens/ref/Gasterosteus_aculeatus.BROADS1.73.dna.toplevel.fa -log /home/owens/SB/C31KCACXX05.log/C31KCACXX05.sb1Pax102L-S2013.Hap
INFO  17:25:15,296 HelpFormatter - Executing as owens@GObox on Linux 3.2.0-63-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_17-b02.
INFO  17:25:15,296 HelpFormatter - Date/Time: 2014/06/10 17:25:15
INFO  17:25:15,296 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:25:15,296 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:25:15,722 GenomeAnalysisEngine - Strictness is SILENT
INFO  17:25:15,892 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
INFO  17:25:15,898 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 17:25:15,942 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04
INFO  17:25:15,948 HCMappingQualityFilter - Filtering out reads with MAPQ < 20
INFO  17:25:15,993 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 8 CPU thread(s) for each of 1 data thread(s), of 12 processors available on this machine
INFO  17:25:16,097 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO  17:25:16,114 GenomeAnalysisEngine - Done preparing for traversal
INFO  17:25:16,114 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  17:25:16,114 ProgressMeter -        Location processed.active regions  runtime per.1M.active regions completed total.runtime remaining
INFO  17:25:16,114 HaplotypeCaller - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
INFO  17:25:16,116 HaplotypeCaller - All sites annotated with PLs force to true for reference-model confidence output
INFO  17:25:16,278 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units
INFO  17:25:46,116 ProgressMeter - scaffold_1722:1180        1.49e+05   30.0 s        3.3 m      0.0%        25.6 h    25.6 h
INFO  17:26:46,117 ProgressMeter - scaffold_279:39930        1.37e+07   90.0 s        6.0 s      3.0%        50.5 m    49.0 m
INFO  17:27:16,118 ProgressMeter - scaffold_139:222911        2.89e+07  120.0 s        4.0 s      6.3%        31.7 m    29.7 m
INFO  17:27:46,119 ProgressMeter - scaffold_94:517387        3.89e+07    2.5 m        3.0 s      8.5%        29.2 m    26.7 m
INFO  17:28:16,121 ProgressMeter - scaffold_80:591236        4.06e+07    3.0 m        4.0 s      8.9%        33.6 m    30.6 m
INFO  17:28:46,123 ProgressMeter - groupXXI:447665        6.07e+07    3.5 m        3.0 s     13.3%        26.4 m    22.9 m
INFO  17:29:16,395 ProgressMeter -  groupV:8824013        7.25e+07    4.0 m        3.0 s     17.6%        22.7 m    18.7 m
INFO  17:29:46,396 ProgressMeter - groupXIV:11551262        9.93e+07    4.5 m        2.0 s     24.0%        18.7 m    14.2 m
WARN  17:29:52,732 ExactAFCalc - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at groupX:1516679 has 8 alternate alleles so only the top alleles
INFO  17:30:19,324 ProgressMeter - groupX:14278234        1.15e+08    5.1 m        2.0 s     27.9%        18.1 m    13.0 m
INFO  17:30:49,414 ProgressMeter - groupXVIII:5967453        1.46e+08    5.6 m        2.0 s     33.0%        16.8 m    11.3 m
INFO  17:31:19,821 ProgressMeter - groupXI:15030145        1.63e+08    6.1 m        2.0 s     38.5%        15.7 m     9.7 m
INFO  17:31:50,192 ProgressMeter - groupVI:5779653        1.96e+08    6.6 m        2.0 s     43.8%        15.0 m     8.4 m
INFO  17:32:20,334 ProgressMeter - groupXVI:18115788        2.13e+08    7.1 m        1.0 s     50.1%        14.1 m     7.0 m
INFO  17:32:50,335 ProgressMeter - groupVIII:4300439        2.50e+08    7.6 m        1.0 s     55.1%        13.7 m     6.2 m
INFO  17:33:30,336 ProgressMeter - groupXIII:2378126        2.89e+08    8.2 m        1.0 s     63.1%        13.0 m     4.8 m
INFO  17:34:02,099 GATKRunReport - Uploaded run statistics report to AWS S3


It seems like it got half way through and stopped. I think it's a memory issue because when I increased the available ram to java, the problem happens less, although I can't figure out why some samples work and others don't (there isn't anything else running on the machine using ram and the biggest bam files aren't failing). It's also strange to me that there doesn't seem to be an error message. Any insight into why this is happening and how to avoid it would be appreciated.

Created 2014-03-25 19:06:06 | Updated 2014-03-25 19:07:22 | Tags: haplotypecaller

Hi,

In our downstream analysis we have to differentiate between non-variants because of no coverage and confident reference calls. Because of this, we need to pay close attention to HaplotypeCaller non-variant calls. Doing this we found calls like this one:

17 7127955 . G <NON_REF> . . END=7127955 GT:DP:GQ:MIN_DP:PL 0/0:29:0:29:0,0,468

Would it be possible to get a feeling why this initial call is giving the same PL(0) to 0/0 and 0/1. If you look at the alignment,after "Data Pre-processing" best practices, we can see there is strong evidence this is a 0/0 position.

GenotypeGVCFs won't call this position as variant, which is good.

Thanks, Carlos

PS: I tried to add a BAM file with the relevant region and I got "Uploaded file type is not allowed".

Created 2013-08-16 21:12:22 | Updated | Tags: haplotypecaller

If I run HaplotypeCaller with a VCF file as the intervals file, -stand_emit_conf 0, and -out_mode EMIT_ALL_SITES, should I get back an output VCF with all the sites from the input VCF, whether or not there was a variant call there? If not, is there a way to force output even if the calls are 0/0 or ./. for everyone in the cohort?

I have been trying to run HC with the above options, but I can't understand why some variants are included in my output file and others aren't. Some positions are output with no alternate allele and GTs of 0 for everyone. However, other positions that I know have coverage are not output at all.

Thanks,

Elise

Created 2013-08-13 18:41:40 | Updated | Tags: haplotypecaller downsample downsampling

Hello,

I'm running the Haplotype Caller on 80 bams, multithreaded (-nt 14), with either vcfs or bed files as the intervals file. I am trying to downsample to approximately 200 reads (-dcov 200). However, my output vcfs clearly have much higher depth (over 300-500 reads) in some areas. Also, HC drastically slows down whenever it hits a region of very deep pile-ups (over 1000 reads).

Is there any way to speed up HC on regions of very deep pileups? Is there a way to make downsampling stricter?

Thank you for your help. Elise

Created 2013-02-07 11:05:27 | Updated 2013-02-08 20:22:14 | Tags: haplotypecaller ploidy

I'm trying to run HaplotypeCaller on a haploid organism. Is this possible? What argument should I use for this? My first attempt produced a diploid calls.

Sorry for the silly question