# Tagged with #haplotypecaller 5 documentation articles | 3 announcements | 91 forum discussions

This document describes the procedure used by HaplotypeCaller to evaluate the evidence for variant alleles based on candidate haplotypes determined in the previous step for a given ActiveRegion. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.

### Overview

The previous step produced a list of candidate haplotypes for each ActiveRegion, as well as a list of candidate variant sites borne by the non-reference haplotypes. Now, we need to evaluate how much evidence there is in the data to support each haplotype. This is done by aligning each sequence read to each haplotype using the PairHMM algorithm, which produces per-read likelihoods for each haplotype. From that, we'll be able to derive how much evidence there is in the data to support each variant allele at the candidate sites, and that produces the actual numbers that will finally be used to assign a genotype to the sample.

### 1. Evaluating the evidence for each candidate haplotype

We originally obtained our list of haplotypes for the ActiveRegion by constructing an assembly graph and selecting the most likely paths in the graph by counting the number of supporting reads for each path. That was a fairly naive evaluation of the evidence, done over all reads in aggregate, and was only meant to serve as a preliminary filter to whittle down the number of possible combinations that we're going to look at in this next step.

Now we want to do a much more thorough evaluation of how much evidence we have for each haplotype. So we're going to take each individual read and align it against each haplotype in turn (including the reference haplotype) using the PairHMM algorithm (see Durbin et al., 1998). If you're not familiar with PairHMM, it's a lot like the BLAST algorithm, in that it's a pairwise alignment method that uses a Hidden Markov Model (HMM) and produces a likelihood score. In this use of the PairHMM, the output score expresses the likelihood of observing the read given the haplotype by taking into account the information we have about the quality of the data (i.e. the base quality scores and indel quality scores).

This produces a big table of likelihoods where the columns are haplotypes and the rows are individual sequence reads. (example figure TBD)

The table essentially represents how much supporting evidence there is for each haplotype (including the reference), itemized by read.

### 2. Evaluating the evidence for each candidate site and corresponding alleles

Having per-read likelihoods for entire haplotypes is great, but ultimately we want to know how much evidence there is for individual alleles at the candidate sites that we identified in the previous step. To find out, we take the per-read likelihoods of the haplotypes and marginalize them over alleles, which produces per-read likelihoods for each allele at a given site. In practice, this means that for each candidate site, we're going to decide how much support each read contributes for each allele, based on the per-read haplotype likelihoods that were produced by the PairHMM.

This may sound complicated, but the procedure is actually very simple -- there is no real calculation involved, just cherry-picking appropriate values from the table of per-read likelihoods of haplotypes into a new table that will contain per-read likelihoods of alleles. This is how it happens. For a given site, we list all the alleles observed in the data (including the reference allele). Then, for each read, we look at the haplotypes that support each allele; we select the haplotype that has the highest likelihood for that read, and we write that likelihood in the new table. And that's it! For a given allele, the total likelihood will be the product of all the per-read likelihoods. (example fig TBD)

At the end of this step, sites where there is sufficient evidence for at least one of the variant alleles considered will be called variant, and a genotype will be assigned to the sample in the next (final) step.

This document describes the procedure used by HaplotypeCaller to define ActiveRegions on which to operate as a prelude to variant calling. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.

### Summary

To define active regions, the HaplotypeCaller operates in three phases. First, it computes an activity score for each individual genome position, yielding the raw activity profile, which is a wave function of activity per position. Then, it applies a smoothing algorithm to the raw profile, which is essentially a sort of averaging process, to yield the actual activity profile. Finally, it identifies local maxima where the activity profile curve rises above the preset activity threshold, and defines appropriate intervals to encompass the active profile within the preset size constraints.

### 1. Calculating the raw activity profile

Active regions are determined by calculating a profile function that characterizes “interesting” regions likely to contain variants. The raw profile is first calculated locus by locus.

In the normal case (no special mode is enabled) the per-position score is the probability that the position contains a variant as calculated using the reference-confidence model applied to the original alignment.

If using the mode for genotyping given alleles (GGA) or the advanced-level flag -useAlleleTrigger, and the site is overlapped by an allele in the VCF file provided through the -alleles argument, the score is set to 1. If the position is not covered by a provided allele, the score is set to 0.

This operation gives us a single raw value for each position on the genome (or within the analysis intervals requested using the -L argument).

### 2. Smoothing the activity profile

The final profile is calculated by smoothing this initial raw profile following three steps. The first two steps consist in spreading individual position raw profile values to contiguous bases. As a result each position will have more than one raw profile value that are added up in the third and last step to obtain a final unique and smoothed value per position.

1. Unless one of the special modes is enabled (GGA or allele triggering), the position profile value will be copied over to adjacent regions if enough high quality soft-clipped bases immediately precede or follow that position in the original alignment. At time of writing, high-quality soft-clipped bases are those with quality score of Q29 or more. We consider that there are enough of such a soft-clips when the average number of high quality bases per soft-clip is 7 or more. In this case the site profile value is copied to all bases within a radius of that position as large as the average soft-clip length without exceeding a maximum of 50bp.

2. Each profile value is then divided and spread out using a Gaussian kernel covering up to 50bp radius centered at its current position with a standard deviation, or sigma, set using the -bandPassSigma argument (current default is 17 bp). The larger the sigma, the broader the spread will be.

3. For each position, the final smoothed value is calculated as the sum of all its profile values after steps 1 and 2.

### 3. Setting the ActiveRegion thresholds and intervals

The resulting profile line is cut in regions where it crosses the non-active to active threshold (currently set to 0.002). Then we make some adjustments to these boundaries so that those regions that are to be considered active, with a profile running over that threshold, fall within the minimum (fixed to 50bp) and maximum region size (customizable using -activeRegionMaxSize).

• If the region size falls within the limits we leave it untouched (it's good to go).

• If the region size is shorter than the minimum, it is greedily extended forward ignoring that cut point and we come back to step 1. Only if this is not possible because we hit a hard-limit (end of the chromosome or requested analysis interval) we will accept the small region as it is.

• If it is too long, we find the lowest local minimum between the maximum and minimum region size. A local minimum is a profile value preceded by a large one right up-stream (-1bp) and an equal or larger value down-stream (+1bp). In case of a tie, the one further downstream takes precedence. If there is no local minimum we simply force the cut so that the region has the maximum active region size.

Of the resulting regions, those with a profile that runs over this threshold are considered active regions and progress to variant discovery and or calling whereas regions whose profile runs under the threshold are considered inactive regions and are discarded except if we are running HC in reference confidence mode.

There is a final post-processing step to clean up and trim the ActiveRegion:

• Remove bases at each end of the read (hard-clipping) until there a base with a call quality equal or greater than minimum base quality score (customizable parameter -mbq, 10 by default).

• Include or exclude remaining soft-clipped ends. Soft clipped ends will be used for assembly and calling unless the user has requested their exclusion (using -dontUseSoftClippedBases), if the read and its mate map to the same chromosome, and if they are in the correct standard orientation (i.e. LR and RL).

• Discard all reads that no longer overlap with the ActiveRegion after the trimming operations described above.

• Downsample remaining reads to a maximum of 1000 reads per sample, but respecting a minimum of 5 reads starting per position. This is performed after any downsampling by the traversal itself (-dt, -dfrac, -dcov etc.) and cannot be overriden from the command line.

### Overview

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variants analyses.

### Important caveat

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller 3.x contain additional information that is formatted in a very specific way. Read on to find out more.

### General comparison of VCF vs. gVCF

The key difference between a regular VCF and a gVCF is that the gVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a gVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION gVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

### The two types of gVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a gVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the gVCF header. The purpose of the blocks (also called banding) is to keep file size down, and there is no downside for the downstream analysis, so we do recommend using the -GVCF option.

### Example gVCF file

This is a banded gVCF produced by HaplotypeCaller with the -GVCF option.

As you can see in the first line, the basic file format is a valid version 4.1 VCF:

##fileformat=VCFv4.1
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)
##GVCFBlock=minGQ=60(inclusive),maxGQ=2147483647(exclusive)
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=b37>
##reference=file:///humgen/1kg/reference/human_g1k_v37.fasta


Toward the middle you see the ##GVCFBlock lines (after the ##FORMAT lines) (repeated here for clarity):

##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)


which indicate the GQ ranges used for banding (corresponding to the boundaries [5, 20, 60]).

You can also see the definition of the MIN_DP annotation in the ##FORMAT lines.

#### Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10000000 and ends at 20:10000116.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10000000    .   T   <NON_REF>   .   .   END=10000116    GT:DP:GQ:MIN_DP:PL  0/0:44:99:38:0,89,1385
20  10000118    .   T   <NON_REF>   .   .   END=10000210    GT:DP:GQ:MIN_DP:PL  0/0:42:99:38:0,80,1314
20  10000212    .   A   <NON_REF>   .   .   END=10000438    GT:DP:GQ:MIN_DP:PL  0/0:52:99:42:0,99,1403
20  10000439    .   T   G,<NON_REF> 1737.77 .   DP=57;MLEAC=2,0;MLEAF=1.00,0.00;MQ=221.41;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,56,0:56:99:1771,168,0,1771,168,1771:0,0,0,0
20  10000440    .   T   <NON_REF>   .   .   END=10000597    GT:DP:GQ:MIN_DP:PL  0/0:56:99:49:0,120,1800
20  10000598    .   T   A,<NON_REF> 1754.77 .   DP=54;MLEAC=2,0;MLEAF=1.00,0.00;MQ=185.55;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,53,0:53:99:1788,158,0,1788,158,1788:0,0,0,0
20  10000599    .   T   <NON_REF>   .   .   END=10000693    GT:DP:GQ:MIN_DP:PL  0/0:51:99:47:0,120,1800
20  10000695    .   G   <NON_REF>   .   .   END=10000757    GT:DP:GQ:MIN_DP:PL  0/0:48:99:45:0,120,1800
20  10000758    .   T   A,<NON_REF> 1663.77 .   DP=51;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.32;MQ0=0  GT:AD:DP:GQ:PL:SB   1/1:0,50,0:50:99:1697,149,0,1697,149,1697:0,0,0,0
20  10000759    .   A   <NON_REF>   .   .   END=10001018    GT:DP:GQ:MIN_DP:PL  0/0:40:99:28:0,65,1080
20  10001020    .   C   <NON_REF>   .   .   END=10001020    GT:DP:GQ:MIN_DP:PL  0/0:26:72:26:0,72,1080
20  10001021    .   T   <NON_REF>   .   .   END=10001021    GT:DP:GQ:MIN_DP:PL  0/0:25:37:25:0,37,909
20  10001022    .   C   <NON_REF>   .   .   END=10001297    GT:DP:GQ:MIN_DP:PL  0/0:30:87:25:0,72,831
20  10001298    .   T   A,<NON_REF> 1404.77 .   DP=41;MLEAC=2,0;MLEAF=1.00,0.00;MQ=171.56;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,41,0:41:99:1438,123,0,1438,123,1438:0,0,0,0
20  10001299    .   C   <NON_REF>   .   .   END=10001386    GT:DP:GQ:MIN_DP:PL  0/0:43:99:39:0,95,1226
20  10001387    .   C   <NON_REF>   .   .   END=10001418    GT:DP:GQ:MIN_DP:PL  0/0:41:42:39:0,21,315
20  10001419    .   T   <NON_REF>   .   .   END=10001425    GT:DP:GQ:MIN_DP:PL  0/0:45:12:42:0,9,135
20  10001426    .   A   <NON_REF>   .   .   END=10001427    GT:DP:GQ:MIN_DP:PL  0/0:49:0:48:0,0,1282
20  10001428    .   T   <NON_REF>   .   .   END=10001428    GT:DP:GQ:MIN_DP:PL  0/0:49:21:49:0,21,315
20  10001429    .   G   <NON_REF>   .   .   END=10001429    GT:DP:GQ:MIN_DP:PL  0/0:47:18:47:0,18,270
20  10001430    .   G   <NON_REF>   .   .   END=10001431    GT:DP:GQ:MIN_DP:PL  0/0:45:0:44:0,0,1121
20  10001432    .   A   <NON_REF>   .   .   END=10001432    GT:DP:GQ:MIN_DP:PL  0/0:43:18:43:0,18,270
20  10001433    .   T   <NON_REF>   .   .   END=10001433    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1201
20  10001434    .   G   <NON_REF>   .   .   END=10001434    GT:DP:GQ:MIN_DP:PL  0/0:44:18:44:0,18,270
20  10001435    .   A   <NON_REF>   .   .   END=10001435    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1130
20  10001436    .   A   AAGGCT,<NON_REF>    1845.73 .   DP=43;MLEAC=2,0;MLEAF=1.00,0.00;MQ=220.07;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,42,0:42:99:1886,125,0,1888,126,1890:0,0,0,0
20  10001437    .   A   <NON_REF>   .   .   END=10001437    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,0


Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

Use HaplotypeCaller!

The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper. Its ability to call SNPs is equivalent to that of the UnifiedGenotyper, its ability to call indels is far superior, and it is now capable of calling non-diploid samples. It also comprises several unique functionalities such as the reference confidence model (which enables efficient and incremental variant discovery on ridiculously large cohorts) and special settings for RNAseq data.

As of GATK version 3.3, we recommend using HaplotypeCaller in all cases, with no exceptions.

Caveats for older versions

If you are limited to older versions for project continuity, you may opt to use UnifiedGenotyper in the following cases:

• If you are working with non-diploid organisms (UG can handle different levels of ploidy while older versions of HC cannot)
• If you are working with pooled samples (also due to the HC’s limitation regarding ploidy)
• If you want to analyze more than 100 samples at a time (for performance reasons) (versions 2.x)

#### Objective

Call variants on a diploid genome with the HaplotypeCaller, producing a raw (unfiltered) VCF.

#### Caveat

This is meant only for single-sample analysis. To analyze multiple samples, see the Best Practices documentation on joint analysis.

• TBD

#### Steps

1. Determine the basic parameters of the analysis
2. Call variants in your sequence data

### 1. Determine the basic parameters of the analysis

If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.

• Genotyping mode (–genotyping_mode)

This specifies how we want the program to determine the alternate alleles to use for genotyping. In the default DISCOVERY mode, the program will choose the most likely alleles out of those it sees in the data. In GENOTYPE_GIVEN_ALLELES mode, the program will only use the alleles passed in from a VCF file (using the -alleles argument). This is useful if you just want to determine if a sample has a specific genotype of interest and you are not interested in other alleles.

• Emission confidence threshold (–stand_emit_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.

• Calling confidence threshold (–stand_call_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.

The terms called and filtered are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.

### 2. Call variants in your sequence data

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R reference.fa \
-I preprocessed_reads.bam \  # can be reduced or not
-L 20 \
--genotyping_mode DISCOVERY \
-stand_emit_conf 10 \
-stand_call_conf 30 \
-o raw_variants.vcf


Note: This is an example command. Please look up what the arguments do and see if they fit your analysis before copying this. To see how the -L argument works, you can refer here: http://gatkforums.broadinstitute.org/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals#latest

#### Expected Result

This creates a VCF file called raw_variants.vcf, containing all the sites that the HaplotypeCaller evaluated to be potentially variant. Note that this file contains both SNPs and Indels.

Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.

GATK 3.3 was released on October 23, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.

## Haplotype Caller

• Improved the accuracy of dangling head merging in the HC assembler (now enabled by default).
• Physical phasing information is output by default in new sample-level PID and PGT tags.
• Added the --sample_name argument. This is a shortcut for people who have multi-sample BAMs but would like to use -ERC GVCF mode with a particular one of those samples.
• Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
• Fixed IndexOutOfBounds error associated with tail merging.

## Variant Recalibrator

• New --ignore_all_filters option. If specified, the variant recalibrator will ignore all input filters and treat sites as unfiltered.

## GenotypeGVCFs

• Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
• Bug fix for the case when we assumed ADs were in the same order if the number of alleles matched.
• Changed the default GVCF GQ Bands from 5,20,60 to be 1..60 by 1s, 60...90 by 10s and 99 in order to give finer resolution.
• Bug fix in the exact model when calling multi-allelic variants. QUAL field is now more accurate.

## RNAseq analysis

• Bug fixes for working with unmapped reads.

## CalculateGenotypePosteriors

• New annotation for low- and high-confidence possible de novos (only annotates biallelics).
• FamilyLikelihoodsUtils now add joint likelihood and joint posterior annotations.
• Restricted population priors based on discovered allele count to be valid for 10 or more samples.

## DepthOfCoverage

• Fixed rare bug triggered by hash collision between sample names.

## SelectVariants

• Updated the --keepOriginalAC functionality in SelectVariants to work for sites that lose alleles in the selection.

• Read groups that are excluded by sample_name, platform, or read_group arguments no longer appear in the header.
• The performance penalty associated with filtering by read group has been essentially eliminated.

## Annotations

• StrandOddsRatio is now a standard annotation that is output by default.
• We used to output zero for FS if there was no data available at a site, now we omit FS.
• Extensive rewrite of the annotation documentation.

## Queue

• Fixed issue related to spaces in job names that were fine in GridEngine 6 but break in (Son of) GE8.
• Improved scatter contigs algorithm to be fairer when splitting many contigs into few parts (contributed by @smowton)

## Documentation

• We now generate PHP files instead of HTML.
• We now output a JSON version of the tool documentation that can be used to generate wrappers for GATK commands.

## Miscellaneous

• Output arguments --no_cmdline_in_header, --sites_only, and --bcf for VCF files, and --bam_compression, --simplifyBAM, --disable_bam_indexing, and --generate_md5 for BAM files moved to the engine level.
• htsjdk updated to version 1.120.1620

Until now, HaplotypeCaller was only capable of calling variants in diploid organisms due to some assumptions made in the underlying algorithms. I'm happy to announce that we now have a generalized version that is capable of handling any ploidy you specify at the command line!

This new feature, which we're calling "omniploidy", is technically still under development, but we think it's mature enough for the more adventurous to try out as a beta test ahead of the next official release. We'd especially love to get some feedback from people who work with non-diploids on a regular basis, so we're hoping that some of you microbiologists and assorted plant scientists will take it out for a spin and let us know how it behaves in your hands.

It's available in the latest nightly builds; just use the -ploidy argument to give it a whirl. If you have any questions or feedback, please post a comment on this article in the forum.

Caveat: the downstream tools involved in the new GVCF-based workflow (GenotypeGVCFs and CombineGVCFs) are not yet capable of handling non-diploid calls correctly -- but we're working on it.

#### UPDATE:

We have added omniploidy support to the GVCF handling tools, with the following limitations:

• When running, you need to indicate the sample ploidy that was used to generate the GVCFs with -ploidy. As usual 2 is the default ploidy.

• The system does not support mixed ploidy across samples nor positions. An error message will be thrown if you attempt to genotype GVCFs that have a mixture, or that have some genotype whose ploidy does not match the -ploidy argument.

#### LATEST UPDATE:

As of GATK version 3.3-0, the GVCF tools are capable of ad-hoc ploidy detection, and can handle mixed ploidies. See the release highlights for details.

As reported here:

If you encounter this bug too, please don't post a new question about it. Feel free to comment in this thread to let us know you have also had the same problem. Tell us what version of the GATK you were using and post your command line.

Thank you for your patience while we work to fix this issue.

### Latest update: we found that the three tools (PrintRead, HaplotypeCaller and UnifiedGenotyper) had different issues that only produced the same symptom (the ArrayIndexOutOfBoundsException error). The underlying issues have all been fixed in version 2.5. If you encounter an ArrayIndexOutOfBoundsException in later versions, it is certainly caused by a different problem, so please open a new thread to report the issue.

Hi, I have a general question about the importance of known VCFs (for BQSR and HC) and resources file (for VQSR). I am working on rice for which the only known sites are the dbSNP VCF files which are built on a genomic version older than the reference genomic fasta file which I am using as basis. How does it affect the quality/accuracy of variants? How important is to have the exact same build of the genome as the one on which the known VCF is based? Is it better to leave out the known sites for some of the steps than to use the version which is built on a different version of the genome for the same species? In other words, which steps (BQSR, HC, VQSR etc) can be performed without the known sites/resource file? If the answers to the above questions are too detailed, can you please point me to any document, if available, which might address this issue?

Thanks, NB

Hi GATK team,

I have exome samples, some males and some females. I mapped the female to a reference genome without the Y chromosome, and continued with each sample the Best Practice steps. The reason for that mapping is that we don't want to lose some of the females reads to the homologous regions of chro Y. Will I be able to run GenotypeGVCFs on those samples? Is this a good way to do the genotyping on the sex chromosomes?

I am using HC 3.3-0-g37228af to generate GVCFs, including the parameters (full command below):

stand_emit_conf 10 stand_call_conf 30

The process completes fine, but when I look at the header of the gvcf produced, they are shown as follows:

standard_min_confidence_threshold_for_calling=-0.0 standard_min_confidence_threshold_for_emitting=-0.0

After trying various tests, it appears that setting these values is incompatible with -ERC GVCF (which requires "-variant_index_type LINEAR" and "-variant_index_parameter 128000" )

1) Can you confirm if this is expected behaviour, and why this should be so? 2) Is this another case where the GVCF is in intermediate file, and hence every possible variant is emitted initially? 3) Regardless of the answers above, is stand_call_conf equivalent to requiring a GQ of 30?

     java -Xmx11200m -Djava.io.tmpdir=TMPDIR -jar /apps/GATK/3.3-0/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -I /E000007/target_indel_realignment/E000007.6.bqsr.bam \ -R /project/production/Indexes/samtools/hsapiens.hs37d5.fasta \ -et NO_ET \ -K /project/production/DAT/apps/GATK/2.4.9/ourkey \ -dt NONE \ -L 10 \ -A AlleleBalance \ -A BaseCounts \ -A BaseQualityRankSumTest \ -A ChromosomeCounts \ -A ClippingRankSumTest \ -A Coverage \ -A DepthPerAlleleBySample \ -A DepthPerSampleHC \ -A FisherStrand \ -A GCContent \ -A HaplotypeScore \ -A HardyWeinberg \ -A HomopolymerRun \ -A ClippingRankSumTest \ -A LikelihoodRankSumTest \ -A LowMQ \ -A MappingQualityRankSumTest \ -A MappingQualityZero \ -A MappingQualityZeroBySample \ -A NBaseCount \ -A QualByDepth \ -A RMSMappingQuality \ -A ReadPosRankSumTest \ -A StrandBiasBySample \ -A StrandOddsRatio \ -A VariantType \ -ploidy 2 \ --min_base_quality_score 10 \ -ERC GVCF \ -variant_index_type LINEAR \ -variant_index_parameter 128000 \ --GVCFGQBands 20 \ --standard_min_confidence_threshold_for_calling 30 \ --standard_min_confidence_threshold_for_emitting 10  Hi Sheila and Geraldine When I run HaplotypeCaller (v3.3-0-g37228af) in GENOTYPE_GIVEN_ALLELES mode with --emitRefConfidence GVCF I get the error: Invalid command line: Argument ERC/gt_mode has a bad value: you cannot request reference confidence output and GENOTYPE_GIVEN_ALLELES at the same time It is however strange that GENOTYPE_GIVEN_ALLELES is mentioned in the --max_alternate_alleles section of the GenotypeGVCFs documentation. Maybe I'm missing something? Thanks, Gerrit Hello, I was wondering whether there is any paper or document that describes the mathematical models of Haplotype caller? Thanks in advance, Homa Dear all, I need to generate vcf file with GATK and I need to have TI (transcript information) in INFO field and VF and GQX in FORMAT field. Could you help me please with arguments. My agruments are to call variants: java -jargatk -T $variant_caller -R$reference -I in.bam -D dbsnp -L bed_file -o .raw.vcf

I still have no TI info. and in FORMAT i have GT:AD:DP:GQ:PL and i NEED GT:AD:DP:GQ:PL:VF:GQX.

Thank you for any help with command line arguments.

Paul.

Hey GATK team,

Are the guidelines suggested in this forum page applicable for filtering calls made by GATK-3.3-0's HaplotypeCaller?

Cheers, Mika

I have 13 whole exome sequencing samples, and unfortunately, I'm having a hard time getting HaplotypeCaller to complete within the time frame the cluster I use allows (150 hours). I use 10 nodes at a time with 10gb ram with 8 cores per node. Is there any way to speed up this rate? I tried using HaplotypeCaller in GVCF mode with the following command:

java -d64 -Xmx8g -jar $GATKDIR/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R$REF --dbsnp DBSNP \ -I 7-27_realigned.bam \ -o 7-27_hg19.vcf \ -U ALLOW_UNSET_BAM_SORT_ORDER \ -gt_mode DISCOVERY \ -mbq 20 \ -stand_emit_conf 20 -G Standard -A AlleleBalance -nct 16 \ --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 Am I doing something incorrectly? Is there anything I can tweak to minimize the runtime? What is the expected runtime for WES on a standard setup (a few cores and some ram)? Hi, We are running the best practices pipeline for variant discovery with GATK 3-2.2. When running the HaplotypeCaller with the flags -A AlleleBalance & -A HomopolymerRun to generate a gVCF, there are no ABHet/ABHom or HRun annotations showing up in the gVCF. I tried running VariantAnnotator on the gVCF and still no annotations. The documentation on both of these annotations state that they only work for biallelic variants. I suspected that the , tag that is on every variant might be causing the tool to treat biallelic variants as multiallelic. So I stripped the , from the gVCF using sed and reran the VariantAnnotator and voila...I got the annotations. Is there a way to either generate the gVCF without the , tag (and probabilities that go with it), or instruct the VariantAnnotator to ignore the , tag. Thanks for you help. Tom Kolar I'm working on an association mapping project in a non-model bird (so there is a reference genome, but it may have problems). We're looking for SNPs linked to a phenotype using paired end GBS data and using the haplotypecaller for SNP calling. We found a single SNP that explains a significant portion of the phenotypic variation. When we actually look at the SNP, it turns out to be an artifact derived from the haplotype caller reassembly. There is 10 bp insertion in the reference not found in this population and when the reassembly happens, it realigns it to produce a SNP. Just looking at the sequence of the reads themselves, there is no SNP. So the question is, what is causing the reassembly in some samples and not others. I tried outputting the activity profile but since it is smoothed it hard to figure exactly where the difference is happening. Is it possible output an unsmoothed activity profile? Are there other ways to figure out how exactly the active region is being picked? Thanks. Hi, I have noticed that every time I repeat a gVCF call on the same sample (~same Bam file), the output gVCF files are not exactly same. They are almost similar, but there will be a few differences here and there & there will be a minute difference in unix-file-sizes as well. Is that something that is expected?? Shalabh Suman Hi GATK-ers, I have been given ~2000 gVCFs generated by Illumina (one sample per gVCF). Though they are in standard gVCF format, they were generated by an Illumina pipeline (https://support.basespace.illumina.com/knowledgebase/articles/147078-gvcf-file if you're really curious) and not the Haplotype Caller. As a result (I think ... ), the GATK doesn't want to process them (I have tried CombineGVCFs and GenotypeGVCFs to no avail). Is there a GATK walker or some other tool that will make my gVCFs GATK-friendly? I need to be able to merge this data together to make it analyze-able because in single-sample VCF format it's pretty useless at the moment. My only other thought has been to expand all the ref blocks of data and then merge everything together, but this seems like it will result in the creation of a massive amount of data. Any suggestions you may have are greatly appreciated!!! Sara Hi, I have a number of samples that consist of multiple individuals from the same population pooled together, and have been truing to use HaplotypeCaller to call the variants. I have set the (ploidy to 2 * number of individuals) but keep getting the same or similar error message, after running for several hours or days: ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: the combination of ploidy (180) and number of alleles (9) results in a very large number of genotypes (> 2147483647). You need to limit ploidy or the number of alternative alleles to analyze this locus ##### ERROR ------------------------------------------------------------------------------------------ and I'm not sure what I can do to rectify it... Obviously I can't limit the ploidy, it is what it is, and I thought that HC only allows a maximum of six alleles anyway? My code is below, and any help would be appreciated. java -Xmx24g -jar ~/bin/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T HaplotypeCaller -nct 6 \ -R ~/my_ref_sequence \ --intervals ~/my_intervals_file \ -ploidy 180 \ -log my_log_file \ -I ~/my_input_bam \ -o ~/my_output_vcf I'm a bit confused regarding the new GATK version and new HC-functions. I'm trying to call haplotypes in a family of plants. I call Haplotypes using haplotype caller, then I want to run Read-backed phasing on the raw vcfs and then CalculateGenotypePosterios to add pedigree information. The CalculateGenotypePosterios-Walker seems to need the format Fields AC and AN, but they are not produced by the HaplotypeCaller. They used to be in earlier HC-Versions though...(?). How can I fix this? And is this a proper workflow at all? Is Read-backed phasing needed or has it become redundant with the new HC-Version being able to do physical phasing. Would it be "enough" to run HC for phasing and CalculateGenotypePosterios to add pedigree information? Anyhow the problem of missing ac and an fields remains. I would be greatful for some help on this. Thsi is how a raw vcf produced by HC looks like ## fileformat=VCFv4.1 ## ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location"> ## FILTER=<ID=LowQual,Description="Low quality"> ## FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> ## FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ## FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ## FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ## FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> ## FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another"> ## FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group"> ## FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> ## FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias."> ## GATKCommandLine=<ID=HaplotypeCaller,Version=3.3-0-g37228af,Date="Fri Jan 30 12:04:00 CET 2015",Epoch=1422615840668,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[/prj/gf-grape/project_FTC_in_crops/members/Nadia/test/GfGa4742_CGATGT_vs_candidategenes.sorted.readgroups.deduplicated.realigned.recalibrated.bam] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/prj/gf-grape/project_FTC_in_crops/members/Nadia/amplicons_run3/GATK_new/RefSequences_all_candidate_genes.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=true bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=LINEAR variant_index_parameter=128000 logging_level=INFO log_to_file=null help=false version=false likelihoodCalculationEngine=PairHMM heterogeneousKmerSizeResolution=COMBO_MIN graphOutput=null bamWriterType=CALLED_HAPLOTYPES disableOptimizations=false dbsnp=(RodBinding name= source=UNBOUND) dontTrimActiveRegions=false maxDiscARExtension=25 maxGGAARExtension=300 paddingAroundIndels=150 paddingAroundSNPs=20 comp=[] annotation=[ClippingRankSumTest, DepthPerSampleHC, StrandBiasBySample] excludeAnnotation=[SpanningDeletions, TandemRepeatAnnotator, ChromosomeCounts, FisherStrand, StrandOddsRatio, QualByDepth] debug=false useFilteredReadsForAnnotations=false emitRefConfidence=GVCF annotateNDA=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 standard_min_confidence_threshold_for_calling=-0.0 standard_min_confidence_threshold_for_emitting=-0.0 max_alternate_alleles=6 input_prior=[] sample_ploidy=2 genotyping_mode=DISCOVERY alleles=(RodBinding name= source=UNBOUND) contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=null exactcallslog=null output_mode=EMIT_VARIANTS_ONLY allSitePLs=true sample_name=null kmerSize=[10, 25] dontIncreaseKmerSizesForCycles=false allowNonUniqueKmersInRef=false numPruningSamples=1 recoverDanglingHeads=false doNotRecoverDanglingBranches=false minDanglingBranchLength=4 consensus=false GVCFGQBands=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 70, 80, 90, 99] indelSizeToEliminateInRefModel=10 min_base_quality_score=10 minPruning=2 gcpHMM=10 includeUmappedReads=false useAllelesTrigger=false phredScaledGlobalReadMismappingRate=45 maxNumHaplotypesInPopulation=2 mergeVariantsViaLD=false doNotRunPhysicalPhasing=false pair_hmm_implementation=VECTOR_LOGLESS_CACHING keepRG=null justDetermineActiveRegions=false dontGenotype=false errorCorrectKmers=false debugGraphTransformations=false dontUseSoftClippedBases=false captureAssemblyFailureBAM=false allowCyclesInKmerGraphToGeneratePaths=false noFpga=false errorCorrectReads=false kmerLengthForReadErrorCorrection=25 minObservationsForKmerToBeSolid=20 pcr_indel_model=CONSERVATIVE maxReadsInRegionPerSample=1000 minReadsPerAlignmentStart=5 activityProfileOut=null activeRegionOut=null activeRegionIn=null activeRegionExtension=null forceActive=false activeRegionMaxSize=null bandPassSigma=null maxProbPropagationDistance=50 activeProbabilityThreshold=0.002 min_mapping_quality_score=20 filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false"> ## GVCFBlock=minGQ=0(inclusive),maxGQ=1(exclusive) ## GVCFBlock=minGQ=1(inclusive),maxGQ=2(exclusive) ## GVCFBlock=minGQ=10(inclusive),maxGQ=11(exclusive) ## GVCFBlock=minGQ=11(inclusive),maxGQ=12(exclusive) ## GVCFBlock=minGQ=12(inclusive),maxGQ=13(exclusive) ## GVCFBlock=minGQ=13(inclusive),maxGQ=14(exclusive) ## GVCFBlock=minGQ=14(inclusive),maxGQ=15(exclusive) ## GVCFBlock=minGQ=15(inclusive),maxGQ=16(exclusive) ## GVCFBlock=minGQ=16(inclusive),maxGQ=17(exclusive) ## GVCFBlock=minGQ=17(inclusive),maxGQ=18(exclusive) ## GVCFBlock=minGQ=18(inclusive),maxGQ=19(exclusive) ## GVCFBlock=minGQ=19(inclusive),maxGQ=20(exclusive) ## GVCFBlock=minGQ=2(inclusive),maxGQ=3(exclusive) ## GVCFBlock=minGQ=20(inclusive),maxGQ=21(exclusive) ## GVCFBlock=minGQ=21(inclusive),maxGQ=22(exclusive) ## GVCFBlock=minGQ=22(inclusive),maxGQ=23(exclusive) ## GVCFBlock=minGQ=23(inclusive),maxGQ=24(exclusive) ## GVCFBlock=minGQ=24(inclusive),maxGQ=25(exclusive) ## GVCFBlock=minGQ=25(inclusive),maxGQ=26(exclusive) ## GVCFBlock=minGQ=26(inclusive),maxGQ=27(exclusive) ## GVCFBlock=minGQ=27(inclusive),maxGQ=28(exclusive) ## GVCFBlock=minGQ=28(inclusive),maxGQ=29(exclusive) ## GVCFBlock=minGQ=29(inclusive),maxGQ=30(exclusive) ## GVCFBlock=minGQ=3(inclusive),maxGQ=4(exclusive) ## GVCFBlock=minGQ=30(inclusive),maxGQ=31(exclusive) ## GVCFBlock=minGQ=31(inclusive),maxGQ=32(exclusive) ## GVCFBlock=minGQ=32(inclusive),maxGQ=33(exclusive) ## GVCFBlock=minGQ=33(inclusive),maxGQ=34(exclusive) ## GVCFBlock=minGQ=34(inclusive),maxGQ=35(exclusive) ## GVCFBlock=minGQ=35(inclusive),maxGQ=36(exclusive) ## GVCFBlock=minGQ=36(inclusive),maxGQ=37(exclusive) ## GVCFBlock=minGQ=37(inclusive),maxGQ=38(exclusive) ## GVCFBlock=minGQ=38(inclusive),maxGQ=39(exclusive) ## GVCFBlock=minGQ=39(inclusive),maxGQ=40(exclusive) ## GVCFBlock=minGQ=4(inclusive),maxGQ=5(exclusive) ## GVCFBlock=minGQ=40(inclusive),maxGQ=41(exclusive) ## GVCFBlock=minGQ=41(inclusive),maxGQ=42(exclusive) ## GVCFBlock=minGQ=42(inclusive),maxGQ=43(exclusive) ## GVCFBlock=minGQ=43(inclusive),maxGQ=44(exclusive) ## GVCFBlock=minGQ=44(inclusive),maxGQ=45(exclusive) ## GVCFBlock=minGQ=45(inclusive),maxGQ=46(exclusive) ## GVCFBlock=minGQ=46(inclusive),maxGQ=47(exclusive) ## GVCFBlock=minGQ=47(inclusive),maxGQ=48(exclusive) ## GVCFBlock=minGQ=48(inclusive),maxGQ=49(exclusive) ## GVCFBlock=minGQ=49(inclusive),maxGQ=50(exclusive) ## GVCFBlock=minGQ=5(inclusive),maxGQ=6(exclusive) ## GVCFBlock=minGQ=50(inclusive),maxGQ=51(exclusive) ## GVCFBlock=minGQ=51(inclusive),maxGQ=52(exclusive) ## GVCFBlock=minGQ=52(inclusive),maxGQ=53(exclusive) ## GVCFBlock=minGQ=53(inclusive),maxGQ=54(exclusive) ## GVCFBlock=minGQ=54(inclusive),maxGQ=55(exclusive) ## GVCFBlock=minGQ=55(inclusive),maxGQ=56(exclusive) ## GVCFBlock=minGQ=56(inclusive),maxGQ=57(exclusive) ## GVCFBlock=minGQ=57(inclusive),maxGQ=58(exclusive) ## GVCFBlock=minGQ=58(inclusive),maxGQ=59(exclusive) ## GVCFBlock=minGQ=59(inclusive),maxGQ=60(exclusive) ## GVCFBlock=minGQ=6(inclusive),maxGQ=7(exclusive) ## GVCFBlock=minGQ=60(inclusive),maxGQ=70(exclusive) ## GVCFBlock=minGQ=7(inclusive),maxGQ=8(exclusive) ## GVCFBlock=minGQ=70(inclusive),maxGQ=80(exclusive) ## GVCFBlock=minGQ=8(inclusive),maxGQ=9(exclusive) ## GVCFBlock=minGQ=80(inclusive),maxGQ=90(exclusive) ## GVCFBlock=minGQ=9(inclusive),maxGQ=10(exclusive) ## GVCFBlock=minGQ=90(inclusive),maxGQ=99(exclusive) ## GVCFBlock=minGQ=99(inclusive),maxGQ=2147483647(exclusive) ## INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities"> ## INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases"> ## INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered"> ## INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?"> ## INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval"> ## INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes"> ## INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation"> ## INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed"> ## INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed"> ## INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality"> ## INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads"> ## INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities"> ## INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias"> ## contig=<ID=GSVIVT01012145001,length=8683> ## contig=<ID=GSVIVT01012049001,length=18657> ## contig=<ID=GSVIVT01012249001,length=14432> ## contig=<ID=GSVIVT01011652001,length=6117> ## contig=<ID=GSVIVT01011710001plu,length=4623> ## contig=<ID=GSVIVT01012250001plu,length=27163> ## contig=<ID=GSVIVT01011947001,length=3289> ## contig=<ID=GSVIVT01011821001,length=7310> ## contig=<ID=GSVIVT01011897001,length=5751> ## contig=<ID=GSVIVT01022014001,length=6337> ## contig=<ID=GSVIVT01011387001,length=11582> ## contig=<ID=GSVIVT01036237001,length=18407> ## contig=<ID=GSVIVT01036499001_CO,length=4568> ## contig=<ID=GSVIVT01020232001,length=21274> ## contig=<ID=GSVIVT01030735001,length=3570> ## contig=<ID=GSVIVT01011433001,length=5349> ## contig=<ID=GSVIVT01011939001,length=73679> ## contig=<ID=GSVIVT01021854001,length=5609> ## contig=<ID=GSVIVT01036549001plu,length=22905> ## contig=<ID=GSVIVT01031112001,length=5884> ## contig=<ID=GSVIVT01036551001plu,length=18328> ## contig=<ID=GSVIVT01031354001,length=8603> ## contig=<ID=GSVIVT01008655001_pl,length=4022> ## contig=<ID=GSVIVT01031338001,length=6893> ## contig=<ID=GSVIVT01019969001,length=5388> ## contig=<ID=GSVIVT01032607001,length=8294> ## contig=<ID=GSVIVT01010521001,length=19492> ## contig=<ID=GSVIVT01036447001,length=6911> ## contig=<ID=GSVIVT01010513001,length=23656> ## contig=<ID=GSVIVT01033067001,length=28278> ## reference=file:///prj/gf-grape/project_FTC_in_crops/members/Nadia/amplicons_run3/GATK_new/RefSequences_all_candidate_genes.fasta # CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GfGa4742 GSVIVT01012145001 1 . G . . END=113 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 GSVIVT01012145001 114 . C . . END=164 GT:DP:GQ:MIN_DP:PL 0/0:172:99:164:0,120,1800 GSVIVT01012145001 165 . T C, 7732.77 . DP=175;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,173,0:173:99:0|1:165_T_C:7761,521,0,7761,521,7761:0,0,165,8 GSVIVT01012145001 166 . G . . END=166 GT:DP:GQ:MIN_DP:PL 0/0:174:72:174:0,72,1080 GSVIVT01012145001 167 . T . . END=175 GT:DP:GQ:MIN_DP:PL 0/0:174:66:174:0,60,900 GSVIVT01012145001 176 . T . . END=191 GT:DP:GQ:MIN_DP:PL 0/0:174:57:173:0,57,855 GSVIVT01012145001 192 . A . . END=194 GT:DP:GQ:MIN_DP:PL 0/0:173:54:173:0,54,810 GSVIVT01012145001 195 . T . . END=199 GT:DP:GQ:MIN_DP:PL 0/0:174:51:173:0,51,765 And this is the Error Message I get ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Key AC found in VariantContext field INFO at GSVIVT01012145001:1 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default. Hi there, This is maybe a silly question but I want to know if it is normal or not. I have just run HC in GVCF mode for each of my bam files. When I see the output...: # CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1 Is it normal to have "sample1" instead of "bam file name" or something like that? I have realized because when I was running GenotypeGVCFs, I was only getting one column named also "sample1", instead of one column per input vcf (what I think that is the normal output, isn't it? So, what I've done is to change "sample1" of each vcf file with the corresponding sample name to run GenotypeGVCF. I hope I've explained myself more or less. Thanks in advance Hi, I am using GATK v3.2.2 following the recommended practices (...HC -> CombineGVCFs -> GenotypeGVCFs ...) and while looking through suspicious variants I came across a few hetz with AD=X,0. Tracing them back I found two inconsistencies (bugs?); 1) Reordering of genotypes when combining gvcfs while the AD values are kept intact, which leads to an erronous AD for a heterozygous call. Also, I find it hard to understand why the 1bp insertion is emitted in the gvcf - there is no reads supporting it: • single sample gvcf 1 26707944 . A AG,G,<NON_REF> 903.73 . [INFO] GT:AD:DP:GQ:PL:SB 0/2:66,0,36,0:102:99:1057,1039,4115,0,2052,1856,941,3051,1925,2847:51,15,27,9 • combined gvcf 1 26707944 . A G,AG,<NON_REF> . . [INFO] GT:AD:DP:MIN_DP:PL:SB [other_samples] ./.:66,0,36,0:102:.:1057,0,1856,1039,2052,4115,941,1925,3051,2847:51,15,27,9 [other_samples] • vcf 1 26707944 . A G 3169.63 . [INFO] [other_samples] 0/1:66,0:102:99:1057,0,1856 [other_samples] 2) Incorrect AD is taken while genotyping gvcf files: • single sample gvcf: 1 1247185 rs142783360 AG A,<NON_REF> 577.73 . [INFO] GT:AD:DP:GQ:PL:SB 0/1:13,20,0:33:99:615,0,361,654,421,1075:7,6,17,3 • combined gvcf 1 1247185 rs142783360 AG A,<NON_REF> . . [INFO] [other_samples] ./.:13,20,0:33:.:615,0,361,654,421,1075:7,6,17,3 [other_samples] • vcf 1 1247185 . AG A 569.95 . [INFO] [other_samples] 0/1:13,0:33:99:615,0,361 [other_samples] I have found multiple such cases here, and no errors nor warnings in the logs. I checked also with calls that I had done before on these samples, but in a smaller batch. There the AD values were correct, but there were plenty of other hetz with AD=X,0... I haven't looked closer into those. Are these bugs that have been fixed in 3.3? Or maybe my brain is not working properly today and I miss sth obvious? Best regards, Paweł Hi, I need to apply hard filters to my data. In cases where I have lower coverage I plan to use the Fisher Strand annotation, and in higher coverage variant calls, SOR (using a JEXL expression to switch between them: DP < 20 ? FS > 50.0 : SOR > 3). The variant call below (some annotations snipped), which is from a genotyped gVCF from HaplotypeCaller (using a BQSR'ed BAM file), looks well supported (high QD, high MQ, zero MQ0). However, there appears to be some strand bias (SOR=3.3): 788.77 . DP=34;FS=5.213;MQ=35.37;MQ0=0;QD=25.44;SOR=3.334 GT:AD:DP:GQ:PL 1/1:2,29:31:35:817,35,0 In this instance the filter example above would be applied. ## My Question Is this filtering out a true positive? And what kind of cut-offs should I be using for FS and SOR? The snipped annotations ReadPosRankSum=-1.809 and BaseQRankSum=-0.8440 for this variant also indicate minor bias that the evidence to support this variant call also has some bias (the variant appears near the end of reads in low quality bases, compared to the reads supporting the reference allele). ## My goal This is part of a larger hard filter I'm applying to a set of genotyped gVCFs called from HaplotypeCaller. I'm filtering HomRef positions using this JEXL filter: vc.getGenotype("%sample%").isHomRef() ? ( vc.getGenotype("%sample%").getAD().size == 1 ? (DP < 10) : ( ((DP - MQ0) < 10) || ((MQ0 / (1.0 * DP)) >= 0.1) || MQRankSum > 3.2905 || ReadPosRankSum > 3.2905 || BaseQRankSum > 3.2905 ) ) : false And filtering HomVar positions using this JEXL: vc.getGenotype("%sample%").isHomVar() ? ( vc.getGenotype("%sample%").getAD().0 == 0 ? ( ((DP - MQ0) < 10) || ((MQ0 / (1.0 * DP)) >= 0.1) || QD < 5.0 || MQ < 30.0 ) : ( BaseQRankSum < -3.2905 || MQRankSum < -3.2905 || ReadPosRankSum < -3.2905 || (MQ0 / (1.0 * DP)) >= 0.1) || QD < 5.0 || (DP < 20 ? FS > 60.0 : SOR > 3.5) || MQ < 30.0 || QUAL < 100.0 ) ) : false My goal is true positive variants only and I have high coverage data, so the filtering should be relatively stringent. Unfortunately I don't have a database I could use to apply VQSR, henceforth the comprehensive filtering strategy. Hi, I have recal.bam files for all the individuals in my study (these constitute 4 families), and each bam file contains information for one chromosome for one individual. I was wondering if it is best for me to pass all the files for a single individual together when running HaplotypeCaller, if it will increase the accuracy of the calling, or if I can just run HaplotypeCaller on each individual bam file separately. Also, I was wondering at which step I should be using CalculateGenotypePosteriors, and if it will clean up the calls substantially. VQSR already filters the calls, but I was reading that CalculateGenotypePosteriors actually takes pedigree files, which would be useful in my case. Should I try to use CalculateGenotypePosteriors after VQSR? Are there other relevant filtering or clean-up tools that I should be aware of? Thanks very much in advance, Alva Hello, I have a pb with HaplotypeCaller. I saw that this error has been reported as linked with -nct option, but I did not use it. Strangely enough, I think my command worked some times ago (before server restart ?) All reference files I used are from the GATK bundle. Sletort Hi, I am working with RNA-Seq data from 6 different samples. Part of my research is to identify novel polymorphisms. I have generated a filtered vcf file for each sample. I would like to now combine these into a single vcf. I am concerned about sites that were either not covered by the RNA-Seq analysis or were no different from the reference allele in some individuals but not others. These sites will be ‘missed’ when haplotypeCaller analyzes each sample individually and will not be represented in the downstream vcf files. When the files are combined, what happens to these ‘missed’ sites? Are they automatically excluded? Are they treated as missing data? Is the absent data filled in from the reference genome? Alternatively, can BaseRecallibrator and/or HaplotypeCaller simultaneously analyze multiple bam files? Is it common practice to combine bam files for discovering sequence variants? Hi there, maybe this is a very basic question but, I've read several posts and sections of GATK Best Practices and tutorials but I can not get the point. What is the main difference between using GATK HC in gVCF mode instead of in multi-sample mode? I know that HC in GVCF mode is used to do variant discovery analysis on cohorts of samples, but what is the meaning of "cohorts of samples"? If I have 2 groups of samples, one WT and the other mutant, should I use GVCF mode? I've read almost all the tutorials and Howto's of GATK and I can not understand at all. And another question, can I give to HC more than one bam file? Maybe using -I several times? And if I introduce several files, in the raw vcf output file, what I'm going to find? Several columns one per sample? Thank you in advance. Excuse me: After calling the genotype using Haplotyper Caller in gatk, i manually check the reads covering the variant sites. But i found one exception: The genotype of one sample was 0/0，wild type,but the reads for this site were .,,,..,,c.,c.c,.cccc. I think almost all the reads should be . or ,. Is this case normal? Many thanks in advance! Hi, I have been using the best practices pipeline for illumnia samples for quite long time. Everything is fine. Recently, I am using the same pipeline for IONTORRENT samples. Everything is fine until the haplotypecaller step. I only got two samples to do the haplotypecaller. One bam file is around 9G, and the other is around 13G. But the speed of this step is really slow. It need around 24 weeks to finish haplotypecaller step. It's really strange. Cause if I run the haplotypecaller for a family with 9 persons (Illumina), it only took around 9 days to finish it. How come these two files would need 24 weeks. Does anyone have any idea about what may cause this problem? The bam file of those 9 person are around the same size of the two IONTORRENT samples. Thank you. Hello, I am having difficulty in understanding the vcf output of haplotype caller when I use it for a pool of two individuals. I set ploidy to 4 in this case. When ploidy is 2, one get the heterozygote genotype as 0/1 which is understandable. In the case of ploidy equals to 4, how I can interpret such a genotype (GT:AD:DP:GQ:PL 0/0/0/1:8,3:11:5:95,0,5,24,232) given that this is a pooled sample so it is not possible to know the genotype. I just was wondering what this means. Thank you, Homa Hello, I tried to find this option but I was unsuccessful. I would like to call SNPs only on a subset of my bam files, that is only on specific chromosomes. I have now done so by getting a subset of bamfile, I was wondering if there is a way to provide Haplotype caller a bed file with the list of chromosomes I want for SNP calling in order to skip the subsetting step? Thanks, Lucia Hi, I'm using GATK haplotypecaller in order to detect varianti in a tumor sample analyzed with Illumina WES 100X2 paired end mode. I know KRAS p.G12 variant (chr12:25398284) is present in my sample (because previously seen with Sanger sequencing, KRAS is the oncogenic event in this kind of tumor). I aligned the fastq file with bwa after quality control e adapter trimmig. In my bam file I'm able to view the variant at genomic coordinate chr12:25398284 as you can see in the picture (IGV): however GATK haplotypecaller doesn't call the variant anyway. This is my basic command line: java -Xmx4g -jar /opt/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37.fasta -I my.bam -stand_call_conf 0 -stand_emit_conf 0 -o output.file and I tryed to tune a a lot of parameter such as stand_call_conf; -stand_emit_conf; --heterozygosity; --min_base_quality_score and so on. This is the pileup string of the chr12:25398284 coordinate in my bam file 12 25398284 C 55 ,,T.,,...,,,,,,T.......,t,t.....,,,t..,,,,,,,,,,,,.^],^],^],^], BB@ECAFCECBCBBB@DBCCCDDCABADBDBCCDD@BBEADDEDBCADBB@@BAC The base quality is good and the mapping quality too, but the haplotypecaller does not determine any active region around the position chr12:25398284 Any suggestion about the reason of misscalling?? Many thanks, Valentina Does UG and HC downsample before or after exclusion of marked duplicates? I guess code is king for the answer... Hi all, I'm in a bit of a daze going through all the documentation and I wanted to do a sanity check on my workflow with the experts. I have ~120 WGS of a ~24Mb fungal pathogen. The end-product of my GATK workflow would be a high quality call set of SNPs, restricted to the sites for which we have confidence in the call across all samples (so sites which are not covered by sufficient high quality reads in one or more samples will be eliminated). Therefore my workflow (starting from a sorted indexed BAM file of reads from a single sample, mapped to reference with bwa mem) is this: • 01- BAM: Local INDEL realignment (RealignerTargetCreator/IndelRealigner) • 02- BAM: MarkDuplicates • 03- BAM: Local INDEL realignment second pass (RealignerTargetCreator/IndelRealigner) • 04- BAM: Calling variants using HaplotypeCaller • 05- VCF: Hard filter variants for truth set for BQSR (there is no known variant site databases so we can use our best variants from each VCF file for this). The filter settings are: "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" and we also filter out heterozygous positions using "isHet == 1". • 06- VCF: Calculate BQSR table using the high quality hard-filtered variants from step 05. • 07- BAM: Apply BQSR recalibration from previous step to BAM file from step 04. • 08- BAM: Calling variants on recalibrated BAM file from previous step using HaplotypeCaller, also emitting reference sites using --output_mode EMIT_ALL_SITES \ and --emitRefConfidence GVCF \ Does this sound like a reasonable thing to do? What options should I use in step 8 in order for HC to tell me how confident it is, site-by-site about it's calls, including those that are homozygous reference? I notice that when using --output_mode EMIT_ALL_CONFIDENT_SITES \ and --emitRefConfidence GVCF \ I am missing a lot of the annotation I get when just outputting variant sites (e.g. QD). Hi, I'm working on targeted resequencing data and I'm doing a multi-sample variant calling with the HaplotypeCaller. First, I tried to call the variants in all the targeted regions by doing the calling at one time on a cluster. I thus specified all the targeted regions with the -L option. Then, as it was taking too long, I decided to cut my interval list, chromosome by chromosome and to do the calling on each chromosome. At the end, I merged the VCFs files that I had obtained for the callings on the different chromosomes. Then, I compared this merged VCF file with the vcf file that I obtained by doing the calling on all the targeted regions at one time. I noticed 1% of variation between the two variants lists. And I can't explain this stochasticity. Any suggestion? Thanks! Maguelonne Hello, I have a general question. Why is it recommended to use single sample for SNP calling from RNA-seq data when using Haplotype caller? Isn't possible to provide all the samples together for the SNP calling purpose? Is it a rather technical aspect of Haplotype Caller or it helps to reduce errors? Thanks a lot in advance for your help, Homa Hi I am using GATK version 3.3 (latest release) and following the best practice for variant calling. Steps i am using is 1. Run haplotye caller and get the gVCF file 2. Run GenotypeGVCF across the samples to the get the VCF file The Question is about the Genotype calling. Here is an example: GVCF file looks like this for a particular position chr1 14677 . G . . END=14677 GT:DP:GQ:MIN_DP:PL 0/0:9:0:9:0,0,203 VCF file looks like this after genotyping with multiple samples. chr1 14677 . G A 76.56 PASS . . GT:AD:DP:GQ:PL 0/1:8,1:9:4:4,0,180 In GVCF file there was no support for the alternate allele, but VCF file shows that there are 1 read supporting the alternate call. Thanks ! Saurabh Is this likely to be something wrong with my bam.bai file? INFO 00:33:22,970 HelpFormatter - -------------------------------------------------------------------------------- INFO 00:33:22,975 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 INFO 00:33:22,975 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 00:33:22,975 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 00:33:22,980 HelpFormatter - Program Args: -T HaplotypeCaller -R /scratch2/vyp-scratch2/reference_datasets/human_reference_sequence/human_g1k_v37.fasta -I Project_1410AHP-0004_ElenaSchiff_211samples/align/data//61_sorted_unique.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -stand_call_conf 30.0 -stand_emit_conf 10.0 -L 22 --downsample_to_coverage 200 --GVCFGQBands 10 --GVCFGQBands 20 --GVCFGQBands 50 -o Project_1410AHP-0004_ElenaSchiff_211samples/gvcf/data//61_chr22.gvcf.gz INFO 00:33:22,989 HelpFormatter - Executing as rmhanpo@groucho-1-2.local on Linux 2.6.18-371.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_45-b18. INFO 00:33:22,989 HelpFormatter - Date/Time: 2014/12/28 00:33:22 INFO 00:33:22,989 HelpFormatter - -------------------------------------------------------------------------------- INFO 00:33:22,989 HelpFormatter - -------------------------------------------------------------------------------- WARN 00:33:23,015 GATKVCFUtils - Creating Tabix index for Project_1410AHP-0004_ElenaSchiff_211samples/gvcf/data/61_chr22.gvcf.gz, ignoring user-specified index type and parameter INFO 00:33:23,649 GenomeAnalysisEngine - Strictness is SILENT INFO 00:33:23,823 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 200 INFO 00:33:23,831 SAMDataSourceSAMReaders - Initializing SAMRecords in serial INFO 00:33:23,854 SAMDataSourceSAMReaders - Done initializing BAM readers: total time 0.02 INFO 00:33:23,862 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 00:33:23,869 IntervalUtils - Processing 51304566 bp from intervals INFO 00:33:24,114 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 00:33:25,977 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Index: unable to reposition file channel of index file Project_1410AHP-0004_ElenaSchiff_211samples/align/data/61_sorted_unique.bam.bai at org.broadinstitute.gatk.engine.datasources.reads.GATKBAMIndex.skipBytes(GATKBAMIndex.java:438) at org.broadinstitute.gatk.engine.datasources.reads.GATKBAMIndex.skipToSequence(GATKBAMIndex.java:295) at org.broadinstitute.gatk.engine.datasources.reads.GATKBAMIndex.readReferenceSequence(GATKBAMIndex.java:127) at org.broadinstitute.gatk.engine.datasources.reads.BAMSchedule.(BAMSchedule.java:104) at org.broadinstitute.gatk.engine.datasources.reads.BAMScheduler.getNextOverlappingBAMScheduleEntry(BAMScheduler.java:295) at org.broadinstitute.gatk.engine.datasources.reads.BAMScheduler.advance(BAMScheduler.java:184) at org.broadinstitute.gatk.engine.datasources.reads.BAMScheduler.populateFilteredIntervalList(BAMScheduler.java:104) at org.broadinstitute.gatk.engine.datasources.reads.BAMScheduler.createOverIntervals(BAMScheduler.java:78) at org.broadinstitute.gatk.engine.datasources.reads.IntervalSharder.shardOverIntervals(IntervalSharder.java:59) at org.broadinstitute.gatk.engine.datasources.reads.SAMDataSource.createShardIteratorOverIntervals(SAMDataSource.java:1173) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.getShardStrategy(GenomeAnalysisEngine.java:623) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Index: unable to reposition file channel of index file Project_1410AHP-0004_ElenaSchiff_211samples/align/data/61_sorted_unique.bam.bai ##### ERROR ------------------------------------------------------------------------------------------ Hi, I'm currently trying to use GATK to call variants from Human RNA seq data So far, I've managed to do variant calling in all my samples following the GATK best practice guidelines. (using HaplotypeCaller in DISCOVERY mode on each sample separately) But I'd like to go further and try to get the genotype in every sample, of each variant found in at least one sample. This, to differentiate for each variant, samples where that variant is absent (homozygous for reference allele) from samples where it is not covered (and therefore note genotyped). To do so, I've first used CombineVariants to merge variants from all my samples and to create the list of variants to be genotype{ALLELES}.vcf

I then try to regenotype my samples with HaplotypeCaller using the GENOTYPE_GIVEN_ALLELES mode and the same settings as before: my command is the following:

******java -jar ${GATKPATH}/GenomeAnalysisTK.jar -T HaplotypeCaller -R${GENOMEFILE}.fa -I ${BAMFILE_CALIB}.bam --genotyping_mode GENOTYPE_GIVEN_ALLELES -alleles${ALLELES}.vcf -out_mode EMIT_ALL_SITES -dontUseSoftClippedBases -stand_emit_conf 20 -stand_call_conf 20
-o ${SAMPLE}_genotypes_all_variants.vcf -mbq 25 -L${CDNA_BED}.bed --dbsnp ${DBSNP}.vc**f In doing so I invariably get the same error after calling 0.2% of the genome. ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.IndexOutOfBoundsException: Index: 3, Size: 3 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at htsjdk.variant.variantcontext.VariantContext.getAlternateAllele(VariantContext.java:845) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:248) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:1059) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:221) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Index: 3, Size: 3 ##### ERROR ------------------------------------------------------------------------------------------ because the problem seemed to originate from getAlternateAllele, I tried to play with --max_alternate_alleles by setting it to 2 or 10, without success. I also checked my${ALLELES}.vcf file to look for malformed Alternate alleles in the region where the GATK crashes (Chr 1, somewhere after 78Mb) , but I couldn't identify any... (I searched for Alternate alles that would not match the following extended regexpr '[ATGC,]+')

Is there any way to turn off the ClippingRankSum? When I attempt to use --excludeAnnotation ClippingRankSum it doesn't seem to do anything. I'm trying to locate the source of some inconsistencies between runs and think this may be the culprit.

Hi,

I'm using 3.3-0-g37228af.

This command gets around a feature that looks like a bug to me:

 gatk -T HaplotypeCaller -R ref.fa -I in.bam -o /dev/stdout 2> log.txt | grep -v -e PairHMM -e "JNI call" | bgzip -s - > out.vcf.gz


As you see, one has to remove three lines of the stdout stream written by PairHMM. Like all other logging information, they should go to stderr.

The three lines look like this:

Using un-vectorized C++ implementation of PairHMM
Time spent in setup for JNI call : 3.3656100000000003E-4
Total compute time in PairHMM computeLikelihoods() : 0.008854789


Regard,s Ari

Hey,

currently, I am working on 454 sequencing data. I am using GATK to call SNPs and indels, which works relatively well for the SNPs with the recommended filtering options. However, I was interested to raise sensitivity, which is why I started experimenting with the QD value. I restarted the whole GATK pipeline (including the genotyping) with the filtering option "QD<0.0" instead of "QD<2.0". As I did not change any other parameters, I was just expecting to see more SNPs being called. However, I came across a list of SNPs, which did contain some new SNPs, but which was also lacking one of the previously called SNPs (it was not even present in the raw HaplotypeCaller output). How could this have happened?

This is an extract from the vcf file containing the SNP that was previously called:

I am very curious to hear whether anyone might have an explanation for my findings. Many thanks in advance!

Here is my command I actually used.

##### ERROR ------------------------------------------------------------------------------------------

Hello

I am using GATK 3.3-0 HaplotypeCaller for variant calling. When I run HaplotypeCaller with the command

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R all_chr_reordered.fa -I 30_S30.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 50 -o 30_S30_control1.vcf -L brca.intervals

I get all the variants I want, however, I also want to get the number of forward and reverse reads that support REF and ALT alleles. Therefore I use StrandBiasBySample annotation when running HaplotypeCaller with the command:

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R all_chr_reordered.fa -I 30_S30.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 50 -o 30_S30_control2.vcf -L brca.intervals -A StrandBiasBySample

The SB field is added, but a variant that was in 30_S30_control1.vcf is absent in 30_S30_control2.vcf. All the remaining variants are there. The only difference between two variant calls was adding -A StrandBiasBySample. What I'm wondering about is that why this one variant is absent.

the missing variant: 17 41276152 . CAT C 615.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.147;ClippingRankSum=0.564;DP=639;FS=15.426;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-1.054;QD=0.96;ReadPosRankSum=0.698;SOR=2.181 GT:AD:DP:GQ:PL 0/1:565,70:635:99:653,0,18310

So I decided to run HaplotypeCaller without -A StrandBiasBySample and later add the annotations with VariantAnnotator. Here is the command:

java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R all_chr_reordered.fa -I 30_S30.bam --variant 30_S30_control1.vcf -L 30_S30_control1.vcf -o 30_S30_control1_SBBS.vcf -A StrandBiasBySample

However, the output vcf file 30_S30_control1_SBBS.vcf was not different from the input variant file 30_S30_control1.vcf except for the header, SB field wasn't added. Why was the SB field not added? Is there any other way to get the number of forward and reverse reads?

Please find 30_S30_control1.vcf, 30_S30_control2.vcf and 30_S30_control1_SBBS.vcf in attachment

Thanks

I am running HC3.3-0 with the following options (e.g. GENOTYPE_GIVEN_ALLELES):

$java7 -Djava.io.tmpdir=tmp -Xmx3900m \ -jar$jar \
--analysis_type HaplotypeCaller \
--reference_sequence $ref \ --input_file$BAM \
--intervals $CHROM \ --dbsnp$dbSNP \
--out $out \ -stand_call_conf 0 \ -stand_emit_conf 0 \ -A Coverage -A FisherStrand -A HaplotypeScore -A MappingQualityRankSumTest -A QualByDepth -A RMSMappingQuality -A ReadPosRankSumTest \ -L$allelesVCF \
-L 20:60000-70000 \
--interval_set_rule INTERSECTION \
--genotyping_mode GENOTYPE_GIVEN_ALLELES \
--alleles $allelesVCF \ --emitRefConfidence NONE \ --output_mode EMIT_ALL_SITES \  The file$allelesVCF contains these neighbouring SNPs:

20  60807   .   C   T   118.96  .
20  60808   .   G   A   46.95   .
20  61270   .   A   C   2870.18 .
20  61271   .   T   A   233.60  .


I am unable to call these neighbouring SNPs; despite reads being present in the file BAM, which shouldn't matter anyway. I also tried adding --interval_merging OVERLAPPING_ONLY to the command line, but that didn't solve the problem. What am I doing wrong? I should probably add GATK breaker/misuser to my CV... Thank you as always. P.S. The CommandLineGATK documentation does not say, what the default value for --interval_merging is. P.P.S. Iterative testing a bit slow, because HC always has to do this step: HCMappingQualityFilter - Filtering out reads with MAPQ < 20  Hi GATK team, i would like to seek opinion from your team to find the best workflow that best fit my data. Previously i've been exploring both variant calling algorithms UnifiedGenotyper and HaplotypeCaller, and i would love to go for UnifiedGenotyper considering of the sensitivity and the analysis runtime. Due to my experimental cohort samples grows from time to time, so i've opt for single sample calling follow by joint-analysis using combineVariants instead of doing multiple-samples variant calling. However by doing so, i've experience few drawbacks from it (this issue was discussed at few forums). For a particular SNP loci, we wouldn't know whether the "./." reported for some of the samples are due to no reads covering that particular loci, or it doesn't pass certain criteria during variant calling performed previously, or it is a homo-reference base (which i concern this most and can't cope to lost this information). Then, i found this "gvcf", and it is potentially to solve my problem (Thanks GATK team for always understand our researcher's need)!! Again, i'm insist of opt for unifiedGenotyper instead of haplotypeCaller to generate the gvcf, and reading from the forum at https://www.broadinstitute.org/gatk/guide/tagged?tag=gvcf, i would assume that as VCFs produced by "UnifiedGenotyper with --output_mode EMIT_ALL_SITES" to be something alike with the gvcf file produced by HaplotyperCaller. However i couldn't joint them up using either "CombineVariants" or "CombineGVCFs", most probably i think "UnifiedGenotyper with --output_mode EMIT_ALL_SITES" doesn't generate gvcf format. Can you please give me some advice to BEST fit my need and with minimum runtime (UnifiedGenotyper would be my best choice), is there any method to joint the ALL_SITES vcf file produced by UnifiedGenotyper which i might probably missed out from the GATK page? Dear GATK Team, If I specify a read filter using the -rf option, is that read filter added to the filters applied by default, or will that then be the only filter that is applied (so I would also need to specify the defaults to ensure they were all run. e.g. I want to add a bad cigar filter... -rf BadCigar But I also want the default filters applied, namely: - NotPrimaryAlignmentFilter - FailsVendorQualityCheckFilter - DuplicateReadFilter - UnmappedReadFilter - MappingQualityUnavailableFilter - HCMappingQualityFilter - MalformedReadFilter HaplotypeCaller has a --min_base_quality_score flag to specify the minimum "base quality required to consider a base for calling". It also has the -stand_call_conf and -stand_emit_conf that both of which use a "minimum phred-scaled confidence threshold". What is the difference between the base quality flag and the phred-scaled confidence thresholds flags in deciding whether or not to call a variant? Also, why are there separate flags for calling variants and emitting variants? To my way of thinking, calling and emitting variants are one-and-the-same. Is there any way to specify a minimum minor-allele-frequency for calling variants in Haplotyper? Under the default settings, the program expects 1:1 ratio of each allele in a single diploid organism. However, stochastic variance in which RNA fragments are sequenced and retained could lead to departures from this ratio. Finally, is there any way to specify the minimum level of coverage for a given variant for it to be considered for calling/genotyping? Hi, I used HaplotypeCaller in GVCF mode to generate a single sample GVCF, but when I checked my vcf file I see that the reference allele is not showing up: 22 1 . N <NON_REF> . . END=16050022 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 22 16050023 . C <NON_REF> . . END=16050023 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,37 22 16050024 . A <NON_REF> . . END=16050026 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,73 22 16050027 . A <NON_REF> . . END=16050035 GT:DP:GQ:MIN_DP:PL 0/0:3:9:3:0,9,110 22 16050036 . A C,<NON_REF> 26.80 . BaseQRankSum=-0.736;ClippingRankSum=-0.736;DP=3;MLEAC=1,0;MLEAF=0.500,0.00;MQ=27.00;MQ0=0;MQRankSum=-0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB 0/1:1,2,0:3:23:55,0,23,58,29,86:1,0,2,0 22 16050037 . G <NON_REF> . . END=16050037 GT:DP:GQ:MIN_DP:PL 0/0:3:9:3:0,9,109 22 16050038 . A <NON_REF> . . END=16050039 GT:DP:GQ:MIN_DP:PL 0/0:4:12:4:0,12,153  I am not sure where to start troubleshooting for this, since all the steps prior to using HaplotypeCaller did not generate any obvious errors. The basic command that I used was: java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hs37d5.fa -I recal_1.bam -o raw_1.vcf -L 22 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 Have you encountered this problem before? Where should I start troubleshooting? Thanks very much in advance, Alva I am identifying new sequence variants/genotypes from RNA-Seq data. The species I am working with is not well studied, and there are no available datasets of reliable SNP and INDEL variants. For BaseRecallibrator, it is recommended that when lacking a reliable set of sequence variants: "You can bootstrap a database of known SNPs. Here's how it works: First do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence." Setting up a script to run HaplotypeCaller and BaseRecallibrator in a loop should be fairly strait forward. What is a good strategy for comparing VCF files and assessing convergence? Specifically, what does the 'start' component of this flag mean? Do the reads all have to start in exactly the same location? Alternatively, does the flag specify the total number of reads that must overlap a putative variant before that variant will be considered for calling? If I want the variants to be called only if they fit the following criteria: 1) Min. total coverage for consideration of heterozygous is 10. 2) Min. coverage of each of the two observed major basecalls to be called heterozygous is 5. 3) Min. percentage of each of the two observed major basecalls in order to be called heterozygous is 20. 4) Min. coverage in order for a position to be called homozygous is 6. which command-line arguments in which tools (HaplotpyeCaller or GenotypeGVCFs) can I use to accomplish these? I cannot seem to find the proper arguments in the documentation. I apologize if I overlook. Thank you Hi, I want to use HaplotypeCaller to call families together. I have bam files for each individual in the 4 families I am studying, as well as a ped file describing the pedigree information. The problem is that these families have complex pedigrees, with the parents (mother and father), the children, and then one grandchild for each child (do not have information about the other parent of the grandchild). I would like to call these families with their complex pedigrees together, and I would like to call all the 4 families together to maximize the power of the calling. However, I'm not sure how to do that with just the -ped option. -ped seems to be designed for only one family or cohort, and I'm not sure it would work for me to feed it all my bams as inputs. Are there any other tools for GATK that I could use to call complex pedigrees? The other possibility would be to call the 4 trios separately and each child-grandchild pair separately, but not sure how to do that either with just -ped. What would you recommend? And finally, I get an error message saying that --emitRefConfidence only works for single sample mode. It seems that I should omit this option when I run HaplotypeCaller on my families, but are there any other options that I should use for cohort calling besides the standard ones (which I know to be --variant_index_type and --variant_index_parameter)? Thanks, Alva Hi, I am performing RNA-Seq to identify new polymorphisms in a species of sea star. Our short-term goal is to generate novel DNA sequences of coding genes for phylogenetic analysis. It is therefore important that polymorphisms be called accurately and that they can be phased. Our reference genome is poorly assembled and comprises over 60,000 scaffolds and contigs. Subsequently, when paired-end RNA-Seq reads are aligned to this reference genome (using TopHat), the two halves of the pair are often mapped to different scaffolds or contigs. This seems to greatly lower the MAQ score, which in turn leads to HaplotypeCaller missing well-supported polymorphisms, because the reads that support them have MAQ values between 1 and 3. The obvious solution for this is to set the --min-mapping-quality-score to 1 or 2, rather than the default of 20; and raising the --min_base_quality_score from the default value of 10 to maybe 25 or 30. This does, however, increase the risk of calling false positives from poorly aligned regions. Has this situation been considered by the GATK development team, and is there a recommended way to account for it? Hi team! I am testing haplotypecaller with VectorLoglessPairHMM on a singel BAM. There are two weird things. 1. There is no speedup going from -nct 1 to -nct 10. 2. There is no speedup implementing VectorLoglessPairHMM. I am very sorry, but here is the first lines of the log file. Hope you have a suggestion for what I can do to speed up the haplotypecaller successfully. sh INFO 21:37:58,043 HelpFormatter - -------------------------------------------------------------------------------- INFO 21:37:58,045 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.2-2-gec30cee, Compiled 2014/07/17 15:22:03 INFO 21:37:58,045 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 21:37:58,045 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 21:37:58,048 HelpFormatter - Program Args: -T HaplotypeCaller -R /mnt/users/torfn/Projects/BosTau/Reference/Bos_taurus.UMD3.1.74.dna_rm.chromosome.ALL.fa -I /mnt/users/tikn/old_Backup2/cigene-pipeline-snp-detection/align_all/2052/2052_aln.posiSrt.withRG.dedup.bam --genotyping_mode DISCOVERY --dbsnp /mnt/users/torfn/Projects/BosTau/Reference/vcf_chr_ALL-dbSNP138.vcf -stand_emit_conf 10 -stand_call_conf 30 -minPruning 3 -o test.gatk.31.vcf -nct 10 --pair_hmm_implementation VECTOR_LOGLESS_CACHING INFO 21:37:58,052 HelpFormatter - Executing as tikn@m620-7 on Linux 2.6.32-504.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_71-mockbuild_2014_10_17_22_23-b00. INFO 21:37:58,052 HelpFormatter - Date/Time: 2014/11/27 21:37:58 INFO 21:37:58,052 HelpFormatter - -------------------------------------------------------------------------------- INFO 21:37:58,053 HelpFormatter - -------------------------------------------------------------------------------- INFO 21:37:58,331 GenomeAnalysisEngine - Strictness is SILENT INFO 21:37:58,521 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 21:37:58,538 SAMDataSourceSAMReaders - Initializing SAMRecords in serial INFO 21:37:58,866 SAMDataSourceSAMReaders - Done initializing BAM readers: total time 0.33 INFO 21:37:58,892 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 21:37:59,211 MicroScheduler - Running the GATK in parallel mode with 10 total threads, 10 CPU thread(s) for each of 1 data thread(s), of 32 processors available on this machine INFO 21:37:59,338 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 21:38:00,229 GenomeAnalysisEngine - Done preparing for traversal INFO 21:38:00,230 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 21:38:00,231 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 21:38:00,232 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime INFO 21:38:00,446 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units INFO 21:38:00,448 PairHMM - Performance profiling for PairHMM is disabled because HaplotypeCaller is being run with multiple threads (-nct>1) option Profiling is enabled only when running in single thread mode Using AVX accelerated implementation of PairHMM INFO 21:38:04,922 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file INFO 21:38:04,923 VectorLoglessPairHMM - Using vectorized implementation of PairHMM INFO 21:38:30,237 ProgressMeter - 1:656214 0.0 30.0 s 49.6 w 0.0% 33.8 h 33.8 h INFO 21:39:30,239 ProgressMeter - 1:2160900 0.0 90.0 s 148.8 w 0.1% 30.8 h 30.8 h INFO 21:40:30,241 ProgressMeter - 1:3789347 0.0 2.5 m 248.0 w 0.1% 29.3 h 29.2 h INFO 21:41:30,242 ProgressMeter - 1:5347891 0.0 3.5 m 347.2 w 0.2% 29.0 h 29.0 h  kind reagards Tim Knutsen Hi, I just finished running HaplotypeCaller version 3.3-0 separately on 6 exome samples with the new best practices. java -Xmx8g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19/hg19_Ordered.fa -I K87/HG19_Analysis/K87-929_final.recalibrated_final.bam --dbsnp dbsnp_138_hg19_Ordered.vcf --pair_hmm_implementation VECTOR_LOGLESS_CACHING -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 --output_mode EMIT_VARIANTS_ONLY -gt_mode DISCOVERY --pcr_indel_model CONSERVATIVE -o ./Haplotypes_929.vcf Many Variant sites are called as homozygous alt (1/1), but none of these sites that are processed to infer haplotype are called as homozygous alt in their PGT field, they are all called as hets, PGT=0|1. for example: GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,29,0:29:93:0|1:121483392_C_G:1331,93,0,1331,93,1331:0,0,13,16 The allelic depths agree with the phased genotype but out of all 6 exomes processed, not a single 1/1 is also phased as 1|1. I checked all output vcfs with a simple grep combo: grep 'PGT' Haplotypes_929.vcf | grep '1/1' - | grep '0|1' - | wc -l = 19046 grep 'PGT' Haplotypes_929.vcf | grep '1/1' - | grep '1|0' - | wc -l = 79 grep 'PGT' Haplotypes_929.vcf | grep '1/1' - | grep '1|1' - | wc -l = 0 This seemed odd, but I continued with GenotyeGVCF: java -Xmx32g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R hg19/hg19_Ordered.fa -V Haplotypes_450.vcf -V Haplotypes_452.vcf -V Haplotypes_925.vcf -V Haplotypes_926.vcf -V Haplotypes_927.vcf -V Haplotypes_929.vcf -D dbsnp_138_hg19_Ordered.vcf -ped K87/HG19_Analysis/K87_6.ped -o Haplotypes_K87_GVCFs.vcf I'm looking at the output vcf as it's being generated and now there are homozygous alt calls but they conflict with the associated Allelic Depths: .... GT:AD:DP:GQ:PGT:PID:PL .... 1/1:0,29:29:85:1|1:33957151_G_T:948,85,0 ..... Full Line: chr1 33957152 rs4403594 T G 3166.96 . AC=12;AF=1.00;AN=12;DB;DP=99;FS=0.000;GQ_MEAN=48.50;GQ_STDDEV=27.55;MLEAC=12;MLEAF=1.00;MQ=39.65;MQ0=0;NCC=0;QD=32.32;SOR=0.693 GT:AD:DP:GQ:PGT:PID:PL 1/1:0,9:9:27:.:.:330,27,0 1/1:0,5:5:15:.:.:141,15,0 1/1:0,29:29:85:1|1:33957151_G_T:948,85,0 1/1:0,20:20:60:.:.:722,60,0 1/1:0,24:24:71:.:.:685,71,0 1/1:0,11:11:33:.:.:366,33,0 Can you help me interpret what seems to me as conflicting results? Cheers, Patrick Hi. I used haplotypecaller for variants calling. After variant recalibration, there is a vcf contains both SNP and Indel. Is there any quick way to split it into two vcf: SNP and Indel. Thanks. Lei The SB field of HaplotypeCaller output is not described terribly well as far as I can find. ##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias."> What exactly happens when there are multiple alternate alleles? For example: scaffold_1 2535 . T C,TTC,<NON_REF> 611.83 . DP=15;MLEAC=1,1,0;MLEAF=0.500,0.500,0.00;MQ=60.00;MQ0=0 GT:AD:DP:GQ:PL:SB 1/2:0,5,10,0:15:99:630,397,379,210,0,180,607,394,210,604:0,0,12,3 It doesn't seem to be particularly informative in this case (a case which is rather common for our data). If it isn't already part of the possible annotations... Perhaps the most sensible approach would be to output field with num-fwd, num-rev for each allele (rev, alt1, alt2, ...). SDP for "strand-depth" might be a reasonable name. Hi I have been trying HaplotypeCaller to find SNPs and INDELS in viral read data (haploid) but am finding that it throws away around half of my reads and I don't understand why. A small proportion (8%) are filtered out duplicates and 0.05% fail on mapping quality but I can't account for the majority of lost reads. I appreciate that GATK wasn't built for viral sequences but would you have an idea of what could be causing this? I use the following command after marking duplicates and realigning around indels: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R Ref.fasta -I realigned_reads.bam --genotyping_mode DISCOVERY -ploidy 1 -bamout reassembled.bam -o rawvariants.vcf I have also tried the same file with UnifiedGenotype and I get the result I expect i.e. most of my reads are retained and I have SNP calls that agree with a VCF constructed in a different program so I assume the reads are lost as part of the local realignment? Thanks Kirstyn Hi, Is there any parameter to set the minimum variant frequency? What is the parameters name? I have read the documentation, but I cannot find it. Thank you, Hi, I'd like to ask you if there is a paper describing the Haplotype Caller algorithm, if you could please send me the reference. I have tried to find it, but I only found the paper on GATK which is great, but it doesn't describe in detail the Haplotype Caller algorithm. thank you, Hi, I'd like to know if the Haplotype Caller perform Linkage Disequilibirum (LD) to decide wether to call a variant or not as part of the variant callling algorithm? I've read documentation about the Haplotype Caller, but I cannot find anything about LD, so should I assume that it doesn't do LD? Thank you, Hi all, I am attempting to use the HaplotyperCaller / CombineGVCFs / GenotypeGVCFs to call variants on chrom X and Y of 769 samples (356 males, 413 females) sequenced at 12x coverage (WG sequening, but right not only calling X and Y). I have called the samples according to the best practises using the HaplotypeCaller, using ploidy = 1 for males on X and Y and ploidy =2 for females on X, e.g.: INFO 16:28:45,750 HelpFormatter - Program Args: -R /gcc/resources/b37/indices/human_g1k_v37.fa -T HaplotypeCaller -L X -ploidy 1 -minPruning 3 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -I /target/gpfs2/gcc/groups/gonl/projects/trio-analysis/rawdata_release2/A102.human_g1k_v37.trio_realigned.bam --sample_name A102a -o /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/A102a.chrX.hc.g.vcf Then I have used CombineGVCFs to combine my samples in batches of 100 samples. Now I am attempting to genotype them and I face the same issue on both X (males + females) and Y (males only): It starts running fine and then just hang on a certain position. At first it crashed asking for additional memory but now with 24Gb or memory it simply stays at a single position for 24hrs until my job eventually stops due to walltime. Here is the chrom X output: INFO 15:00:39,501 HelpFormatter - Program Args: -R /gcc/resources/b37/indices/human_g1k_v37.fa -T GenotypeGVCFs -ploidy 1 --dbsnp /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf -stand_call_conf 10 -stand_emit_conf 10 --max_alternate_alleles 4 -o /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.vcf -L X -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.1.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.2.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.3.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.4.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.5.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.6.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.7.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.8.g.vcf INFO 15:00:39,507 HelpFormatter - Executing as lfrancioli@targetgcc15-mgmt on Linux 3.0.80-0.5-default amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13. INFO 15:00:39,507 HelpFormatter - Date/Time: 2014/11/12 15:00:39 INFO 15:00:39,508 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:00:39,508 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:00:40,951 GenomeAnalysisEngine - Strictness is SILENT INFO 15:00:41,153 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 15:57:53,416 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.4.g.vcf.idx INFO 17:09:39,597 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.5.g.vcf.idx INFO 18:21:00,656 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.6.g.vcf.idx INFO 19:30:46,624 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.7.g.vcf.idx INFO 20:22:38,368 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.8.g.vcf.idx WARN 20:26:45,716 FSLockWithSharedLockAcquisitionTask - WARNING: Unable to lock file /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf.idx because an IOException occurred with message: No locks available. INFO 20:26:45,718 RMDTrackBuilder - Could not acquire a shared lock on index file /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf.idx, falling back to using an in-memory index for this GATK run. INFO 20:33:29,491 IntervalUtils - Processing 155270560 bp from intervals INFO 20:33:29,628 GenomeAnalysisEngine - Preparing for traversal INFO 20:33:29,635 GenomeAnalysisEngine - Done preparing for traversal INFO 20:33:29,636 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 20:33:29,637 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 20:33:29,638 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 20:33:29,948 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files INFO 20:33:59,642 ProgressMeter - X:65301 0.0 30.0 s 49.6 w 0.0% 19.8 h 19.8 h INFO 20:34:59,820 ProgressMeter - X:65301 0.0 90.0 s 149.1 w 0.0% 59.4 h 59.4 h ... INFO 20:52:01,064 ProgressMeter - X:177301 0.0 18.5 m 1837.7 w 0.1% 11.3 d 11.2 d INFO 20:53:01,066 ProgressMeter - X:177301 0.0 19.5 m 1936.9 w 0.1% 11.9 d 11.9 d ... INFO 14:58:25,243 ProgressMeter - X:177301 0.0 18.4 h 15250.3 w 0.1% 96.0 w 95.9 w INFO 14:59:38,112 ProgressMeter - X:177301 0.0 18.4 h 15250.3 w 0.1% 96.1 w 96.0 w INFO 15:00:47,482 ProgressMeter - X:177301 0.0 18.5 h 15250.3 w 0.1% 96.2 w 96.1 w =>> PBS: job killed: walltime 86440 exceeded limit 86400

I would really appreciate if you could give me some pointer as how to handle this situation.

Thanks! Laurent

I ran GenotypeGVCFs on all the vcf files generated by HC using the following command ( All these file are indexed by HC):

java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R /ref/human_g1k_v37.fasta -o HC_raw.vcf -V gVCF_n1.vcf -V gVCF_n2.vcf -V gVCF_n3.vcf -V gVCF_n4.vcf -V gVCF_n5.vcf -V gVCF_n6.vcf

##### ERROR ------------------------------------------------------------------------------------------

Hi, I had called a VCF using haplotypecaller, following general guidelines with the GATK version 3.2-2. I had encountered a possible bug where a sample is shown to have a HET genotype, yet the AD column returns a '0' for the alt allele. However there is evidence of the 'alt' allele in the original bam file for the sample. This error is seen for some snps and indel alike. Variant example is pasted below (IGV screenshot attached for the bam)

2 46839487 . G GA 45.66 PASS AC=3;AF=0.042;AN=72;BaseQRankSum=0.218;ClippingRankSum=0.894;DP=749;FS=0.000;GQ_MEAN=31.19;GQ_STDDEV=29.02;InbreedingCoeff=-0.1296;MLEAC=2;MLEAF=0.028;MQ=60.00;MQ0=0;MQRankSum=-4.700e-02;NCC=0;QD=3.04;ReadPosRankSum=0.00;VQSLOD=-6.710e-02;culprit=FS GT:AD:DP:GQ:PL 0/0:20,0:20:60:0,60,823 0/0:41,0:43:99:0,122,1117 0/0:22,0:22:63:0,63,945 0/0:20,0:20:60:0,60,774 0/0:22,2:27:26:0,26,590 0/0:20,0:20:60:0,60,833 0/0:30,0:30:62:0,62,1077 0/0:20,0:20:60:0,60,847 0/0:42,0:42:16:0,16,1323 0/0:25,0:25:51:0,51,765 0/0:27,0:27:63:0,63,945 0/0:27,0:27:63:0,63,945 0/0:16,0:18:0:0,0,373 0/0:24,0:24:0:0,0,610 0/1:15,0:17:23:23,0,607 0/1:12,3:18:28:28,0,312 0/0:8,0:8:0:0,0,145 0/0:13,0:16:38:0,38,348 0/0:11,0:11:21:0,21,315 0/0:13,0:13:27:0,27,405 0/0:13,1:16:23:0,23,469 0/0:19,0:21:8:0,8,418 0/0:23,0:23:33:0,33,589 0/1:8,0:12:71:71,0,187 0/0:19,0:19:0:0,0,438 0/0:7,0:7:21:0,21,248 0/0:9,0:9:0:0,0,199 0/0:8,0:8:21:0,21,315 0/0:18,0:18:0:0,0,396 0/0:26,0:26:0:0,0,585 0/0:24,0:24:0:0,0,469 0/0:14,0:16:41:0,41,375 0/0:6,0:6:0:0,0,133 0/0:17,0:19:51:0,51,460 0/0:15,0:15:11:0,11,373 0/0:12,0:12:0:0,0,333

Is this a bug?

-Uma

I have 23 samples and I want to look over 63807197 bp region. Many thanks before.

Kind regards, Angelica

Hi,

I have the below variant from GATK Haplotype caller and annotated as 1/1 which means homozygous for the alternate allele.

chr1    10023229    .   G   A   101.03  .   AC=2;AF=1.00;AN=2;BaseQRankSum=-0.742;DP=49;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=8.21;MQ0=42;MQRankSum=0.742;QD=2.06;ReadPosRankSum=-0.742 GT:AD:DP:GQ:PL 1/1:42,7:47:12:129,12,0


However, the AD column shows there are 42 reads for reference and 7 reads for alternate allele. Could someone comment on this snp being reported as homozygous for alternate allele despite of having very few reads supporting it.

Dear GATK team,

I produced VCF files from RNA-Seq data as described in https://www.broadinstitute.org/gatk/guide/article?id=3891 My VCF files contain genomic coordinates but not dbSNP rs IDs. Is this because I omitted the optional --dbSNP parameter when performing the -T HaplotypeCaller step? If so, can I add the IDs to the VCF files after they have been generated? If not, and i have to repeat the HaplotypeCaller step, what is the correct dnSNP.vcf file for hg38 that you guys recommend?

Thanks, Joe

Hi:

I'm running HaplotypeCaller on whole-genome data. I'm interested in only several thousand specific positions across the genome, all of which are in dbSNP. Some of these are indels, which makes me more inclined to use HaplotypeCaller. However, I want to know the quality/likelihood of the bases called at each position in downstream analysis. So, my goal is to obtain a VCF file containing a call at every dbSNP position, whether or not it's variant with respect to the reference. I need each associated with a likelihood from VQSR, with VQSLOD having the same meaning whether or not the site is variant.

It seems that I could do this with UnifiedGenotyper --output_mode EMIT_ALL_CONFIDENT_SITES. How can I replicate this with HaplotypeCaller? Second, is there a better way to achieve this? Should I not be using HaplotypeCaller at all?

So far, I ran HaplotypeCaller with --emitRefConfidence BP_RESOLUTION --dbsnp dbsnp.vcf, followed by GenotypeGVCFs --includeNonVariantSites --dbsnp dbsnp.vcf. However, the rsIDs aren't used to populate the ID column in the resulting VCF (is this a bug?), so I can't then remove non-dbSNP positions. Secondly, annotations aren't added to positions matching the reference.

Hello,

First of all, thank you for your detailed best practice pipeline for SNP calling from RNA-seq data.

I have pooled RNA seq data which I need to call SNP from. Each library consists of a pooled sample of 2-3 individuals of the same sex-tissue combination.

I was wondering if Haplotype caller can handle SNP calling from pooled sequences or is it better if I use FreeBayes?

I understand that these results come from experimenting with the data but it would be great if you could share your experiences with me on this.

Cheers, Homa

Hi,

We have some exomes processed with GATK from one exome capture platform (Nimblegen SeqCap on HiSeq), and now I am going to analyze a small batch of exomes sequenced using a different platform (Nextera on NextSeq). I was wondering if and how GenotypeGVCFs can cope with gvcfs produced on two (or more) different exome capture platforms? Is joint-calling of such "heterogeneous" samples advised, or should I rather genotype the small set of equally processed bams separately?

The problems I could foresee were:

• differing targets (each batch of samples will have gaps in coverage due to target-constrained (-L) haplotypecalling). If this should cause problems for GenotypeGVCFs the haplotypes could be called again on superset of targets, or easier, the calling restricted to the intersection.
• the alignment algorithm is different (BWA BWTSW vs BWA MEM) so the mapping quality could potentially differ (haven't checked). Is this something GATK is compensating for, like base qualities?
• the sequencer and chemistry, but here I hope that the BQSR should help in removing the variation

Thanks, Paweł

I would like to run HaplotypeCaller on some WGS samples from 24 individuals. I have attempted to use the multithreading option but would instead like to use the scatter-gather approach with Queue on our cluster as this will hopefully run much quicker. I see that there was a patch written for Queue to work with PBS scheduler some time back, however, I am struggling to find the associated documentation for it. Could you please point me in the right direction or provide some advice on how to do this.

I am trying to use the HaplotypeCaller with a ploidy setting of 1 to genotype sex chromosomes in a single individual using the following command:

java -Xmx4g -jar ~/applications/GenomeAnalysisTK-3.2-2.jar -T HaplotypeCaller -R /data/genome_references/Hmel1-1_primaryScaffolds_mtDNA.fasta -L Z_chromosome.intervals -ploidy 1 -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -I realigned09-201_Hel_par_ser_HEL_4_CGATGT_L005.bam -hets 0.01 -indelHeterozygosity 0.01 -out_mode EMIT_ALL_CONFIDENT_SITES -maxAltAlleles 1 -o realigned09-201_Hel_par_ser_HEL_4_CGATGT_L005.HC_sex.vcf

This is failing. Error message at the bottom.

Setting ploidy to 2 works fine.

The following command with the UnifiedGeonotyper is also fine:

java -Xmx4g -jar ~/applications/GenomeAnalysisTK-3.2-2.jar -T UnifiedGenotyper -R /data/genome_references/Hmel1-1_primaryScaffolds_mtDNA.fasta -ploidy 1 -L Z_chromosome.intervals -I realigned09-201_Hel_par_ser_HEL_4_CGATGT_L005.bam -glm BOTH -hets 0.01 -indelHeterozygosity 0.01 -out_mode EMIT_ALL_CONFIDENT_SITES -o realigned09-201_Hel_par_ser_HEL_4_CGATGT_L005.HC_sex.vcf

Any ideas what the problem is?

Thanks, Kanchon

INFO 12:42:20,180 AFCalcFactory - Requested ploidy 1 maxAltAlleles 6 not supported by requested model EXACT_INDEPENDENT looking for an option INFO 12:42:20,181 AFCalcFactory - Selecting model EXACT_GENERAL_PLOIDY INFO 12:42:21,550 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR stack trace

java.lang.NullPointerException at org.broadinstitute.gatk.tools.walkers.genotyper.afcalc.GeneralPloidyExactAFCalc.assignGenotype(GeneralPloidyExactAFCalc.java:559) at org.broadinstitute.gatk.tools.walkers.genotyper.afcalc.GeneralPloidyExactAFCalc.subsetAlleles(GeneralPloidyExactAFCalc.java:527) at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:286) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:335) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:320) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:310) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.isActive(HaplotypeCaller.java:844) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.addIsActiveResult(TraverseActiveRegions.java:618) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.access$800(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$ActiveRegionIterator.hasNext(TraverseActiveRegions.java:378) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:268) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:273) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

##### ERROR ------------------------------------------------------------------------------------------

Hi, I ran HaplotypeCaller with the following command

java -jar GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T HaplotypeCaller -minPruning 3 -R genome.fasta -I reduced.bam -o Chr10.vcf -stand_call_conf 50 -mbq 20 --dbsnp /home/bioinfo/data/genomes/rice/nipponbare_v7.0/variations/all_sorted.vcf

and it gave the following message with the error:

##### ERROR MESSAGE: Argument with name 'mbq' isn't defined.

I was wondering what am I doing wrong. I see that -mbq is defined in the haplotypeCaller page. Was it not defined in the version 2.8.1?

Thanks

I am attempting to use the HaplotypeCaller to call indels and SNPs in genomes from the Illumina Platinum Genomes pedigree (http://www.illumina.com/platinumgenomes/). I did these steps before running haplotype caller:

• alignment with BWA mem
• mark duplicates with Picard
• Indel realignment

I am now trying to run haplotype caller (version 3.2-2-gec30cee) with the realigned BAM files and am getting the following error:

##### ERROR stack trace
org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Somehow the requested coordinate is not covered by the read. Alignment 10357 | 86M15H
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR
##### ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Alignment 10357 | 86M15H
##### ERROR ------------------------------------------------------------------------------------------


Notably, I don’t get this error when analyzing the same BAMs with UnifiedGenotyper. I have isolated a BAM file with ~1300 reads that can recreate the problem but couldn’t narrow it down further. The command I used is:

java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller  \
-R Homo_sapiens_assembly19_withchr.fasta \
-I del.bam \
-o del.hc.vcf \
--output_mode EMIT_ALL_SITES


HI, I have noticed on the number of occasions that the cigar strings of deletions in the haplotype file generated with -bamout are incorrectly shifted by 1 or 2 basepairs. It does not affect variant calling (calls in vcf fie are correct). Looks like a bug? See below the IGV screenshot (top panel - haplotype bam, middle - realigned recalibrated bam, bottom - vcf file) and below are a couple of reads from bam files:

From haplotype bam: HISEQ:410:C1UVYACXX:2:1107:15486:59236 99 1 11740287 54 41M12D59M = 11740353 166 AGGTGATCCGCCCGCCTCGGCCTCCCAAAGTGCTAGGATTACAGGCGCCCGACCAAAGAATTTTTAAAAGTCCACCCTGTACCCACTCTGTGGTTCTGGG @CACFEGHECGEEBECEG>DFDGFDCEDDFDGEFDFCCEDDEFBDFAFDEBDGG5555555555555555555555555555555555555555555555 HC:i:476358318 MC:Z:100M BD:Z:MMPOPSQRPOOPLNNONMONOONMNKOLCMOKOOMNONNMLNKPOOMNOKMOONOLCMMMMMDDDLLCCMOONOKOLOPLOOOLPLPOOQNNRRQRNPPL RG:Z:130809_SN172_0410_BC1UVYACXX_TGACCA_L002_R1_001 BI:Z:PPQRRTSSQRRQMQQPRPQPRPRPPMSOGQRORSQRPQQOPQNRPRPQPNRSSRTPHRPQRPJJJQOHHRSSQTPROTUQTSSOUQTSTVRSUUTWRTRM NM:i:12 MQ:i:60 AS:i:82 XS:i:47

Same read from realigned recalibrated bam: HISEQ:410:C1UVYACXX:2:1107:15486:59236 99 1 11740287 54 43M12D57M = 11740353 166 AGGTGATCCGCCCGCCTCGGCCTCCCAAAGTGCTAGGATTACAGGCGCCCGACCAAAGAATTTTTAAAAGTCCACCCTGTACCCACTCTGTGGTTCTGGG @CACFEGHECGEEBECEG>DFDGFDCEDDFDGEFDFCCEDDEFBDFAFDEBDGGFFEAD@AFCFFEDGGJFIFGIABGJEFDGGHFHJGGDGFFDFFBEE MC:Z:100M BD:Z:MMPOPSQRPOOPLNNONMONOONMNKOLCMOKOOMNONNMLNKPOOMNOKMOONOLCMMMMMDDDLLCCMOONOKOLOPLOOOLPLPOOQNNRRQRNPPL RG:Z:130809_SN172_0410_BC1UVYACXX_TGACCA_L002_R1_001 BI:Z:PPQRRTSSQRRQMQQPRPQPRPRPPMSOGQRORSQRPQQOPQNRPRPQPNRSSRTPHRPQRPJJJQOHHRSSQTPROTUQTSSOUQTSTVRSUUTWRTRM NM:i:12 MQ:i:60 AS:i:82 XS:i:47

Thank you, Dalia

Hi,

I am using HaplotypeCaller for variant calling on GATK version 3.2.2 on whole genome Illumina reads. I used the following command as per best practices with and without multithreading option (-nct).

java -jar /GenomeAnalysisTK-3-2-2/GenomeAnalysisTK.jar -T HaplotypeCaller -nct 10 -I infile.re.recal.bam -R /genome/human_g1k_v37.fasta -o outfile_raw.vcf -stand_call_conf 30 -stand_emit_conf 10 -minPruning 3

Without -nct option variants found : 207828 And with -nct option variants found: 207850

-SK

Hi,

Is there a way that HaplotypeCaller (or another tool) can output phased genotypes in the VCF file?

Thanks! Andrew

Following the the recommended best practice and using GATK v3.2, the raw vcf file created by haplotyecaller/GenotypeGVCFs does not contain any of AC, AN and AF tags. Is this an expected behaviour?

This is a line of the g.vcf file generated from HaplotypeCaller:
chr21 36104982 . AT A,ATT,<NON_REF> 0.39 . BaseQRankSum=-1.693;DP=35;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQ=60.00;MQ0=0;MQRankSum=1.874;ReadPosRankSum=-0.023 GT:AD:GQ:PL:SB 0/1:23,6,2,0:27:27,0,499,67,411,529,96,459,496,525:10,13,2,4

And the raw vcf file from GenotypeGVCFs: chr21 36122885 . A AT 7871.94 . BaseQRankSum=0.151;DP=493;FS=5.433;MLEAC=24;MLEAF=0.750;MQ=60.00;MQ0=0;MQRankSum=0.129;QD=19.53;ReadPosRankSum=0.257 GT:AD:DP:GQ:PL 0/1:13,8:.:99:162,0,238 1/1:0,21:.:62:536,62,0 1/1:0,24:.:71:610,71,0 0/1:12,12:.:99:245,0,207 1/1:0,27:.:80:693,80,0 1/1:0,34:.:99:877,101,0 0/1:31,13:.:99:210,0,624 0/0:30,0:30:65:0,65,1125 0/1:6,11:.:96:225,0,96 0/1:10,5:.:87:87,0,196 0/1:4,15:.:39:340,0,39 1/1:2,35:.:66:895,66,0 1/1:0,53:.:99:1376,161,0 1/1:0,13:.:39:328,39,0 1/1:0,24:.:74:662,74,0 1/1:1,29:.:65:702,65,0

My command for HaplotypeCaller:
java -Xmx8g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R myRef.fa -I my.bam -L chr21 --dbsnp dbSNP_chr21.vcf --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -stand_emit_conf 10 -stand_call_conf 30 -mbq 20 -nct 8 -A FisherStrand -A QualByDepth -A HaplotypeScore --annotation Coverage -o raw.g.vcf

And GenotypeGVCFs:
java -Xmx8g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R myRef.fa -V myVCF.list --dbsnp dbSNP_chr21.vcf -A FisherStrand -A QualByDepth -A HaplotypeScore -A Coverage -L chr21 -o myraw.vcf

I have WES data of 235 samples, aligned using bwa aln. I followed GATK's Best Pratice for marking duplicates, realignment around INDEL, and BQSR until I got the recalibrated bamfiles. All these are done with GATK 2.7-2. After that I generated gvcf file from the recalibrated bamfiles, using HaplotypeCaller from GATK 3.1-1, followed by GenotypeGVCFs and VQSR.

I came across one INDEL, which passes VQSR,

The variant is chr14 92537378 . G GC 11330.02 PASS AC=11;AF=0.023;AN=470;BaseQRankSum=0.991;ClippingRankSum=-6.820e-01;DP= 13932;FS=0.000;GQ_MEAN=137.20;GQ_STDDEV=214.47;InbreedingCoeff=-0.0311;MLEAC=11;MLEAF=0.023;MQ=43.67;MQ0=0;MQRankSum=-1.073e+00;NCC=0;QD=18.16; ReadPosRankSum=-1.760e-01;VQSLOD=3.70;culprit=FS

Indivudual genotypes seem very good. To name a few: 1800 0/0:65,0:65:99:0,108,1620 0/0:86,0:86:99:0,120,1800 0/1:23,25:48:99:1073,0,1971 0/1:51,39:90:99:1505,0,4106

But when I checked the variants in IGV, something strange popped out.

The pictures (and complete post) can be view here: http://alittlebitofsilence.blogspot.hk/2014/09/to-be-solved-gatk-haplotypecaller-indel.html

Judging from the alignment visualized by IGV, there are no insertions at that site, but an SNP in the next position, in all samples. But HaplotypeCaller calls G->GC insertion in some samples while not in other samples. My questions are: 1) In variant calling HaplotypeCaller would do a local de-novo assembly in some region. Is the inconsistency between bamfiles and gvcf because of the re-assembly? 2) Why is the insertion called in some samples but not others, although the position in all samples looks similar? 3) There are other variants called adjacent or near this site, but later filtered by VQSR. Does that indicate the variants from this region cannot be trusted? chr14 92537353 . C CG 303342.41 VQSRTrancheINDEL0.00to90.00+ chr14 92537354 . C CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG 142809.26 VQSRTrancheINDEL0.00to90.00+ chr14 92537378 . G GC 11330.02 PASS chr14 92537379 rs12896583 T TGCTGCTGCTGCTGC,C 13606.25 VQSRTrancheINDEL0.00to90.00+ chr14 92537385 rs141993435 CTTT C 314.64 VQSRTrancheINDEL0.00to90.00+ chr14 92537387 rs12896588 T G 4301.60 VQSRTrancheSNP99.90to100.00 chr14 92537388 rs12896589 T C 4300.82 VQSRTrancheSNP99.90to100.00 chr14 92537396 . G GC,GCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGC 4213.61 VQSRTrancheINDEL0.00to90.00+

chr14 92537397 . T TGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,TGCTGCTGCTGCTG,TGCTGCTGCTG,TGCTGCTG 2690.79 VQSRTrancheINDEL0.00to90.00+ 3) I used GATK 2.7 for processing before calling gvcf files. Any possibility that if I change to 3.2, the bamfiles would look more similar to gvcf result? I have tried GATK3.2-2 for generating gvcf from GATK 2.7 bamfiles, the results are the same.

Hi,

I've tried my best to find the problem but without luck, so my first post...

First the background:

I have used HaplotypeCaller to make GVCFs per individual for 2 cohorts of 270 and 300 individuals each.
I then used CombineGVCFs to merged all the individuals in each cohort across 10Mb chunks across the genome.

These steps appear to have worked OK! Then I have been attempting to call genotypes from pairs GVCFs (one from each cohort) per region using GenotypeGVCFs with a command such as:

java -jar -Xmx6G /gpfs0/apps/well/gatk/3.2-2/GenomeAnalysisTK.jar \ -T GenotypeGVCFs \ -R ../hs37d5/hs37d5.fa \ --variant /well/hill/kate/exomes//haploc/cap270/cap270_1:230000000-240000000_merged_gvcf.vcf \ --variant /well/hill/kate/exomes//haploc/sepsis300/sepsis300_1:230000000-240000000_merged_gvcf.vcf \ -o /well/hill/kate/exomes//haploc/cap270-sepsis300/cap270-sepsis300_1:230000000-240000000_raw.vcf

This seems to start fine (and works perfectly to completion for some regions), but then throws a error for some of the files. For this example the error is:

##### ERROR MESSAGE: The provided VCF file is malformed at approximately line number 519534: ./.:29:75:25:0,63,945 is not a valid start position in the VCF format, for input source: /well/hill/kate/exomes/haploc/cap270/cap270_1:230000000-240000000_merged_gvcf.vcf

So to try and solve this I have grepped "./.:29:75:25:0,63,945 " from the VCF file (expecting to find a truncated line), but could find nothing wrong with the lines containing this expression. I looked carefully at the first line containing this after the last position that was written in the output - nothing out of the ordinary that I can see.

Hello everyone, I have recently sequenced 15 sample exomes using the Ion proton system, which directly generates VCFs after sequencing runs. However, since I will be using GATK for downstream analysis, I have decided to recall the variants using UG or HC. I read a document stating that HC should not be used in the case of pooled samples. Can someone clarify what exactly this means? For Ion proton, three samples are barcoded and pooled together prior to sequencing. Also, is 15 samples enough to obtain good discovery power? I was thinking of using external, publicly available bam files of exome sequencing data if 15 samples is not sufficient for joint variant calling.

Hey there,

How can it be possible that some of my snps or indels calls miss the QD tag? I'm doing the recommended workflow and I've tested if for both RNAseq (hard filters complains, that's how I saw those tags were missing) and Exome sequencing (VQSR). How can a hard filter applied on QD on a line without actually that tag can be considered to pass that threshold too? I'm seeing a lot more INDELs in RNAseq where this kind of scenario is happening as well.

Here's the command lines that I used :

# VQSR

java -Djava.io.tmpdir=$LSCRATCH -Xmx10g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -T VariantRecalibrator -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -mode SNP -resource:1000G,known=false,training=true,truth=false,prior=10.0 ~/References/hg19/VQSR/1000G_phase1.snps.high_confidence.b37.noGL.converted.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ~/References/hg19/VQSR/hapmap_3.3.b37.noGL.nochrM.converted.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 ~/References/hg19/VQSR/1000G_omni2.5.b37.noGL.nochrM.converted.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 ~/References/hg19/VQSR/dbsnp_138.b37.excluding_sites_after_129.noGL.nochrM.converted.vcf -input snv.vcf -recalFile 96Exomes.HC.TruSeq.snv.RECALIBRATED -tranchesFile 96Exomes.HC.TruSeq.snv.tranches -rscriptFile 96Exomes.HC.TruSeq.snv.plots.R -R ~/References/hg19/hg19.fasta --maxGaussians 4 java -Djava.io.tmpdir=$LSCRATCH -Xmx10g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -T ApplyRecalibration -ts_filter_level 99.0 -mode SNP -input snv.vcf -recalFile 96Exomes.HC.TruSeq.snv.RECALIBRATED -tranchesFile 96Exomes.HC.TruSeq.snv.tranches -o 96Exomes.HC.TruSeq.snv.recal.vcf -R ~/References/hg19/hg19.fasta

# HARD FILTER (RNASeq)

java -Djava.io.tmpdir=$LSCRATCH -Xmx2g -jar /home/apps/Logiciels/GATK/3.1-1/GenomeAnalysisTK.jar -l INFO -T VariantFiltration -R ~/References/hg19/hg19.fasta -V 96RNAseq.STAR.q1.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o 96RNAseq.STAR.q1.FS30.QD2.vcf Here are some examples for RNAseq : chr1 6711349 . G A 79.10 PASS BaseQRankSum=-1.369e+00;ClippingRankSum=1.00;DP=635;FS=1.871;MLEAC=1;MLEAF=5.495e-03;MQ=60.00;MQ0=0;MQRankSum=-4.560e-01;ReadPosRankSum=-1.187e+00 GT:AD:DP:GQ:PL 0/0:8,0:8:24:0,24,280 ./.:0,0:0 0/0:9,0:9:21:0,21,248 0/0:7,0:7:21:0,21,196 0/0:7,0:7:21:0,21,226 0/0:8,0:8:21:0,21,227 0/0:8,0:8:21:0,21,253 0/0:7,0:7:21:0,21,218 0/0:9,0:9:27:0,27,282 1/1:0,0:5:15:137,15,0 0/0:2,0:2:6:0,6,47 0/0:28,0:28:78:0,78,860 0/0:7,0:7:21:0,21,252 0/0:2,0:2:6:0,6,49 0/0:5,0:5:12:0,12,152 0/0:3,0:3:6:0,6,90 0/0:4,0:4:12:0,12,126 0/0:9,0:9:21:0,21,315 0/0:7,0:7:21:0,21,256 0/0:7,0:7:21:0,21,160 0/0:8,0:8:21:0,21,298 0/0:20,0:20:60:0,60,605 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,67 0/0:2,0:2:6:0,6,71 0/0:14,0:14:20:0,20,390 0/0:7,0:7:21:0,21,223 0/0:7,0:7:21:0,21,221 0/0:4,0:4:12:0,12,134 0/0:2,0:2:6:0,6,54 ./.:0,0:0 0/0:4,0:4:9:0,9,118 0/0:8,0:8:21:0,21,243 0/0:6,0:6:15:0,15,143 0/0:8,0:8:21:0,21,244 0/0:7,0:7:21:0,21,192 0/0:2,0:2:6:0,6,54 0/0:13,0:13:27:0,27,359 0/0:8,0:8:21:0,21,245 0/0:7,0:7:21:0,21,218 0/0:12,0:12:36:0,36,354 0/0:8,0:8:21:0,21,315 0/0:7,0:7:21:0,21,215 0/0:2,0:2:6:0,6,49 0/0:10,0:10:24:0,24,301 0/0:7,0:7:21:0,21,208 0/0:7,0:7:21:0,21,199 0/0:2,0:2:6:0,6,47 0/0:3,0:3:9:0,9,87 0/0:2,0:2:6:0,6,73 0/0:7,0:7:21:0,21,210 0/0:8,0:8:22:0,22,268 0/0:7,0:7:21:0,21,184 0/0:7,0:7:21:0,21,213 0/0:5,0:5:9:0,9,135 0/0:7,0:7:21:0,21,200 0/0:4,0:4:12:0,12,118 0/0:7,0:7:21:0,21,232 0/0:7,0:7:21:0,21,232 0/0:7,0:7:21:0,21,217 0/0:8,0:8:21:0,21,255 0/0:9,0:9:24:0,24,314 0/0:8,0:8:21:0,21,221 0/0:9,0:9:24:0,24,276 0/0:9,0:9:21:0,21,285 0/0:3,0:3:6:0,6,90 0/0:2,0:2:6:0,6,57 0/0:13,0:13:20:0,20,385 0/0:2,0:2:6:0,6,48 0/0:11,0:11:27:0,27,317 0/0:8,0:8:21:0,21,315 0/0:9,0:9:24:0,24,284 0/0:7,0:7:21:0,21,228 0/0:14,0:14:33:0,33,446 0/0:2,0:2:6:0,6,64 0/0:2,0:2:6:0,6,72 0/0:7,0:7:21:0,21,258 0/0:10,0:10:27:0,27,348 0/0:7,0:7:21:0,21,219 0/0:9,0:9:21:0,21,289 0/0:20,0:20:57:0,57,855 0/0:4,0:4:12:0,12,146 0/0:7,0:7:21:0,21,205 0/0:12,0:14:36:0,36,1030 0/0:3,0:3:6:0,6,87 0/0:2,0:2:6:0,6,60 0/0:7,0:7:21:0,21,226 0/0:7,0:7:21:0,21,229 0/0:8,0:8:21:0,21,265 0/0:4,0:4:6:0,6,90 ./.:0,0:0 0/0:7,0:7:21:0,21,229 0/0:2,0:2:6:0,6,59 0/0:2,0:2:6:0,6,56 chr1 7992047 . T C 45.83 SnpCluster BaseQRankSum=1.03;ClippingRankSum=0.00;DP=98;FS=0.000;MLEAC=1;MLEAF=0.014;MQ=60.00;MQ0=0;MQRankSum=-1.026e+00;ReadPosRankSum=-1.026e+00 GT:AD:DP:GQ:PL ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,70 0/0:2,0:2:6:0,6,45 0/0:3,0:3:6:0,6,87 0/0:2,0:2:6:0,6,52 ./.:0,0:0 ./.:0,0:0 ./.:1,0:1 ./.:0,0:0 0/0:2,0:2:6:0,6,55 0/0:2,0:2:6:0,6,49 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,61 0/0:2,0:2:6:0,6,49 ./.:0,0:0 ./.:0,0:0 0/0:3,0:3:6:0,6,90 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,52 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,69 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,64 ./.:0,0:0 0/0:2,0:2:6:0,6,37 ./.:0,0:0 0/0:2,0:2:6:0,6,67 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,68 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:11,0:11:24:0,24,360 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,68 0/0:2,0:2:6:0,6,50 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,50 0/0:3,0:3:6:0,6,90 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:4:6:0,6,50 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:7,0:7:21:0,21,231 0/0:2,0:2:6:0,6,64 ./.:0,0:0 0/0:2,0:2:6:0,6,63 0/0:2,0:2:6:0,6,70 ./.:0,0:0 0/0:6,0:6:15:0,15,148 ./.:0,0:0 ./.:0,0:0 1/1:0,0:2:6:90,6,0 ./.:0,0:0 0/0:2,0:2:6:0,6,63 0/0:2,0:2:6:0,6,74 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,58 0/0:2,0:2:6:0,6,71 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 For Exome Seq now : chr2 111878571 . C T 93.21 PASS DP=634;FS=0.000;MLEAC=1;MLEAF=5.319e-03;MQ=60.00;MQ0=0;VQSLOD=14.19;culprit=MQ GT:AD:DP:GQ:PL 0/0:8,0:8:24:0,24,243 0/0:4,0:4:9:0,9,135 0/0:7,0:7:18:0,18,270 0/0:7,0:7:21:0,21,230 0/0:16,0:16:48:0,48,542 0/0:8,0:8:21:0,21,315 0/0:6,0:6:18:0,18,186 0/0:5,0:5:15:0,15,168 0/0:6,0:6:15:0,15,225 0/0:10,0:10:30:0,30,333 0/0:7,0:7:21:0,21,239 0/0:6,0:6:18:0,18,202 0/0:6,0:6:15:0,15,225 0/0:7,0:7:21:0,21,225 0/0:8,0:8:24:0,24,272 0/0:5,0:5:15:0,15,168 1/1:0,0:13:13:147,13,0 0/0:2,0:2:6:0,6,73 0/0:8,0:8:24:0,24,256 0/0:14,0:14:4:0,4,437 0/0:3,0:3:9:0,9,85 0/0:4,0:4:12:0,12,159 0/0:7,0:7:21:0,21,238 0/0:5,0:5:15:0,15,195 0/0:7,0:7:15:0,15,225 0/0:12,0:12:36:0,36,414 0/0:4,0:4:12:0,12,156 0/0:7,0:7:0:0,0,190 0/0:2,0:2:6:0,6,64 0/0:7,0:7:21:0,21,242 0/0:7,0:7:21:0,21,234 0/0:8,0:8:24:0,24,267 0/0:7,0:7:21:0,21,245 0/0:7,0:7:21:0,21,261 0/0:6,0:6:18:0,18,204 0/0:8,0:8:24:0,24,302 0/0:5,0:5:15:0,15,172 0/0:9,0:9:24:0,24,360 0/0:18,0:18:51:0,51,649 0/0:5,0:5:15:0,15,176 0/0:2,0:2:6:0,6,70 0/0:14,0:14:33:0,33,495 0/0:4,0:4:9:0,9,135 0/0:8,0:8:21:0,21,315 0/0:4,0:4:12:0,12,149 0/0:4,0:4:6:0,6,90 0/0:10,0:10:27:0,27,405 0/0:3,0:3:6:0,6,90 0/0:4,0:4:12:0,12,133 0/0:14,0:14:6:0,6,431 0/0:4,0:4:12:0,12,151 0/0:5,0:5:15:0,15,163 0/0:3,0:3:9:0,9,106 0/0:7,0:7:21:0,21,237 0/0:7,0:7:21:0,21,268 0/0:8,0:8:21:0,21,315 0/0:2,0:2:6:0,6,68 ./.:0,0:0 0/0:3,0:3:9:0,9,103 0/0:7,0:7:21:0,21,230 0/0:3,0:3:6:0,6,90 0/0:9,0:9:26:0,26,277 0/0:7,0:7:21:0,21,236 0/0:5,0:5:15:0,15,170 ./.:1,0:1 0/0:15,0:15:45:0,45,653 0/0:8,0:8:24:0,24,304 0/0:6,0:6:15:0,15,225 0/0:3,0:3:9:0,9,103 0/0:2,0:2:6:0,6,79 0/0:7,0:7:21:0,21,241 0/0:4,0:4:12:0,12,134 0/0:3,0:3:6:0,6,90 0/0:5,0:5:15:0,15,159 0/0:4,0:4:12:0,12,136 0/0:5,0:5:12:0,12,180 0/0:11,0:11:21:0,21,315 0/0:13,0:13:39:0,39,501 0/0:3,0:3:9:0,9,103 0/0:8,0:8:24:0,24,257 0/0:2,0:2:6:0,6,73 0/0:8,0:8:24:0,24,280 0/0:4,0:4:12:0,12,144 0/0:4,0:4:9:0,9,135 0/0:8,0:8:24:0,24,298 0/0:4,0:4:12:0,12,129 0/0:5,0:5:15:0,15,184 0/0:2,0:2:6:0,6,62 0/0:2,0:2:6:0,6,65 0/0:9,0:9:27:0,27,337 0/0:7,0:7:21:0,21,230 0/0:7,0:7:21:0,21,239 0/0:5,0:5:0:0,0,113 0/0:11,0:11:33:0,33,369 0/0:7,0:7:21:0,21,248 0/0:10,0:10:30:0,30,395 Thanks for your help. Hi, I encountered a ConcurrentModificationException on several bam files using HaplotypeCaller 3.2-2 that ran just fine on 3.1 using the same number of threads. Here is the stack trace: java.util.ConcurrentModificationException at java.util.LinkedHashMap$LinkedHashIterator.nextEntry(LinkedHashMap.java:390) at java.util.LinkedHashMap$EntryIterator.next(LinkedHashMap.java:409) at java.util.LinkedHashMap$EntryIterator.next(LinkedHashMap.java:408) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.addMiscellaneousAllele(HaplotypeCallerGenotypingEngine.java:292) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:255) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:941) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:218) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722)

And here are the program args for one of them: Program Args: -T HaplotypeCaller -R human_genome/human_g1k_v37.fasta -I <my_bam> -ERC GVCF -nct 16 --dbsnp /tmp/4437712/dbsnp_135.b37.vcf --variant_index_type LINEAR --variant_index_parameter 128000 -o /tmp/4437712/<my_sample>.vcf.gz

Hi,

I encountered a NullPointerException in 3.2-2 using the HaplotypeCaller on a BAM that run just fine on 3.1. Here is the stack trace:

java.lang.NullPointerException at org.broadinstitute.gatk.tools.walkers.haplotypecaller.PairHMMLikelihoodCalculationEngine.computeDiploidHaplotypeLikelihoods(PairHMMLikelihoodCalculationEngine.java:421) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.PairHMMLikelihoodCalculationEngine.computeDiploidHaplotypeLikelihoods(PairHMMLikelihoodCalculationEngine.java:395) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.calculateGLsForThisEvent(HaplotypeCallerGenotypingEngine.java:421) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:257) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:941) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:218) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722)

And here are the program args:

Program Args: -T HaplotypeCaller -R human_genome/human_g1k_v37.fasta -I <my_bam> -ERC GVCF -nct 16 --dbsnp /tmp/4437712/dbsnp_135.b37.vcf --variant_index_type LINEAR --variant_index_parameter 128000 -o /tmp/4437712/<my_sample>.vcf.gz

Hi GATK developer,

I am working on a human WES project with an average sequencing depth 42X per individual. Recent version of GATK (v3.2-2) was used in my study. To investigate which "-minPruning" setting is better for my project, I ran HaplotypeCaller and tested the "-minPruning 3" and "-minPruning 2", so I generate two sets of gvcf files. I found some interesting cases when I compared these two sets of files.

For example in the same individual (we name this individual as A):

Locus "chr1 1654007" from the GVCF file for individual A (-minPruning 3) is listed as below:

chr1 1654007 . A . . END=1654007 GT:DP:GQ:MIN_DP:PL 0/0:64:0:64:0,0,1922

Locus "chr1 1654007" from the GVCF file for individual A (-minPruning 2) is listed as below:

chr1 1654007 rs199558549 A T, 155.90 . DB;DP=6;MLEAC=2,0;MLEAF=1.00,0.00;MQ=27.73;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,6,0:6:18:189,18,0,189,18,189:0,0,4,2

You can see it is quite different for the results in terms of the final genotyping call and the sequencing depth ("DP" tag) generated from "-minPruning 3" and "-minPruning 2" setting. I am wondering what exactly happens for this locus so I check this locus using IGV. I found at this locus, there are 6 reads supporting the ALT allele ("T") and 64 reads supporting the REF allele ("A"). So I believe the result from "-minPruning 3" should be correct.
Just wondering is it normal considering the different setting of "-minPruning 3" and "-minPruning 2"or is it just a bug? If it is just because of the different setting of "-minPruning 3" and "-minPruning 2", I would say the setting of "-minPruning" is critical for the final results so a guideline of how to set the "-minPruning" based on the sequencing depth will be important. How do you think?

Thanks,

Qiongyi

Hi,

I am using HC in GATK v3.2-2-gec30cee and trying to visualize the locally reassembled reads in IGV. For one sample, which is hetRA, the bam file produced by -bamout shows the reads and the variant as expected. For another which is called as homRR in this position, I don't see any reads at all, although MIN_DP in this block is 26 according to GVCF. Also viewing the input bam to HC shows a bunch of reads supporting the reference allele.

I tried using -AR and -forceActive to ensure the locus is processed, and different options for bamWriteType and output_mode. Still no success in getting the assembled reads in the bamout file. When running with the -debug flag, I can see that the site is processed and called homRR ("Found only the reference haplotype in the assembly graph"), but no reads are printed to the bamout. Is this the intended behaviour? Is there a way of getting these reads out into the bamout file?

Thank you, Pawel

Hi,

I get an error when using HaplotypeCaller (GATK version 3.2 and latest nightly build) on a specific BAM File:

java.lang.IndexOutOfBoundsException: Index: 28, Size: 6 at java.util.LinkedList.checkElementIndex(Unknown Source) at java.util.LinkedList.get(Unknown Source) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.readthreading.DanglingChainMergingGraph.mergeDanglingTail(DanglingChainMergingGraph.java:272) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.readthreading.DanglingChainMergingGraph.recoverDanglingTail(DanglingChainMergingGraph.java:184) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.readthreading.DanglingChainMergingGraph.recoverDanglingTails(DanglingChainMergingGraph.java:131) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler.createGraph(ReadThreadingAssembler.java:202) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler.assemble(ReadThreadingAssembler.java:114) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.LocalAssemblyEngine.runLocalAssembly(LocalAssemblyEngine.java:164) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.assembleReads(HaplotypeCaller.java:1022) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:882) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:218) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:273) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107

It occurs in both normal VCF and GVCF mode:  java -XX:ParallelGCThreads=1 -Xmx4g -jar GenomeAnalysisTK.jar \ -R hg19.fa -I error.bam \ -L test2.bed \ -T HaplotypeCaller -o test.gatk.ontarget.vcf 

 java -XX:ParallelGCThreads=1 -Xmx4g -jar GenomeAnalysisTK.jar \ -R hg19.fa -I error.bam \ -L test2.bed \ -T HaplotypeCaller --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o test.gatk.ontarget.gvcf 

The BAM file was produced according to the best practice guide. I narrowed the error down to 15 reads (see attached files).

Best Regards, Thomas

Hello,

It seems the banded gvcf produced by HaplotypeCaller is missing many positions in the genome. We are not able to find half the specific positions we want to look at. To be clear, these positions are not even contained in any band, they are simply not there. For example, we were looking for position 866438 on chromosome 1 and these are the relevant lines in the gvcf:

1 866433 . G C, 412.77 . BaseQRankSum=-1.085;DP=16;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-1.519;ReadPosRankSum=0.651 GT:AD:GQ:PL:SB 0/1:15,1,0:99:441,0,23572,546,23612,24157:0,0,0,0 1 866434 . G . . END=866435 GT:DP:GQ:MIN_DP:PL 0/0:22:45:21:0,45,675 1 866436 . ACGTG A, 403.73 . BaseQRankSum=-0.510;DP=17;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.714;ReadPosRankSum=0.434 GT:AD:GQ:PL:SB 0/1:16,1,0:99:441,0,23572,546,23612,24157:0,0,0,0 1 866441 . A . . END=866441 GT:DP:GQ:MIN_DP:PL 0/0:21:0:21:0,0,402

We looked at the corresponding bam files and there seems to be good coverage for these positions. Moreover, if HaplotypeCaller is run in BP_RESOLUTION mode, then these positions are present in the resulting vcf. Any idea what might be going wrong?

Thanks,

Juber

Hi we've been looking at results of a recent run of GATK-HC (3.1-1) using the new N1 pipeline and we've been seeing something odd in the distribution of the Depth of Coverage (Using DP from the Genotype Fields) we're seeing for the raw unfiltered variants.

All our samples are sequenced using PCR-Free libraries and have two lanes of sequence (~24x mapped depth) and looking at depth of coverage from bedtools we see a nice clean distribution (red in graph) but when we look at the data from the HaplotypeCaller sites (Blue in graph) we see a bimodal distribution with an excess of variants called at lower coverage (~16x) vs the most common coverage of around 24x. We've seen this in all the samples we've looked at so far, so it's not just a one off.

I've had a quick look at read depth from another variant caller (Platypus) and there we see no evidence of this bimodal distribution in the variants it has called.

Is this expected behaviour? If so why does it occur? If not any idea what is going on here, is it a bug in the variant caller or the depth statistics? Do you see the same thing in other datasets?

Thanks!

I'm running the HaplotypeCaller on a series of samples using a while loop in a bash script and for some samples the HaplotypeCaller is stopping part way through the file. My command was: java -Xmx18g -jar $Gpath/GenomeAnalysisTK.jar \ -nct 8 \ -l INFO \ -R$ref \ -log $log/$plate.$prefix.HaplotypeCaller.log \ -T HaplotypeCaller \ -I$bam/prefix.realign.bam \ --emitRefConfidence GVCF \ -variant_index_type LINEAR \ -variant_index_parameter 128000 \ -ogvcf/$prefix.GATK.gvcf.vcf Most of the samples completed and the output looks good, but for some I only have a truncated gvcf file with no index. When I look at the log it looks like this: INFO 17:25:15,289 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:25:15,291 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:21 INFO 17:25:15,291 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:25:15,291 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:25:15,294 HelpFormatter - Program Args: -nct 8 -l INFO -R /home/owens/ref/Gasterosteus_aculeatus.BROADS1.73.dna.toplevel.fa -log /home/owens/SB/C31KCACXX05.log/C31KCACXX05.sb1Pax102L-S2013.Hap INFO 17:25:15,296 HelpFormatter - Executing as owens@GObox on Linux 3.2.0-63-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_17-b02. INFO 17:25:15,296 HelpFormatter - Date/Time: 2014/06/10 17:25:15 INFO 17:25:15,296 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:25:15,296 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:25:15,722 GenomeAnalysisEngine - Strictness is SILENT INFO 17:25:15,892 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 17:25:15,898 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  17:25:15,942 SAMDataSourceSAMReaders - Done initializing BAM readers: total time 0.04 INFO 17:25:15,948 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 17:25:15,993 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 8 CPU thread(s) for each of 1 data thread(s), of 12 processors available on this machine INFO 17:25:16,097 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 17:25:16,114 GenomeAnalysisEngine - Done preparing for traversal INFO 17:25:16,114 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:25:16,114 ProgressMeter - Location processed.active regions runtime per.1M.active regions completed total.runtime remaining INFO 17:25:16,114 HaplotypeCaller - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output INFO 17:25:16,116 HaplotypeCaller - All sites annotated with PLs force to true for reference-model confidence output INFO 17:25:16,278 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units INFO 17:25:46,116 ProgressMeter - scaffold_1722:1180 1.49e+05 30.0 s 3.3 m 0.0% 25.6 h 25.6 h INFO 17:26:46,117 ProgressMeter - scaffold_279:39930 1.37e+07 90.0 s 6.0 s 3.0% 50.5 m 49.0 m INFO 17:27:16,118 ProgressMeter - scaffold_139:222911 2.89e+07 120.0 s 4.0 s 6.3% 31.7 m 29.7 m INFO 17:27:46,119 ProgressMeter - scaffold_94:517387 3.89e+07 2.5 m 3.0 s 8.5% 29.2 m 26.7 m INFO 17:28:16,121 ProgressMeter - scaffold_80:591236 4.06e+07 3.0 m 4.0 s 8.9% 33.6 m 30.6 m INFO 17:28:46,123 ProgressMeter - groupXXI:447665 6.07e+07 3.5 m 3.0 s 13.3% 26.4 m 22.9 m INFO 17:29:16,395 ProgressMeter - groupV:8824013 7.25e+07 4.0 m 3.0 s 17.6% 22.7 m 18.7 m INFO 17:29:46,396 ProgressMeter - groupXIV:11551262 9.93e+07 4.5 m 2.0 s 24.0% 18.7 m 14.2 m WARN 17:29:52,732 ExactAFCalc - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at groupX:1516679 has 8 alternate alleles so only the top alleles INFO 17:30:19,324 ProgressMeter - groupX:14278234 1.15e+08 5.1 m 2.0 s 27.9% 18.1 m 13.0 m INFO 17:30:49,414 ProgressMeter - groupXVIII:5967453 1.46e+08 5.6 m 2.0 s 33.0% 16.8 m 11.3 m INFO 17:31:19,821 ProgressMeter - groupXI:15030145 1.63e+08 6.1 m 2.0 s 38.5% 15.7 m 9.7 m INFO 17:31:50,192 ProgressMeter - groupVI:5779653 1.96e+08 6.6 m 2.0 s 43.8% 15.0 m 8.4 m INFO 17:32:20,334 ProgressMeter - groupXVI:18115788 2.13e+08 7.1 m 1.0 s 50.1% 14.1 m 7.0 m INFO 17:32:50,335 ProgressMeter - groupVIII:4300439 2.50e+08 7.6 m 1.0 s 55.1% 13.7 m 6.2 m INFO 17:33:30,336 ProgressMeter - groupXIII:2378126 2.89e+08 8.2 m 1.0 s 63.1% 13.0 m 4.8 m INFO 17:34:02,099 GATKRunReport - Uploaded run statistics report to AWS S3  It seems like it got half way through and stopped. I think it's a memory issue because when I increased the available ram to java, the problem happens less, although I can't figure out why some samples work and others don't (there isn't anything else running on the machine using ram and the biggest bam files aren't failing). It's also strange to me that there doesn't seem to be an error message. Any insight into why this is happening and how to avoid it would be appreciated. Hi, GATK Team I've run into a strange case where a SNP called by HaplotypeCaller has been represented as if it were an indel: 6 16327909 . ATGCTGATGCTGC CTGCTGATGCTGC 1390.70 PASS AC=1;AC_Orig=2;AF=0.500;AF_Orig=0.040;AN=2;AN_Orig=50;BaseQRankSum=0.788;DP=10;FS=6.154;InbreedingCoeff=0.1807;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358;VQSLOD=2.78;culprit=FS GT:DP:GQ:PL 0/1:10:70:284,0,214 This VCF entry (for a single individual) comes from a multi-sample VCF that has multiple alternate "alleles" at that position: 6 16327909 . ATGCTGATGCTGC ATGC,CTGCTGATGCTGC,A 1390.70 . AC=12,3,1;AF=0.024,6.048e-03,2.016e-03;AN=496;BaseQRankSum=0.788;DP=2791;FS=6.154;InbreedingCoeff=0.1807;MLEAC=13,3,1;MLEAF=0.026,6.048e-03,2.016e-03;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358 GT:AD:DP:GQ:PL  However, this mode of representing a SNP is causing processing and analysis problems further downstream after I've split the multi-sample VCF into individual files. Is there a way to fix this problem such that variants are listed in the most parsimonious (and hopefully standard) way? Thanks, Grace When I run GATK with identical settings on some amplicon sequencing data from MiSeq (150KB region), I get different numbers of variants (approximately 10% difference), even after setting -dfrac to 1. What is the cause for this variation? How to make results reproducible? Thank you so much! RESULT #8.1.vcf and 8.2.vcf are the raw VCFs from two runs with filters applied java -Xmx4g -jar gatk.jar -T CombineVariants -R human_g1k_v37.fasta -o 8merge_union.vcf -V 8.1.vcf -V 8.2.vcf grep Intersection 8merge_union.vcf > 8merge_intersection.vcf grep -v Intersection 8merge_union.vcf > 8merge_NOintersection.vcf grep PASS 8merge_NOintersection.vcf > 8merge_NOintersection.PASS.vcf wc -l *vcf 711 8.1.vcf 642 8.2.vcf 308 8merge_intersection.vcf 86 8merge_NOintersection.PASS.vcf 462 8merge_NOintersection.vcf 770 8merge_union.vcf  Please find 8.1.vcf and 8.2.vcf in attachment. COMMAND: ##realign java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T RealignerTargetCreator -I 8.bam -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta -known ~public/project/seqlib/gatk/Mills_and_1000G_gold_standard.indels.b37.vcf -o 8.intervals -nt 6 -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1 java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T IndelRealigner -I 8.bam -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta -targetIntervals 8.intervals --out 8.realign.bam -known ~public/project/seqlib/gatk/Mills_and_1000G_gold_standard.indels.b37.vcf --consensusDeterminationModel USE_READS -LOD 0.4 -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1 # ##base recal java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T BaseRecalibrator -I 8.realign.bam -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --default_platform ILLUMINA -knownSites ~public/project/seqlib/gatk/dbsnp_137.b37.vcf -knownSites ~public/project/seqlib/gatk/Mills_and_1000G_gold_standard.indels.b37.vcf -nct 6 -o 8.recal_data -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1 java -Xmx6g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T PrintReads -I 8.realign.bam -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta -o 8.recal.bam -BQSR 8.recal_data -nct 1 -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1 #variant calling java -Xmx8g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T HaplotypeCaller -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta -I 8.recal.bam -o 8.raw.vcf --dbsnp ~public/project/seqlib/gatk/dbsnp_137.b37.vcf -nct 6 -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1 #variant filtering java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T SelectVariants -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.raw.vcf -o 8.raw.indel.vcf -selectType INDEL -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1 java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T SelectVariants -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.raw.vcf -o 8.raw.snp.vcf -selectType SNP -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1 #use hard filtering due to small input file #SNP java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T VariantFiltration -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.raw.snp.vcf --filterName QDFilter --filterExpression 'QD<2.0' --filterName FSFilter --filterExpression 'FS>60.0' --filterName MQFilter --filterExpression 'MQ<40.0' --filterName MaqQualRankSumFilter --filterExpression 'MappingQualityRankSum<-12.5' --filterName ReadPosFilter --filterExpression 'ReadPosRankSum<-8.0' -o 8.filter.snp.vcf -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1 #indel java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T VariantFiltration -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.raw.indel.vcf --clusterWindowSize 10 --filterExpression 'QD<2.0' --filterName QDFilter --filterExpression 'ReadPosRankSum<-20.0' --filterName ReadPosFilter --filterExpression 'FS>200.0' --filterName FSFilter -o 8.filter.indel.vcf -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1 java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T CombineVariants -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.filter.snp.vcf --variant 8.filter.indel.vcf -o 8.filter.vcf -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1  Dear GATK Team, I've recently been exploring HaplotypeCaller and noticed that, for my data, it is reporting ~10x lower DP and AD values in comparison to reads visible in the igv browser and reported by the UnifiedGenotyper. I'm analyzing a human gene panel of amplicon data produced on a MiSeq, 150bp paired end. The coverage is ~5,000x. My pipeline is: Novoalign -> GATK (recalibrate quality) -> GATK (re-align) -> HaplotypeCaller/UnifiedGenotyper. Here are the minimum commands that reproduce the discrepancy: java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ --dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \ -R /gatk_bundle/ucsc.hg19.fasta \ -I sample1.rg.bam \ -o sample1.HC.vcf \ -L ROI.bed \ -dt NONE \ -nct 8  Example variant from sample1.HC.vcf: chr17 41245466 . G A 18004.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.411;ClippingRankSum=-1.211;DP=462;FS=2.564;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.250;QD=31.14;ReadPosRankSum=1.159 GT:AD:DP:GQ:PL 1/1:3,458:461:99:18033,1286,0 ... In comparison to using UnifiedGenotyper with exactly the same alignment file: java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ --dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \ -R /gatk_bundle/ucsc.hg19.fasta \ -I sample1.rg.bam \ -o sample1.UG.vcf \ -L ROI.bed \ -nct 4 \ -dt NONE \ -glm BOTH  Example variant from sample1.UG.vcf: chr17 41245466 . G A 140732.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=5.488;DP=6382;Dels=0.00;FS=0.000;HaplotypeScore=568.8569;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.096;QD=22.05;ReadPosRankSum=0.104 GT:AD:DP:GQ:PL 1/1:56,6300:6378:99:140761,8716,0 I looked at the mapping quality and number of the alignments at the example region (200nt window) listed above and they look good: awk '{if (3=="chr17" && $4 > (41245466-100) &&$4 < (41245466+100))  print}' sample1.rg.sam | awk '{count[\$5]++} END {for(i in count) print count[i], i}' | sort -nr
8764 70
77 0


With other data generated in our lab, that has ~200x coverage and the same assay principle [just more amplicons], the DP reported by HaplotypeCaller corresponds perfectly to UnifiedGenotyper and igv.

Is there an explanation as to why I should see a difference between HaplotypeCaller and UnifiedGenotyper, using these kinds of data?

Sam

If I run HaplotypeCaller with a VCF file as the intervals file, -stand_emit_conf 0, and -out_mode EMIT_ALL_SITES, should I get back an output VCF with all the sites from the input VCF, whether or not there was a variant call there? If not, is there a way to force output even if the calls are 0/0 or ./. for everyone in the cohort?

I have been trying to run HC with the above options, but I can't understand why some variants are included in my output file and others aren't. Some positions are output with no alternate allele and GTs of 0 for everyone. However, other positions that I know have coverage are not output at all.

Thanks,

Elise

Hey there!

I've been using HaplotypeCaller as part of a new whole genome variant calling pipeline I'm working on and I had a question about the number of samples to use. From my tests, it seems like increasing the number of samples to run HaplotypeCaller on simultaneously improves the accuracy no matter how many samples I add (I've tried 1, 4, and 8 samples at a time). Before I tried 16 samples, I was wondering if you could tell me if there's a point of diminishing returns for adding samples to HaplotypeCaller. It seems like with every sample I add, the time/sample increases, so I don't want to keep adding samples if it's not going to result in an improved call set, but if it does improve the results I'll deal with it and live with the longer run times. I should note that I'm making this pipeline for an experiment where there will be up to 50 individuals, and of those, there are family groups of 3-4 people. If running HaplotypeCaller on all 50 simultaneously would result in the best call set, that's what I'll do. Thanks! (By the way, I love the improvements you made with 2.5!)

• Grant

I'm trying to run HaplotypeCaller on a haploid organism. Is this possible? What argument should I use for this? My first attempt produced a diploid calls.

Sorry for the silly question