# Tagged with #haplotypecaller 3 documentation articles | 2 announcements | 94 forum discussions

Created 2015-04-26 02:02:50 | Updated 2015-04-26 02:08:43 | Tags: tutorials haplotypecaller bamout debugging

### 1. Overview

As you may know, HaplotypeCaller performs a local reassembly and realignment of the reads in the region surrounding potential variant sites (see the HaplotypeCaller method docs for more details on why and how this is done). So it often happens that during the calling process, the reads get moved to different mapping positions than what you can observe in the BAM file that you originally provided to HC as input.

These remappings usually explain most discordances between calls that are expected based on the original data and actual calls made by HaplotypeCaller, so it's very useful to be able to visualize what rearrangements the tool has made.

### 2. Generating the bamout for a single site or interval

To generate the bamout file for a specific site or interval, just run HaplotypeCaller on the region around the site or interval of interest using the -L argument to restrict the analysis to that region (adding about 500 bp on either side) and using the -bamout argument to specify the name of the bamout file that will be generated.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I recalibrated.bam -o hc_variants.vcf -L 20:10255630-10255840 -bamout bamout.bam


If you were using any additional parameters in your original variant calling (including -ERC and related arguments), make sure to include them in this command as well so that you can make an apples-to-apples comparison.

Then you open up both the original bam and the bamout file together in a genome browser such as IGV. On some test data from our favorite sample, NA12878, this is what you would see:

You can see that the bamout file, on top, contains data only for the ActiveRegion that was within the analysis interval specified by -L. The two blue reads represent the artificial haplotypes constructed by HaplotypeCaller (you may need to adjust your IGV settings to see the same thing on your machine).

You can see a whole group of reads neatly aligned, with an insertion in the middle. In comparison, the original data shown in the lower track has fewer reads with insertions, but has several reads with mismapped ends. This is a classic example of a site where realignment through reassembly has provided additional evidence for an indel, allowing HaplotypeCaller to call it confidently. In contrast, UnifiedGenotyper was not able to call this insertion confidently.

### 3. Generating the bamout for multiple intervals or the whole genome

Although we don't recommend doing this by default because it will cause slower performance and take up a lot of storage space, you can generate a bamout that contains many more intervals, or even covers the whole genome. To do so, just run the same command, but this time, pass your list of intervals to -L, or simply omit it if you want the entire genome to be included.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I recalibrated.bam -o hc_variants.vcf -bamout bamout.bam


This time, if you zoom out a bit in IGV, you will see multiple stacks of reads corresponding to the various ActiveRegions that were identified and processed.

### 4. Forcing an output in a region that is not covered in the bamout

In some cases HaplotypeCaller does not complete processing on an ActiveRegion that it has started. This is typically because there is either almost no evidence of variation once the remapping has been done, or on the contrary, the region is very messy and there is too much complexity. In both cases, the program is designed to give up in order to avoid wasting time. This is a good thing most of the time, but it does mean that sometimes you will have no output in the bamout for the site you are trying to troubleshoot.

The good news is that in most cases it is possible to force HaplotypeCaller to go through with the full processing so that it will produce bamout output for your site of interest. To do so, simply add the flags -forceActive and -disableOptimizations to your command line, in addition to the -bamout argument of course.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I recalibrated.bam -L 20:10371667-10375021 -o hc_forced.vcf -bamout force_bamout.bam -forceActive -disableOptimizations


In this other region, you can see that the original mapping (middle track) was a bit messy with some possible evidence of variation, and in fact UnifiedGenotyper called a SNP in this region (top variant track). But HaplotypeCaller did not call the SNP, and did not output anything in our first bamout file (top read track). When you force an output in that region using the two new flags, you see in the forced bamout (bottom read track) that the remapped data is a lot cleaner and the evidence for variation is essentially gone.

It is also possible to force an ActiveRegion to be triggered at specific intervals; see the HaplotypeCaller tool docs for more details on how this is done.

Created 2014-04-03 20:20:08 | Updated 2014-10-22 19:22:34 | Tags: haplotypecaller genotypegvcfs combinegvcfs gvcf joint-analysis

### Overview

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variants analyses.

### Important caveat

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller 3.x contain additional information that is formatted in a very specific way. Read on to find out more.

### General comparison of VCF vs. gVCF

The key difference between a regular VCF and a gVCF is that the gVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a gVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION gVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

### The two types of gVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a gVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the gVCF header. The purpose of the blocks (also called banding) is to keep file size down, and there is no downside for the downstream analysis, so we do recommend using the -GVCF option.

### Example gVCF file

This is a banded gVCF produced by HaplotypeCaller with the -GVCF option.

As you can see in the first line, the basic file format is a valid version 4.1 VCF:

##fileformat=VCFv4.1
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)
##GVCFBlock=minGQ=60(inclusive),maxGQ=2147483647(exclusive)
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=b37>
##reference=file:///humgen/1kg/reference/human_g1k_v37.fasta


Toward the middle you see the ##GVCFBlock lines (after the ##FORMAT lines) (repeated here for clarity):

##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)


which indicate the GQ ranges used for banding (corresponding to the boundaries [5, 20, 60]).

You can also see the definition of the MIN_DP annotation in the ##FORMAT lines.

#### Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10000000 and ends at 20:10000116.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10000000    .   T   <NON_REF>   .   .   END=10000116    GT:DP:GQ:MIN_DP:PL  0/0:44:99:38:0,89,1385
20  10000118    .   T   <NON_REF>   .   .   END=10000210    GT:DP:GQ:MIN_DP:PL  0/0:42:99:38:0,80,1314
20  10000212    .   A   <NON_REF>   .   .   END=10000438    GT:DP:GQ:MIN_DP:PL  0/0:52:99:42:0,99,1403
20  10000439    .   T   G,<NON_REF> 1737.77 .   DP=57;MLEAC=2,0;MLEAF=1.00,0.00;MQ=221.41;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,56,0:56:99:1771,168,0,1771,168,1771:0,0,0,0
20  10000440    .   T   <NON_REF>   .   .   END=10000597    GT:DP:GQ:MIN_DP:PL  0/0:56:99:49:0,120,1800
20  10000598    .   T   A,<NON_REF> 1754.77 .   DP=54;MLEAC=2,0;MLEAF=1.00,0.00;MQ=185.55;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,53,0:53:99:1788,158,0,1788,158,1788:0,0,0,0
20  10000599    .   T   <NON_REF>   .   .   END=10000693    GT:DP:GQ:MIN_DP:PL  0/0:51:99:47:0,120,1800
20  10000695    .   G   <NON_REF>   .   .   END=10000757    GT:DP:GQ:MIN_DP:PL  0/0:48:99:45:0,120,1800
20  10000758    .   T   A,<NON_REF> 1663.77 .   DP=51;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.32;MQ0=0  GT:AD:DP:GQ:PL:SB   1/1:0,50,0:50:99:1697,149,0,1697,149,1697:0,0,0,0
20  10000759    .   A   <NON_REF>   .   .   END=10001018    GT:DP:GQ:MIN_DP:PL  0/0:40:99:28:0,65,1080
20  10001020    .   C   <NON_REF>   .   .   END=10001020    GT:DP:GQ:MIN_DP:PL  0/0:26:72:26:0,72,1080
20  10001021    .   T   <NON_REF>   .   .   END=10001021    GT:DP:GQ:MIN_DP:PL  0/0:25:37:25:0,37,909
20  10001022    .   C   <NON_REF>   .   .   END=10001297    GT:DP:GQ:MIN_DP:PL  0/0:30:87:25:0,72,831
20  10001298    .   T   A,<NON_REF> 1404.77 .   DP=41;MLEAC=2,0;MLEAF=1.00,0.00;MQ=171.56;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,41,0:41:99:1438,123,0,1438,123,1438:0,0,0,0
20  10001299    .   C   <NON_REF>   .   .   END=10001386    GT:DP:GQ:MIN_DP:PL  0/0:43:99:39:0,95,1226
20  10001387    .   C   <NON_REF>   .   .   END=10001418    GT:DP:GQ:MIN_DP:PL  0/0:41:42:39:0,21,315
20  10001419    .   T   <NON_REF>   .   .   END=10001425    GT:DP:GQ:MIN_DP:PL  0/0:45:12:42:0,9,135
20  10001426    .   A   <NON_REF>   .   .   END=10001427    GT:DP:GQ:MIN_DP:PL  0/0:49:0:48:0,0,1282
20  10001428    .   T   <NON_REF>   .   .   END=10001428    GT:DP:GQ:MIN_DP:PL  0/0:49:21:49:0,21,315
20  10001429    .   G   <NON_REF>   .   .   END=10001429    GT:DP:GQ:MIN_DP:PL  0/0:47:18:47:0,18,270
20  10001430    .   G   <NON_REF>   .   .   END=10001431    GT:DP:GQ:MIN_DP:PL  0/0:45:0:44:0,0,1121
20  10001432    .   A   <NON_REF>   .   .   END=10001432    GT:DP:GQ:MIN_DP:PL  0/0:43:18:43:0,18,270
20  10001433    .   T   <NON_REF>   .   .   END=10001433    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1201
20  10001434    .   G   <NON_REF>   .   .   END=10001434    GT:DP:GQ:MIN_DP:PL  0/0:44:18:44:0,18,270
20  10001435    .   A   <NON_REF>   .   .   END=10001435    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1130
20  10001436    .   A   AAGGCT,<NON_REF>    1845.73 .   DP=43;MLEAC=2,0;MLEAF=1.00,0.00;MQ=220.07;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,42,0:42:99:1886,125,0,1888,126,1890:0,0,0,0
20  10001437    .   A   <NON_REF>   .   .   END=10001437    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,0


Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

Created 2012-07-30 17:37:12 | Updated 2015-04-26 02:30:27 | Tags: unifiedgenotyper haplotypecaller snp bamout debugging

This can happen when you expect a call to be made based on the output of other variant calling tools, or based on examination of the data in a genome browser like IGV.

There are several possibilities, and among them, it is possible that GATK may be missing a real variant. But we are generally very confident in the calculations made by our tools, and in our experience, most of the time, the problem lies elsewhere. So, before you post this issue in our support forum, please follow these troubleshooting guidelines, which hopefully will help you figure out what's going on.

In all cases, to diagnose what is happening, you will need to look directly at the sequencing data at the position in question.

### 1. Generate the bamout and compare it to the input bam

If you are using HaplotypeCaller to call your variants (as you nearly always should) you'll need to run an extra step first to produce a file called the "bamout file". See this tutorial for step-by-step instructions on how to do this.

What often happens is that when you look at the reads in the original bam file, it looks like a variant should be called. However, once HaplotypeCaller has performed the realignment, the reads may no longer support the expected variant. Generating the bamout file and comparing it to the original bam will allow you to elucidate such cases.

In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!

### 2. Check the base qualities of the non-reference bases

The variant callers apply a minimum base quality threshold, under which bases will not be counted as supporting evidence for a variant. This is because low base qualities mean that the sequencing machine was not confident that it called the right bases. If your expected variant is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the DP field of the VCF is the unfiltered depth. You may believe you have good coverage at your site of interest, but since the variant callers ignore bases that fail the quality filters, the actual coverage seen by the variant callers may be lower than you think.

### 3. Check the mapping qualities of the reads that support the non-reference allele(s)

The quality of a base is capped by the mapping quality of the read that it is on. This is because low mapping qualities mean that the aligner had little confidence that the read was mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- in fact, you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

### 4. Check how many alternate alleles are present

By default the variant callers will only consider a certain number of alternate alleles. This parameter can be relaxed using the --max_alternate_alleles argument (see the HaplotypeCaller documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles increases the computational cost of the processing, scaling exponentially with the number of alternate alleles, which means it will use more resources and take longer. Unless you have a really good reason to change the default value, we highly recommend that you not modify this parameter.

### 5. When using UnifiedGenotyper, check for overlapping deletions

The UnifiedGenotyper ignores sites if there are too many overlapping deletions. This parameter can be relaxed using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument) but be aware that increasing its value could adversely affect the reliability of your results.

### 6. Check for systematic biases introduced by your sequencing technology

Some sequencing technologies introduce particular sources of bias. For example, in data produced by the SOLiD platform, alignments tend to have reference bias and it can be severe in some cases. If the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site, you are probably seeing false positives.

Created 2015-05-15 04:52:05 | Updated | Tags: haplotypecaller release-notes genotypegvcfs gatk3

GATK 3.4 was released on May 15, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights to be published soon.

Note that the release is in progress at time of posting -- it may take a couple of hours before the new GATK jar file is updated on the downloads page.

### New tool

• ASEReadCounter: A tool to count read depth in a way that is appropriate for allele specific expression (ASE) analysis. It counts the number of reads that support the REF allele and the ALT allele, filtering low qual reads and bases and keeping only properly paired reads. See Highlights for more details.

### HaplotypeCaller & GenotypeGVCFs

• Important fix for genotyping positions over spanning deletions. Previously, if a SNP occurred in sample A at a position that was in the middle of a deletion for sample B, sample B would be genotyped as homozygous reference there (but it's NOT reference - there's a deletion). Now, sample B is genotyped as having a symbolic DEL allele. See Highlights for more details.
• Deprecated --mergeVariantsViaLD argument in HaplotypeCaller since it didn’t work. To merge complex substitutions, use ReadBackedPhasing as a post-processing step.
• Removed exclusion of MappingQualityZero, SpanningDeletions and TandemRepeatAnnotation from the list of annotators that cannot be annotated by HaplotypeCaller. These annotations are still not recommended for use with HaplotypeCaller, but this is no longer enforced by a hardcoded ban.
• Clamp the HMM window starting coordinate to 1 instead of 0 (contributed by nsubtil).
• Fixed the implementation of allowNonUniqueKmersInRef so that it applies to all kmer sizes. This resolves some assembly issues in low-complexity sequence contexts and improves calling sensitivity in those regions.
• Initialize annotations so that --disableDithering actually works.
• Automatic selection of indexing strategy based on .g.vcf file extension. See Highlights for more details.
• Removed normalization of QD based on length for indels. Length-based normalization is now only applied if the annotation is calculated in UnifiedGenotyper.
• Added the RGQ (Reference GenotypeQuality) FORMAT annotation to monomorphic sites in the VCF output of GenotypeGVCFs. Now, instead of stripping out the GQs for monomorphic ohm-ref sites, we transfer them to the RGQ. This is extremely useful for people who want to know how confident the hom-ref genotype calls are. See Highlights for more details.
• Removed GenotypeSummaries from default annotations.
• Added -uniquifySamples to GenotypeGVCFs to make it possible to genotype together two different datasets containing the same sample.
• Disallow changing -dcov setting for HaplotypeCaller (pending a fix to the downsampling control system) to prevent buggy behavior. See Highlights for more details.
• Raised per-sample limits on the number of reads in ART and HC. Active Region Traversal was using per sample limits on the number of reads that were too low, especially now that we are running one sample at a time. This caused issues with high confidence variants being dropped in high coverage data.
• Removed explicit limitation (20) of the maximum ploidy of the reference-confidence model. Previously there was a fixed-size maximum ploidy indel RCM likelihood cache; this was changed to a dynamically resizable one. There are still some de facto limitations which can be worked around by lowering the max alt alleles parameter.
• Made GQ of Hom-Ref Blocks in GVCF output be consistent with PLs.
• Fixed a bug where HC was not realigning against the reference but against the best haplotype for the read.
• Fixed a bug (in HTSJDK) that was causing GenotypeGVCFs to choke on sites with large numbers of alternate alleles (>140).
• Modified the way GVCFBlock header lines are named because the new HTSJDK version disallows duplicate header keys (aside from special-cased keys such as INFO and FORMAT).

### CombineGVCFs

• Added option to break blocks at every N sites. Using --breakBandsAtMultiplesOf N will ensure that no reference blocks span across genomic positions that are multiples of N. This is especially important in the case of scatter-gather where you don't want your scatter intervals to start in the middle of blocks (because of a limitation in the way -L works in the GATK for VCF records with the END tag). See Highlights for more details.
• Fixed a bug that caused the tool to stop processing after the first contig.
• Fixed a bug where the wrong REF allele was output to the combined gVCF.

### VariantRecalibrator

• Switched VQSR tranches plot ordering rule (ordering is now based on tranche sensitivity instead of novel titv).
• VQSR VCF header command line now contains annotations and tranche levels.

### SelectVariants

• Added -trim argument to trim (simplify) alleles to a minimal representation.
• Added -trimAlternates argument to remove all unused alternate alleles from variants. Note that this is pretty aggressive for monomorphic sites.
• Changed the default behavior to trim (remove) remaining alleles when samples are subset, and added the -noTrim argument to preserve original alleles.
• Added --keepOriginalDP argument.

### VariantAnnotator

• Improvements to the allele trimming functionalities.
• Added functionality to support multi-allelic sites when annotating a VCF with annotations from another callset. See Highlights for more details.

### CalculateGenotypePosteriors

• Fixed user-reported bug featuring "trio" family with two children, one parent.
• Added error handling for genotypes that are called but have no PLs.

### Various tools

• BQSR: Fixed an issue where GATK would skip the entire read if a SNP is entirely contained within a sequencing adapter (contributed by nsubtil); and improved how uncommon platforms (as encoded in RG:PL tag) are handled.
• DepthOfCoverage: Now logs a warning if incompatible arguments are specified.
• SplitSamFile: Fixed a bug that caused a NullPointerException.
• SplitNCigarReads: Fixed issue to make -fixNDN flag fully functional.
• IndelRealigner: Fixed an issue that was due to reads that have an incorrect CIGAR length.
• CombineVCFs: Minor change to an error check that was put into 3.3 so that identical samples don't need -genotypeMergeOption.
• VariantsToBinaryPED: Corrected swap between mother and father in PED file output.
• GenotypeConcordance: Monomorphic sites in the truth set are no longer called "Mismatching Alleles" when the comp genotype has an alternate allele.
• ReadBackedPhasing: Fixed a couple of bugs in MNP merging.
• CatVariants: Now allows different input / output file types, and spaces in directory names.
• VariantsToTable: Fixed a bug that affected the output of the FORMAT record lists when -SMA is specified. Note that FORMAT fields behave the same as INFO fields - if the annotation has a count of A (one entry per Alt Allele), it is split across the multiple output lines. Otherwise, the entire list is output with each field.

• Corrected logical expression in MateSameStrandFilter (contributed by user seru71).
• Handle X and = CIGAR operators appropriately
• Added -drf argument to disable default read filters. Limited to specific tools and specific filters (currently only DuplicateReadFilter).

### Annotations

• Calculate StrandBiasBySample using all alternate alleles as “REF vs. any ALT”.
• Modified InbreedingCoeff so that it works when genotyping uniquified samples (see GenotypeGVCFs changes).
• Changed GC Content value type from Integer to Float.
• Added StrandAlleleCountsBySample annotation. This annotation outputs the number of reads supporting each allele, stratified by sample and read strand; callable from HaplotypeCaller only.
• Made annotators emit a warning if they can't be applied.

### GATK Engine & common features

• Fixed logging of 'out' command line parameter in VCF headers; changed []-type arrays to lists so argument parsing works in VCF header commandline output.
• Modified GATK command line header for unique keys. The GATK command line header keys were being repeated in the VCF and subsequently lost to a single key value by HTSJDK. This resolves the issue by appending the name of the walker after the text "GATKCommandLine" and a number after that if the same walker was used more than once in the form: GATKCommandLine.(walker name) for the first occurrence of the walker, and GATKCommandLine.(walker name).# where # is the number of the occurrence of the walker (e.g. GATKCommandLine.SomeWalker.2 for the second occurrence of SomeWalker).
• Handle X and = CIGAR operators appropriately.
• Added barebones read/write CRAM support (no interval seeking!). See Highlights for more details.
• Cleaned up logging outputs / streams; messages (including HMM log messages) that were going to stdout now going to stderr.
• Improved error messages; when an error is related to a specific file, the engine now includes the file name in the error message.
• Fixed BCF writing when FORMAT annotations contain arrays.

### Queue

• Added -qsub-broad argument. When -qsub-broad is specified instead of -qsub, Queue will use the h_vmem parameter instead of h_rss to specify memory limit requests. This was done to accommodate changes to the Broad’s internal job scheduler. Also causes the GridEngine native arguments to be output by default to the logger, instead of only when in debug mode.
• Fixed the scala wrapper for Picard MarkDuplicates (needed because MarkDuplicates was moved to a different package within Picard).
• Added optional element "includeUnmapped" to the PartitionBy annotation. The value of this element (default true) determines whether Queue will explicitly run this walker over unmapped reads. This patch fixes a runtime error when FindCoveredIntervals was used with Queue.

### Documentation

• Plentiful enhancements and fixes to various tool docs, especially annotations and read filters.

### For developers

• Upgraded SLF4J to allow new convenient logging syntaxes.
• Patched maven pom file for slf4j-log4j12 version (contributed by user Biocyberman).
• Updated HTSJDK version (now pulling it in from Maven Central); various edits made to match.
• Collected VCF IDs and header lines into one place (GATKVCFConstants).

Created 2014-08-27 18:39:39 | Updated 2014-12-16 03:19:38 | Tags: haplotypecaller ploidy haploid polyploid beta

Until now, HaplotypeCaller was only capable of calling variants in diploid organisms due to some assumptions made in the underlying algorithms. I'm happy to announce that we now have a generalized version that is capable of handling any ploidy you specify at the command line!

This new feature, which we're calling "omniploidy", is technically still under development, but we think it's mature enough for the more adventurous to try out as a beta test ahead of the next official release. We'd especially love to get some feedback from people who work with non-diploids on a regular basis, so we're hoping that some of you microbiologists and assorted plant scientists will take it out for a spin and let us know how it behaves in your hands.

It's available in the latest nightly builds; just use the -ploidy argument to give it a whirl. If you have any questions or feedback, please post a comment on this article in the forum.

Caveat: the downstream tools involved in the new GVCF-based workflow (GenotypeGVCFs and CombineGVCFs) are not yet capable of handling non-diploid calls correctly -- but we're working on it.

#### UPDATE:

We have added omniploidy support to the GVCF handling tools, with the following limitations:

• When running, you need to indicate the sample ploidy that was used to generate the GVCFs with -ploidy. As usual 2 is the default ploidy.

• The system does not support mixed ploidy across samples nor positions. An error message will be thrown if you attempt to genotype GVCFs that have a mixture, or that have some genotype whose ploidy does not match the -ploidy argument.

#### LATEST UPDATE:

As of GATK version 3.3-0, the GVCF tools are capable of ad-hoc ploidy detection, and can handle mixed ploidies. See the release highlights for details.

Created 2015-10-03 00:45:25 | Updated | Tags: haplotypecaller rna-seq

Hi, I am trying to call SNPs from RNA-seq data. The data that I have is a pooled sample (from the larvae of shellfish 1000s of larvae pooled together to get enough RNA) I have 6 of those samples. Can I use GATK to call SNPs in these pooled samples ?

Hamdi

Created 2015-09-30 17:06:21 | Updated | Tags: haplotypecaller gvcf quality-score

After applying the standard RNA-Seq pipeline (with STAR, etc) I called varients with the command:

java -jar GenomeAnalysisTK.jar
-T HaplotypeCaller
-R chromosome.fa
-I ./final.bam
-dontUseSoftClippedBases
--variant_index_type LINEAR
--variant_index_parameter 128000
--emitRefConfidence GVCF -o ./final.gvcf


On the resultant gVCF file, I ran a little python script to see the distribution of calling quality across the different called genotypes:

• x-axis is quality score rounded to the nearest integer
• y-axis is the number of variants at that quality score 

As you can see, its mostly heterozygous variants, which is what I expect since this data comes from highly inbred mice. What i didn't expect however is the periodicity. Is that normal? Now I presumably I need to filter these variants on some number of quality score, and from this I really dont know where. 0? 50? 75?

Code to generate this data:

#!/usr/bin/env python2.7
import collections
with open('/home/john/overnight/outputs/ctrl_all_FVB.gvcf', 'rb') as f:
data = {}
for line in f:
if line[0] == '#': continue
line = line.split('\t')
if line[5] == '.': continue
gt = line[9][:3]
try: data[gt][int(float(line[5]))] += 1
except KeyError: data[gt] = collections.defaultdict(int)
for gt,qualities in data.items():
print '\n',gt
for qual,count in sorted(qualities.items()):
print qual,count


Created 2015-09-27 22:16:57 | Updated | Tags: haplotypecaller fasta

Hi,

I have tried to solve several issues which came up while trying to run the HaplotypeCaller. For this one, I didn't find anything on google and to be honest when pasting the error, google doesn't even find something similar.

ERROR MESSAGE: Badly formed genome loc: Contig NC_007605 given as location, but this contig isn't present in the Fasta sequence dictionary

Can anyone please tell me what's the problem here? The fasta file I got was the one downloaded from the bundle: human_g1k_v37.fasta.gz

Any help would be really appreciated. Thank you!!

Created 2015-09-22 08:56:30 | Updated | Tags: haplotypecaller strandoddsratio sor

I am using GATK HaplotypeCaller to call variation with the following command: java -Xmx20g -jar GenomeAnalysisTK.jar -l INFO -R hg19.fa -T HaplotypeCaller -nct 16 -I D-2.realigned.recal.bam -I D-3.realigned.recal.bam -I D-4.realigned.recal.bam --dbsnp hg19_GATK_snp137.vcf -o D-2_D-3_D-4.raw.vcf -A StrandOddsRatio -A AlleleBalance -A BaseCounts -A StrandBiasBySample -A FisherStrand

However, there is a problem in the VCF file generated by this command. The SOR information in the header definition line did not exist in the mutation list. ##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias"> chr1 13116 rs201725126 T G 94.57 . AC=1;AF=0.250;AN=4;BaseQRankSum=1.754;DB;DP=7;FS=0.000;MLEAC=1;MLEAF=0.250;MQ=29.47;MQ0=0;MQRankSum=-1.754;QD=23.64;ReadPosRankSum=-0.550 GT:AD:GQ:PL:SB 0/0:3,0:9:0,9,191:0,0,0,0 0/1:1,3:44:123,0,44:0,0,0,0 ./.

I have tried VariantAnnotator but still got the same problem.

Could you please tell me where the problem exist and how to solve it?

Thanks !

Created 2015-09-17 00:02:05 | Updated 2015-09-17 00:02:45 | Tags: haplotypecaller rnaseq gatk-walkers

Hi, I have been running HP for my RNA-seq data

java -Xmx16g -jar GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R $ref \ -I INPUT.bam \ -stand_call_conf 50.0 \ -stand_emit_conf 10.0 \ -o output.vcf my process is killed when it was 82% completed. Is there a way to resume the run without running from the beginning ? Thanks Best Regards T. Hamdi Kitapci Created 2015-09-16 13:37:51 | Updated | Tags: haplotypecaller dbsnp Hi, I am having the following problem: I use the HaplotypeCaller (GATK 3.3.0) for variant calling. To identify variants that are known according to dbSNP, I use the "--dbsnp" statement and define a dbSNP file (vcf file). I thought, that everything would work fine, but by coincidence I observed a (in my eyes really serious) problem: The same call is recognized in the case of one sample, but not in the case of another sample. These are the two important lines of the vcf files that get reported: 17 7579643 . CCCCCAGCCCTCCAGGT C 5066.73 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=4.819;ClippingRankSum=-1.054;DP=231;FS=78.565;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=-0.994;QD=21.93;ReadPosRankSum=-5.473;SOR=1.639;set=variant;EFF=INTRON(MODIFIER||||393|TP53|protein_coding|CODING|ENST00000445888|3|1) GT:AD:DP:GQ:PL 1/1:23,207:230:99:5104,251,0 17 7579643 rs59758982 CCCCCAGCCCTCCAGGT C 2868.73 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=3.120;ClippingRankSum=0.256;DB;DP=134;FS=1.120;MLEAC=2;MLEAF=1.00;MQ=59.91;MQ0=0;MQRankSum=1.849;QD=21.41;ReadPosRankSum=-1.285;SOR=0.704;set=variant;EFF=INTRON(MODIFIER||||393|TP53|protein_coding|CODING|ENST00000445888|3|1) GT:AD:DP:GQ:PL 1/1:13,121:134:96:2906,96,0 As we exclude known variants for our analysis, it is essential that this step works correctly. Yet, I am pretty insecure what to do no. The variant seems to be well known (according to information on the ncbi homepage). Yet, why was it not identified in the other sample??? It would be great if anyone could help me. Many thanks in advance! Sarah Created 2015-09-16 13:26:09 | Updated | Tags: haplotypecaller bamout Hi all, I used multi-threading mode on HaplotypeCaller hoping to save some time. But seemed like bamout can not be emitted in multi-threading mode. I searched the answers. But I am still not sure if the latest 3.4-46 version can support multi-threading with bamout. BTW, I am still using the old 3.3-0 version. If you say yes, now 3.4 version can support multi-threading bam, then I will ask the computing core to update gatk for me. Or maybe I just delete the bamout option to save some time. But I really prefer not to do so because I need to check the depth and coverage of mapping results actually finally used for variant calling. My command line: java -Xmx12g -jar$GATK_JARS/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -nct 12 \ -R human_g1k_v37.fasta \ --dbsnp dbsnp_138.b37.vcf \ -I recal_realigned_b37.dedup.sorted.bam \ --genotyping_mode DISCOVERY \ -stand_emit_conf 10 \ -stand_call_conf 20 \ --emitRefConfidence GVCF \ --variant_index_type LINEAR \ --variant_index_parameter 128000 \ -o raw_var_TKDOME.g.vcf \ -bamout force_bamout_TKDOME_b37.bam -forceActive -disableOptimizations BTW, is it necessary to add --variant_index_type LINEAR and --variant_index_parameter 128000 in Version 3.3? Thank you very much!

Created 2015-09-16 07:43:20 | Updated | Tags: haplotypecaller splitncigarreads-error

First of all I thank for the Tool , I am using this GATK var calling for my RNA-seq data.. I have been following the commands said in the site but its stopping me at the splitting BAM file step with the following error,

ERROR ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ERROR ERROR MESSAGE: Badly formed genome loc: Contig * given as location, but this contig isn't present in the Fasta sequence dictionary

The command I used is, /opt/husar/bin/java-1.7 -jar /GenomeAnalysisTK-3.2-2.jar -T SplitNCigarReads -R /human_genome37_gatk.fa -I BM_ID_reorder.bam -o BM_ID_split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

I tried do variant calling on the duplicate removed BAM file, which also throwed error message as,

##### ERROR -

The command line I used for this, /opt/husar/bin/java-1.7 -jar -Xincgc -Xmx1586M $NGSUTILDIR/java/GenomeAnalysisTK-3.2-2.jar -T HaplotypeCaller -R /human_genome37_gatk.fa -I BM_ID_reorder.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o BM_ID.vcf Created 2015-09-15 01:47:02 | Updated | Tags: unifiedgenotyper haplotypecaller @Geraldine_VdAuwera and @Sheila - Please help! I've read through many of your posts/responses regarding HaplotypeCaller not calling variants, and tried many of the suggestions you've made to others, but I'm still missing variants. My situation is a little different (I'm trying to identify variants from Sanger sequence reads) but I'm hoping you might have additional ideas or can see something I've overlooked. I hope I haven't given you too much information below, but I've seen it mentioned that too much info is better than not enough. A while back, I generated a variant call set from Illumina Next Gen Sequencing data using UnifiedGenotyper (circa v2.7.4), identifying ~46,000 discordant variants between the genomes of two haploid strains of S. cerevisiae. Our subsequent experiments included Sanger sequencing ~95 kb of DNA across 17 different loci in these two strains. I don't think any of the SNP calls were false positives, and there were very, very few were false negatives. Since then, we've constructed many strains by swapping variants at these loci between these two strains of yeast. To check if a strain was constructed successfully, we PCR the loci of interest, and Sanger sequencing the PCR product. I'm trying to use GATK (version 3.4-46) HaplotypeCaller (preferably, or alternatively UnifiedGenotyper) in a variant detection pipeline to confirm a properly constructed strain. I convert the .ab1 files to fastqs using EMBOSS seqret, map the Sanger reads using bwa mem ($ bwa mem -R $RG$refFasta $i >${outDir}/samFiles/{fileBaseName}.sam), merge the sam files for each individual, and then perform the variant calling separately for each individual. I do not dedup (I actually intentionally leave out the -M flag in bwa), nor do I realign around indels (I plan to in the future, but there aren't any indels in any of the regions we are currently looking at), or do any BQSR in this pipeline. Also, when I do the genotyping after HaplotypCaller, I don't do joint genotyping, each sample (individual) gets genotyped individually. In general, this pipeline does identify many variants from the Sanger reads, but I'm still missing many variant calls that I can clearly see in the Sanger reads. Using a test set of 36 individuals, I examined the variant calls made from 364 Sanger reads that cover a total of 63 known variant sites across three ~5kb loci (40 SNPs in locus 08a-s02, 9 SNPs in locus 10a-s01, 14 SNPs in locus 12c-s02). Below are some example calls to HaplotypeCaller and UnifiedGenotyper, as well as a brief summary statement of general performance using the given command. I've also included some screenshots from IGV showing the alignments (original bam files and bamOut files) and SNP calls from the different commands. Ideally, I'd like to use the HaplotypeCaller since not only can it give me a variant call with a confidence value, but it can also give me a reference call with a confidence value. And furthermore, I'd like to stay in DISCOVERY mode as opposed to Genotype Given Alleles, that way I can also assess whether any experimental manipulations we've performed might have possibly introduced new mutations. Again, I'm hoping someone can advice me on how to make adjustments to reduce the number of missed calls. Call 1: The first call to HaplotypeCaller I'm showing produced the least amount of variant calls at sites where I've checked the Sanger reads. java -Xmx4g -jargatkJar \
-R $refFasta \ -T HaplotypeCaller \ -I$inBam \
-ploidy 1 \
-nct 1 \
-bamout {inBam%.bam}_hapcallRealigned.bam \ -forceActive \ -disableOptimizations \ -dontTrimActiveRegions \ --genotyping_mode DISCOVERY \ --emitRefConfidence BP_RESOLUTION \ --interval_padding 500 \ --intervalsoutDir/tmp.intervals.bed \
--min_base_quality_score 5 \
--standard_min_confidence_threshold_for_calling 0 \
--standard_min_confidence_threshold_for_emitting 0 \
-A VariantType \
-A SampleList \
-A AlleleBalance \
-A BaseCounts \
-A AlleleBalanceBySample \
-o $outDir/vcfFiles/${fileBaseName}_hc_bp_raw.g.vcf


Call 2: I tried a number of different -kmerSize values [(-kmerSize 10 -kmerSize 25), (-kmerSize 9), (-kmerSize 10), (-kmerSize 12), (-kmerSize 19), (-kmerSize 12 -kmerSize 19), (maybe some others). I seemed to have the best luck when using -kmerSize 12 only; I picked up a few more SNPs (where I expected them), and only lost one SNP call as compared Call 1.

java -Xmx4g -jar $gatkJar \ -R$refFasta \
-T HaplotypeCaller \
-I $inBam \ -ploidy 1 \ -nct 1 \ -bamout${inBam%.bam}_kmer_hapcallRealigned.bam \
-forceActive \
-disableOptimizations \
-dontTrimActiveRegions \
--genotyping_mode DISCOVERY \
--emitRefConfidence BP_RESOLUTION \
--intervals $outDir/tmp.intervals.bed \ --min_base_quality_score 5 \ --standard_min_confidence_threshold_for_calling 0 \ --standard_min_confidence_threshold_for_emitting 0 \ -kmerSize 12 \ -A VariantType \ -A SampleList \ -A AlleleBalance \ -A BaseCounts \ -A AlleleBalanceBySample \ -o$outDir/vcfFiles/${fileBaseName}_hc_bp_kmer_raw.g.vcf  Call 3: I tried adjusting --minPruning 1 and --minDanglingBranchLength 1, which helped more than playing with kmerSize. I picked up many more SNPs compared to both Call 1 and Call 2 (but not necessarily the same SNPs I gained in Call 2). java -Xmx4g -jar$gatkJar \
-R $refFasta \ -T HaplotypeCaller \ -I$inBam \
-ploidy 1 \
-nct 1 \
-bamout {inBam%.bam}_adv_hapcallRealigned.bam \ -forceActive \ -disableOptimizations \ -dontTrimActiveRegions \ --genotyping_mode DISCOVERY \ --emitRefConfidence BP_RESOLUTION \ --interval_padding 500 \ --intervalsoutDir/tmp.intervals.bed \
--min_base_quality_score 5 \
--standard_min_confidence_threshold_for_calling 0 \
--standard_min_confidence_threshold_for_emitting 0 \
--minPruning 1 \
--minDanglingBranchLength 1 \
-A VariantType \
-A SampleList \
-A AlleleBalance \
-A BaseCounts \
-A AlleleBalanceBySample \
-o $outDir/vcfFiles/${fileBaseName}_hc_bp_adv_raw.g.vcf


Call 4: I then tried adding both --minPruning 1 --minDanglingBranchLength 1 and -kmerSize 12 all at once, and I threw in a --min_mapping_quality_score 5. I maybe did slightly better... than in Calls 1-4. I did actually lose 1 SNP compared to Calls 1-4, but I got most of the additional SNPs I got from using Call 3, as well as some of the SNPs I got from using Call 2.

java -Xmx4g -jar $gatkJar \ -R$refFasta \
-T HaplotypeCaller \
-I $inBam \ -ploidy 1 \ -nct 1 \ -bamout${inBam%.bam}_hailMary_raw.bam \
-forceActive \
-disableOptimizations \
-dontTrimActiveRegions \
--genotyping_mode DISCOVERY \
--emitRefConfidence BP_RESOLUTION \
--intervals $outDir/tmp.intervals.bed \ --min_base_quality_score 5 \ --min_mapping_quality_score 10 \ --standard_min_confidence_threshold_for_calling 0 \ --standard_min_confidence_threshold_for_emitting 0 \ --minPruning 1 \ --minDanglingBranchLength 1 \ -kmerSize 12 \ -A VariantType \ -A SampleList \ -A AlleleBalance \ -A BaseCounts \ -A AlleleBalanceBySample \ -o$outDir/vcfFiles/${fileBaseName}_hailMary_raw.g.vcf  Call 5: As I mentioned above, I've experience better performance (or at least I've done a better job executing) with UnifiedGenotyper. I actually get the most SNPs called at the known SNP sites, in individuals where manual examination confirms a SNP. java -Xmx4g -jar$gatkJar \
-R $refFasta \ -T UnifiedGenotyper \ -I$inBam \
-ploidy 1 \
--output_mode EMIT_ALL_SITES \
-glm BOTH \
-dt NONE -dcov 0 \
-nt 4 \
-nct 1 \
--intervals $outDir/tmp.intervals.bed \ --interval_padding 500 \ --min_base_quality_score 5 \ --standard_min_confidence_threshold_for_calling 0 \ --standard_min_confidence_threshold_for_emitting 0 \ -minIndelCnt 1 \ -A VariantType \ -A SampleList \ -A AlleleBalance \ -A BaseCounts \ -A AlleleBalanceBySample \ -o$outDir/vcfFiles/${fileBaseName}_ug_emitAll_raw.vcf  I hope you're still with me :) None of the above commands are calling all of the SNPs that I (maybe naively) would expect them to. "Examples 1-3" in the first attached screenshot are three individuals with reads (two reads each) showing the alternate allele. The map quality scores for each read are 60, and the base quality scores at this position for individual #11 are 36 and 38, and for the other individuals, the base quality scores are between 48-61. The reads are very clean upstream of this position, the next upstream SNP is about ~80bp away, and the downstream SNP at the position marked for "Examples 4-6" is ~160 bp away. Commands 1 and 2 do not elicit a SNP call for Examples 1-6, Command 3 get the calls at both positions for individual 10, Command 4 for gets the both calls for individuals 10 and the upstream SNP for individual 11. Command 5 (UnifiedGenotyper) gets the alt allele called in all 3 individuals at the upstream position, and the alt allele called for individuals 10 and 12 at the downstream position. Note that in individual 11, there is only one read covering the downstream variant position, where UnifiedGenotyper missed the call. Here is the vcf output for those two positions from each command. Note that there are more samples in the per-sample breakdown for the FORMAT tags. The last three groups of FORMAT tags correspond to the three individuals I've shown in the screenshots. Command 1 output Examples 1-3 649036 . G . . . AN=11;DP=22;VariantType=NO_VARIATION;set=ReferenceInAll GT:AD:DP:RGQ . . . . . . . . . . .:0:0:0 0:0:2:0 0:2:2:89 0:0:2:0 0:2:2:84 0:0:2:0 0:2:2:89 0:0:2:0 0:2:2:89 0:0:2:0 0:0:2:0 0:0:2:0 Examples 4-6 649160 . C . . . AN=11;DP=21;VariantType=NO_VARIATION;set=ReferenceInAll GT:AD:DP:RGQ . . . . . . . . . . .:0:0:0 0:0:2:0 0:2:2:89 0:0:2:0 0:2:2:0 0:0:2:0 0:2:2:71 0:0:2:0 0:2:2:44 0:0:2:0 0:0:1:0 0:0:2:0  Command 2 output Examples 1-3 649036 . G A 26.02 . ABHom=1.00;AC=6;AF=0.545;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-011;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-011_merged_sorted_hc_bp_kmer_raw.vcf GT:AD:DP:GQ:PL:RGQ . . . . . . . . .:0:0:.:.:0 1:0,2:.:56:56,0 0:2:2:.:.:89 1:0,1:.:45:45,0 0:2:2:.:.:84 1:0,1:.:45:45,0 0:2:2:.:.:89 1:0,1:.:45:45,0 0:2:2:.:.:89 1:0,1:.:45:45,0 1:0,2:.:88:88,0 0:0:2:.:.:0 Examples 4-6 649160 . C A 13.22 . AC=3;AF=0.273;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;OND=1.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hc_bp_kmer_raw.vcf GT:AD:DP:GQ:PL:RGQ . . . . . . . . . . . . . . . . . . . .:0:0:.:.:0 0:0:2:.:.:0 0:2:2:.:.:89 1:0,1:.:43:43,0 0:2:2:.:.:0 0:0:2:.:.:0 0:2:2:.:.:71 1:0,0,1:.:37:37,0 0:2:2:.:.:44 1:0,1:.:34:34,0 0:0:1:.:.:0 0:0:2:.:.:0  Command 3 output Examples 1-3 649036 . G A 36.01 . ABHom=1.00;AC=3;AF=0.273;AN=11;DP=20;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_hc_bp_adv_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_adv_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_hc_bp_adv_raw.vcf GT:AD:DP:GQ:PL:RGQ . . . . . . . . . . . . . . . . . . . .:0:0:.:.:0 1:0,2:.:66:66,0 0:2:2:.:.:89 1:0,1:.:45:45,0 0:2:2:.:.:84 1:0,1:.:45:45,0 0:2:2:.:.:89 0:0:2:.:.:0 0:2:2:.:.:89 0:0:2:.:.:0 0:0:2:.:.:0 0:0:2:.:.:0 Examples 4-6 649160 . C A 13.22 . ABHom=1.00;AC=1;AF=0.091;AN=11;DP=20;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-004;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_adv_raw.vcf GT:AD:DP:GQ:PL:RGQ . . . . . . . . . . . . . . . . . .:0:0:.:.:0 0:0:2:.:.:0 0:2:2:.:.:89 1:0,1:.:43:43,0 0:2:2:.:.:0 0:0:2:.:.:0 0:2:2:.:.:71 0:0:2:.:.:0 0:2:2:.:.:44 0:0:2:.:.:0 0:0:1:.:.:0 0:0:2:.:.:0  Command 4 output Examples 1-3 649036 . G A 26.02 . ABHom=1.00;AC=6;AF=0.545;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-011;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-011_merged_sorted_hailMary_raw.vcf GT:AD:DP:GQ:PL:RGQ . . . . . . . . . .:0:0:.:.:0 1:0,2:.:56:56,0 0:2:2:.:.:89 1:0,1:.:45:45,0 0:2:2:.:.:84 1:0,1:.:45:45,0 0:2:2:.:.:89 1:0,1:.:45:45,0 0:2:2:.:.:89 1:0,1:.:45:45,0 1:0,2:.:88:88,0 0:0:2:.:.:0 Examples 4-6 649160 . C A 13.22 . AC=3;AF=0.273;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;OND=1.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hailMary_raw.vcf GT:AD:DP:GQ:PL:RGQ . . . . . . . . . . . . . . . . . . . . .:0:0:.:.:0 0:0:2:.:.:0 0:2:2:.:.:89 1:0,1:.:43:43,0 0:2:2:.:.:0 0:0:2:.:.:0 0:2:2:.:.:71 1:0,0,1:.:37:37,0 0:2:2:.:.:44 1:0,1:.:34:34,0 0:0:1:.:.:0 0:0:2:.:.:0  Command 5 output Examples 1-3 649036 . G A 26.02 . ABHom=1.00;AC=7;AF=0.636;AN=11;DP=22;Dels=0.00;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQ0=0;SOR=2.303;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-011,qHZT-12c-s02_r2657_p4096_dJ-012;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-011_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-012_merged_sorted_ug_emitAll_raw.vcf GT:AD:DP:GQ:PL ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./../. ./. ./. ./. ./. ./. ./. 1:0,2:2:56:56,0 0:.:2 1:0,2:2:99:117,0 0:.:2 1:0,2:2:99:122,0 0:.:2 1:0,2:2:67:67,0 0:.:2 1:0,2:2:99:110,0 1:0,2:2:84:84,0 1:0,2:2:99:127,0 Examples 4-6 649160 . C A 46 . ABHom=1.00;AC=5;AF=0.455;AN=11;DP=21;Dels=0.00;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQ0=0;Samples=qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-012;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-012_merged_sorted_ug_emitAll_raw.vcf GT:AD:DP:GQ:PL ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./../. ./. ./. ./. ./. ./. ./. 0:.:2 0:.:2 1:0,2:2:76:76,0 0:.:2 1:0,2:2:70:70,0 0:.:2 1:0,1:2:37:37,0 0:.:2 1:0,2:2:60:60,0 0:.:1 1:0,2:2:75:75,0  There are many more examples of missed SNP calls. When using the HaplotypeCaller, I'm missing ~23% of the SNP calls. So...what can I do to tweak my variant detection pipeline so that I don't miss so many SNP calls? As I mentioned, I'm currently getting better results with the UnifiedGenotyper walker. I'm only missing about 2% of all Alt SNP calls. Also, about half of that 2% are improperly being genotyped as Ref by Command #5. It appears to me that most of the variant calls I'm missing using the UnifiedGenotyper are at positions where I only have a single Sanger read covering the base, and the base quality score starts to fall below 25 (such as in individual #11 in the first attached screen shot, base quality score was 20). Attached is a second IGV screenshot of a different locus where I've also missed SNP calls using Command 5 (Examples 7-9). I've also included the read details for those positions, as well as the VCF file output from Command 5. I have seen at least one instance where I had two Sanger reads reporting an alternate allele, however, UG did not call the variant. In that case though, the base quality scores in both reads were very low (8); mapping quality was 60 for both reads. Does anyone have any suggestions as to how I might alter any of the parameters to reduce (hopefully eliminate) the missed SNP calls. I think I would accept false positives over false negatives in this case. Or does anyone have any other idea as to what my problem might be? Thanks so much! Matt Maurer Command 5 output for second screen shot file: VCF output The samples shown in the second attached screen shot correspond to the 11th and 12th groupings in the per-sample breakdown of the FORMAT tags. Examples 7-8 163422 . G C 173 . ABHom=1.00;AC=2;AF=0.182;AN=11;DP=14;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQ0=0;SOR=1.609;Samples=qHZT-08a-s02_r2657_p4094_dJ-002,qHZT-08a-s02_r2657_p4094_dJ-008;VariantType=SNP;set=qHZT-08a-s02_r2657_p4094_dJ-002_merged_sorted_ug_emitAll_raw.vcf-qHZT-08a-s02_r2657_p4094_dJ-008_merged_sorted_ug_emitAll_raw.vcf GT:AD:DP:GQ:PL 0:.:1 1:0,4:4:99:203,0 0:.:1 0:.:1 0:.:1 0:.:1 0:.:1 1:0,1:1:54:54,0 0:.:1 ./. 0:.:1 0:.:1 ./. ./. ./. ./. ./. ./. ./. ./. ./../. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. Example 9 163476 . A G 173 . ABHom=1.00;AC=2;AF=0.167;AN=12;DP=15;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=1;MLEAF=1.00;MQ0=0;SOR=1.609;Samples=qHZT-08a-s02_r2657_p4094_dJ-002,qHZT-08a-s02_r2657_p4094_dJ-008;VariantType=SNP;set=qHZT-08a-s02_r2657_p4094_dJ-002_merged_sorted_ug_emitAll_raw.vcf-qHZT-08a-s02_r2657_p4094_dJ-008_merged_sorted_ug_emitAll_raw.vcf GT:AD:DP:GQ:PL 0:.:1 1:0,4:4:99:203,0 0:.:1 0:.:1 0:.:1 0:.:1 0:.:1 1:0,1:1:57:57,0 0:.:1 0:.:1 0:.:1 0:.:1 . . . . . . . . . . .  Also, why are the GT's sometimes "./." as they are for site163422, and sometimes "." as they are for site 163476? Read Details: Example#7 Read name = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11 Sample = qHZT-08a-s02_r2657_p4094_dJ-011 Read group = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11 ---------------------- Location = 163,422 Alignment start = 163,293 (+) Cigar = 34S833M1D72M1I50M1I9M Mapped = yes Mapping quality = 60 Secondary = no Supplementary = no Duplicate = no Failed QC = no ---------------------- Base = C Base phred quality = 23 ---------------------- RG = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11 NM = 20 AS = 858 XS = 0 -------------------  Example 8 Read name = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11 Sample = qHZT-08a-s02_r2657_p4094_dJ-011 Read group = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11 ---------------------- Location = 163,476 Alignment start = 163,293 (+) Cigar = 34S833M1D72M1I50M1I9M Mapped = yes Mapping quality = 60 Secondary = no Supplementary = no Duplicate = no Failed QC = no ---------------------- Base = G Base phred quality = 15 ---------------------- RG = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11 NM = 20 AS = 858 XS = 0 -------------------  Example #9 Read name = qHZT-08a-s02_L_r2657_p4094_dJ-012_pcrP1_oMM575_2014-11-27_A12 Sample = qHZT-08a-s02_r2657_p4094_dJ-012 Read group = qHZT-08a-s02_L_r2657_p4094_dJ-012_pcrP1_oMM575_2014-11-27_A12 ---------------------- Location = 163,422 Alignment start = 163,329 (+) Cigar = 67S16M1D181M1D634M1D9M1I8M1I62M1D17M4S Mapped = yes Mapping quality = 60 Secondary = no Supplementary = no Duplicate = no Failed QC = no ---------------------- Base = C Base phred quality = 18 ---------------------- RG = qHZT-08a-s02_L_r2657_p4094_dJ-012_pcrP1_oMM575_2014-11-27_A12 NM = 87 AS = 480 XS = 0 -------------------  Created 2015-09-14 16:27:46 | Updated | Tags: haplotypecaller vcf bam rnaseq Hello, I'm using GATK to call variants in my RNA-Seq data. I'm noticing something strange, perhaps someone can help? For a number of sites the VCF is reporting things I cannot replicate from BAMs. How can I recover the reads that contribute to a variant call? Here is an example for 1 site in 1 sample, but I've observed this at many sites/samples: $ grep 235068463 file.vcf chr1 235068463 . T C 1795.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-3.530;ClippingRankSum=-0.535;DP=60;FS=7.844;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=0.401;QD=29.93;ReadPosRankSum=3.557 GT:AD:DP:GQ:PL 0/1:5,55:60:44:1824,0,44 60 reads, 5 T, 55 C.
But loading the bam in IGV, I do not see any T reads. Similarly: samtools view -uh file.md.realn.bam chr1:235068463-235068463 |samtools mpileup - |grep 235068463 [mpileup] 1 samples in 1 input files <mpileup> Set max per-file depth to 8000 chr1 235068463 N 60 cCCccccCCCcccccCcccccccccCCCccCCCCCcCcccccCCCcCcCCccCCCCccCC >CA@B@>A>BA@BCABACCC:@@ACABBBCAACBBCABCB@CABBAB?>A?CBBAAAABA There are just 60 C's at that location. How do I decide what the genotype here is? C/C or C/T ? For methodology I'm using gatk/3.2.0. I tried using HC from gatk/3.3.1 and got the same result. The bam and vcf files come from the final two lines: -2 pass STAR -Mark Dups -SplitNCigarReads -RealignerTargetCreator -IndelRealigner -BaseRecalibrator -PrintReads -MergeSamFiles.jar -Mark Dups -RealignerTargetCreator -IndelRealigner -HaplotyeCaller Thanks, Kipp Created 2015-09-14 10:00:46 | Updated | Tags: haplotypecaller indels Hi all, I have the below INDEL call from GATK-3.3 Haplotype caller. chr17 39190954 . G GCAGCAGCTTGGCTGGCAGCAGCTGGTCTCA 770.52 PASS AC=1;AF=0.500;AN=2;DP=138; FS=0.000;MLEAC=1;MLEAF=0.500;MQ=58.33;MQ0=0;QD=5.58;SOR=0.693 GT:AD:DP:GQ:PL 0/1:0,0:0:7:807,0,7  The command used: java -Xmx10G -jar GenomeAnalysisTK.jar -R %s -T HaplotypeCaller -I %s -L %s -stand_emit_conf 10 -stand_call_conf 30 --genotyping_mode DISCOVERY -o %s  DP in the INFO field is 138 and AD from the FORMAT field is 0,0. I understand that DP and AD are unfiltered and filtered depths. However, having 0 reads is something alarming. Could someone help me to understand the differing read depths. Created 2015-09-11 12:29:24 | Updated | Tags: haplotypecaller gq-pl genotypegvcf rgq I work with non-human genomes and commonly need the confidence of the reference sites, so I was happy to see the inclusion of the RGQ score in the format field of GenotypeGVCFs. However, I am a little confused as to what this score means (how it is calculated). Out of curiosity I plotted the distribution of RGQ and GQ scores over ~1Mbp. A few things jumped out that I was hoping you could explain: (1) There are two peaks of GQ and RGQ scores, one at 99 - which is obviously just the highest confidence score and another at exactly GQ/RGQ=45. You can see this in the GQ/RGQ distribution below. I've excluded the sites where RGQ/GQ = 0 or 99 (RGQ = blue, GQ=red) is there some reason why so many GT calls == 45? (2) There are very few GQ = 0 calls and ~96% are GQ=99 - but in the RGQ ~42% == 0 and 54%=99. Is there any explanation why so many RGQ scores == 0? I fear that filtering on RGQ will bias the data against reference calls and include a disproportionate number of variant calls. Created 2015-09-11 10:31:16 | Updated | Tags: haplotypecaller gvcf I came across some unusual variants called by HaplotypeCaller running in gvcf mode while working on human WGS data (the example gvcf line can be seen below). The genotype in almost all samples is undefined i.e. "./.", despite the good coverage reported in DP field (only one sample is identified as 0/1). Moreover, in "./." genotyped samples all reads fall into reference allele group of AD field, therefore I would anticipate "0/0" genotype rather than "./.". I have also inspected several bam files visually and did not find any obvious mapping problems. I have attached two IGV snapshots of the variant region: first is from an example "./." genotyped patient and second one is from the only patient with variant. The region seems to have good 25-30x coverage with majority of mapping qualities equal to 60. However, apparently there is some other insertion nearby. The GATK version I am using is 2015.1-3.4.0-1-ga5ca3fc and reference genome is GRCh38. Could you please explain why the inferred genotype is "./." instead of "0/0" ? Best, Ewa chr1 100474610 rs568102277 T TG 358.91 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.54;ClippingRankSum=0.419;DB;DP=4026;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=1.36;QD=13.29;ReadPosRankSum=0.814;SOR=0.551 GT:AD:DP:GQ:PGT:PID:PL ./.:36,0:36 ./.:37,0:37 ./.:33,0:33 ./.:30,0:30 ./.:36,0:36 ./.:32,0:32 ./.:36,0:36 ./.:32,0:32 ./.:31,0:31 ./.:37,0:37 ./.:27,0:27 ./.:34,0:34 ./.:38,0:38 ./.:28,0:28 ./.:29,0:29 ./.:31,0:31 ./.:25,0:25 ./.:24,0:24 ./.:19,0:19 ./.:41,0:41 ./.:24,0:24 ./.:27,0:27 ./.:26,0:26 ./.:28,0:28 ./.:31,0:31 ./.:38,0:38 ./.:27,0:27 ./.:22,0:22 ./.:31,0:31 ./.:27,0:27 ./.:29,0:29 ./.:28,0:28 ./.:34,0:34 ./.:20,0:20 ./.:26,0:26 ./.:33,0:33 ./.:26,0:26 ./.:26,0:26 ./.:31,0:31 ./.:32,0:32 ./.:34,0:34 ./.:27,0:27 ./.:28,0:28 ./.:37,0:37 ./.:38,0:38 ./.:25,0:25 ./.:31,0:31 ./.:37,0:37 ./.:31,0:31 ./.:32,0:32 ./.:30,0:30 ./.:38,0:38 ./.:36,0:36 ./.:32,0:32 ./.:40,0:40 ./.:32,0:32 ./.:42,0:42 ./.:37,0:37 ./.:29,0:29 ./.:42,0:42 ./.:31,0:31 ./.:36,0:36 ./.:35,0:35 ./.:31,0:31 ./.:35,0:35 ./.:32,0:32 ./.:30,0:30 ./.:30,0:30 ./.:36,0:36 ./.:34,0:34 ./.:28,0:28 ./.:37,0:37 ./.:34,0:34 ./.:24,0:24 ./.:31,0:31 ./.:33,0:33 ./.:36,0:36 ./.:37,0:37 ./.:48,0:48 ./.:25,0:25 ./.:39,0:39 ./.:26,0:26 ./.:23,0:23 ./.:39,0:39 ./.:29,0:29 ./.:33,0:33 ./.:37,0:37 ./.:27,0:27 ./.:29,0:29 ./.:42,0:42 ./.:28,0:28 ./.:29,0:29 ./.:30,0:30 ./.:39,0:39 ./.:39,0:39 ./.:35,0:35 ./.:31,0:31 ./.:29,0:29 ./.:23,0:23 ./.:30,0:30 ./.:24,0:24 ./.:29,0:29 ./.:26,0:26 ./.:19,0:19 ./.:26,0:26 ./.:16,0:16 ./.:27,0:27 ./.:24,0:24 ./.:34,0:34 ./.:28,0:28 ./.:41,0:41 ./.:41,0:41 ./.:39,0:39 ./.:24,0:24 0/1:11,16:27:99:1|0:100474609_G_GT:381,0,245 ./.:36,0:36 ./.:26,0:26 ./.:27,0:27 ./.:29,0:29 ./.:29,0:29 ./.:28,0:28 ./.:24,0:24 ./.:19,0:19 ./.:31,0:31 ./.:33,0:33 ./.:23,0:23 ./.:25,0:25 ./.:31,0:31 ./.:34,0:34 ./.:26,0:26 Created 2015-08-31 03:14:54 | Updated 2015-08-31 03:16:12 | Tags: haplotypecaller Hi, I'm hoping you can help resolve the behaviour of HaplotypeCaller with respect to a certain position. Here's the IGV screenshot, with these filters: MQ>30, filter secondaries and dups. The DP is 14-18 across this deletion. HC called a TATA deletion in this proband, with this gvcf call: 5 67597220 rs71655141 GTATA G,<NON_REF> 95.14 . DB;DP=12;MLEAC=2,0;MLEAF=1.00,0.00;MQ=57.93;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,3,0:3:10:132,10,0,132,10,132:0,0,1,2 It's calling this as GT=1/1 with AD=0,3. Clearly this is likely all noise, and a tough region of the genome to make a call in, but i'm curious why the depth is 3, how HC handles the multiple overlapping deletions - ie how it only makes the delTATA call. I'm using GATK 3.3, and following best practices. cheers, Mark Created 2015-08-30 22:27:03 | Updated 2015-08-30 22:30:48 | Tags: haplotypecaller ploidy pooled-calls Hello everyone, I was reading the haplotype caller documentation and noticed the "--sample_ploidy/-ploidy" flag. The description reads "Ploidy (number of chromosomes) per sample. For pooled data, set to (Number of samples in each pool * Sample Ploidy)." My question is, what exactly is a pooled experiment? Is it when I have multiple samples? I have separate files for each of my 8 samples and the organism only has one chromosome. So would the number I set be 8*1? Or is this pooled number for multiple samples within a file, and in which case, I would specify 1 instead of 8. Thanks! Raymosrunerx Created 2015-08-27 18:54:14 | Updated 2015-08-27 18:54:44 | Tags: haplotypecaller bug Haplotype Caller output this record, how can it have an AD of 0,0? 7 21584892 . T TAA 257.77 . AC=1;AF=0.500;AN=2;DP=118;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=70.00;SOR=4.804 GT:AD:GQ:PL 0/1:0,0:99:125,0,112 GATK version is 3.4-46 Created 2015-08-26 01:25:39 | Updated | Tags: haplotypecaller ploidy runtime-error Greetings! I am running HaplotypeCaller on a haploid data set. It works perfectly leaving the default ploidy at 2, but I get the following error when I set it to 1: INFO 21:05:57,644 HelpFormatter - -------------------------------------------------------------------------------- INFO 21:05:57,648 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.2-2-gec30cee, Compiled 2014/07/17 15:22:03 INFO 21:05:57,648 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 21:05:57,651 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 21:05:57,658 HelpFormatter - Program Args: -T HaplotypeCaller -R Ref.fa -I aln-pe_RG_sorted_dedup_realigned.bam -o aln-pe_RG_sorted_dedup_realigned_HC_ploidy1.vcf -ploidy 1 -maxAltAlleles 2 INFO 21:05:57,666 HelpFormatter - Executing as mzilvers@f2c06 on Linux 2.6.32-504.8.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_75-mockbuild_2015_01_20_23_39-b00. INFO 21:05:57,667 HelpFormatter - Date/Time: 2015/08/25 21:05:57 INFO 21:05:57,667 HelpFormatter - -------------------------------------------------------------------------------- INFO 21:05:57,668 HelpFormatter - -------------------------------------------------------------------------------- INFO 21:05:57,913 GenomeAnalysisEngine - Strictness is SILENT INFO 21:05:58,027 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 21:05:58,040 SAMDataSourceSAMReaders - Initializing SAMRecords in serial INFO 21:05:58,067 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 INFO 21:05:58,079 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 21:05:58,213 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 21:05:58,302 GenomeAnalysisEngine - Done preparing for traversal INFO 21:05:58,303 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 21:05:58,305 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 21:05:58,306 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime INFO 21:05:58,520 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units INFO 21:05:58,828 AFCalcFactory - Requested ploidy 1 maxAltAlleles 2 not supported by requested model EXACT_INDEPENDENT looking for an option INFO 21:05:58,829 AFCalcFactory - Selecting model EXACT_GENERAL_PLOIDY INFO 21:06:28,312 ProgressMeter - Starting 0.0 30.0 s 49.6 w 100.0% 30.0 s 0.0 s ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.NullPointerException at org.broadinstitute.gatk.tools.walkers.genotyper.afcalc.GeneralPloidyExactAFCalc.assignGenotype(GeneralPloidyExactAFCalc.java:559) at org.broadinstitute.gatk.tools.walkers.genotyper.afcalc.GeneralPloidyExactAFCalc.subsetAlleles(GeneralPloidyExactAFCalc.java:527) at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:286) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:335) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:320) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:310) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.isActive(HaplotypeCaller.java:844) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.addIsActiveResult(TraverseActiveRegions.java:618) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.access$800(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegionsActiveRegionIterator.hasNext(TraverseActiveRegions.java:378) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:268) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:273) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee): Any help is appreciated! Created 2015-08-24 09:35:24 | Updated | Tags: unifiedgenotyper vqsr bqsr haplotypecaller I'm a bit uncertain as to the optimal pipeline for calling variants. I've sequenced a population sample of ~200 at high coverage ~30X, with no prior information on nucleotide variation. The most rigorous pipeline would seem to be: 1. Call variants with UG on 'raw' (realigned) bams. 2. Extract out high-confidence variants (high QUAL, high DP, not near indels or repeats, high MAF) 3. Perform BQSR using the high-confidence variants. 4. Call variants with HaplotypeCaller on recalibrated bams. 5. Perform VQSR using high-confidence variants. 6. Any other hard filters. Is this excessive? Does using HaplotypeCaller negate the use of *QSR? Is it worthwhile performing VQSR if BQSR hasn't been done? Otherwise I'm just running HaplotyperCaller on un-recalibrated bams, and then hard-filtering. Created 2015-08-19 12:35:40 | Updated 2015-08-19 12:37:10 | Tags: haplotypecaller Hi there, I've been playing around with RNA-seq data and came across what seems to be a high quality single nucleotide variant but that isn't detected by HaplotypeCaller. I followed the Calling variants in RNAseq method, except for the alignment which was carried out by tophat2 (and looks good). Then I used MarkDuplicates, ReorderSam, SplitNCigarReads, BaseRecalibrator, PrintReads and finally HaplotypeCaller. When I look at the reads that come out of the PrintReads steps with IGV, I can see a lot of high quality reads with the variant sequence (see attachment). However the HaplotypeCaller doesn't report this site as variant in the VCF output. I first thought it was filtered but setting stand_call_conf and stand_emit_conf to 0.0, and adding EMIT_ALL_SITES doesn't make a difference on this site (although I can see lots of low quality sites appear as expected). java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -I Cet1_2.OS9_rg.bam -dontUseSoftClippedBases -stand_call_conf 0.0 -stand_emit_conf 0.0 -o Cet1_2.OS9_all_sites.vcf -R Homo_sapiens.GRCh38.dna.primary_assembly.fa -out_mode EMIT_ALL_SITES This site is called in a replicate experiment (same cell line, different sample prep) and is assigned a very high quality (QD = 25.34, DP=1298). The two samples look very similar and I can't understand what's going on and why the variant isn't called? Any idea on how to tackle this issue would be welcome. Created 2015-08-18 20:05:07 | Updated | Tags: haplotypecaller gvcf wgs Hi, I am doing gVCF calls for whole genome samples and I would notice that the gvcf-calling jobs for some of the samples would fail at random genomic locations and if I resubmit those failed jobs, they would either finish successfully or fail again at a different genomic location ('genomic location' info from "ProgressMeter" line inside logs). • I am doing one gVCF job per WGS sample. Right now there are more than 70% of jobs that are failing. Is there anything that should be changed on the parameters? • Do you have something like a SOP for best practises on doing HaplotypeCaller calling for WGS samples? I understand the process is very similar to exome sequencing gVCF calling but somehow I see many more job failures with gVCF calling on WGS samples. I am using the following parameters for gVCF call: sh java -Xmx128g -XX:+UseConcMarkSweepGC -XX:-UseGCOverheadLimit -jar GenomeAnalysisTK.jar -T HaplotypeCaller -I file.bam -nct 8 -R human_g1k_v37.fasta -o /ttemp/file.g.vcf -L b37_wgs.intervals —emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -dcov 250 -minPruning 3 -stand_call_conf 30 -stand_emit_conf 30 -G Standard -A AlleleBalance -A Coverage -A HomopolymerRun -A QualByDepth Compute: One full node (“256GB RAM, 20 cores” per node) per single sample WGS gvcf job. GATK version being used is "3.1” P.S. I am also testing out the latest version of GATK (3.4) without “-dcov” option to see if that resolves the issue. Thanks, Shalabh Created 2015-08-18 13:57:14 | Updated | Tags: haplotypecaller cohort multiple-bam multiple-samples Hi GATK team, I was calling variants individually to generate gVCF files for each sample. Then I thought maybe I should also try using multiple samples as input from the ‘RealignerTargetCreator ‘ step to generate a joint bam file that contained alignment from different samples (I used different SM tag for each sample). I paste my command below. So in the ‘HaplotypeCaller’ step, I got the message ‘emitRefConfidence has a bad value’ because my bam file is mixed with different SM tag. I suppose I shouldn't change SM tags into the same, or I will not be able to identify what variants are from which individual. I think that I misunderstood the meaning of cohort calling from the beginning. So I want to clarify two points to see if I am understanding correctly now. 1.The’ HaplotypeCaller’ will only use the SM tag information to identify different individuals. Other tags like ID, PL, LB will not be considered. If all samples from different individuals have the same SM tag,HC will treat it as from one individual. Is it correct? 2. My target is to find variants between individuals, rather than in all individuals. Then I should call variant one sample per time and run ‘GenotypeGVCFs’ to join all samples together before hard filtering. Do I still misunderstand something? Thanks. The following is the command line I used. java -Xmx32g -jarGATK_JARS/GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R ucsc.hg19.fasta \ -I aligned_TKSAHB.dedup.sorted.bam \ -I aligned_TKSAHV.dedup.sorted.bam \ -I aligned_TKSASA.dedup.sorted.bam \ -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \ -known 1000G_phase1.indels.hg19.sites.vcf \ -o target_interval_TKSA.list \ && java -Xmx32g -jar GATK_JARS/GenomeAnalysisTK.jar \ -T IndelRealigner \ -R ucsc.hg19.fasta \ -I aligned_TKSAHB.dedup.sorted.bam \ -I aligned_TKSAHV.dedup.sorted.bam \ -I aligned_TKSASA.dedup.sorted.bam \ -targetIntervals target_interval_TKSA.list \ -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \ -known 1000G_phase1.indels.hg19.sites.vcf \ -o realigned_TKSA.dedup.sorted.bam Created 2015-08-18 12:42:10 | Updated | Tags: haplotypecaller variantfiltration Hi I am new to using HC. I am just using the HapMap sample NA12878 to validate my pipeline. I ran the following command on a BAM file that I generated. I noticed that certain SNPs were filtered out most of which I can understand the rationale. java -Xms454m -Xmx3181m -XX:+UseSerialGC -Djava.io.tmpdir=/home/tx/tmpYdAKAC -jar /usr/local/share/gatk/3.4-46-gbc02625/GenomeAnalysisTK.jar \ -R /usr/local/share/genomes/Hsapiens/GRCh37/seq/GRCh37.fa \ -I /home/INDELanalysis/validationrun/NA12878-sort-11_0_15994663-prep.bam \ --dbsnp /usr/local/share/genomes/Hsapiens/GRCh37/variation/dbsnp_138.vcf.gz \ -L 11:1628982-1651500 \ -T HaplotypeCaller \ -o /home/INDELanalysis/validationrun/NA12878-11_0_15994663-raw.vcf.gz I ran the same analysis using Samtools and the following SNP was in the final filtered output Samtools output 11 1651228 rs71454096 C G 117 PASS AC=1;AN=2;BQB=0.61207;BaseQRankSum=-0.458;DB;DP=21;DP4=11,0,5,3;FS=7.225;GC=71.29;HOB=0.5;HRun=3;ICB=1;MQ=56.97;MQ0=0;MQ0F=0;MQB=0.395294;MQRankSum=-1.091;MQSB=0.798516;QD=5.57;RPB=0.902014;ReadPosRankSum=-1.162;SGB=-0.651104;VDB=0.0103095;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|gCg/gGg|A53G|237|KRTAP5-5|protein_coding|CODING|ENST00000399676|1|G) GT:DP:PL 0/1:19:150,0,158 However when I run it through HC with the above settings the above variant is filtered out of the analysis. When I ran HC with--emitRefConfidence GVCF I got the following output for this variant: HC output 11 1651228 rs71454096 C G,<NON_REF> 0 . DB;DP=20;MLEAC=0,0;MLEAF=0.00,0.00;MQ=58.22 GT:AD:DP:GQ:PGT:PID:PL:SB 0/0:10,0,0:10:34:0|1:1651120_G_T:0,34,442,34,442,442:10,0,0,0 I am a little confused why it got a QUAL score of 0. So looking nearby I also noticed that there were some sizeable INDELS 11 1651198 . GAGGCTGTGGGGGCTGTGGCTCCGGCTGTGC G,<NON_REF> 0 . DP=25;MLEAC=0,0;MLEAF=0.00,0.00;MQ=58.38 GT:AD:DP:GQ:PL:SB 0/0:14,0,0:14:37:0,37,585,42,589,594:12,2,0,0 11 1651199 rs71025763 A AGGCTGTGGCTCC,<NON_REF> 0 . DB;DP=18;MLEAC=0,0;MLEAF=0.00,0.00;MQ=57.74 GT:AD:DP:GQ:PL:SB 0/0:6,0,0:6:28:0,28,332,28,332,332:5,1,0,0 However neither of these INDELS were called either. Can anyone shed some light as to why nothing is called in this region? If I look at the site 11 1651228 in the bam file I can see that most of the calls were made on the forward strand so there is likely to be some strand bias here and in actual fact we can see that HC called the genotype 0/0 with only 10 reads contributing to this genotype call. I presume the remaining reads were filtered out. However when I look at the pileup for this position I don't understand why over half the reads were filtered 11 1651228 C CCCCGGGGCCCGGCCCGGCGG BBFIFB<FFIIFFFFFFFBBB Any advice help would be much appreciated. Created 2015-08-17 20:48:32 | Updated | Tags: haplotypecaller sureselect range I am preparing to run the Best Practices variant calling pipeline on a large set of samples. These samples were captured with several technologies, some with SureSelect, some with Nextera, and some with HaloPlex. When running the HaplotypeCaller on the GVCFs, do you have any recommendations for how to handle the overlapping and non-overlapping regions between these capture technologies? Option A is to set the range to only those regions covered by all technologies. I see this as the safest, most conservative. However, this abandons potential data by limiting to the smallest set of capture ranges. Option B is to set the range to the largest capture ranges. Basically, analyze across the largest range captured by any technology. My question here is whether the statistics used by HC, and then later VQSR are going to be affected by having regions that are of differing coverage across sets of samples. For example given Samples 1, 2, and 3, captured by SureSelect, Nextera, and HaloPlex, respectively. And given regions A, B and C, where A is SureSelect unique, B is across all platforms and C is HaloPlex unique. Will it be a problem that Sample1 has coverage in A-B, Sample2 has coverage only in B, and Sample3 has coverage in B-C? Will the statistics run properly for variants falling in A, B and C? Option C is the most annoying. I calculate the Venn diagram of the capture ranges and turn the crank on the best practices once for each Venn cell, then combine the VCFs after VQSR filtering. Basically, with SS, N, HP ranges I run the best practices for the overlapping ranges in: 1. SS+N+HP (same as option A ranges) 2. SS+N 3. SS+NP 4. N+HP 5. SS 6. N 7. HP One of my concerns with this method is that for the singleton ranges the sample numbers X capture range may get too small for VQSR to be effective. What are your thoughts on how to handle this? Thanks Alex Created 2015-08-13 15:47:04 | Updated | Tags: indelrealigner haplotypecaller If HaplotypeCaller re-assembles all reads in a region Why is it recommend to run IndelRealigner first? Created 2015-08-11 18:52:55 | Updated | Tags: haplotypecaller gvcf Dear GATK team, I wish to get gVCF files for each data set. But I am not sure if I should still use --output_mode EMIT_ALL_SITES argument in my command lines. In your previous thread, I found you mentioned that "HaplotypeCaller used to have that option, but it was removed when we introduced the reference model (gVCF) option. Have a look at the documentation that explains this here: http://www.broadinstitute.org/gatk/guide/article?id=2940". I clicked in the link. But the link was not accessible. So I wish to confirm if I am using the right arguments in my command. Here is my command. I have removed the --output_mode option. Will that be all right? java -Xmx12g -jarGATK_JARS/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R ucsc.hg19.fasta \ -I sample1.realigned.dedup.sorted.bam \ --genotyping_mode DISCOVERY \ -stand_emit_conf 10 \ -stand_call_conf 20 \ --emitRefConfidence GVCF \ --variant_index_type LINEAR \ --variant_index_parameter 128000 \ -o raw_var_sample1.g.vcf

Created 2015-08-11 14:01:44 | Updated | Tags: haplotypecaller softclip soft-clipping

Dear GATK team, From what I read on the forum, it seems that HaplotypeCaller call variants (and perhaps calculates RGQ) using soft-clipped parts of the reads? Is there a parameter to suppress this behavior, making it throw out the soft-clipped proportions before calling? Best regards, Ray

Created 2015-08-10 18:17:31 | Updated | Tags: haplotypecaller vcf-format combinegvcfs

Hi,

I've a problem with a VCF when combiningGVCFs, gatk/v3.4. "headerKey END found in VariantContext field INFO at chr2L:64003 but this key isn't defined in the VCFHeader"

HaplotypeCaller -nct 30 --analysis_type HaplotypeCaller --reference_sequence ../../reference_sequences/dmel/v6.0/dm6.fa \ --input_file ../../read_mapping/reference_line/RGnb/RGnb.bam \ --variant_index_type LINEAR --variant_index_parameter 128000 -mbq 10 -L chr2L \ --out ../../read_mapping/reference_line/RGnb/RGnb.g.vcf

VCF CHROM POS ID REF ALT QUAL FILTER INFO FORMAT RGnb chr2L 43265 . A G 2147.77 . AC=2;AF=1.00;AN=2;DP=54;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.59;SOR=1.118 GT:AD:DP:GQ:PL 1/1:0,54:54:99:2176,162,0 chr2L 64003 . GT G 57.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.987;ClippingRankSum=-0.189;DP=45;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.85;MQRankSum=0.525;QD=1.28;ReadPosRankSum=1.149;SOR=0.622 GT:AD:DP:GQ:PL 0/1:25,8:33:95:95,0,521

Combine three gvcfs --analysis_type CombineGVCFs --reference_sequence ../../reference_sequences/dmel/v6.0/dm6.fa \ --variant rg_GVCF.list --out ./RG.g.vcf

Error message: java.lang.IllegalStateException: Key END found in VariantContext field INFO at chr2L:64003 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF

Could anyone advise me on how to fix this?

Sincerely,

Blue

Created 2015-08-10 16:15:23 | Updated | Tags: haplotypecaller

Hello,

I am having an issue with haplotypecaller omitting true heterozygotes. Attached is an IGV image of the VCF (top track), de novo reassembled BAM file (middle) and input BAM file (bottom). This appears to be happening all over. I was wondering what I can do to address this issue.

commandline:

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I sample.realigned.marked.sorted.bqsr.unique.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o sample.hapcal.raw.vcf -nct 12

p.s. I get the same results when running without multiple threads and when outputting the rearranged BAM file used for variant calling.

Created 2015-08-10 00:02:58 | Updated 2015-08-10 00:14:35 | Tags: haplotypecaller phasing beagle read-backed-phasing

Hi,

I have been using HaplotypeCaller 3.4 on five hundred cattle genomes. I am wondering how to pass the physical phasing information, now generated by Haplotype Caller in N+1 mode, through GenotypeGVCF applying pedigree and then out to Beagle 4.0 as a vcf.gz for imputation. My goal is to make a phased reference that is as accurate as possible to be used as an imputation resource. hence I would like to exploit physical phasing information.

Do you have an example work flow? It seems that the recommendations for read-backed phasing have changed since haplotype caller 3.3 came up with the N+1 workflow.

Thanks

Mike

Created 2015-08-07 14:25:49 | Updated | Tags: depthofcoverage haplotypecaller dp solid igv

Hello Everyone!

I'm using the whole GATK workflow to analyze Target Resequencing data coming from SOLID platforms. I followed the Best Practices for analysis and used the proper SOLID flags when using BaseRecalibrator (--solid_recal_mode SET_Q_ZERO_BASE_N --solid_nocall_strategy PURGE_READ), however, when looking at the VCF files after Haplotype Caller something does not add up.

I checked some of the variants inside some of my samples and i found that the DP field does not report the same per base coverage value than the one that are reported by the bam (using the --bamOutput to produce a bam for Haplotype Caller) when looking at them using the IGV. As far as I understand, for each position there's a downsampling, but I'm see a lower DP value compared to the ones that are stored in the BAM I'm attaching an IGV screenshots of one of the variants in which i'm encountering this problem. I deactivated all filtering alignment options in IGV, as well as downsampling. Here's the line Reported in the VCF for this variant:

As you can see from the screenshot, not only the covers differ, but a lot of reads that maps according to the reference are missing- Does somebody has an idea of what happened to the coverage inside the VCF?

Thanks a lot for your time!

Daniele

Created 2015-08-04 23:35:08 | Updated 2015-08-04 23:36:54 | Tags: haplotypecaller vcf genotypegvcfs

It appears that there are some cases in which an upstream site is heterozygous for a deletion, and a downstream site should be heterozygous for something like 1/2 (where 2 is the * allele), but GATK is still making erroneous 0/1 calls. For instance, in my dataset I have a deletion at position 773 (I'm leaving out some fields and GT info for additional samples for brevity).

scaffold_1 773 . CAAAACATAGTTTCCCATGTCCGCCATCCCATGTTCTGCATCCGTGCA C ... 0/1:102,70:172:99:0|1:773_CAAAACATAGTTTCCCATGTCCGCCATCCCATGTTCTGCATCCGTGCA_C:283 9,0,4099

And just 2 bases downstream, there's a SNP for the same sample that has non-* calls on both strands, which seems impossible. Shouldn't there be a * allele here, e.g (ALT==G,* ; GT==1/2)? Here's the call: scaffold_1 775 . A G ... 0/1:60,134:194:99:4507,0,1642

Could this be a bug, such that when there's a heterozygous ALT call on the strand without the deletion, the half of the GT field for deleted strand is still getting a reference (0) call when it should be getting a * ? At another site overlapping with the same deletion, at which the strand without the deletion for this sample is a REF call , I get something that's more like what I'd expect.

scaffold_1 805 . G T,* ... 0/2:235,0,33:268:99:0|1:773_CAAAACATAGTTTCCCATGTCCGCCATCCCATGTTCTGCATCCGTGCA_C:676,1384,11251,0,9868,9762

This was done with GenotypeGVCFs from GATK 3.4-46.

Created 2015-08-04 16:25:30 | Updated | Tags: haplotypecaller

Hi, so I have some data that I ran through both Unified Genotype AND Haplotype Caller (because I wanted to see the difference in the number and quality of SNPs returned). I have merged both vcf files (using a simple awk command) to produce one that contains concordant SNPs between the two. However, I am now about to filter on GQ and DP except that I don't have a DP field in the file containing the SNPs that both programs called. Neither does it appear in the Haplotype Caller file. It DOES appear in the Unified Genotyper file though. I really would like to continue using my concordant file: is there a way that I can force the addition of the DP field into the Haplotype Caller vcf? I have the AD field showing the read numbers for both alleles, and the DP field is included in the INFO column but I can't filter depth at present so I'd really appreciate some guidance with this. PS. I'm a newbie to bioinformatics processing and computing. Thanks for any help given.

Created 2015-07-31 22:11:45 | Updated | Tags: commandlinegatk haplotypecaller gatk error

Hello,

I am receiving the following error. I am working with SAM files that were exported from CLC, then edited with Picard-tools to addReadGroups. I am not sure if I need to add an additional step to solve this problem, I cannot find any documentation regarding this error.

Please let me know what I need to do to correct this issue.

Thank you!

gatk -T HaplotypeCaller -R spinach_assembly-repeatdetect_PACBIO_V1.3_formated_60.fa -I .sam.list -drf DuplicateRead --alleles Unfiltered_Spinach_PacBio_Reseq_12_Geno_Assay_SNP.fixed.noblanks.vcf --genotyping_mode GENOTYPE_GIVEN_ALLELES --output_mode EMIT_ALL_SITES -o output_raw_unfiltered_spinach_snps_gbs.vcf INFO 14:48:44,450 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:48:44,453 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12 INFO 14:48:44,454 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 14:48:44,454 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 14:48:44,458 HelpFormatter - Program Args: -T HaplotypeCaller -R spinach_assembly-repeatdetect_PACBIO_V1.3_formated_60.fa -I .sam.list -drf DuplicateRead --alleles Unfiltered_Spinach_PacBio_Reseq_12_Geno_Assay_SNP.fixed.noblanks.vcf --genotyping_mode GENOTYPE_GIVEN_ALLELES --output_mode EMIT_ALL_SITES -o output_raw_unfiltered_spinach_snps_gbs.vcf INFO 14:48:44,468 HelpFormatter - Executing as ahulse@jalapeno.genomecenter.ucdavis.edu on Linux 2.6.18-348.12.1.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_05-b13. INFO 14:48:44,469 HelpFormatter - Date/Time: 2015/07/31 14:48:44 INFO 14:48:44,469 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:48:44,470 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:48:45,102 GenomeAnalysisEngine - Strictness is SILENT INFO 14:48:45,385 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500 INFO 14:48:45,394 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 14:48:48,432 SAMDataSource$SAMReaders - Init 50 BAMs in last 3.04 s, 50 of 80 in 3.04 s / 0.05 m (16.46 tasks/s). 30 remaining with est. completion in 1.82 s / 0.03 m INFO 14:48:50,052 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 4.66 INFO 14:48:50,164 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 14:48:54,742 RMDTrackBuilder - Writing Tribble index to disk for file /local/scratch/scratch/Amanda/Spinach_GBS/Unfiltered_Spinach_PacBio_Reseq_12_Geno_Assay_SNP.fixed.noblanks.vcf.idx INFO 14:48:58,784 GenomeAnalysisEngine - Preparing for traversal over 80 BAM files INFO 14:49:00,054 GATKRunReport - Uploaded run statistics report to AWS S3  ERROR ------------------------------------------------------------------------------------------ ERROR A BAM ERROR has occurred (version 3.4-46-gbc02625): ERROR ERROR This means that there is something wrong with the BAM file(s) you provided. ERROR The error message below tells you what is the problem. ERROR ERROR Visit our website and forum for extensive documentation and answers to ERROR commonly asked questions http://www.broadinstitute.org/gatk ERROR ERROR Please do NOT post this error to the GATK forum until you have followed these instructions: ERROR - Make sure that your BAM file is well-formed by running Picard's validator on it ERROR (see http://picard.sourceforge.net/command-line-overview.shtml#ValidateSamFile for details) ERROR - Ensure that your BAM index is not corrupted: delete the current one and regenerate it with 'samtools index' ERROR ERROR MESSAGE: Cannot retrieve file pointers within SAM text files.  ##### ERROR ------------------------------------------------------------------------------------------ Created 2015-07-31 00:11:55 | Updated | Tags: haplotypecaller kmer bamout Hello, I am using HaplotypeCaller in order to get haplotype sequences from individual samples (several samples per species) for gene tree/species tree analysis. The reads are from an exome capture experiment. Because I am running individual samples I have limited the max # of haplotypes to 2. However, the default behavior of using two kmer size (10 and 25) results in up to four haplotypes per exon (interval) in the bamout file. I have found that if I supply a kmerSize parameter I get only 2 haplotypes but these differ depending on the kmer I supply. The difference is not only subsetting of the snps found with multiple kmer sizes but distinct snps called with different kmer sizes as well. I would like to run the analysis with multiple kmerSizes specified and have the caller only output the two most likely haplotypes. Is this possible and, if so, how can I do it? Or, am I misunderstanding how the caller works? I think I understand why different kmer sizes would result in different snps called but if anyone could explain it to me I'd love confirmation. Here is my original command line before experimenting with kmer sizes: java -jar /opt/local/NGS/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller -R /Users/bdorsey/Documents/Dioon/Capture_seqs_assembly/captured_seqs_uniq.fa -I /Volumes/HD2/Capture_assembly/Dioon1/contigs/Dioon1_m1n350r.10x.sp5.bam -L /Volumes/HD2/Capture_assembly/Dioon1/exonsCov10sp5.list --activeRegionIn /Volumes/HD2/Capture_assembly/Dioon1/exonsCov10sp5.list --maxNumHaplotypesInPopulation 2 --minReadsPerAlignmentStart 5 -out_mode EMIT_ALL_SITES -ERC BP_RESOLUTION --forceActive --dontTrimActiveRegions --activeRegionMaxSize 10000 -bamWriterType CALLED_HAPLOTYPES --disableOptimizations -bamout /Volumes/HD2/Capture_assembly/Dioon1/haplo/Dioon1.haplos.bam -o /Volumes/HD2/Capture_assembly/Dioon1/haplo/Dioon1.haplos.g.vcf Thanks very much for any help. Cheers, Brian D Created 2015-07-30 12:26:51 | Updated | Tags: haplotypecaller Hello, I am reading the documentation of HaplotypeCaller in order to use this tool for "Single-sample all-sites calling on DNAseq (for -ERC GVCF cohort analysis workflow)". The table in the middle of the page indicates that the default value of --output_mode is NA. However, in the details below are listed 3 possible values: EMIT_VARIANTS_ONLY, EMIT_ALL_CONFIDENT_SITES and EMIT_ALL_SITES. Which one is used as the default? Sincerely, TF Created 2015-07-28 12:20:30 | Updated | Tags: haplotypecaller vcf ngs variant-calling allele-frequency Hi! As part of a variant calling pipeline i'm interested in lowering the threshold for allele frequency tolerance in GATK's HaplotypeCaller variant caller to 0.01 (1%), if possible. If not, is there another variant calling tool that doesn't fliter out variants with low allele frequencies? Background: I know for a fact that there's at least one variant in my sample that doesn't appear in my VCF file if allele frequencies below 0.1 (10%) are filtered out during the variant calling as it's the default in some programs. I can see the variant when I inspect the corresponding bam file with samtools tview and another lab called that specific variant itself. Thanks in advance, Alon Created 2015-07-24 06:38:11 | Updated 2015-07-24 07:03:39 | Tags: haplotypecaller gatk rice I am working on Rice to call variants using GATK3.3. I have used BWA (bwa mem -M options) to map reads to reference genome. Followed by variant calling (HaplotypeCaller) by following Best Practices of GATK3.3. I have compared the results of 20 genotypes and in all genotypes INDELs are more (2x) as compared to SNPs. Please suggest me on this Thanks Mahesh HB Created 2015-07-23 20:43:09 | Updated | Tags: haplotypecaller phasebytransmission Hi all, Can I use ReadBackedPhasing or some other computational tools to distinguish between paternal and maternal chromosomes/reads based on pair-end DNA sequence data (fastq files)? Thank you. Created 2015-07-23 14:16:25 | Updated | Tags: haplotypecaller heterozygotes homozygote Hi all, Second question of the day, I am sorry. I have been facing an issue that I saw many times on the forum but each time it seemed that you fixed it with new versions. Problem is I am using the very last version of GATK = 3.4 (downloaded yesterday). I am using HaplotypeCaller single sample mode with all parameters by default as indicated in the "DNASeq" example in the HaplotypeCaller documentation page. I have a problem with one variant rs206076 on BRCA2 chromosome 13 position 32915005. I have Sanger validation for this variant in all my samples and Unified Genotyper was calling it correctly. Using HaplotypeCaller, there are some variants where the HOM call is made correctly and with very high quality (22 000) and others where HC calls the position as heterozygote with very low quality when it is obvious from IGV that the position is homozygote. I put in the IGV screenshot on individual that fails (up) and the other that is ok (down). I do have the same problem with another variant close to this one (not shown) whereas a couple of variants in between are called correctly in all samples. Any idea about this issue ? thanks a lot Manon Created 2015-07-23 11:26:42 | Updated | Tags: fisherstrand haplotypecaller downsampling strand-bias Hi GATK team, Again thanks a lot for the wonderful tools you're offering to the community. I have recently switched from UnifiedGenotyper to Haplotype Caller (1 sample at a time, DNASeq). I was planning to use the same hard filtering procedure that I was using previously, including the filter of the variants with FS > 60. However I am facing an issue probably due to the downsampling done by HC. I should have 5000 reads, but DP is around 500/600 which I understood is due to downsampling (even with -dt NONE). I did understand that it does not impact in the calling itself. However it is annoying me for 2 reasons 1) Calculating frequency of the variant using the AD field is not correct (not based on all reads) 2) I get variants with FS >60 whereas when you look at the entire set of reads, there is absolutely no strand bias. Example with this variant chr17 41245466 rs1799949 G A 7441.77 STRAND_BIAS; AC=1;AF=0.500;AN=2;BaseQRankSum=7.576;DB;DP=1042;FS=63.090;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.666;QD=7.14;ReadPosRankSum=-11.896;SOR=5.810 GT:AD:GQ:PL:SB 0/1:575,258:99:7470,0,21182:424,151,254,4 When I observe all reads I have the following counts, well shared on the + and - strands Allele G : 1389 (874+, 515-) Allele A : 1445 (886+, 559-) Could you please tell me how to avoid such an issue ? (By the way, this variant is a true one and should not be filtered out). Thanks a lot. Created 2015-07-23 08:14:30 | Updated | Tags: haplotypecaller gvcf wgs Hello, Does anyone know a rough estimate of the file size of a gvcf produced at BP_RESOLUTION by the HaplotypeCallerfor a whole genome sequencing experiment. Perhaps a rather simple question, but i cannot find it elsewhere on the forum or other places like seqanswers. Thanks in advance, Created 2015-07-22 12:08:20 | Updated 2015-07-22 12:09:10 | Tags: unifiedgenotyper haplotypecaller snp-calling input-prior I'm using HaplotypeCaller (but it could be also possible to use this option with UnifiedGenotyper) for a very special experimental design in a no-human species, where we have an expectation for the prior probabilities of each genotype. I'm planning to call SNPs for single diploid individuals using HaplotypeCaller and afterwards for the whole dataset with GenotypeGVCFs. Nevertheless, I'm confused about the structure of the prior probabilities command line. In the documentation, it says: "Values will be added and Pr(AC=0) will be 1-sum, so that they sum up to one. f) If user-defined values add to more than one, an error will be produced". So I'll require to provide two prior probabilities out of the 3 for each genotype (0/0, 0/1 and 1/1). My first guess is that the prior that I don't need to provide is for the reference homozygous (0/0) based on the Pr(AC=0) specified in the documentation. I would like to know if this idea is correct. My second problem if is the two input_prior options are positional parameters. If so, and if my first guess for the Pr(AC=0) is correct, do they represent the probability of 0/1 and 1/1, that is, Pr(AC=1) and Pr(AC=2)? More concretely, I'm going to provide an example where you don't expect any heterozygous call. In that case, is it correct that the argument will be --input_prior 0.5 --input_prior 0? Thank you very much. Created 2015-07-22 11:46:00 | Updated | Tags: combinevariants haplotypecaller best-practices vcf gatk genotypegvcfs combinegvcfs gvcf I was trying to do combine sets of vcf files for all my samples so that I have one single vcf output using this command option below java -d64 -Xmx48g -jar${GATK}/GenomeAnalysisTK.jar \ -R ${REF} \ -T GenotypeGVCFs \ --variant A.g.vcf \ --variant B.g.vcf \ --variant C.g.vcf \ -stand_emit_conf 30 \ -stand_call_conf 30 \ -o genotype.vcf but I got this error message “The following invalid GT allele index was encountered in the file: END=21994810”. I have tried to locate where the problem could be coming from but I do not understand this. Could you please advise me. Created 2015-07-17 22:01:12 | Updated 2015-07-17 22:14:12 | Tags: haplotypecaller ploidy haploid genotypegvcfs combinegvcfs chromosome-x Hi, I'm attempting to run GenotypeGVCFs on a cohort of ~4200 human male samples with targeted sequencing. I'm following the current DNA-Seq guidelines for cohort genotyping, with GATK v3.4-0. For each sample, I ran HaplotypeCaller separately for diploid and haploid (i.e., chrX non-PAR) regions, specifying --ploidy 1 for the haploid regions, then combined the resulting two GVCFs with CombineGVCFs. I then combined the per-sample GVCFs into groups of 64 samples using CombineGVCFs. Finally, I ran GenotypeGVCFs with all samples separately for groups of ~100 small target intervals (baits). Every group of target intervals ran fine without error in about 4 hours with ~5GB of RAM, except for the non-PAR chrX regions, which were haploid for all samples. For the haploid regions, GATK hangs on the very first base, slowly increasing memory usage, then eventually runs out of memory and exits. The estimated runtime keeps increasing without making any progress. The last run exited after 12 hours without making any progress. This happens no matter how much memory I specify (up to 128 GB). Interestingly, a PAR region of chromosome X run with --ploidy 2 in HaplotypeCaller worked with no problem. The inputted GVCF files to GenotypeGVCFs are uncompressed and were indexed by CombineGVCFs. I'm using default settings for GenotypeGVCFs, except for the following:  --standard_min_confidence_threshold_for_calling 20 --standard_min_confidence_threshold_for_emitting 10 --dbsnp dbSNP142.20150416.GRCh37.for-GATK.chr1-MT.vcf.gz  I tried running GenotypeGVCFs with the latest v3.4-46 release, but the same problem occurred. Below is example output: INFO 10:57:51,803 HelpFormatter - --------------------------------------------------------------------------------- INFO 10:57:51,810 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12 INFO 10:57:51,811 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 10:57:51,811 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 10:57:51,818 HelpFormatter - Program Args: -T GenotypeGVCFs -R /tmp/12715944.hpc-pbs.hpcc.usc.edu/hs37m.fa --dbsnp dbSNP142.20150416.GRCh37.for-GATK.chr1-MT.vcf.gz --standard_min_confidence_threshold_for_calling 20 --standard_min_confidence_threshold_for_emitting 10 [LONG LIST OF VARIANT FILES OMITTED] --out gatk.hc.combined.genotyped.chunk117.vcf.gz -L split_117.intervals --log_to_file gatk.hc.combined.genotyped.chunk117.log INFO 10:57:51,824 HelpFormatter - Executing as cedlund@hpc1130.m10g.hpcc.usc.edu on Linux 2.6.32-504.12.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13. INFO 10:57:51,825 HelpFormatter - Date/Time: 2015/07/17 10:57:51 INFO 10:57:51,826 HelpFormatter - --------------------------------------------------------------------------------- INFO 10:57:51,826 HelpFormatter - --------------------------------------------------------------------------------- INFO 10:57:56,331 GenomeAnalysisEngine - Strictness is SILENT INFO 10:57:56,671 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 10:59:03,550 IntervalUtils - Processing 154370 bp from intervals WARN 10:59:03,615 IndexDictionaryUtils - Track dbsnp doesn't have a sequence dictionary built in, skipping dictionary validation INFO 10:59:03,766 GenomeAnalysisEngine - Preparing for traversal INFO 10:59:03,768 GenomeAnalysisEngine - Done preparing for traversal INFO 10:59:03,768 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 10:59:03,769 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 10:59:03,770 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 10:59:04,283 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files INFO 10:59:57,328 ProgressMeter - X:51033766 216.0 53.0 s 68.9 h 0.2% 7.2 h 7.2 h INFO 11:00:27,330 ProgressMeter - X:51035366 216.0 83.0 s 4.5 d 1.2% 111.5 m 110.1 m INFO 11:00:57,353 ProgressMeter - X:51035366 216.0 113.0 s 6.1 d 1.2% 2.5 h 2.5 h INFO 11:01:27,354 ProgressMeter - X:51035366 216.0 2.4 m 7.7 d 1.2% 3.2 h 3.2 h INFO 11:01:57,356 ProgressMeter - X:51035366 216.0 2.9 m 9.3 d 1.2% 3.9 h 3.8 h INFO 11:02:27,358 ProgressMeter - X:51035366 216.0 3.4 m 10.9 d 1.2% 4.5 h 4.5 h INFO 11:02:57,882 ProgressMeter - X:51035366 216.0 3.9 m 12.5 d 1.2% 5.2 h 5.2 h INFO 11:03:27,884 ProgressMeter - X:51035366 216.0 4.4 m 14.2 d 1.2% 5.9 h 5.8 h INFO 11:03:57,885 ProgressMeter - X:51035366 216.0 4.9 m 15.8 d 1.2% 6.6 h 6.5 h INFO 11:04:27,887 ProgressMeter - X:51035366 216.0 5.4 m 17.4 d 1.2% 7.3 h 7.2 h INFO 11:04:58,976 ProgressMeter - X:51035366 216.0 5.9 m 19.0 d 1.2% 7.9 h 7.8 h INFO 11:05:28,977 ProgressMeter - X:51035366 216.0 6.4 m 2.9 w 1.2% 8.6 h 8.5 h INFO 11:05:58,979 ProgressMeter - X:51035366 216.0 6.9 m 3.2 w 1.2% 9.3 h 9.2 h INFO 11:06:28,981 ProgressMeter - X:51035366 216.0 7.4 m 3.4 w 1.2% 10.0 h 9.8 h INFO 11:06:58,982 ProgressMeter - X:51035366 216.0 7.9 m 3.6 w 1.2% 10.6 h 10.5 h INFO 11:07:28,984 ProgressMeter - X:51035366 216.0 8.4 m 3.9 w 1.2% 11.3 h 11.2 h INFO 11:07:58,986 ProgressMeter - X:51035366 216.0 8.9 m 4.1 w 1.2% 12.0 h 11.8 h INFO 11:08:30,497 ProgressMeter - X:51035366 216.0 9.4 m 4.3 w 1.2% 12.7 h 12.5 h INFO 11:09:30,568 ProgressMeter - X:51035366 216.0 10.4 m 4.8 w 1.2% 14.0 h 13.8 h INFO 11:10:32,779 ProgressMeter - X:51035366 216.0 11.5 m 5.3 w 1.2% 15.4 h 15.2 h INFO 11:11:33,479 ProgressMeter - X:51035366 216.0 12.5 m 5.7 w 1.2% 16.8 h 16.6 h INFO 11:12:35,360 ProgressMeter - X:51035366 216.0 13.5 m 6.2 w 1.2% 18.2 h 17.9 h INFO 11:13:35,445 ProgressMeter - X:51035366 216.0 14.5 m 6.7 w 1.2% 19.5 h 19.3 h INFO 11:14:39,689 ProgressMeter - X:51035366 216.0 15.6 m 7.2 w 1.2% 20.9 h 20.7 h INFO 11:15:40,505 ProgressMeter - X:51035366 216.0 16.6 m 7.6 w 1.2% 22.3 h 22.0 h INFO 11:16:41,140 ProgressMeter - X:51035366 216.0 17.6 m 8.1 w 1.2% 23.7 h 23.4 h INFO 11:17:41,956 ProgressMeter - X:51035366 216.0 18.6 m 8.6 w 1.2% 25.0 h 24.7 h INFO 11:18:41,958 ProgressMeter - X:51035366 216.0 19.6 m 9.0 w 1.2% 26.4 h 26.0 h INFO 11:19:44,493 ProgressMeter - X:51035366 216.0 20.7 m 9.5 w 1.2% 27.8 h 27.4 h INFO 11:20:49,749 ProgressMeter - X:51035366 216.0 21.8 m 10.0 w 1.2% 29.2 h 28.8 h INFO 11:21:53,414 ProgressMeter - X:51035366 216.0 22.8 m 10.5 w 1.2% 30.6 h 30.3 h INFO 11:22:58,174 ProgressMeter - X:51035366 216.0 23.9 m 11.0 w 1.2% 32.1 h 31.7 h INFO 11:24:01,211 ProgressMeter - X:51035366 216.0 25.0 m 11.5 w 1.2% 33.5 h 33.1 h INFO 11:25:05,051 ProgressMeter - X:51035366 216.0 26.0 m 12.0 w 1.2% 34.9 h 34.5 h INFO 11:26:07,782 ProgressMeter - X:51035366 216.0 27.1 m 12.4 w 1.2% 36.3 h 35.9 h INFO 11:27:10,933 ProgressMeter - X:51035366 216.0 28.1 m 12.9 w 1.2% 37.8 h 37.3 h INFO 11:28:20,854 ProgressMeter - X:51035366 216.0 29.3 m 13.5 w 1.2% 39.3 h 38.8 h INFO 11:29:28,165 ProgressMeter - X:51035366 216.0 30.4 m 14.0 w 1.2% 40.8 h 40.3 h INFO 11:30:28,575 ProgressMeter - X:51035366 216.0 31.4 m 14.4 w 1.2% 42.2 h 41.6 h INFO 11:31:36,673 ProgressMeter - X:51035366 216.0 32.5 m 14.9 w 1.2% 43.7 h 43.1 h INFO 11:32:45,497 ProgressMeter - X:51035366 216.0 33.7 m 15.5 w 1.2% 45.2 h 44.7 h INFO 11:33:49,205 ProgressMeter - X:51035366 216.0 34.8 m 16.0 w 1.2% 46.7 h 46.1 h INFO 11:34:49,226 ProgressMeter - X:51035366 216.0 35.8 m 16.4 w 1.2% 48.0 h 47.4 h INFO 11:35:54,571 ProgressMeter - X:51035366 216.0 36.8 m 16.9 w 1.2% 49.5 h 48.8 h INFO 11:36:59,402 ProgressMeter - X:51035366 216.0 37.9 m 17.4 w 1.2% 50.9 h 50.3 h INFO 11:38:03,427 ProgressMeter - X:51035366 216.0 39.0 m 17.9 w 1.2% 52.3 h 51.7 h INFO 11:39:12,036 ProgressMeter - X:51035366 216.0 40.1 m 18.4 w 1.2% 53.9 h 53.2 h INFO 11:40:15,472 ProgressMeter - X:51035366 216.0 41.2 m 18.9 w 1.2% 55.3 h 54.6 h INFO 11:41:22,184 ProgressMeter - X:51035366 216.0 42.3 m 19.4 w 1.2% 56.8 h 56.1 h INFO 11:42:24,992 ProgressMeter - X:51035366 216.0 43.4 m 19.9 w 1.2% 58.2 h 57.5 h INFO 11:43:30,745 ProgressMeter - X:51035366 216.0 44.4 m 20.4 w 1.2% 59.7 h 58.9 h INFO 11:44:41,392 ProgressMeter - X:51035366 216.0 45.6 m 21.0 w 1.2% 61.3 h 60.5 h INFO 11:45:51,136 ProgressMeter - X:51035366 216.0 46.8 m 21.5 w 1.2% 62.8 h 62.0 h INFO 11:46:59,056 ProgressMeter - X:51035366 216.0 47.9 m 22.0 w 1.2% 64.3 h 63.5 h INFO 11:48:09,266 ProgressMeter - X:51035366 216.0 49.1 m 22.5 w 1.2% 65.9 h 65.1 h INFO 11:49:16,701 ProgressMeter - X:51035366 216.0 50.2 m 23.1 w 1.2% 67.4 h 66.6 h INFO 11:50:24,150 ProgressMeter - X:51035366 216.0 51.3 m 23.6 w 1.2% 68.9 h 68.1 h INFO 11:51:31,883 ProgressMeter - X:51035366 216.0 52.5 m 24.1 w 1.2% 70.5 h 69.6 h INFO 11:52:40,234 ProgressMeter - X:51035366 216.0 53.6 m 24.6 w 1.2% 72.0 h 71.1 h INFO 11:53:46,785 ProgressMeter - X:51035366 216.0 54.7 m 25.1 w 1.2% 73.5 h 72.6 h INFO 11:54:53,194 ProgressMeter - X:51035366 216.0 55.8 m 25.6 w 1.2% 75.0 h 74.0 h Here is an example of the GVCF for 3 samples in one of the problem haploid regions: X 51035345 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:78:2:0,78 .:9:99:3:0,112 X 51035346 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035347 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035348 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035349 . G <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035350 . G <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035351 . G <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035352 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035353 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035354 . T <NON_REF> . . END=51035355 GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035356 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:41:1:0,41 .:9:99:3:0,112 X 51035357 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:40:1:0,40 .:9:99:3:0,112 X 51035358 . G <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:39:1:0,39 .:9:99:3:0,112 X 51035359 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:38:1:0,38 .:9:99:3:0,112 X 51035360 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:41:1:0,41 .:9:99:3:0,112 X 51035361 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:39:1:0,39 .:9:99:3:0,112 X 51035362 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:38:1:0,38 .:9:99:3:0,112 X 51035363 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:38:1:0,38 .:9:99:3:0,112 X 51035364 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:1:38:1:0,38 .:9:99:3:0,112 X 51035365 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035366 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035367 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035368 . G <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035369 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035370 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035371 . A C,<NON_REF> . . DP=246;MQ=60.00 GT:AD:DP:GQ:MIN_DP:PL:SB .:0,4,0:4:99:.:126,0,126:0,0,1,3 .:.:2:72:2:0,72,72 .:.:9:99:3:0,112,112 X 51035372 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035373 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035374 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:2:45:2:0,45 .:2:72:2:0,72 .:9:99:3:0,112 X 51035375 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:2:45:2:0,45 .:2:72:2:0,72 .:9:99:3:0,112 X 51035376 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:2:45:2:0,45 .:2:54:2:0,54 .:9:99:3:0,112 X 51035377 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:2:45:2:0,45 .:2:71:2:0,71 .:9:99:3:0,112 X 51035378 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:18:99:3:0,105 .:2:71:2:0,71 .:9:99:3:0,112 X 51035379 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:18:99:3:0,105 .:2:81:2:0,81 .:9:99:3:0,112 X 51035380 . G <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:18:99:3:0,105 .:2:78:2:0,78 .:9:99:3:0,112 X 51035381 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:18:99:3:0,105 .:2:45:2:0,45 .:9:99:3:0,112 X 51035382 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:18:99:3:0,105 .:2:45:2:0,45 .:9:99:3:0,112 Any help is greatly appreciated. Please let me know if you need any other information. Kindest regards, Chris Created 2015-07-13 05:04:48 | Updated | Tags: haplotypecaller Hi, We have several close-related samples each contain 3 million variants relative to the reference genome (both UG and HC give this results for most samples), while HC only give less than 1 million for a few samples. I searched the forum and did some test, and failed to figure out why this happen. Here goes some detailed descriptions: I tested two samples, those two look very similar, and the mapping results both seem proper with good mapping quality (>20) HC gives proper varaint results and output bam file for sample1: However, the sample2 gives the similar mapping results, but HC did not give any variants, the output bam file contain no sequence, after adding "--disableOptimizations --forceActive" options, the output bam file contain one sequence without any variants The strangest thing is when I tried joint-calling of two samples, those variants called in sample1 also get disappeared, and the output bam file is sample as the sample2 Those reads were generated from Hiseq4000 platform, most base qualities are also high (Illumina seems changed the highest quality value to 42, I'm not sure whether this could affect, however other samples seem proper ...). Thank you very much in advance! Wang Long Created 2015-07-09 14:27:00 | Updated 2015-07-09 14:30:20 | Tags: haplotypecaller pacbio gatk-unified-genotyper We have PacBio data where we want to do variant calling. I tried both UnifiedGenotyper and Haplotype caller. I was not very successfull doing that. When I used UnifiedGenotyper I got some output, but just SNPs but NO indels... (I skipped the realigment part there). I tried to play arround with the parameters (indelGapContinuationPenalty, indGapOpenPenalty, min_base_quality_score). Setting the "min_base_quality_score" to a lower value is giving at least this output with only SNPs as mentioned above. The only manual for PacBio data on GATK I got was this: https://www.broadinstitute.org/gatk/guide/topic?name=methods But this document is pretty old. Are there newer developments regarding GATK for PacBio? Or any more detailed tutorials? Or would you suggest PBHoney as the right tool to use? Anything else? Just to mention: For Illumina your toolkit worked like a charm! So basically we are able to work with it... Thanks! Michael Created 2015-07-07 15:31:49 | Updated | Tags: haplotypecaller code-exception Dear GATK Team, I'm trying to run HaplotypeCaller, on a Bowtie2 bam file, but I'm having a Code exception. Any idea how to fix it? Best regards, Miguel Machado Command: /scratch/mpmachado/NGStools/java_oracle/jre1.8.0_45/bin/java -jar /scratch/mpmachado/NGStools/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller -R /scratch/mpmachado/trabalho1diversidade/genomaReferencia/dadosGenoma/NC_004368.fna -I /scratch/mpmachado/trabalho1diversidade/genomaReferencia/outputs/bowtie/GBS_01.sorted_position.bam --out output.raw.snps.indels.vcf --genotyping_mode DISCOVERY --output_mode EMIT_ALL_SITES --sample_ploidy 1 INFO 16:21:59,507 HelpFormatter - -------------------------------------------------------------------------------- INFO 16:21:59,511 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-0-g7e26428, Compiled 2015/05/15 03:25:41 INFO 16:21:59,511 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 16:21:59,511 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 16:21:59,516 HelpFormatter - Program Args: -T HaplotypeCaller -R /scratch/mpmachado/trabalho1diversidade/genomaReferencia/dadosGenoma/NC_004368.fna -I /scratch/mpmachado/trabalho1diversidade/genomaReferencia/outputs/bowtie/GBS_01.sorted_position.bam --out output.raw.snps.indels.vcf --genotyping_mode DISCOVERY --output_mode EMIT_ALL_SITES --sample_ploidy 1 INFO 16:21:59,521 HelpFormatter - Executing as mpmachado@dawkins on Linux 2.6.32-5-amd64 i386; Java HotSpot(TM) Server VM 1.8.0_45-b14. INFO 16:21:59,521 HelpFormatter - Date/Time: 2015/07/07 16:21:59 INFO 16:21:59,521 HelpFormatter - -------------------------------------------------------------------------------- INFO 16:21:59,521 HelpFormatter - -------------------------------------------------------------------------------- INFO 16:22:00,257 GenomeAnalysisEngine - Strictness is SILENT INFO 16:22:00,360 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500 INFO 16:22:00,370 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 16:22:00,401 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 INFO 16:22:00,410 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 16:22:01,869 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.NullPointerException at java.util.TreeMap.compare(TreeMap.java:1290) at java.util.TreeMap.put(TreeMap.java:538) at java.util.TreeSet.add(TreeSet.java:255) at org.broadinstitute.gatk.utils.sam.ReadUtils.getSAMFileSamples(ReadUtils.java:70) at org.broadinstitute.gatk.engine.samples.SampleDBBuilder.addSamplesFromSAMHeader(SampleDBBuilder.java:66) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.initializeSampleDB(GenomeAnalysisEngine.java:846) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:296) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.4-0-g7e26428): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Code exception (see stack trace for error itself) ##### ERROR ------------------------------------------------------------------------------------------ Created 2015-06-29 17:57:31 | Updated 2015-06-29 17:58:38 | Tags: depthperallelebysample haplotypecaller asereadcounter Hi, I was running ASEReadCounter and came across a "stack trace" error after completion of 14.1%. I am running the command like this: java -jar ../GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -R /results2/indexes/genomes/mm9/fasta/mm9.fa -T ASEReadCounter -o filterReadCounts.chr13.58264978.txt -I dedupFolder/D4.indel.recal.dedup.reordered.bam -sites variants/gatk/D4.allBases.chr13.vcf -minDepth 20 --minMappingQuality 30 --minBaseQuality 20 The output is as follows: INFO 13:27:40,125 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:27:40,130 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-0-g7e26428, Compiled 2015/05/15 03:25:41 INFO 13:27:40,130 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 13:27:40,131 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 13:27:40,136 HelpFormatter - Program Args: -R /results2/indexes/genomes/mm9/fasta/mm9.fa -T ASEReadCounter -o filterReadCounts.chr13.58264978.txt -I dedupFolder/D4.indel.recal.dedup.reordered.bam -sites variants/gatk/D4.allBases.chr13.vcf -minDepth 20 --minMappingQuality 30 --minBaseQuality 20 -U ALLOW_N_CIGAR_READS -rf BadCigar INFO 13:27:40,143 HelpFormatter - Executing as szimmerm@al2.aecom.yu.edu on Linux 2.6.18-371.12.1.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_09-b05. INFO 13:27:40,144 HelpFormatter - Date/Time: 2015/06/29 13:27:40 INFO 13:27:40,145 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:27:40,145 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:27:40,937 GenomeAnalysisEngine - Strictness is SILENT INFO 13:27:41,075 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 10000 INFO 13:27:41,091 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:27:41,186 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.09 INFO 13:27:41,474 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 13:27:42,051 GenomeAnalysisEngine - Done preparing for traversal INFO 13:27:42,052 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 13:27:42,053 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 13:27:42,054 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 13:28:12,062 ProgressMeter - chr10:61439995 7224296.0 30.0 s 4.0 s 2.3% 21.6 m 21.1 m INFO 13:28:42,066 ProgressMeter - chr10:107123365 1.4729978E7 60.0 s 4.0 s 4.0% 24.8 m 23.8 m INFO 13:29:12,078 ProgressMeter - chr11:29429929 2.1979505E7 90.0 s 4.0 s 6.0% 25.0 m 23.5 m INFO 13:29:42,085 ProgressMeter - chr11:67787354 2.8914135E7 120.0 s 4.0 s 7.4% 26.8 m 24.8 m INFO 13:30:12,090 ProgressMeter - chr11:88702416 3.4725261E7 2.5 m 4.0 s 8.2% 30.3 m 27.8 m INFO 13:30:42,095 ProgressMeter - chr11:110188026 4.0515208E7 3.0 m 4.0 s 9.0% 33.2 m 30.2 m INFO 13:31:12,100 ProgressMeter - chr12:44912778 4.8755304E7 3.5 m 4.0 s 11.2% 31.3 m 27.8 m INFO 13:31:42,107 ProgressMeter - chr12:104866935 5.6805392E7 4.0 m 4.0 s 13.4% 29.8 m 25.8 m INFO 13:32:12,113 ProgressMeter - chr12:121257414 5.9799881E7 4.5 m 4.0 s 14.1% 32.0 m 27.5 m ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.ArrayIndexOutOfBoundsException: 0 at org.broadinstitute.gatk.tools.walkers.rnaseq.ASEReadCounter.map(ASEReadCounter.java:216) at org.broadinstitute.gatk.tools.walkers.rnaseq.ASEReadCounter.map(ASEReadCounter.java:102) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.4-0-g7e26428): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: 0 ##### ERROR ------------------------------------------------------------------------------------------ Is there any reason for this error, or is it just a bug? My overall goal is to get the filtered count of reads that support a given allele for an individual sample. DepthPerAlleleBySample gives unfiltered counts instead of filtered ones. If there is a way to get the filtered counts besides using ASEReadCounter that would also be great. Thanks! Created 2015-06-25 18:23:01 | Updated | Tags: haplotypecaller Hello, While running HaplotypeCaller in genotyping mode, I am getting following error: ##### ERROR stack trace java.lang.IndexOutOfBoundsException: Index: 6, Size: 6 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at htsjdk.variant.variantcontext.VariantContext.getAlternateAllele(VariantContext.java:849) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:252) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:883) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:226) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.4-0-g7e26428): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Index: 6, Size: 6 ##### ERROR ------------------------------------------------------------------------------------------ The command line given by me was: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ./chr17.fa -I ./calpop2/1_recal.bam -L 17 --genotyping_mode GENOTYPE_GIVEN_ALLELES -alleles 17dbsnp.vcf -stand_emit_conf 10 -stand_call_conf 30 -o ./calpop2/1_pop.vcf. Can you please help me understand the reason of error. Created 2015-06-25 14:25:13 | Updated | Tags: haplotypecaller snp-calling Using the following line I'm calling SNPs. java -Xmx4g -jar${GATKPath} -R ref -T HaplotypeCaller -ploidy 1 -I downsample.bam -o hapreadyAll.vcf -bamout bamout.bam -dontUseSoftClippedBases Shown in the screen shot it seems for this regions a bamout file along with a SNP should be output. Is here a HC command to force this region and others like it to output a SNP call? Created 2015-06-24 18:55:35 | Updated | Tags: haplotypecaller multi-sample rnaseq pooled-calls Hi everybody! I've recently started working with GATK and after reading documents, tutorial and forum discussions, I set a pipeline for my experiment. I'm dealing with multi-sample RNAseq for which GATK tools are less improved and verified than DNAseq, thus I 'd like to have your suggestions. Briefly, this is the experiment: 2 phenotypes of Sparus aurata, 8 libary per phenotype, each library consists on a pool (not barcoded) of 3 animals. Thus I have a total of** 16 samples**. My goal is to find the total number of variant sites and compare the allele frequencies between the two phenotypes. **I lack genome and SNPs database. ** Step by step: 1) I used STAR (not 2-pass) in order to map reads against my de novo assembly. STAR --runThreadN 16 --genomeDir ./GenomeIndex --readFilesIn XXX.fastq --alignIntronMax 19 --outSAMtype BAM SortedByCoordinate --outSAMmapqUnique 60 --outFilterMultimapNmax 5 --outFilterMismatchNmax 4 2) I used the picard-tools to Mark duplicates 3) I used the picard-tools to AddOrReplaceReadGroups 4) I used the picard-tools to BuildBamIndex 5) I called the haplotype for the 16 samples with the following command: GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I sample.bam -dontUseSoftClippedBases -ploidy 6 -ERC GVCF -o output.g.vcf 6) I used the GenotypeGVCFs to merge the samples from the same population in an unique vcf file as follow GenomeAnalysisTK.jar -T GenotypeGVCFs -R reference.fasta -stand_call_conf 20 -stand_emit_conf 20 -ploidy 6 --variant sample1.g.vcf --variant sample2.g.vcf --variant sample3.g.vcf (8 samples) -o output_HC.vcf Finally I'm going to filter the results with the Variant Filtration: GenomeAnalysisTK.jar -T VariantFiltration -R reference.fasta -V output.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o utputFiltered.vcf What do you think? Now I'd like to compare the two populations, but how??? manually in excel files? vcf tools seem not to handle ploidy higher than 2. Does anyone deal with these issues and can kindly give some tips? Best Marianna Created 2015-06-23 12:34:18 | Updated | Tags: haplotypecaller pcrmodel What is the default value for pcrModel in HaplotypeCaller ? NONE no specialized PCR error model will be applied; if base insertion/deletion qualities are present they will be used HOSTILE a most aggressive model will be applied that sacrifices true positives in order to remove more false positives AGGRESSIVE a more aggressive model will be applied that sacrifices true positives in order to remove more false positives CONSERVATIVE a less aggressive mo Created 2015-06-19 19:20:58 | Updated | Tags: haplotypecaller rnaseq genotypegvcfs gvcf Hello, I was wondering if there is a way to output all annotations for all sites when running HaplotypeCaller with BP_RESOLUTION. Currently it outputs all annotations for only called variants. Thanks in advance. Created 2015-06-05 07:41:51 | Updated | Tags: haplotypecaller gvcf Hi, I first use HaplotypeCaller to call variants with --emitRefConfidence GVCF, and then the tool GenotypeGVCFs on only the sample. In other words, I did not apply GenotypeGVCFs on cohort but on the sample itself. There was a record called in the first step but was not output in the second step. The particular record is showed below: chr1 78435701 rs202224025 TA T,TAA, 23.75 . BaseQRankSum=1.312;DB;DP=62;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=1.101;ReadPosRankSum=0.773 GT:AD:GQ:PL:SB 0/2:31,7,8,0:36:61,36,882,0,568,734,154,759,709,863:1,30,0,8 The command lines used are below: java -Xmx5g -Djava.io.tmpdir=pwd/tmp -jar /Software/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T HaplotypeCaller -R /Data/bundle_2.8_hg19/ucsc.hg19.fasta -I /input/chr1.bam -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 --dbsnp /Data/bundle_2.8_hg19/dbsnp_138.hg19.vcf -A StrandOddsRatio -A Coverage -A QualByDepth -A FisherStrand -A MappingQualityRankSumTest -A ReadPosRankSumTest -A RMSMappingQuality -o /output/chr1.gvcf.vcf -L chr1 java -Xmx5g -Djava.io.tmpdir=pwd/tmp -jar /Software/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /Data/bundle_2.8_hg19/ucsc.hg19.fasta --variant /output/chr1.gvcf.vcf -A StrandOddsRatio -A Coverage -A QualByDepth -A FisherStrand -A MappingQualityRankSumTest -A ReadPosRankSumTest -A RMSMappingQuality -o /output/chr1.gatkHC.vcf --dbsnp /Data/bundle_2.8_hg19/dbsnp_138.hg19.vcf -stand_call_conf 30.0 -stand_emit_conf 10.0 -L /BED/chr1.bed I thought that this variant should be emitted into the final vcf, since the qual is 23.75 which is greater than 10.0 (set by -stand_emit_conf). Do I misunderstand something here? Thank you! Created 2015-06-03 10:26:13 | Updated | Tags: haplotypecaller indels Hi, I am wondering whether GATK HC uses read pairs information (such as reads that are mapped with unexpected separation distances or orientation) to detect large structural variations ? Thank you, Fabrice Created 2015-06-02 11:52:06 | Updated | Tags: bqsr haplotypecaller I was wondering if GATK will behave correctly if I use the -BQSR option to HaplotypeCaller instead of PrintReads, thus saving a processing step in my pipeline. AFAIK, HaplotypeCaller is the only part of our pipeline that actually uses the base score values. Created 2015-06-02 09:25:02 | Updated 2015-06-02 09:26:43 | Tags: haplotypecaller missing-genotype Dear GATK community, I am using HaplotypeCaller for variant discovery and I have found some strange results in my VCF. It appears that this walker is making no calls in positions with reads of support for them in the original BAM. For instance this position: chr7 302528 rs28436118 A G 31807.9 PASS . GT:AD:DP:GQ:PL 1/1:0,256:256:99:8228,767,0 ./.:98,0:98:.:. ./.:81,0:81:.:. 1/1:0,134:134:99:4287,401,0 was not called for 2 of the 4 samples available. However, in both samples where the genotype is missing there are many reads supporting an homozygous reference call (0/0). Do you have any idea of why this is happening? Cheers JMFA> ******** Created 2015-05-29 16:00:04 | Updated | Tags: haplotypecaller Hello, I am trying to run HaplotypeCaller on my processed bam file but it keeps running out of memory. I have tried increasing the heap size to 90G but it still crash. This might have to do with the type of analysis I am doing... The sample are pool of pigs (6 individual so a ploidy of 12 for this particular sample) that have been sequenced on targeted regions, I use the bed file that has been given with the kit to narrow done the calling to the targeted regions. I have also reduce the number of alternative allele from 6 to 3. But it still crash after a while. Is there any other parameters I should try to modify to reduce the memory usage? I have attached the log file if you want to have a look at all the parameters. Cheers, Julien Created 2015-05-23 21:49:38 | Updated | Tags: bqsr haplotypecaller best-practices dnaseq Hi, I have ~150 WGS all sequenced in batches on Illumina HiSeq over a number of years, on a number of machines/flowcells. I want to perform BQSR on my BAM files before calling variants. However for my organism I do not have access to a resource like dbSNP for true variants. So I am following the protocol by doing a first round of variant calling, subsetting to the SNPs with highest confidence and using these as the known variants for BQSR. My question is, should I carry this out on samples individually, i.e. one BAM per sample, on which I use HaplotypeCaller for initial variant calling, then subset to best SNPs, use these for BaseRecalibrator and apply the calibration to the original sample before carrying on with single sample variant finding and then joint genotyping as per best practices.... OR As I have multiple samples from the same flowcells and lanes, would I gain more information by combining those samples into a multisample BAM first, before carrying out BQSR? I'm a little unsure of how the covariates used by BQSR and actually calculated and whether I can increase the accuracy of my recalibration in this way? Each sample has between 500M and 5 billion nucleotides sequenced. Many thanks. Created 2015-05-23 05:08:22 | Updated 2015-05-23 05:12:25 | Tags: haplotypecaller bug gatk Hi All, I cannot perform joint genotype calling using gVCFs using the GenotypeGVCF command on my dataset containing 352 mtDNA samples. Does any one has run into this problem before, or has suggestions on how to get this working. I have attached the standard output for a hanging run. I can generate VCF files individually no problems. INFO 17:06:45,732 GenomeAnalysisEngine - Strictness is SILENT INFO 17:06:45,867 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 17:06:47,721 MicroScheduler - Running the GATK in parallel mode with 4 total threads, 1 CPU thread(s) for each of 4 data thread(s), of 20 processors available on this machine INFO 17:06:47,791 GenomeAnalysisEngine - Preparing for traversal INFO 17:06:47,793 GenomeAnalysisEngine - Done preparing for traversal INFO 17:06:47,794 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:06:47,794 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 17:06:47,795 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 17:06:48,436 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files INFO 17:07:17,800 ProgressMeter - gi|251831106|ref|NC_012920.1|:1 0.0 30.0 s 49.6 w 0.0% 15250.3 w 15250.3 w Waiting for data... (interrupt to abort) GATK VERSION: The Genome Analysis Toolkit (GATK) v3.4-0-g7e26428, Compiled 2015/05/15 03:25:41 java version "1.7.0_51" Thanks for your time. James Boocock Created 2015-05-22 18:55:21 | Updated | Tags: haplotypecaller The HaplotypeCaller typically does a great job calling SNPs, but there are a few occasions where reads are realigned causing a SNP to be missed. Image 1 shows three samples where the UnifiedGenotyper called correctly an AC=2 SNP and the HaplotypeCaller called it an AC=1. Also, due to a misalignment, surrounding the real SNP, additional AC=1 SNPs show. The bamout file of the rearrangement is shown in image 2. Here the reads forced into alignment can be seen causing the misidentified AC=1 calls. Is this a HaplotypeCaller bug or is there an additional option that can be applied to prevent this misalignment? Thank you, Running v3.4-0-g7e26428 Created 2015-05-22 08:00:24 | Updated 2015-05-22 08:01:11 | Tags: haplotypecaller dp Hi, currently I'm running GATK version 3.3.0 and run in the problem, that quite often I don't get a value for DP after the HaplotypeCaller. There actually is nothing (just a dot) in the VCF file. chr11 57569573 . A ACC 1640.8 VQSRTrancheINDEL99.90to100.00 . GT:AD:DP:GQ:PGT:PID:PL 0/1:57,16:.:99:0|1:57569569_CAGG_C:480,0,7262 chr11 57569575 . A AT 1686.05 VQSRTrancheINDEL99.90to100.00 . GT:AD:DP:GQ:PGT:PID:PL 0/1:60,16:.:99:0|1:57569569_CAGG_C:492,0,7092 What woks is to annotate afterwards using the VariantAnnotator. But in the documentation it states, that the VariantAnnotator defines the depth by the total amount of reads at that specific position. In contrast the DP after HaplotypeCaller should be the number of informative reads for that position. I can find that difference when I check for the position in GEMINI prior and after Annotation with Variant annotator. Patient 1 (after reannotation using VariantAnnotator) Position depth variant 57563990 75 C/C 57569568 81 CAGG/C 57569572 78 A/ACC 57569574 81 A/AT Patient 2 (after reannotation using VariantAnnotator) Postition depth variant 57563990 73 C/T 57569568 104 CAGG/CAGG 57569572 106 A/A 57569574 106 A/A After HaplotypeCaller Postion patient1 patient2 ...(other patients) 57563990 48 -1 -1 -1 -1 -1 C/C C/T C/T C/T C/T C/T 57569568 -1 66 52 -1 56 60 CAGG/C CAGG/CAGG CAGG/CAGG CAGG/C CAGG/CAGG CAGG/CAGG 57569572 -1 101 88 -1 101 77 A/ACC A/A A/A A/ACC A/A A/A 57569574 -1 97 87 -1 94 78 A/AT A/A A/A A/AT A/A A/A 57576040 -1 -1 -1 -1 -1 8 ./. ./. ./. ./. ./. C/C 57578937 -1 -1 -1 -1 -1 -1 ./. ./. ./. ./. ./. TTG/T 57578941 -1 -1 -1 -1 -1 -1 ./. ./. ./. ./. ./. C/CTG 57578942 -1 -1 -1 -1 -1 -1 ./. ./. ./. ./. ./. C/G E.g. patient 2 has in the position 57569568 66 informative reads, but 104 reads in total. To check how valid a Variant is, I think it is way more interesting to know the number of informative reads. In this specific case we checked for those mutation (57569572 and 57569574) via sanger and didn't find anything. If we would know that only 2 or 3 reads would be informative we might have gone to another variant instead. Am I doing something wrong, or is there a bug in the software? One example how I call the HaplotypeCaller. java -Xmx10g -jar /data/ngs/bin/GATK/3.3-0/GenomeAnalysisTK.jar -l INFO -et NO_ET \ -T HaplotypeCaller \ -R /data/ngs/programs/bundle_2.8/ucsc.hg19.fasta \ -D /data/ngs/programs/bundle_2.8/dbsnp_138.hg19.vcf \ -L /data/ngs/exome_targets/SureSelect_V4_S03723314_Regions.bed \ -ERC GVCF \ -variant_index_type LINEAR \ -variant_index_parameter 128000 \ -stand_call_conf 30 \ -stand_emit_conf 30 \ -A Coverage -A AlleleBalance \ -A QualByDepth -A HaplotypeScore \ -nct 4 \ -I out/06-BQSR/21_383_1.bam \ -o out/08-variant-calling/21_383_1/21_383_1-raw.vcf \ -log out/08-variant-calling/21_383_1/21_383_1-raw_calling.log Thanks in advance, Anselm Created 2015-05-21 06:27:17 | Updated | Tags: haplotypecaller genotypeconcordance Dear GATK Team, I performed a variant caller comparison, and the genotype concordance analysis of HaplotypeCaller's results a little strange to me, and I can't understand it at all. ALLELES_MATCH: 0 EVAL_SUPERSET_TRUTH: 2438 EVAL_SUBSET_TRUTH: 0 ALLELES_DO_NOT_MATCH: 636 EVAL_ONLY: 20 TRUTH_ONLY: 1014546 The last and first value could be true? Or anyone can tell what could cause this? I got very different (and reasonable) results with other callers. Many thanks! Created 2015-05-20 13:06:46 | Updated | Tags: haplotypecaller java genotypegvcfs Dear GATK, I used the HaplotypeCaller with "-dcov 500 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000" to produce 60 gvcf files, that worked fine. However, GenotypeGVCFs gets stuck on a position and runs out of memory after about 24hours, even when I allocate 240Gb. Testing a short region of 60kb does not help. Here was my command line: software/jre1.7.0_25/bin/java -Xmx240g -jar GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R Reference.fasta -L chrom14:2240000-2300000 --variant 60samples_gvcf.list -o output.vcf If I split my list of 60 gvcf files into two lists of 30 samples each, GenotypeGVCFs works fine for both batches within 15 minutes (~10Gb of memory). I tested with 47 samples, it took 8 hours (31gb of memory) for a 60kb region. Once I use more than ~55 samples, it takes forever and crashes. Any help will be much appreciated! Thanks, Antoine Created 2015-05-20 05:27:42 | Updated | Tags: haplotypecaller Hello, I want to ask about the step in GATK for variant calling with HaplotypeCaller. There are 2 steps for that in the guide, which result in gvcf or vcf. I still don't understand what is the difference even though it is written that the gvcf is for cohort analysis workflow. How do I choose which analysis workflow I should use, the vcf of the gvcf? My target is to identify SNP in every sample I have and my analysis don't require to compare one sample to each other. Just to get the variation per sample. Created 2015-05-18 16:18:23 | Updated 2015-05-18 16:18:53 | Tags: haplotypecaller gatk3-4 warnings GATK3.4 is throwing lots of warnings. Should I be concerned about any of them? I'm not sure, which --variant_index_type to set, so I will make sure to use the correct file extension going forward: WARN 17:02:08,480 GATKVCFUtils - Naming your output file using the .g.vcf extension will automatically set the appropriate values for --variant_index_type and --variant_index_parameter WARN 17:02:08,484 GATKVCFUtils - Creating Tabix index for out_HaplotypeCaller/20/APP5339451.vcf.gz, ignoring user-specified index type and parameter  I also got this warning with earlier versions: WARN 17:02:10,495 IndexDictionaryUtils - Track dbsnp doesn't have a sequence dictionary built in, skipping dictionary validation  Is this something to be concerned about? WARN 17:02:18,771 PairHMMLikelihoodCalculationEngine1 - Failed to load native library for VectorLoglessPairHMM - using Java implementation of LOGLESS_CACHING


I didn't ask for these annotations:

WARN  17:02:19,078 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper
WARN  17:02:19,079 InbreedingCoeff - Annotation will not be calculated, must provide a valid PED file (-ped) from the command line.


My command line looked like this:

-A Coverage -A FisherStrand -A StrandOddsRatio -A MappingQualityRankSumTest -A QualByDepth -A RMSMappingQuality -A ReadPosRankSumTest


Created 2015-05-14 15:47:19 | Updated | Tags: haplotypecaller

Hi there,

I've been looking at some WES data where I am comparing pre and post GT refinement workflow data and came across where it seems like to me that HC does not take into account overlapping reads...I thought I read a year a two ago that Unified Genotyper does take the consensus when reads overlap (as opposed to counting each base in the overlap individually), but I'm guessing that HC does not?

I had a variant that was called a denovo prior to refinement, but not after refinement (the reason why I am looking at the variant). When I looked at the data in IGV I saw that the site had 2 alternate alleles and 4 reference alleles so I could see why HC made the call, but when I switched to "viewed as pairs" I can see that it really is only 3 molecules because both ends overlapped and thus if you took the consensus there would only be one molecule with the variant base. So if the basic logic is that you need two reads supporting the alternate haplotype for it to be considered, then if you took into account overlapping pairs, then this would not have been considered.

I also went back to the gvcf file and saw that it was counting all 6 reads (instead of 3).

I am attaching the two plots as well (before and after viewing as pairs).

The GVCF files were created with 3.3, everything downstream of that was done with a Jan 15 nightly (CombineGVCF, GenotypeGVCF and CGP).

I realize that I am splitting hairs here, but just wanted to send this for my edification/confirm what I am thinking if nothing else...or just as a forum for my incoherent babbling.

Thanks

Kurt

Created 2015-05-11 13:57:37 | Updated | Tags: unifiedgenotyper depthofcoverage haplotypecaller mendelianviolations

Hi,

Two questions, which relate to Unified Genotyper or Haplocaller:

1. Detecting and determining the exact genotype of an individual seems to be inaccurate at read depths less than 10X. This is because there aren't enough reads to accurately say whether it could be heterozygous. Is there an option in either Unified Genotyper or Haplocaller, that excludes sites in individuals that have below 10X? Alternatively how can these sites be removed from the vcf - or set to missing ?

2. I'm sure I've read somewhere that pedigree information can be supplied to a variant caller (such as Unified Genotyper or Haplocaller), and used to improve the calling accuracy/speed/efficiency. I am calling variants on one parent and multiple offspring.

Apologies if these questions are answered in the user manual. I regularly look it but have not found much to answer my questions.

Cheers,

Blue

Created 2015-05-07 09:22:47 | Updated 2015-05-07 10:09:41 | Tags: haplotypecaller format genotypegvcfs info-field

I have a potential bug running GATK GenotypeGVCFs. It complains that there is a DP in the INFO field, but in my haplotypecaller-generated -mg.g.vcf.gz's I do not have a DP in the info, I do have DP in the FORMAT field though, but that's present in the headers as shown below the error output.

Any idea what could be the problem?

INFO  18:30:12,694 HelpFormatter - --------------------------------------------------------------------------------
INFO  18:30:12,698 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-geee94ec, Compiled 2015/03/09 14:27:22
INFO  18:30:12,699 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  18:30:12,706 HelpFormatter - Program Args: -l INFO -T GenotypeGVCFs -R /net/NGSanalysis/ref/Mus_musculus.GRCm38/index/bwa/Mus_musculus.GRCm38.dna.primary_assembly.fa -o /dev/stdout -ploidy 2 --num_threads 32 --intervals:targets,BED /net/NGSanalysis/ref/Mus_musculus.GRCm38/bed/SeqCap/ex100/110624_MM10_exome_L2R_D02_EZ_HX1-ex100.bed --max_alternate_alleles 20 -V:3428_10_14_SRO_185_TGGCTTCA-mg,VCF 3428_10_14_SRO_185_TGGCTTCA-mg.g.vcf.gz -V:3428_11_14_SRO_186_TGGTGGTA-mg,VCF 3428_11_14_SRO_186_TGGTGGTA-mg.g.vcf.gz -V:3428_12_13_SRO_422_TTCACGCA-mg,VCF 3428_12_13_SRO_422_TTCACGCA-mg.g.vcf.gz -V:3428_13_13_SRO_492_AACTCACC-mg,VCF 3428_13_13_SRO_492_AACTCACC-mg.g.vcf.gz -V:3428_14_13_SRO_493_AAGAGATC-mg,VCF 3428_14_13_SRO_493_AAGAGATC-mg.g.vcf.gz -V:3428_15_14_SRO_209_AAGGACAC-mg,VCF 3428_15_14_SRO_209_AAGGACAC-mg.g.vcf.gz -V:3428_16_14_SRO_218_AATCCGTC-mg,VCF 3428_16_14_SRO_218_AATCCGTC-mg.g.vcf.gz -V:3428_17_14_SRO_201_AATGTTGC-mg,VCF 3428_17_14_SRO_201_AATGTTGC-mg.g.vcf.gz -V:3428_18_13_SRO_416_ACACGACC-mg,VCF 3428_18_13_SRO_416_ACACGACC-mg.g.vcf.gz -V:3428_19_14_SRO_66_ACAGATTC-mg,VCF 3428_19_14_SRO_66_ACAGATTC-mg.g.vcf.gz -V:3428_1_13_SRO_388_GTCGTAGA-mg,VCF 3428_1_13_SRO_388_GTCGTAGA-mg.g.vcf.gz -V:3428_20_14_SRO_68_AGATGTAC-mg,VCF 3428_20_14_SRO_68_AGATGTAC-mg.g.vcf.gz -V:3428_21_14_SRO_210_AGCACCTC-mg,VCF 3428_21_14_SRO_210_AGCACCTC-mg.g.vcf.gz -V:3428_22_14_SRO_256_AGCCATGC-mg,VCF 3428_22_14_SRO_256_AGCCATGC-mg.g.vcf.gz -V:3428_23_14_SRO_270_AGGCTAAC-mg,VCF 3428_23_14_SRO_270_AGGCTAAC-mg.g.vcf.gz -V:3428_24_13_SRO_452_ATAGCGAC-mg,VCF 3428_24_13_SRO_452_ATAGCGAC-mg.g.vcf.gz -V:3428_2_13_SRO_399_GTCTGTCA-mg,VCF 3428_2_13_SRO_399_GTCTGTCA-mg.g.vcf.gz -V:3428_3_13_SRO_461_GTGTTCTA-mg,VCF 3428_3_13_SRO_461_GTGTTCTA-mg.g.vcf.gz -V:3428_4_13_SRO_462_TAGGATGA-mg,VCF 3428_4_13_SRO_462_TAGGATGA-mg.g.vcf.gz -V:3428_5_13_SRO_465_TATCAGCA-mg,VCF 3428_5_13_SRO_465_TATCAGCA-mg.g.vcf.gz -V:3428_6_13_SRO_402_TCCGTCTA-mg,VCF 3428_6_13_SRO_402_TCCGTCTA-mg.g.vcf.gz -V:3428_7_13_SRO_474_TCTTCACA-mg,VCF 3428_7_13_SRO_474_TCTTCACA-mg.g.vcf.gz -V:3428_8_13_SRO_531_TGAAGAGA-mg,VCF 3428_8_13_SRO_531_TGAAGAGA-mg.g.vcf.gz -V:3428_9_14_SRO_166_TGGAACAA-mg,VCF 3428_9_14_SRO_166_TGGAACAA-mg.g.vcf.gz
INFO  18:30:12,714 HelpFormatter - Executing as roel@utonium.nki.nl on Linux 2.6.32-504.12.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_75-b13.
INFO  18:30:12,714 HelpFormatter - Date/Time: 2015/05/06 18:30:12
INFO  18:30:12,715 HelpFormatter - --------------------------------------------------------------------------------
INFO  18:30:12,715 HelpFormatter - --------------------------------------------------------------------------------
INFO  18:30:15,963 GenomeAnalysisEngine - Strictness is SILENT
INFO  18:30:16,109 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  18:30:29,705 IntervalUtils - Processing 101539431 bp from intervals
WARN  18:30:29,726 IndexDictionaryUtils - Track 3428_10_14_SRO_185_TGGCTTCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,727 IndexDictionaryUtils - Track 3428_11_14_SRO_186_TGGTGGTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,727 IndexDictionaryUtils - Track 3428_12_13_SRO_422_TTCACGCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,728 IndexDictionaryUtils - Track 3428_13_13_SRO_492_AACTCACC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,728 IndexDictionaryUtils - Track 3428_14_13_SRO_493_AAGAGATC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,728 IndexDictionaryUtils - Track 3428_15_14_SRO_209_AAGGACAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,729 IndexDictionaryUtils - Track 3428_16_14_SRO_218_AATCCGTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,729 IndexDictionaryUtils - Track 3428_17_14_SRO_201_AATGTTGC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,730 IndexDictionaryUtils - Track 3428_18_13_SRO_416_ACACGACC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,730 IndexDictionaryUtils - Track 3428_19_14_SRO_66_ACAGATTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,730 IndexDictionaryUtils - Track 3428_1_13_SRO_388_GTCGTAGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,731 IndexDictionaryUtils - Track 3428_20_14_SRO_68_AGATGTAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,731 IndexDictionaryUtils - Track 3428_21_14_SRO_210_AGCACCTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,731 IndexDictionaryUtils - Track 3428_22_14_SRO_256_AGCCATGC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,732 IndexDictionaryUtils - Track 3428_23_14_SRO_270_AGGCTAAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,732 IndexDictionaryUtils - Track 3428_24_13_SRO_452_ATAGCGAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,732 IndexDictionaryUtils - Track 3428_2_13_SRO_399_GTCTGTCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,733 IndexDictionaryUtils - Track 3428_3_13_SRO_461_GTGTTCTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,733 IndexDictionaryUtils - Track 3428_4_13_SRO_462_TAGGATGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,733 IndexDictionaryUtils - Track 3428_5_13_SRO_465_TATCAGCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,734 IndexDictionaryUtils - Track 3428_6_13_SRO_402_TCCGTCTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,734 IndexDictionaryUtils - Track 3428_7_13_SRO_474_TCTTCACA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,734 IndexDictionaryUtils - Track 3428_8_13_SRO_531_TGAAGAGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,735 IndexDictionaryUtils - Track 3428_9_14_SRO_166_TGGAACAA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
INFO  18:30:29,749 MicroScheduler - Running the GATK in parallel mode with 32 total threads, 1 CPU thread(s) for each of 32 data thread(s), of 64 processors available on this machine
INFO  18:30:29,878 GenomeAnalysisEngine - Preparing for traversal
INFO  18:30:29,963 GenomeAnalysisEngine - Done preparing for traversal
INFO  18:30:29,964 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  18:30:29,965 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  18:30:29,966 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
INFO  18:30:30,562 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files
INFO  18:31:00,420 ProgressMeter -       1:4845033         0.0    30.0 s      50.3 w        0.0%    46.7 h      46.7 h
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.IllegalStateException: Key DP found in VariantContext field INFO at 1:4839315 but this key isn't defined in the VCFHeader.  We require all VCFs to have complete VCF headers by default.
at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:115)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java:291) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java:280)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-geee94ec): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Key DP found in VariantContext field INFO at 1:4839315 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default. ##### ERROR ------------------------------------------------------------------------------------------  for f in *.g.vcf.gz; do echo -e "\n--$f --"; zcat "$f" | sed -n -r "/^#.*DP/p;/^1\t4839315\t/{p;q;}"; done ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839317 GT:DP:GQ:MIN_DP:PL 0/0:22:0:21:0,0,432 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839317 GT:DP:GQ:MIN_DP:PL 0/0:20:0:20:0,0,410 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839315 GT:DP:GQ:MIN_DP:PL 0/0:29:0:29:0,0,773 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839315 GT:DP:GQ:MIN_DP:PL 0/0:25:2:25:0,3,790 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839316 GT:DP:GQ:MIN_DP:PL 0/0:33:0:33:0,0,837 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839315 GT:DP:GQ:MIN_DP:PL 0/0:23:31:23:0,31,765 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GA G,<NON_REF> 0 . ClippingRankSum=-0.578;MLEAC=0,0;MLEAF=0.00,0.00 GT:DP:GQ:PL:SB 0/0:21:39:0,39,488,60,491,512:20,0,0,0 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839315 GT:DP:GQ:MIN_DP:PL 0/0:18:0:18:0,0,514 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839316 GT:DP:GQ:MIN_DP:PL 0/0:29:0:29:0,0,810 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839316 GT:DP:GQ:MIN_DP:PL 0/0:33:0:33:0,0,812 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839317 GT:DP:GQ:MIN_DP:PL 0/0:28:0:25:0,0,624 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GA G,<NON_REF> 0.08 . ClippingRankSum=-0.189;MLEAC=1,0;MLEAF=0.500,0.00 GT:DP:GQ:PL:SB 0/1:17:20:20,0,311,62,320,382:14,0,3,0 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GA G,<NON_REF> 6.76 . ClippingRankSum=-0.374;MLEAC=1,0;MLEAF=0.500,0.00 GT:DP:GQ:PL:SB 0/1:25:43:43,0,401,102,417,519:20,0,3,2 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GA G,<NON_REF> 0 . ClippingRankSum=-1.095;MLEAC=0,0;MLEAF=0.00,0.00 GT:DP:GQ:PL:SB 0/0:23:1:0,1,395,56,406,460:19,0,0,0 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839317 GT:DP:GQ:MIN_DP:PL 0/0:28:0:28:0,0,626 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GA G,<NON_REF> 5.99 . ClippingRankSum=-0.584;MLEAC=1,0;MLEAF=0.500,0.00 GT:DP:GQ:PL:SB 0/1:18:42:42,0,293,84,305,388:13,1,3,1 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839317 GT:DP:GQ:MIN_DP:PL 0/0:22:0:22:0,0,558 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G GA,<NON_REF> 6.76 . ClippingRankSum=0.850;MLEAC=1,0;MLEAF=0.500,0.00 GT:DP:GQ:PL:SB 0/1:19:43:43,0,262,87,274,361:12,3,4,0 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GA G,<NON_REF> 16.82 . ClippingRankSum=-0.784;MLEAC=1,0;MLEAF=0.500,0.00 GT:DP:GQ:PL:SB 0/1:21:54:54,0,352,102,367,470:16,0,4,1 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839317 GT:DP:GQ:MIN_DP:PL 0/0:26:0:25:0,0,419 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839316 GT:DP:GQ:MIN_DP:PL 0/0:30:0:30:0,0,771 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839315 GT:DP:GQ:MIN_DP:PL 0/0:34:77:34:0,78,1136 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839316 GT:DP:GQ:MIN_DP:PL 0/0:26:0:20:0,0,397 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GAA G,GA,<NON_REF> 22.75 . ClippingRankSum=-2.181;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00 GT:DP:GQ:PL:SB 0/2:11:22:60,22,209,0,87,104,63,153,113,176:4,2,3,0  Created 2015-05-05 00:40:58 | Updated | Tags: haplotypecaller queue gatk Question Hey folks, We've run in to a situation where we have hundreds of queue managed GATK HaplotypeCaller.scala jobs which we'd like to run. This has led to hundreds of instances of Queue.jar in their post scatter watching state holding on to cores in our cluster. I'm curious about plans for Queue and whether or not a job dependency tree model has been considered for future versions, or whether or not this is already available. By that, I mean something along the lines of the scatter kicks off the child jobs and a check/resubmit_faileds/gather job is queued on the cluster with a qsub -W depend=after:child1jobid[:child2jobid: ...:childNjobid]. The initial scatter job would end, freeing up a core, and resources would be available to the sub jobs. Then, when the child jobs finish up, the dependencies would be met and the next job would execute (when resources are available) and pick up the successful, resubmit the failed, lather, rinse , repeat. Thanks in advance for your patience with me in the phrasing of my question, as I am approaching this as a cluster admin, not as the developer who has implemented the queue. Cheers, Scott Created 2015-04-28 08:17:21 | Updated | Tags: unifiedgenotyper haplotypecaller coverage bias Hi, I am doing joint variant calling for Illumina paired end data of 150 monkeys. Coverage varies from 3-30 X with most individuals having around 4X coverage. I was doing all the variant detection and hard-filtering (GATK Best Practices) process with both UnifiedGenotyper and Haplotype caller. My problem is that HaplotypeCaller shows a much stronger bias for calling the reference allele in low coverage individuals than UnifiedGenotyper does. Is this a known issue? In particular, consider pairwise differences across individuals: The absolute values are lower for low coverage individuals than for high coverage, for both methods, since it is more difficult to make calls for them. However, for UnifiedGenotyper, I can correct for this by calculating the "accessible genome size" for each pair of individuals by substracting from the total reference length all the filtered sites and sites where one of the two individuals has no genotype call (./.). If I do this, there is no bias in pairwise differences for UnifiedGenotyper. Values are comparable for low and high coverage individuals (If both pairs consist of members of similar populations). However, for HaplotypeCaller, this correction does not remove bias due to coverage. Hence, it seems that for UnifiedGenotyper low coverage individuals are more likely to have no call (./.) but if there is a call it is not biased towards reference or alternative allele (at least compared to high coverage individuals). For HaplotypeCaller, on the other hand, it seems that in cases of doubt the genotype is more likely to be set to reference. I can imagine that this is an effect of looking for similar haplotypes in the population. Can you confirm this behaviour? For population genetic analysis this effect is highly problematic. I would trade in more false positive if this removed the bias. Note that when running HaplotypeCaller, I used a value of 3*10^(-3) for the expected heterozygosity (--heterozygosity) which is the average cross individuals diversity and thus already at the higher-end for within individual heterozygosity. I would expect the problem to be even worse if I chose lower values. Can you give me any recommendation, should I go back using UnifiedGenotyper or is there any way to solve this problem? Many thanks in advance, Hannes Created 2015-04-24 14:29:30 | Updated 2015-04-24 15:01:33 | Tags: haplotypecaller genotype-given-allele gga Hi Team, I want to call variants for some samples using Genotype Given Allele mode(GGA) of HaplotypeCaller. I have used UnifiedGenotyper earlier in GGA mode. I am not sure about the right approach to do GGA mode in HaplotypeCaller. You make GVCFs for all BAMs using HaplotypeCaller in GGA mode and later use the GenotypeGVCFs walker to make VCF from these GVCFs. Is this the way to do this ? Or Since GVCFs has all the sites from BAMs, can GenotypeGVCF can do a variant calling using GGA mode. I know that the GenotypeGVCF doesn't have the GGA option currently. But just checking the right approach. Created 2015-04-22 21:11:25 | Updated 2015-04-22 21:14:24 | Tags: haplotypecaller Hi GATK team, I am using the GATK 3.3 HaplotypeCaller. I found if i use different -L i will have different genotype on the same location. My para is very simple: -T HaplotypeCaller -R ucsc.hg19.fasta -mbq 20 --emitRefConfidence GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -I test.bam -L chr17 -o output.vcf I check the same site. If i set -L chr17:36070200-36071000, there is a reported SNV. if i set -L chr17:36000000-36140000, there is no SNV report. if i set larger: -L chr17:30000000-40000000, it just show up again. if i use the whole chr17, it gone. This is very confusing me. it is a C->G variant at chr17. The snp is looks like: GT 0/1 AB 0.84 AD 257,49,0 DP 306 GQ 99 PGT 0|1 PID 36070590_GTCACCAT_G PL 2966,0,10470,3755,10800,14555 SB 14,243,49,0 While I checked the bam file with IGV. There are two wired things: (1) there is no read supporting the non-reference allele. (2) the depth is about 600, far more deeper than the HaplotypeCaller reported. Why would this happen? Created 2015-04-20 10:57:05 | Updated 2015-04-20 11:03:28 | Tags: unifiedgenotyper combinevariants haplotypecaller genotype-given-alleles Question: Is it possible to have CV merge like bcftools does it? I get this warning, when running UG in GGA mode using an -alleles vcf generated with CV: WARN 10:17:21,394 GenotypingGivenAllelesUtils - Multiple valid VCF records detected in the alleles input file at site 20:106089, only considering the first record  I made this call with HC from 10 samples: 20 106089 . CA C  And this call with UG from 10 other samples: 20 106089 . C A  CV merges like this: 20 106089 . C A 20 106089 . CA C  bcftools merges like this: 20 106089 . CA AA,C  The UG recall from the CV generated -alleles vcf is incomplete: 20 106089 . C A  The UG recall from the bcftools generated -alleles vcf is complete: 20 106089 . CA AA,C  Is it possible to have CV merge like bcftools does it? In another thread @Geraldine_VdAuwera said: I'm really not sure. It's not a use case that UG was designed for (with UG we kept SNPs and indels separate until post-analysis), so I would recommend being cautious with it. I checked the genotypes and UG seems to handle merged MNPs and indels just fine; see below. But I will do some additional testing. Or I might just take the safe path and do the recalling separately for SNPs and indels as suggested. The reason I have UG and HC calls in the first place is because I have low and high coverage data for different cohorts. I want to create a merged dataset. Despite --interval_padding 100 helping to recall more sites with HC in GGA mode as per previous recommendation, some sites still fail to be called with HC in GGA mode. Hence I opted for UG. UG calls on samples 1-10: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 535 545 546 550 554 564 567 574 575 578 20 106089 . C A 16.19 . AC=2;AF=0.125;AN=16;BaseQRankSum=-0.854;DP=37;Dels=0.00;FS=0.000;HaplotypeScore=1.5282;MLEAC=2;MLEAF=0.125;MQ=58.74;MQ0=0;MQRankSum=-0.560;QD=2.70;ReadPosRankSum=-1.797;SOR=0.935;VariantType=SNP GT:AD:DP:GQ:PL 0/0:3,0:3:6:0,6,76 0/0:4,2:6:9:0,9,115 0/1:3,1:4:24:24,0,80 0/0:6,0:6:12:0,12,130 0/1:1,1:2:29:30,0,29 ./. 0/0:7,0:7:15:0,15,188 0/0:3,1:4:6:0,6,74 ./. 0/0:5,0:5:12:0,12,142  HC calls on samples 11-20: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 585 590 622 625 628 640 655 668 687 693 20 106089 . CA C 47.95 . AC=5;AF=0.250;AN=20;BaseQRankSum=0.925;DP=36;FS=1.850;InbreedingCoeff=0.0646;MLEAC=5;MLEAF=0.250;MQ=59.48;MQ0=0;MQRankSum=0.175;QD=3.00;ReadPosRankSum=-1.725;SOR=0.387 GT:AD:GQ:PL 0/0:2,0:6:0,6,49 0/0:2,0:6:0,6,49 0/0:3,0:12:0,12,130 0/0:5,0:15:0,15,122 0/0:2,0:6:0,6,46 0/1:2,1:14:14,0,39 0/1:2,1:15:15,0,38 0/0:4,0:12:0,12,93 0/1:3,1:12:12,0,46 1/1:0,3:9:67,9,0  UG GGA recalls on samples 1-20: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 535 545 546 550 554 564 567 574 575 578 585 590 622 625 628 640 655 668 687 693 20 106089 . CA AA,C 110.56 . AC=0,8;AF=0.00,0.222;AN=36;DP=81;FS=0.000;InbreedingCoeff=0.5076;MLEAC=0,6;MLEAF=0.00,0.167;MQ=58.56;MQ0=0;QD=3.45;SOR=0.859;VariantType=MULTIALLELIC_MIXED GT:AD:DP:GQ:PL:SB 0/0:0,0,0:3:0:0,0,0,6,6,52:0,0,0,0 0/2:0,0,1:6:0:5,5,5,0,0,109:0,0,1,0 0/2:0,0,1:4:0:12,12,12,0,0,47:0,0,1,0 0/0:0,0,0:6:0:0,0,0,17,17,123:0,0,0,0 0/0:0,0,0:2:0:0,0,0,3,3,10:0,0,0,0 ./. 0/0:0,0,0:7:0:0,0,0,9,9,60:0,0,0,0 0/2:0,0,1:4:0:12,12,12,0,0,61:0,0,0,1 ./. 0/0:0,0,1:5:0:0,0,0,4,4,30:0,0,0,1 0/0:0,0,0:3:0:0,0,0,6,6,49:0,0,0,0 0/0:0,0,0:3:0:0,0,0,9,9,76:0,0,0,0 0/0:0,0,1:4:0:0,0,0,1,1,22:0,0,1,0 0/0:0,0,0:7:0:0,0,0,18,18,149:0,0,0,0 0/0:0,0,0:4:0:0,0,0,11,11,76:0,0,0,0 0/2:0,0,1:5:0:9,9,9,0,0,65:0,0,0,1 0/2:0,0,1:4:0:12,12,12,0,0,60:0,0,0,1 0/0:0,0,0:5:0:0,0,0,15,15,116:0,0,0,0 0/2:0,0,1:6:0:12,12,12,0,0,47:0,0,0,1 2/2:0,0,3:3:9:67,67,67,9,9,0:0,0,3,0  This thread is related to the following threads on GGA: http://gatkforums.broadinstitute.org/discussion/5249/overcalling-deletion-in-unifiedgenotyper-genotype-given-alleles-mode http://gatkforums.broadinstitute.org/discussion/5018/ug-call-combined-snp-indel-sites-in-gga-mode http://gatkforums.broadinstitute.org/discussion/4936/not-all-sites-emitted-with-genotype-given-alleles http://gatkforums.broadinstitute.org/discussion/4024/genotype-and-validate-or-haplotype-caller-gga-what-am-i-doing-wrong  P.S. I might gate crash your Cambridge party this week despite not being invited :D The course was already fully booked, when you announced it. I don't have a time machine! Created 2015-04-16 15:31:12 | Updated | Tags: haplotypecaller Hi, I am interested in calling SNPs for a set of 150 bacterial genomes (genome size ~1Mb). I'm attempting to use the HaplotypeCaller and am running into errors with the java memory: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java. There is an estimated run time of ~11 days. I have increased the memory to 20g and am limiting the max_alternate_alleles as well as shown below: java -d64 -Xmx20g -jar$EXECGATK \ -T HaplotypeCaller \ -R $REF \ -I$DATAPATH${BAMLIST} \ -stand_call_conf 20 \ -stand_emit_conf 20 \ --sample_ploidy 1 \ --maxNumHaplotypesInPopulation 198 \ --max_alternate_alleles 3 \ -L "gi|15594346|ref|NC_001318.1|" \ -o${OUTPATH}{BASE}.chr.snps.indels.vcf Is there a way to call only SNPs as my understanding is that indel calling is memory intensive and I am going to focus on SNPs for this part of my analysis? Or is there another way to make this analysis more efficient? Thank you! Created 2015-04-15 10:37:45 | Updated | Tags: haplotypecaller targeted-sequencing Dear all, we use the Broad best practice pipeline to call germline variants on Panel-seq data, i.e. our chip comprises the Omimom. Hence, it is not a small target set but significantly smaller than for instance common exome-seq. The results are good so far, the mixture model plots look decent to me. Here are my questions: 1. Is there any best practice for such panels? 2. I attached a file with the tranches plot for the SNPs, what do you think? Shall we increase the --ts_filter_level for SNPs to, say, 99.99 (it is now 99.9 as in the best practice for exome-seq)? Thank you, Chris Created 2015-04-14 12:49:42 | Updated 2015-04-14 12:53:11 | Tags: haplotypecaller Hi GATK team, I have recently performed an analysis to try and select a GQ cut-off to be applied to WES data post-VQSR (applied 99.9% to the data). The WES data was called using HaplotypeCaller GenomeAnalysisTK-3.2-2 (over ~3000) samples and VQSR was applied (using the same GATK version). To decide on a GQ threshold, I looked at the correlation (over different GQ filters applied to the WES data) of chip genotypes and the sequencing genotypes (~350 samples were both genotyped and sequenced). The genotype data has been QC'ed s normally is in GWAS. The correlation is just the r squared (r2) for each variant between 2 vectors: one with the 350 chip genotypes and the other with the 350 sequencing genotypes. I finally estimated the average r2 per GQ quality filter applied and also counted how many genotypes were being dropped (ie., no longer variant sites). The result of this is the following figure, which I think looks a bit odd and suggests that the GQ is perhaps multi-modal. Have you ever seen this or have any suggestions as to why this might be? The blue line is the correlation (left y axis) and the green is the proportion of GTs dropped (right y axis). The x axis is the GQ filters applied to the data from 0 to 50. The calling command line used was this: -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -L S04380110_Padded.interval_list Many thanks, Eva Created 2015-04-14 03:01:09 | Updated | Tags: haplotypecaller Hi, There are some confused result found in the output gvcf of HaplotypeCaller. Why the genotypes is 0/0 but called a GC->G deletion? see below for example. chr8 118915071 . GC G,<NON_REF> 0 . DP=37;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.29;MQ0=0 GT:AD:DP:GQ:PL:SB 0/0:35,0,0:35:99:0,104,1078,105,1079,1080:27,8,0,0 chr8 118915072 rs34953549 CA C,<NON_REF> 0 . BaseQRankSum=1.783;ClippingRankSum=-0.149;DB;DP=40;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.27;MQ0=0;MQRankSum=0.743;ReadPosRankSum=1.296 GT:AD:DP:GQ:PL:SB 0/0:25,3,0:28:24:0,24,331,71,340,387:18,7,0,0  Thanks a lot! Hiu Created 2015-03-25 20:33:49 | Updated 2015-03-25 20:38:23 | Tags: haplotypecaller Hi, The variant was called by MiSeq reporter and GATK2.7 UnifiedGenotyper with VAF 19% after soft-clipping the primer sequences, as also seen in BAM file got according to GATK best practice (BWA, indel realignment, base recalibration, but no dedup due to amplicon-based sequencing). But GATK 3.3 HaplotypeCaller called a single base pair deletion instead (A: 4 (1%); C: 527) : variants-HC3.3.vcf: chr13 28592642 . C . 1356.23 . AN=2;DP=469;MQ=59.95 GT:AD:DP 0/0:469:469 variants-UG_.vcf: chr13 28592642 rs121913488 C A 1785.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-7.477;DB;DP=731;Dels=0.00;FS=0.713;HaplotypeScore=62.1894;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=0.522;QD=2.44;ReadPosRankSum=0.174 GT:AD:DP:GQ:PL 0/1:596,134:731:99:1814,0,11941 The variant is at the end of one amplicon of lower coverage, close to another amplicon of much higher coverage (please see attached beforeHC.png, the soft-clipped part is primer sequences). If I trim 30 bp of the primer sequence instead of soft-clipping, the variant was called with strand bias under low stringency: 8 (12%: 1+, 7-), C: 59 (6+, 53-) chr13 28592642 rs121913488 C A 20.80 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.750;ClippingRankSum=0.250;DB;DP=20;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-1.950;QD=1.04;ReadPosRankSum=-1.450;SOR=1.085 GT:AD:DP:GQ:PL 0/1:15,4:19:49:49,0,354 The variant was not called at all if I trim 22, 24, 27 or 32 bp instead of 30 bp from both ends. Why the variant was not called by HaplotypeCaller after soft-clipping primer sequences? How can I adjust anything during the HaplotypeCaller to make this variant called? Thanks! Created 2015-03-08 09:52:48 | Updated | Tags: haplotypecaller dp Hello GATK team, 1. I've noticed a strange variant in the gvcf output of HC: Raw vcf: 6 32552140 rs17882663 T A, 108.18 . DB;DP=0;MLEAC=2,0;MLEAF=1.00,0.00;MQ=0.00;MQ0=0 GT:GQ:PGT:PID:PL:SB 1/1:9:0|1:32552059_G_T:135,9,0,135,9,135:0,0,0,0 After GenortpeGVCFs: 6 32552140 rs17882663 T A 107.28 . AC=2;AF=1.00;AN=2;DB;DP=0;FS=0.000;GQ_MEAN=9.00;MLEAC=2;MLEAF=1.00;MQ=0.00;MQ0=0;NCC=0;SOR=0.693 GT:AD:GQ:PGT:PID:PL 1/1:0,0:9:1|1:32552059_G_T:135,9,0 The DP and AD are 0, but there is a variant - A. What do you think? Why does it happened? 1. What is the difference between the DP in the format part and in the info part? I looked for an answer in the documentation, but couldn't find one. Bellow is an example of a big difference between the values of this two. 3 195511214 . G GACCTGTGGATGCTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGTGTC 673.77 . AC=1;AF=0.500;AN=2;DP=169;FS=0.000;GQ_MEAN=38.00;MLEAC=1;MLEAF=0.500;MQ=56.57;MQ0=0;NCC=0;SOR=0.693 GT:AD:DP:GQ:PL 0/1:0,0:0:38:702,0,38 Maya I am running HC3.3-0 with the following options (e.g. GENOTYPE_GIVEN_ALLELES): java7 -Djava.io.tmpdir=tmp -Xmx3900m \
-jar $jar \ --analysis_type HaplotypeCaller \ --reference_sequence$ref \
--input_file $BAM \ --intervals$CHROM \
--dbsnp $dbSNP \ --out$out \
-stand_call_conf 0 \
-stand_emit_conf 0 \
-A Coverage -A FisherStrand -A HaplotypeScore -A MappingQualityRankSumTest -A QualByDepth -A RMSMappingQuality -A ReadPosRankSumTest \
-L $allelesVCF \ -L 20:60000-70000 \ --interval_set_rule INTERSECTION \ --genotyping_mode GENOTYPE_GIVEN_ALLELES \ --alleles$allelesVCF \
--emitRefConfidence NONE \
--output_mode EMIT_ALL_SITES \


The file $allelesVCF contains these neighbouring SNPs: 20 60807 . C T 118.96 . 20 60808 . G A 46.95 . 20 61270 . A C 2870.18 . 20 61271 . T A 233.60 .  I am unable to call these neighbouring SNPs; despite reads being present in the file$BAM, which shouldn't matter anyway. I also tried adding --interval_merging OVERLAPPING_ONLY to the command line, but that didn't solve the problem. What am I doing wrong? I should probably add GATK breaker/misuser to my CV...

Thank you as always.

P.S. The CommandLineGATK documentation does not say, what the default value for --interval_merging is.

P.P.S. Iterative testing a bit slow, because HC always has to do this step:

HCMappingQualityFilter - Filtering out reads with MAPQ < 20


Created 2014-12-04 19:17:24 | Updated | Tags: baserecalibrator haplotypecaller vcf convergence bootstrap

I am identifying new sequence variants/genotypes from RNA-Seq data. The species I am working with is not well studied, and there are no available datasets of reliable SNP and INDEL variants.

For BaseRecallibrator, it is recommended that when lacking a reliable set of sequence variants: "You can bootstrap a database of known SNPs. Here's how it works: First do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence."

Setting up a script to run HaplotypeCaller and BaseRecallibrator in a loop should be fairly strait forward. What is a good strategy for comparing VCF files and assessing convergence?

Created 2014-12-04 18:54:17 | Updated | Tags: haplotypecaller vcf rnaseq

Specifically, what does the 'start' component of this flag mean? Do the reads all have to start in exactly the same location? Alternatively, does the flag specify the total number of reads that must overlap a putative variant before that variant will be considered for calling?

Created 2014-12-02 16:32:32 | Updated | Tags: haplotypecaller bam gatk

Hi,

I want to use HaplotypeCaller to call families together. I have bam files for each individual in the 4 families I am studying, as well as a ped file describing the pedigree information. The problem is that these families have complex pedigrees, with the parents (mother and father), the children, and then one grandchild for each child (do not have information about the other parent of the grandchild). I would like to call these families with their complex pedigrees together, and I would like to call all the 4 families together to maximize the power of the calling. However, I'm not sure how to do that with just the -ped option. -ped seems to be designed for only one family or cohort, and I'm not sure it would work for me to feed it all my bams as inputs. Are there any other tools for GATK that I could use to call complex pedigrees?

The other possibility would be to call the 4 trios separately and each child-grandchild pair separately, but not sure how to do that either with just -ped. What would you recommend?

And finally, I get an error message saying that --emitRefConfidence only works for single sample mode. It seems that I should omit this option when I run HaplotypeCaller on my families, but are there any other options that I should use for cohort calling besides the standard ones (which I know to be --variant_index_type and --variant_index_parameter)?

Thanks, Alva

Created 2014-11-18 02:40:18 | Updated | Tags: haplotypecaller

Hi,

I'd like to ask you if there is a paper describing the Haplotype Caller algorithm, if you could please send me the reference. I have tried to find it, but I only found the paper on GATK which is great, but it doesn't describe in detail the Haplotype Caller algorithm.

thank you,

Created 2014-11-17 11:59:46 | Updated | Tags: haplotypecaller genotypegvcfs combinegvcfs

Hi all,

I am attempting to use the HaplotyperCaller / CombineGVCFs / GenotypeGVCFs to call variants on chrom X and Y of 769 samples (356 males, 413 females) sequenced at 12x coverage (WG sequening, but right not only calling X and Y).

I have called the samples according to the best practises using the HaplotypeCaller, using ploidy = 1 for males on X and Y and ploidy =2 for females on X, e.g.:

INFO 16:28:45,750 HelpFormatter - Program Args: -R /gcc/resources/b37/indices/human_g1k_v37.fa -T HaplotypeCaller -L X -ploidy 1 -minPruning 3 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -I /target/gpfs2/gcc/groups/gonl/projects/trio-analysis/rawdata_release2/A102.human_g1k_v37.trio_realigned.bam --sample_name A102a -o /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/A102a.chrX.hc.g.vcf

java -Djava.io.tmpdir=$LSCRATCH -Xmx10g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -T ApplyRecalibration -ts_filter_level 99.0 -mode SNP -input snv.vcf -recalFile 96Exomes.HC.TruSeq.snv.RECALIBRATED -tranchesFile 96Exomes.HC.TruSeq.snv.tranches -o 96Exomes.HC.TruSeq.snv.recal.vcf -R ~/References/hg19/hg19.fasta # HARD FILTER (RNASeq) java -Djava.io.tmpdir=$LSCRATCH -Xmx2g -jar /home/apps/Logiciels/GATK/3.1-1/GenomeAnalysisTK.jar -l INFO -T VariantFiltration -R ~/References/hg19/hg19.fasta -V 96RNAseq.STAR.q1.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o 96RNAseq.STAR.q1.FS30.QD2.vcf

Here are some examples for RNAseq :

chr1 6711349 . G A 79.10 PASS BaseQRankSum=-1.369e+00;ClippingRankSum=1.00;DP=635;FS=1.871;MLEAC=1;MLEAF=5.495e-03;MQ=60.00;MQ0=0;MQRankSum=-4.560e-01;ReadPosRankSum=-1.187e+00 GT:AD:DP:GQ:PL 0/0:8,0:8:24:0,24,280 ./.:0,0:0 0/0:9,0:9:21:0,21,248 0/0:7,0:7:21:0,21,196 0/0:7,0:7:21:0,21,226 0/0:8,0:8:21:0,21,227 0/0:8,0:8:21:0,21,253 0/0:7,0:7:21:0,21,218 0/0:9,0:9:27:0,27,282 1/1:0,0:5:15:137,15,0 0/0:2,0:2:6:0,6,47 0/0:28,0:28:78:0,78,860 0/0:7,0:7:21:0,21,252 0/0:2,0:2:6:0,6,49 0/0:5,0:5:12:0,12,152 0/0:3,0:3:6:0,6,90 0/0:4,0:4:12:0,12,126 0/0:9,0:9:21:0,21,315 0/0:7,0:7:21:0,21,256 0/0:7,0:7:21:0,21,160 0/0:8,0:8:21:0,21,298 0/0:20,0:20:60:0,60,605 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,67 0/0:2,0:2:6:0,6,71 0/0:14,0:14:20:0,20,390 0/0:7,0:7:21:0,21,223 0/0:7,0:7:21:0,21,221 0/0:4,0:4:12:0,12,134 0/0:2,0:2:6:0,6,54 ./.:0,0:0 0/0:4,0:4:9:0,9,118 0/0:8,0:8:21:0,21,243 0/0:6,0:6:15:0,15,143 0/0:8,0:8:21:0,21,244 0/0:7,0:7:21:0,21,192 0/0:2,0:2:6:0,6,54 0/0:13,0:13:27:0,27,359 0/0:8,0:8:21:0,21,245 0/0:7,0:7:21:0,21,218 0/0:12,0:12:36:0,36,354 0/0:8,0:8:21:0,21,315 0/0:7,0:7:21:0,21,215 0/0:2,0:2:6:0,6,49 0/0:10,0:10:24:0,24,301 0/0:7,0:7:21:0,21,208 0/0:7,0:7:21:0,21,199 0/0:2,0:2:6:0,6,47 0/0:3,0:3:9:0,9,87 0/0:2,0:2:6:0,6,73 0/0:7,0:7:21:0,21,210 0/0:8,0:8:22:0,22,268 0/0:7,0:7:21:0,21,184 0/0:7,0:7:21:0,21,213 0/0:5,0:5:9:0,9,135 0/0:7,0:7:21:0,21,200 0/0:4,0:4:12:0,12,118 0/0:7,0:7:21:0,21,232 0/0:7,0:7:21:0,21,232 0/0:7,0:7:21:0,21,217 0/0:8,0:8:21:0,21,255 0/0:9,0:9:24:0,24,314 0/0:8,0:8:21:0,21,221 0/0:9,0:9:24:0,24,276 0/0:9,0:9:21:0,21,285 0/0:3,0:3:6:0,6,90 0/0:2,0:2:6:0,6,57 0/0:13,0:13:20:0,20,385 0/0:2,0:2:6:0,6,48 0/0:11,0:11:27:0,27,317 0/0:8,0:8:21:0,21,315 0/0:9,0:9:24:0,24,284 0/0:7,0:7:21:0,21,228 0/0:14,0:14:33:0,33,446 0/0:2,0:2:6:0,6,64 0/0:2,0:2:6:0,6,72 0/0:7,0:7:21:0,21,258 0/0:10,0:10:27:0,27,348 0/0:7,0:7:21:0,21,219 0/0:9,0:9:21:0,21,289 0/0:20,0:20:57:0,57,855 0/0:4,0:4:12:0,12,146 0/0:7,0:7:21:0,21,205 0/0:12,0:14:36:0,36,1030 0/0:3,0:3:6:0,6,87 0/0:2,0:2:6:0,6,60 0/0:7,0:7:21:0,21,226 0/0:7,0:7:21:0,21,229 0/0:8,0:8:21:0,21,265 0/0:4,0:4:6:0,6,90 ./.:0,0:0 0/0:7,0:7:21:0,21,229 0/0:2,0:2:6:0,6,59 0/0:2,0:2:6:0,6,56 chr1 7992047 . T C 45.83 SnpCluster BaseQRankSum=1.03;ClippingRankSum=0.00;DP=98;FS=0.000;MLEAC=1;MLEAF=0.014;MQ=60.00;MQ0=0;MQRankSum=-1.026e+00;ReadPosRankSum=-1.026e+00 GT:AD:DP:GQ:PL ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,70 0/0:2,0:2:6:0,6,45 0/0:3,0:3:6:0,6,87 0/0:2,0:2:6:0,6,52 ./.:0,0:0 ./.:0,0:0 ./.:1,0:1 ./.:0,0:0 0/0:2,0:2:6:0,6,55 0/0:2,0:2:6:0,6,49 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,61 0/0:2,0:2:6:0,6,49 ./.:0,0:0 ./.:0,0:0 0/0:3,0:3:6:0,6,90 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,52 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,69 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,64 ./.:0,0:0 0/0:2,0:2:6:0,6,37 ./.:0,0:0 0/0:2,0:2:6:0,6,67 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,68 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:11,0:11:24:0,24,360 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,68 0/0:2,0:2:6:0,6,50 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,50 0/0:3,0:3:6:0,6,90 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:4:6:0,6,50 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:7,0:7:21:0,21,231 0/0:2,0:2:6:0,6,64 ./.:0,0:0 0/0:2,0:2:6:0,6,63 0/0:2,0:2:6:0,6,70 ./.:0,0:0 0/0:6,0:6:15:0,15,148 ./.:0,0:0 ./.:0,0:0 1/1:0,0:2:6:90,6,0 ./.:0,0:0 0/0:2,0:2:6:0,6,63 0/0:2,0:2:6:0,6,74 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,58 0/0:2,0:2:6:0,6,71 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49

For Exome Seq now : chr2 111878571 . C T 93.21 PASS DP=634;FS=0.000;MLEAC=1;MLEAF=5.319e-03;MQ=60.00;MQ0=0;VQSLOD=14.19;culprit=MQ GT:AD:DP:GQ:PL 0/0:8,0:8:24:0,24,243 0/0:4,0:4:9:0,9,135 0/0:7,0:7:18:0,18,270 0/0:7,0:7:21:0,21,230 0/0:16,0:16:48:0,48,542 0/0:8,0:8:21:0,21,315 0/0:6,0:6:18:0,18,186 0/0:5,0:5:15:0,15,168 0/0:6,0:6:15:0,15,225 0/0:10,0:10:30:0,30,333 0/0:7,0:7:21:0,21,239 0/0:6,0:6:18:0,18,202 0/0:6,0:6:15:0,15,225 0/0:7,0:7:21:0,21,225 0/0:8,0:8:24:0,24,272 0/0:5,0:5:15:0,15,168 1/1:0,0:13:13:147,13,0 0/0:2,0:2:6:0,6,73 0/0:8,0:8:24:0,24,256 0/0:14,0:14:4:0,4,437 0/0:3,0:3:9:0,9,85 0/0:4,0:4:12:0,12,159 0/0:7,0:7:21:0,21,238 0/0:5,0:5:15:0,15,195 0/0:7,0:7:15:0,15,225 0/0:12,0:12:36:0,36,414 0/0:4,0:4:12:0,12,156 0/0:7,0:7:0:0,0,190 0/0:2,0:2:6:0,6,64 0/0:7,0:7:21:0,21,242 0/0:7,0:7:21:0,21,234 0/0:8,0:8:24:0,24,267 0/0:7,0:7:21:0,21,245 0/0:7,0:7:21:0,21,261 0/0:6,0:6:18:0,18,204 0/0:8,0:8:24:0,24,302 0/0:5,0:5:15:0,15,172 0/0:9,0:9:24:0,24,360 0/0:18,0:18:51:0,51,649 0/0:5,0:5:15:0,15,176 0/0:2,0:2:6:0,6,70 0/0:14,0:14:33:0,33,495 0/0:4,0:4:9:0,9,135 0/0:8,0:8:21:0,21,315 0/0:4,0:4:12:0,12,149 0/0:4,0:4:6:0,6,90 0/0:10,0:10:27:0,27,405 0/0:3,0:3:6:0,6,90 0/0:4,0:4:12:0,12,133 0/0:14,0:14:6:0,6,431 0/0:4,0:4:12:0,12,151 0/0:5,0:5:15:0,15,163 0/0:3,0:3:9:0,9,106 0/0:7,0:7:21:0,21,237 0/0:7,0:7:21:0,21,268 0/0:8,0:8:21:0,21,315 0/0:2,0:2:6:0,6,68 ./.:0,0:0 0/0:3,0:3:9:0,9,103 0/0:7,0:7:21:0,21,230 0/0:3,0:3:6:0,6,90 0/0:9,0:9:26:0,26,277 0/0:7,0:7:21:0,21,236 0/0:5,0:5:15:0,15,170 ./.:1,0:1 0/0:15,0:15:45:0,45,653 0/0:8,0:8:24:0,24,304 0/0:6,0:6:15:0,15,225 0/0:3,0:3:9:0,9,103 0/0:2,0:2:6:0,6,79 0/0:7,0:7:21:0,21,241 0/0:4,0:4:12:0,12,134 0/0:3,0:3:6:0,6,90 0/0:5,0:5:15:0,15,159 0/0:4,0:4:12:0,12,136 0/0:5,0:5:12:0,12,180 0/0:11,0:11:21:0,21,315 0/0:13,0:13:39:0,39,501 0/0:3,0:3:9:0,9,103 0/0:8,0:8:24:0,24,257 0/0:2,0:2:6:0,6,73 0/0:8,0:8:24:0,24,280 0/0:4,0:4:12:0,12,144 0/0:4,0:4:9:0,9,135 0/0:8,0:8:24:0,24,298 0/0:4,0:4:12:0,12,129 0/0:5,0:5:15:0,15,184 0/0:2,0:2:6:0,6,62 0/0:2,0:2:6:0,6,65 0/0:9,0:9:27:0,27,337 0/0:7,0:7:21:0,21,230 0/0:7,0:7:21:0,21,239 0/0:5,0:5:0:0,0,113 0/0:11,0:11:33:0,33,369 0/0:7,0:7:21:0,21,248 0/0:10,0:10:30:0,30,395

Created 2014-08-28 02:16:12 | Updated 2014-08-28 02:40:19 | Tags: haplotypecaller minpruning

Hi GATK developer,

I am working on a human WES project with an average sequencing depth 42X per individual. Recent version of GATK (v3.2-2) was used in my study. To investigate which "-minPruning" setting is better for my project, I ran HaplotypeCaller and tested the "-minPruning 3" and "-minPruning 2", so I generate two sets of gvcf files. I found some interesting cases when I compared these two sets of files.

For example in the same individual (we name this individual as A):

Locus "chr1 1654007" from the GVCF file for individual A (-minPruning 3) is listed as below:

chr1 1654007 . A . . END=1654007 GT:DP:GQ:MIN_DP:PL 0/0:64:0:64:0,0,1922

Locus "chr1 1654007" from the GVCF file for individual A (-minPruning 2) is listed as below:

chr1 1654007 rs199558549 A T, 155.90 . DB;DP=6;MLEAC=2,0;MLEAF=1.00,0.00;MQ=27.73;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,6,0:6:18:189,18,0,189,18,189:0,0,4,2

You can see it is quite different for the results in terms of the final genotyping call and the sequencing depth ("DP" tag) generated from "-minPruning 3" and "-minPruning 2" setting. I am wondering what exactly happens for this locus so I check this locus using IGV. I found at this locus, there are 6 reads supporting the ALT allele ("T") and 64 reads supporting the REF allele ("A"). So I believe the result from "-minPruning 3" should be correct.
Just wondering is it normal considering the different setting of "-minPruning 3" and "-minPruning 2"or is it just a bug? If it is just because of the different setting of "-minPruning 3" and "-minPruning 2", I would say the setting of "-minPruning" is critical for the final results so a guideline of how to set the "-minPruning" based on the sequencing depth will be important. How do you think?

Thanks,

Qiongyi

Created 2014-06-27 09:45:39 | Updated 2014-06-27 09:46:05 | Tags: depthofcoverage haplotypecaller

Hi we've been looking at results of a recent run of GATK-HC (3.1-1) using the new N1 pipeline and we've been seeing something odd in the distribution of the Depth of Coverage (Using DP from the Genotype Fields) we're seeing for the raw unfiltered variants.

All our samples are sequenced using PCR-Free libraries and have two lanes of sequence (~24x mapped depth) and looking at depth of coverage from bedtools we see a nice clean distribution (red in graph) but when we look at the data from the HaplotypeCaller sites (Blue in graph) we see a bimodal distribution with an excess of variants called at lower coverage (~16x) vs the most common coverage of around 24x. We've seen this in all the samples we've looked at so far, so it's not just a one off.

I've had a quick look at read depth from another variant caller (Platypus) and there we see no evidence of this bimodal distribution in the variants it has called.

Is this expected behaviour? If so why does it occur? If not any idea what is going on here, is it a bug in the variant caller or the depth statistics? Do you see the same thing in other datasets?

Thanks!

Created 2013-10-16 12:50:25 | Updated | Tags: unifiedgenotyper haplotypecaller dp

Dear GATK Team,

I've recently been exploring HaplotypeCaller and noticed that, for my data, it is reporting ~10x lower DP and AD values in comparison to reads visible in the igv browser and reported by the UnifiedGenotyper.

I'm analyzing a human gene panel of amplicon data produced on a MiSeq, 150bp paired end. The coverage is ~5,000x.

My pipeline is:

Novoalign -> GATK (recalibrate quality) -> GATK (re-align) -> HaplotypeCaller/UnifiedGenotyper.

Here are the minimum commands that reproduce the discrepancy:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.HC.vcf \
-L ROI.bed \
-dt NONE \
-nct 8


Example variant from sample1.HC.vcf:

chr17 41245466 . G A 18004.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.411;ClippingRankSum=-1.211;DP=462;FS=2.564;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.250;QD=31.14;ReadPosRankSum=1.159 GT:AD:DP:GQ:PL 1/1:3,458:461:99:18033,1286,0

... In comparison to using UnifiedGenotyper with exactly the same alignment file:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.UG.vcf \
-L ROI.bed \
-nct 4 \
-dt NONE \
-glm BOTH


Example variant from sample1.UG.vcf:

chr17 41245466 . G A 140732.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=5.488;DP=6382;Dels=0.00;FS=0.000;HaplotypeScore=568.8569;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.096;QD=22.05;ReadPosRankSum=0.104 GT:AD:DP:GQ:PL 1/1:56,6300:6378:99:140761,8716,0

I looked at the mapping quality and number of the alignments at the example region (200nt window) listed above and they look good:

awk '{if ($3=="chr17" &&$4 > (41245466-100) && $4 < (41245466+100)) print}' sample1.rg.sam | awk '{count[$5]++} END {for(i in count) print count[i], i}' | sort -nr
8764 70
77 0


With other data generated in our lab, that has ~200x coverage and the same assay principle [just more amplicons], the DP reported by HaplotypeCaller corresponds perfectly to UnifiedGenotyper and igv.

Is there an explanation as to why I should see a difference between HaplotypeCaller and UnifiedGenotyper, using these kinds of data?

Sam

Created 2013-08-16 21:12:22 | Updated | Tags: haplotypecaller

If I run HaplotypeCaller with a VCF file as the intervals file, -stand_emit_conf 0, and -out_mode EMIT_ALL_SITES, should I get back an output VCF with all the sites from the input VCF, whether or not there was a variant call there? If not, is there a way to force output even if the calls are 0/0 or ./. for everyone in the cohort?

I have been trying to run HC with the above options, but I can't understand why some variants are included in my output file and others aren't. Some positions are output with no alternate allele and GTs of 0 for everyone. However, other positions that I know have coverage are not output at all.

Thanks,

Elise

Created 2013-05-30 02:04:50 | Updated | Tags: haplotypecaller

Hi I have been running HaplotypeCaller on >700 monkey alignments and came across this error in some intervals:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.IllegalStateException: Mismatch between the reference haplotype and the reference assembly graph path. for graph BaseGraph{kmerSize=10} graph = GGAATAACTCCAGGCAACCA
GTTCCAGCCGCCTCCTCCCTGTCTCCTTCAAGGTTCCCTTCCTCTACCTGCAATTTACAACCTCAGTGGTTCCCCAGGGCTCTGTCCTGCGCCCTCAGTGCTTCCCTTCTGCACGTTTTCCCAGGCAATCTCTTCCTGCCTCTGGGCACCAACTCCATCCGTATAGAGATAGTT
CCCACAGGCACAGCCC haplotype = CCAGGCAACCAGTTCCAGCCGCCTCCTCCCTGTCTCCTTCAAGGTTCCCTTCCTCTACCTGCAATTTACAACCTCAGTGGTTCCCCAGGGCTCTGTCCTGCGCCCTCAGTGCTTCCCTTCTGCACGTTTTCCCAGGCAATCTCTT
CCTGCCTCTGGGCACCAACTCCATCCGTATAGAGATAGTTCCCACAGGCACAGCCC
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:665) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:661)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version nightly-2013-05-17-g2c8b717):
##### ERROR
##### ERROR Please check the documentation guide to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR
##### ERROR MESSAGE: Mismatch between the reference haplotype and the reference assembly graph path. for graph BaseGraph{kmerSize=10} graph = GGAATAACTCCAGGCAACCAGTTCCAGCCGCCTCCTCCCTGTCTCCTTCAAGGTTCCCTTCCTCTACCTGCAATTTACAACCTCAGTGGTTCCCCAGGGCTCTGTCCTGCGCCCTCAGTGCTTCCCTTCTGCACGTTTTCCCAGGCAATCTCTTCCTGCCTCTGGGCACCAACTCCATCCGTATAGAGATAGTTCCCACAGGCACAGCCC haplotype = CCAGGCAACCAGTTCCAGCCGCCTCCTCCCTGTCTCCTTCAAGGTTCCCTTCCTCTACCTGCAATTTACAACCTCAGTGGTTCCCCAGGGCTCTGTCCTGCGCCCTCAGTGCTTCCCTTCTGCACGTTTTCCCAGGCAATCTCTTCCTGCCTCTGGGCACCAACTCCATCCGTATAGAGATAGTTCCCACAGGCACAGCCC
##### ERROR ------------------------------------------------------------------------------------------


My commandline looks like (omitting long list of bam files):

java -Xms6000m -Xmx8000m -XX:PermSize=1500m -XX:MaxPermSize=2000m -jar gatk2Jar/GenomeAnalysisTK.jar --reference_sequence reference/3280_vervet_ref_6.0.3.fasta -T HaplotypeCaller --unsafe --validation_strictness SILENT --read_filter BadCigar --num_threads 1 -L:bed folder/Scaffold84_line_1064463_1069462_bed.tsv --out NewCaller/Scaffold84_1064463_1069462.orig.vcf --heterozygosity 0.01 --minPruning 2 --downsample_to_coverage 40 --downsampling_type BY_SAMPLE -I ...

Created 2013-02-07 11:05:27 | Updated 2013-02-08 20:22:14 | Tags: haplotypecaller ploidy

I'm trying to run HaplotypeCaller on a haploid organism. Is this possible? What argument should I use for this? My first attempt produced a diploid calls.

Sorry for the silly question