Tagged with #genotypegvcfs
1 documentation article | 4 announcements | 79 forum discussions



Created 2014-04-03 20:20:08 | Updated 2014-10-22 19:22:34 | Tags: haplotypecaller genotypegvcfs combinegvcfs gvcf joint-analysis
Comments (48)

Overview

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variants analyses.

Important caveat

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller 3.x contain additional information that is formatted in a very specific way. Read on to find out more.

General comparison of VCF vs. gVCF

The key difference between a regular VCF and a gVCF is that the gVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a gVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION gVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

The two types of gVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a gVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the gVCF header. The purpose of the blocks (also called banding) is to keep file size down, and there is no downside for the downstream analysis, so we do recommend using the -GVCF option.

Example gVCF file

This is a banded gVCF produced by HaplotypeCaller with the -GVCF option.

Header:

As you can see in the first line, the basic file format is a valid version 4.1 VCF:

##fileformat=VCFv4.1
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)
##GVCFBlock=minGQ=60(inclusive),maxGQ=2147483647(exclusive)
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=b37>
##reference=file:///humgen/1kg/reference/human_g1k_v37.fasta

Toward the middle you see the ##GVCFBlock lines (after the ##FORMAT lines) (repeated here for clarity):

##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)

which indicate the GQ ranges used for banding (corresponding to the boundaries [5, 20, 60]).

You can also see the definition of the MIN_DP annotation in the ##FORMAT lines.

Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10000000 and ends at 20:10000116.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10000000    .   T   <NON_REF>   .   .   END=10000116    GT:DP:GQ:MIN_DP:PL  0/0:44:99:38:0,89,1385
20  10000117    .   C   T,<NON_REF> 612.77  .   BaseQRankSum=0.000;ClippingRankSum=-0.411;DP=38;MLEAC=1,0;MLEAF=0.500,0.00;MQ=221.39;MQ0=0;MQRankSum=-2.172;ReadPosRankSum=-0.235   GT:AD:DP:GQ:PL:SB   0/1:17,21,0:38:99:641,0,456,691,519,1210:6,11,11,10
20  10000118    .   T   <NON_REF>   .   .   END=10000210    GT:DP:GQ:MIN_DP:PL  0/0:42:99:38:0,80,1314
20  10000211    .   C   T,<NON_REF> 638.77  .   BaseQRankSum=0.894;ClippingRankSum=-1.927;DP=42;MLEAC=1,0;MLEAF=0.500,0.00;MQ=221.89;MQ0=0;MQRankSum=-1.750;ReadPosRankSum=1.549    GT:AD:DP:GQ:PL:SB   0/1:20,22,0:42:99:667,0,566,728,632,1360:9,11,12,10
20  10000212    .   A   <NON_REF>   .   .   END=10000438    GT:DP:GQ:MIN_DP:PL  0/0:52:99:42:0,99,1403
20  10000439    .   T   G,<NON_REF> 1737.77 .   DP=57;MLEAC=2,0;MLEAF=1.00,0.00;MQ=221.41;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,56,0:56:99:1771,168,0,1771,168,1771:0,0,0,0
20  10000440    .   T   <NON_REF>   .   .   END=10000597    GT:DP:GQ:MIN_DP:PL  0/0:56:99:49:0,120,1800
20  10000598    .   T   A,<NON_REF> 1754.77 .   DP=54;MLEAC=2,0;MLEAF=1.00,0.00;MQ=185.55;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,53,0:53:99:1788,158,0,1788,158,1788:0,0,0,0
20  10000599    .   T   <NON_REF>   .   .   END=10000693    GT:DP:GQ:MIN_DP:PL  0/0:51:99:47:0,120,1800
20  10000694    .   G   A,<NON_REF> 961.77  .   BaseQRankSum=0.736;ClippingRankSum=-0.009;DP=54;MLEAC=1,0;MLEAF=0.500,0.00;MQ=106.92;MQ0=0;MQRankSum=0.482;ReadPosRankSum=1.537 GT:AD:DP:GQ:PL:SB   0/1:21,32,0:53:99:990,0,579,1053,675,1728:9,12,10,22
20  10000695    .   G   <NON_REF>   .   .   END=10000757    GT:DP:GQ:MIN_DP:PL  0/0:48:99:45:0,120,1800
20  10000758    .   T   A,<NON_REF> 1663.77 .   DP=51;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.32;MQ0=0  GT:AD:DP:GQ:PL:SB   1/1:0,50,0:50:99:1697,149,0,1697,149,1697:0,0,0,0
20  10000759    .   A   <NON_REF>   .   .   END=10001018    GT:DP:GQ:MIN_DP:PL  0/0:40:99:28:0,65,1080
20  10001019    .   T   G,<NON_REF> 93.77   .   BaseQRankSum=0.058;ClippingRankSum=-0.347;DP=26;MLEAC=1,0;MLEAF=0.500,0.00;MQ=29.65;MQ0=0;MQRankSum=-0.925;ReadPosRankSum=0.000 GT:AD:DP:GQ:PL:SB   0/1:19,7,0:26:99:122,0,494,179,515,694:12,7,4,3
20  10001020    .   C   <NON_REF>   .   .   END=10001020    GT:DP:GQ:MIN_DP:PL  0/0:26:72:26:0,72,1080
20  10001021    .   T   <NON_REF>   .   .   END=10001021    GT:DP:GQ:MIN_DP:PL  0/0:25:37:25:0,37,909
20  10001022    .   C   <NON_REF>   .   .   END=10001297    GT:DP:GQ:MIN_DP:PL  0/0:30:87:25:0,72,831
20  10001298    .   T   A,<NON_REF> 1404.77 .   DP=41;MLEAC=2,0;MLEAF=1.00,0.00;MQ=171.56;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,41,0:41:99:1438,123,0,1438,123,1438:0,0,0,0
20  10001299    .   C   <NON_REF>   .   .   END=10001386    GT:DP:GQ:MIN_DP:PL  0/0:43:99:39:0,95,1226
20  10001387    .   C   <NON_REF>   .   .   END=10001418    GT:DP:GQ:MIN_DP:PL  0/0:41:42:39:0,21,315
20  10001419    .   T   <NON_REF>   .   .   END=10001425    GT:DP:GQ:MIN_DP:PL  0/0:45:12:42:0,9,135
20  10001426    .   A   <NON_REF>   .   .   END=10001427    GT:DP:GQ:MIN_DP:PL  0/0:49:0:48:0,0,1282
20  10001428    .   T   <NON_REF>   .   .   END=10001428    GT:DP:GQ:MIN_DP:PL  0/0:49:21:49:0,21,315
20  10001429    .   G   <NON_REF>   .   .   END=10001429    GT:DP:GQ:MIN_DP:PL  0/0:47:18:47:0,18,270
20  10001430    .   G   <NON_REF>   .   .   END=10001431    GT:DP:GQ:MIN_DP:PL  0/0:45:0:44:0,0,1121
20  10001432    .   A   <NON_REF>   .   .   END=10001432    GT:DP:GQ:MIN_DP:PL  0/0:43:18:43:0,18,270
20  10001433    .   T   <NON_REF>   .   .   END=10001433    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1201
20  10001434    .   G   <NON_REF>   .   .   END=10001434    GT:DP:GQ:MIN_DP:PL  0/0:44:18:44:0,18,270
20  10001435    .   A   <NON_REF>   .   .   END=10001435    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1130
20  10001436    .   A   AAGGCT,<NON_REF>    1845.73 .   DP=43;MLEAC=2,0;MLEAF=1.00,0.00;MQ=220.07;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,42,0:42:99:1886,125,0,1888,126,1890:0,0,0,0
20  10001437    .   A   <NON_REF>   .   .   END=10001437    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,0

Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).


Created 2015-05-15 04:52:05 | Updated | Tags: haplotypecaller release-notes genotypegvcfs gatk3
Comments (27)

GATK 3.4 was released on May 15, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights to be published soon.

Note that the release is in progress at time of posting -- it may take a couple of hours before the new GATK jar file is updated on the downloads page.


New tool

  • ASEReadCounter: A tool to count read depth in a way that is appropriate for allele specific expression (ASE) analysis. It counts the number of reads that support the REF allele and the ALT allele, filtering low qual reads and bases and keeping only properly paired reads. See Highlights for more details.

HaplotypeCaller & GenotypeGVCFs

  • Important fix for genotyping positions over spanning deletions. Previously, if a SNP occurred in sample A at a position that was in the middle of a deletion for sample B, sample B would be genotyped as homozygous reference there (but it's NOT reference - there's a deletion). Now, sample B is genotyped as having a symbolic DEL allele. See Highlights for more details.
  • Deprecated --mergeVariantsViaLD argument in HaplotypeCaller since it didn’t work. To merge complex substitutions, use ReadBackedPhasing as a post-processing step.
  • Removed exclusion of MappingQualityZero, SpanningDeletions and TandemRepeatAnnotation from the list of annotators that cannot be annotated by HaplotypeCaller. These annotations are still not recommended for use with HaplotypeCaller, but this is no longer enforced by a hardcoded ban.
  • Clamp the HMM window starting coordinate to 1 instead of 0 (contributed by nsubtil).
  • Fixed the implementation of allowNonUniqueKmersInRef so that it applies to all kmer sizes. This resolves some assembly issues in low-complexity sequence contexts and improves calling sensitivity in those regions.
  • Initialize annotations so that --disableDithering actually works.
  • Automatic selection of indexing strategy based on .g.vcf file extension. See Highlights for more details.
  • Removed normalization of QD based on length for indels. Length-based normalization is now only applied if the annotation is calculated in UnifiedGenotyper.
  • Added the RGQ (Reference GenotypeQuality) FORMAT annotation to monomorphic sites in the VCF output of GenotypeGVCFs. Now, instead of stripping out the GQs for monomorphic ohm-ref sites, we transfer them to the RGQ. This is extremely useful for people who want to know how confident the hom-ref genotype calls are. See Highlights for more details.
  • Removed GenotypeSummaries from default annotations.
  • Added -uniquifySamples to GenotypeGVCFs to make it possible to genotype together two different datasets containing the same sample.
  • Disallow changing -dcov setting for HaplotypeCaller (pending a fix to the downsampling control system) to prevent buggy behavior. See Highlights for more details.
  • Raised per-sample limits on the number of reads in ART and HC. Active Region Traversal was using per sample limits on the number of reads that were too low, especially now that we are running one sample at a time. This caused issues with high confidence variants being dropped in high coverage data.
  • Removed explicit limitation (20) of the maximum ploidy of the reference-confidence model. Previously there was a fixed-size maximum ploidy indel RCM likelihood cache; this was changed to a dynamically resizable one. There are still some de facto limitations which can be worked around by lowering the max alt alleles parameter.
  • Made GQ of Hom-Ref Blocks in GVCF output be consistent with PLs.
  • Fixed a bug where HC was not realigning against the reference but against the best haplotype for the read.
  • Fixed a bug (in HTSJDK) that was causing GenotypeGVCFs to choke on sites with large numbers of alternate alleles (>140).
  • Modified the way GVCFBlock header lines are named because the new HTSJDK version disallows duplicate header keys (aside from special-cased keys such as INFO and FORMAT).

CombineGVCFs

  • Added option to break blocks at every N sites. Using --breakBandsAtMultiplesOf N will ensure that no reference blocks span across genomic positions that are multiples of N. This is especially important in the case of scatter-gather where you don't want your scatter intervals to start in the middle of blocks (because of a limitation in the way -L works in the GATK for VCF records with the END tag). See Highlights for more details.
  • Fixed a bug that caused the tool to stop processing after the first contig.
  • Fixed a bug where the wrong REF allele was output to the combined gVCF.

VariantRecalibrator

  • Switched VQSR tranches plot ordering rule (ordering is now based on tranche sensitivity instead of novel titv).
  • VQSR VCF header command line now contains annotations and tranche levels.

SelectVariants

  • Added -trim argument to trim (simplify) alleles to a minimal representation.
  • Added -trimAlternates argument to remove all unused alternate alleles from variants. Note that this is pretty aggressive for monomorphic sites.
  • Changed the default behavior to trim (remove) remaining alleles when samples are subset, and added the -noTrim argument to preserve original alleles.
  • Added --keepOriginalDP argument.

VariantAnnotator

  • Improvements to the allele trimming functionalities.
  • Added functionality to support multi-allelic sites when annotating a VCF with annotations from another callset. See Highlights for more details.

CalculateGenotypePosteriors

  • Fixed user-reported bug featuring "trio" family with two children, one parent.
  • Added error handling for genotypes that are called but have no PLs.

Various tools

  • BQSR: Fixed an issue where GATK would skip the entire read if a SNP is entirely contained within a sequencing adapter (contributed by nsubtil); and improved how uncommon platforms (as encoded in RG:PL tag) are handled.
  • DepthOfCoverage: Now logs a warning if incompatible arguments are specified.
  • SplitSamFile: Fixed a bug that caused a NullPointerException.
  • SplitNCigarReads: Fixed issue to make -fixNDN flag fully functional.
  • IndelRealigner: Fixed an issue that was due to reads that have an incorrect CIGAR length.
  • CombineVCFs: Minor change to an error check that was put into 3.3 so that identical samples don't need -genotypeMergeOption.
  • VariantsToBinaryPED: Corrected swap between mother and father in PED file output.
  • GenotypeConcordance: Monomorphic sites in the truth set are no longer called "Mismatching Alleles" when the comp genotype has an alternate allele.
  • ReadBackedPhasing: Fixed a couple of bugs in MNP merging.
  • CatVariants: Now allows different input / output file types, and spaces in directory names.
  • VariantsToTable: Fixed a bug that affected the output of the FORMAT record lists when -SMA is specified. Note that FORMAT fields behave the same as INFO fields - if the annotation has a count of A (one entry per Alt Allele), it is split across the multiple output lines. Otherwise, the entire list is output with each field.

Read Filters

  • Added erroneous CIGAR length to criteria for BadCigarFilter.
  • Corrected logical expression in MateSameStrandFilter (contributed by user seru71).
  • Handle X and = CIGAR operators appropriately
  • Added -drf argument to disable default read filters. Limited to specific tools and specific filters (currently only DuplicateReadFilter).

Annotations

  • Calculate StrandBiasBySample using all alternate alleles as “REF vs. any ALT”.
  • Modified InbreedingCoeff so that it works when genotyping uniquified samples (see GenotypeGVCFs changes).
  • Changed GC Content value type from Integer to Float.
  • Added StrandAlleleCountsBySample annotation. This annotation outputs the number of reads supporting each allele, stratified by sample and read strand; callable from HaplotypeCaller only.
  • Made annotators emit a warning if they can't be applied.

GATK Engine & common features

  • Fixed logging of 'out' command line parameter in VCF headers; changed []-type arrays to lists so argument parsing works in VCF header commandline output.
  • Modified GATK command line header for unique keys. The GATK command line header keys were being repeated in the VCF and subsequently lost to a single key value by HTSJDK. This resolves the issue by appending the name of the walker after the text "GATKCommandLine" and a number after that if the same walker was used more than once in the form: GATKCommandLine.(walker name) for the first occurrence of the walker, and GATKCommandLine.(walker name).# where # is the number of the occurrence of the walker (e.g. GATKCommandLine.SomeWalker.2 for the second occurrence of SomeWalker).
  • Handle X and = CIGAR operators appropriately.
  • Added barebones read/write CRAM support (no interval seeking!). See Highlights for more details.
  • Cleaned up logging outputs / streams; messages (including HMM log messages) that were going to stdout now going to stderr.
  • Improved error messages; when an error is related to a specific file, the engine now includes the file name in the error message.
  • Fixed BCF writing when FORMAT annotations contain arrays.

Queue

  • Added -qsub-broad argument. When -qsub-broad is specified instead of -qsub, Queue will use the h_vmem parameter instead of h_rss to specify memory limit requests. This was done to accommodate changes to the Broad’s internal job scheduler. Also causes the GridEngine native arguments to be output by default to the logger, instead of only when in debug mode.
  • Fixed the scala wrapper for Picard MarkDuplicates (needed because MarkDuplicates was moved to a different package within Picard).
  • Added optional element "includeUnmapped" to the PartitionBy annotation. The value of this element (default true) determines whether Queue will explicitly run this walker over unmapped reads. This patch fixes a runtime error when FindCoveredIntervals was used with Queue.

Documentation

  • Plentiful enhancements and fixes to various tool docs, especially annotations and read filters.

For developers

  • Upgraded SLF4J to allow new convenient logging syntaxes.
  • Patched maven pom file for slf4j-log4j12 version (contributed by user Biocyberman).
  • Updated HTSJDK version (now pulling it in from Maven Central); various edits made to match.
  • Collected VCF IDs and header lines into one place (GATKVCFConstants).
  • Made various changes that lead to reduced build times.

Created 2014-10-23 18:53:52 | Updated 2015-05-12 17:24:14 | Tags: Troll haplotypecaller ploidy release-notes genotypegvcfs gatk3 genotyperefinement
Comments (2)

GATK 3.3 was released on October 23, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.


Haplotype Caller

  • Improved the accuracy of dangling head merging in the HC assembler (now enabled by default).
  • Physical phasing information is output by default in new sample-level PID and PGT tags.
  • Added the --sample_name argument. This is a shortcut for people who have multi-sample BAMs but would like to use -ERC GVCF mode with a particular one of those samples.
  • Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
  • Fixed IndexOutOfBounds error associated with tail merging.

Variant Recalibrator

  • New --ignore_all_filters option. If specified, the variant recalibrator will ignore all input filters and treat sites as unfiltered.

GenotypeGVCFs

  • Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
  • Bug fix for the case when we assumed ADs were in the same order if the number of alleles matched.
  • Changed the default GVCF GQ Bands from 5,20,60 to be 1..60 by 1s, 60...90 by 10s and 99 in order to give finer resolution.
  • Bug fix in the exact model when calling multi-allelic variants. QUAL field is now more accurate.

RNAseq analysis

  • Bug fixes for working with unmapped reads.

CalculateGenotypePosteriors

  • New annotation for low- and high-confidence possible de novos (only annotates biallelics).
  • FamilyLikelihoodsUtils now add joint likelihood and joint posterior annotations.
  • Restricted population priors based on discovered allele count to be valid for 10 or more samples.

DepthOfCoverage

  • Fixed rare bug triggered by hash collision between sample names.

SelectVariants

  • Updated the --keepOriginalAC functionality in SelectVariants to work for sites that lose alleles in the selection.

PrintReads

  • Read groups that are excluded by sample_name, platform, or read_group arguments no longer appear in the header.
  • The performance penalty associated with filtering by read group has been essentially eliminated.

Annotations

  • StrandOddsRatio is now a standard annotation that is output by default.
  • We used to output zero for FS if there was no data available at a site, now we omit FS.
  • Extensive rewrite of the annotation documentation.

Queue

  • Fixed Queue bug with bad localhost addresses.
  • Fixed issue related to spaces in job names that were fine in GridEngine 6 but break in (Son of) GE8.
  • Improved scatter contigs algorithm to be fairer when splitting many contigs into few parts (contributed by @smowton)

Documentation

  • We now generate PHP files instead of HTML.
  • We now output a JSON version of the tool documentation that can be used to generate wrappers for GATK commands.

Miscellaneous

  • Output arguments --no_cmdline_in_header, --sites_only, and --bcf for VCF files, and --bam_compression, --simplifyBAM, --disable_bam_indexing, and --generate_md5 for BAM files moved to the engine level.
  • htsjdk updated to version 1.120.1620

Created 2014-07-15 03:54:06 | Updated 2014-10-23 17:58:36 | Tags: variantrecalibrator haplotypecaller selectvariants variantannotator release-notes catvariants genotypegvcfs gatk-3-2
Comments (13)

GATK 3.2 was released on July 14, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.


We also want to take this opportunity to thank super-user Phillip Dexheimer for all of his excellent contributions to the codebase, especially for this release.


Haplotype Caller

  • Various improvements were made to the assembly engine and likelihood calculation, which leads to more accurate genotype likelihoods (and hence better genotypes).
  • Reads are now realigned to the most likely haplotype before being used by the annotations, so AD and DP will now correspond directly to the reads that were used to generate the likelihoods.
  • The caller is now more conservative in low complexity regions, which significantly reduces false positive indels at the expense of a little sensitivity; mostly relevant for whole genome calling.
  • Small performance optimizations to the function to calculate the log of exponentials and to the Smith-Waterman code (thanks to Nigel Delaney).
  • Fixed small bug where indel discovery was inconsistent based on the active-region size.
  • Removed scary warning messages for "VectorPairHMM".
  • Made VECTOR_LOGLESS_CACHING the default implementation for PairHMM.
  • When we subset PLs because alleles are removed during genotyping we now also subset the AD.
  • Fixed bug where reference sample depth was dropped in the DP annotation.

Variant Recalibrator

  • The -mode argument is now required.
  • The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

AnalyzeCovariates

  • The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

Variant Annotator

  • SB tables are created even if the ref or alt columns have no counts (used in the FS and SOR annotations).

Genotype GVCFs

  • Added missing arguments so that now it models more closely what's available in the Haplotype Caller.
  • Fixed recurring error about missing PLs.
  • No longer pulls the headers from all input rods including dbSNP, rather just from the input variants.
  • --includeNonVariantSites should now be working.

Select Variants

  • The dreaded "Invalid JEXL expression detected" error is now a kinder user error.

Indel Realigner

  • Now throws a user error when it encounters reads with I operators greater than the number of read bases.
  • Fixed bug where reads that are all insertions (e.g. 50I) were causing it to fail.

CalculateGenotypePosteriors

  • Now computes posterior probabilities only for SNP sites with SNP priors (other sites have flat priors applied).
  • Now computes genotype posteriors using likelihoods from all members of the trio.
  • Added annotations for calling potential de novo mutations.
  • Now uses PP tag instead of GP tag because posteriors are Phred-scaled.

Cat Variants

  • Can now process .list files with -V.
  • Can now handle BCF and Block-Compressed VCF files.

Validate Variants

  • Now works with gVCF files.
  • By default, all strict validations are performed; use --validationTypeToExclude to exclude specific tests.

FastaAlternateReferenceMaker

  • Now use '--use_IUPAC_sample sample_name' to specify which sample's genotypes should be used for the IUPAC encoding with multi-sample VCF files.

Miscellaneous

  • Refactored maven directories and java packages replacing "sting" with "gatk".
  • Extended on-the-fly sample renaming feature to VCFs with the --sample_rename_mapping_file argument.
  • Added a new read transformer that refactors NDN cigar elements to one N element.
  • Now a Tabix index is created for block-compressed output formats.
  • Switched outputRoot in SplitSamFile to an empty string instead of null (thanks to Carlos Barroto).
  • Enabled the AB annotation in the reference model pipeline (thanks to John Wallace).
  • We now check that output files are specified in a writeable location.
  • We now allow blank lines in a (non-BAM) list file.
  • Added legibility improvements to the Progress Meter.
  • Allow for non-tab whitespace in sample names when performing on-the-fly sample-renaming (thanks to Mike McCowan).
  • Made IntervalSharder respect the IntervalMergingRule specified on the command line.
  • Sam, tribble, and variant jars updated to version 1.109.1722; htsjdk updated to version 1.112.1452.

Created 2014-03-04 06:37:01 | Updated 2014-03-17 22:48:07 | Tags: readbackedphasing reducereads release-notes genotypegvcfs combinegvcfs splitncigarreads
Comments (2)

GATK 3.0 was released on March 5, 2014. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

One important change for those who prefer to build from source is that we now use maven instead of ant. See the relevant documentation for building the GATK with our new build system.


SplitNCigarReads

  • This is a new GATK tool to be used for variant calling in RNA-seq data. Its purpose is to split reads that contain N Cigar operators (due to a limitation in the GATK that we will eventually handle internally) and to trim (and generally clean up) imperfect alignments.

Haplotype Caller

  • Fixed bug where dangling tail merging in the assembly graph occasionally created a cycle.
  • Added experimental code to retrieve dangling heads in the assembly graph, which is needed for calling variants in RNA-seq data.
  • Generally improved gVCF output by making it more accurate. This includes many updates so that the single sample gVCFs can be accurately genotyped together by GenotypeGVCFs.
  • Fixed a bug in the PairHMM class where the transition probability was miscalculated resulting in probabilities larger than 1.
  • Fixed bug in the function to find the best paths from an alignment graph which was causing bad genotypes to be emitted when running with multiple samples together.

CombineGVCFs

  • This is a new GATK tool to be used in the Haplotype Caller pipeline with large cohorts. Its purpose is to combine any number of gVCF files into a single merged gVCF. One would use this tool for hierarchical merges of the data when there are too many samples in the project to throw at all at once to GenotypeGVCFs.

GenotypeGVCFs

  • This is a new GATK tool to be used in the Haplotype Caller pipeline. Its purpose is to take any number of gVCF files and to genotype them in order to produce a VCF with raw SNP and indel calls.

SimulateReadsForVariants

  • This is a new GATK tool that might be useful to some. Given a VCF file, this tool will generate simulated reads that support the variants present in the file.

Unified Genotyper

  • Fixed bug when clipping long reads in the HMM; some reads were incorrectly getting clipped.

Variant Recalibrator

  • Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

Reduce Reads

  • Removed from the GATK. It was a valiant attempt, but ultimately we found a better way to process large cohorts. Reduced BAMs are no longer supported in the GATK.

Variant Annotator

  • Improved the FisherStrand (FS) calculation when used in large cohorts. When the table gets too large, we normalize it down to values that are more reasonable. Also, we don't include a particular sample's contribution unless we observe both ref and alt counts for it. We expect to improve on this even further in a future release.
  • Improved the QualByDepth (QD) calculation when used in large cohorts. Now, when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1. Note that this only works in the gVCF pipeline for now.
  • In addition, fixed the normalization for indels in QD (which was over-penalizing larger events).

Combine Variants

  • Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

Select Variants

  • Fixed a huge bug where selecting out a subset of samples while using multi-threading (-nt) caused genotype-level fields (e.g. AD) to get swapped among samples. This was a bad one.
  • Fixed a bug where selecting out a subset of samples at multi-allelic sites occasionally caused the alternate alleles to be re-ordered but the AD values were not updated accordingly.

CalculateGenotypePosteriors

  • Fixed bug where it wasn't checking for underflow and occasionally produced bad likelihoods.
  • It no longer strips out the AD annotation from genotypes.
  • AC/AF/AN counts are updated after fixing genotypes.
  • Updated to handle cases where the AC (and MLEAC) annotations are not good (e.g. they are greater than AN somehow).

Indel Realigner

  • Fixed bug where a realigned read can sometimes get partially aligned off the end of the contig.

Read Backed Phasing

  • Updated the tool to use the VCF 4.1 framework for phasing; it now uses HP tags instead of '|' to convey phase information.

Miscellaneous

  • Thanks to Phillip Dexheimer for several Queue related fixes and patches.
  • Thanks to Nicholas Clarke for patches to the timer which occasionally had negative elapsed times.
  • Providing an empty BAM list no results in a user error.
  • Fixed a bug in the gVCF writer where it was dropping the first few reference blocks at the beginnings of all but the first chromosome. Also, several unnecessary INFO field annotations were dropped from the output.
  • Logger output now goes to STDERR instead of STDOUT.
  • Picard, Tribble, and Variant jars updated to version 1.107.1683.

Created 2015-08-18 08:28:40 | Updated 2015-08-18 08:30:16 | Tags: genotypegvcfs
Comments (4)

Hi,

I tried to use GenotypeGVCFs to perform joint genotyping with 5 samples. I used -V mygvcfs.list to submit multiple gvcfs, but I got the following error message: Invalid command line: No tribble type was provided on the command line and the type of the file could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types.

So, I run again with -V:VCF mygvcfs.list. I got another error: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file

I checked back my gvcfs, the #CHROM header line is present. I also tried to run GeotypeGVCFs by typing the input gvcfs one by one on command line, and it works well.

Any suggestion on why this happened and how can I submit the input gvcfs in list?

I am using GATK version: 3.3.0

Thanks.


Created 2015-08-12 15:42:52 | Updated | Tags: genotypegvcfs
Comments (2)

I wanted to produce the cohort vcf file from several g.vcf files made using HaplotypeCaller. I used the 'GenotypeGVCFs' and gave a list of g.vcf files. The order of samples in the final vcf file is different from the gvcf file list I provided. It looks like it sorted alphabetically with the sample name. How can I get samples in the same order as in the gvcf file list? I am using GATK 3.3.


Created 2015-08-07 22:07:06 | Updated | Tags: combinevariants genotypegvcfs reference-genome
Comments (5)

Dear GATK team,

I have a set of samples for which I have WES and WGS data. I have run the same pipeline (mostly following GATK's Best Proctices) on both datasets with the same reference sets and target sites down to VQSR. So now I have to VCFs ready to analyze. My idea is to merge these datasets and I believe my situation parallels to the third case described in this document gatkforums.broadinstitute.org/discussion/53/combining-variants-from-different-files-into-one.

Therefore, I used -T CombineVariants to merge these two files into one. Up until here everything runs ok and I obtain a working final VCF. However when I look deeper into it, there are some variants that have been kept duplicated but present different reference allele. When I look back at the separate VCF files, these differences are already there. My question is, how two sets of files that have been processed exactly in the same way, can present different reference sites? I paste couple of examples for clarification:

chr10 100167436 . T C 177.06 PASS AC=1;AF=0.011;AN=94; set=WGS chr10 100167436 . TGTCACCAGGGGTCACCAGGGATGAGGACC CGTCACCAGGGGTCACCAGGGATGAGGACC,T 56690.81 PASS AC=50,2;AF=0.024,9.533e-04;AN=2098; set=WES

chr10 101462504 . CT CTT,C 1539.20 PASS AC=34,48;AF=0.014,0.020;AN=2348; set=WES chr10 101462504 . C T,CT . PASS AC=0,0;AF=0.00,0.00;AN=114; set=WGS

chr10 102295645 . G GT 627.02 PASS AC=7;AF=0.061;AN=114; set=WGS chr10 102295645 . GT GTT,G,TT 63780.89 PASS AC=203,403,17;AF=0.088,0.174,7.328e-03;AN=2320; set=WES

Any thoughts on this behaviour? and possible solutions?

Thanks in advance

Victoria


Created 2015-08-04 23:35:08 | Updated 2015-08-04 23:36:54 | Tags: haplotypecaller vcf genotypegvcfs
Comments (1)

It appears that there are some cases in which an upstream site is heterozygous for a deletion, and a downstream site should be heterozygous for something like 1/2 (where 2 is the * allele), but GATK is still making erroneous 0/1 calls. For instance, in my dataset I have a deletion at position 773 (I'm leaving out some fields and GT info for additional samples for brevity).

scaffold_1 773 . CAAAACATAGTTTCCCATGTCCGCCATCCCATGTTCTGCATCCGTGCA C ... 0/1:102,70:172:99:0|1:773_CAAAACATAGTTTCCCATGTCCGCCATCCCATGTTCTGCATCCGTGCA_C:283 9,0,4099

And just 2 bases downstream, there's a SNP for the same sample that has non-* calls on both strands, which seems impossible. Shouldn't there be a * allele here, e.g (ALT==G,* ; GT==1/2)? Here's the call: scaffold_1 775 . A G ... 0/1:60,134:194:99:4507,0,1642

Could this be a bug, such that when there's a heterozygous ALT call on the strand without the deletion, the half of the GT field for deleted strand is still getting a reference (0) call when it should be getting a * ? At another site overlapping with the same deletion, at which the strand without the deletion for this sample is a REF call , I get something that's more like what I'd expect.

scaffold_1 805 . G T,* ... 0/2:235,0,33:268:99:0|1:773_CAAAACATAGTTTCCCATGTCCGCCATCCCATGTTCTGCATCCGTGCA_C:676,1384,11251,0,9868,9762

This was done with GenotypeGVCFs from GATK 3.4-46.


Created 2015-07-28 14:07:17 | Updated | Tags: genotypegvcfs
Comments (11)

Hi.

I found that in some case, GenotypeGVCFs may output IUB code * (such as G,* )into vcf. What is the meaning of "" and how to remove the "" from ALTs ?

GATK Version 3.4.0-46 Command: java -jar GATK.jar -T GenotypeGVCFs -R ${fullref} --dbsnp ${dbsnp} -o out.vcf -V v1.g.vcf -V v2.g.vcf ... (no -nt option)

the VCF file(for test, we call var on chr22 only.)

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_100N Sample_100T Sample_101N Sample_101T Sample_104_N Sample_104_T Sample_105_N Sample_105_T Sample_106_N Sample_106_T Sample_107_N Sample_107_T Sample_108_N Sample_108_T Sample_109_N Sample_109_T

22 44862615 rs8777 A G,* 29879.7 . AC=18,2;AF=0.563,0.063;AN=32;BaseQRankSum=1.35;ClippingRankSum=0.182;DB;DP=1221;FS=1.288;InbreedingCoeff=-0.6;MLEAC=18,2;MLEAF=0.563,0.063;MQ=60;MQRankSum=0.119;QD=24.71;ReadPosRankSum=-0.194;SOR=0.855 GT:AD:DP:GQ:PGT:PID:PL 0/1:31,46,0:77:99:0|1:44862609_CTG_C:1073,0,1208,1167,1293,2460 0/1:50,48,0:98:99:0|1:44862609_CTG_C:1841,0,1843,1990,1987,3977 1/1:0,80,0:80:99:1|1:44862609_CTG_C:3770,256,0,3770,256,3770 1/1:0,80,0:80:99:1|1:44862609_CTG_C:3553,244,0,3553,244,3553 0/1:41,42,0:83:99:0|1:44862609_CTG_C:1598,0,1571,1722,1700,3422 0/1:38,50,0:88:99:0|1:44862609_CTG_C:1988,0,1404,2102,1561,3662 1/1:0,73,0:73:99:1|1:44862609_CTG_C:2961,226,0,2961,226,2961 1/1:0,60,0:60:99:1|1:44862609_CTG_C:2510,181,0,2510,181,2510 0/2:45,0,40:85:99:.:.:1502,1637,3506,0,1869,1749 0/2:32,0,33:65:99:.:.:1269,1365,2707,0,1341,1242 0/1:31,30,0:61:99:0|1:44862609_CTG_C:1166,0,1207,1260,1297,2557 0/1:26,35,0:61:99:0|1:44862609_CTG_C:1392,0,951,1470,1057,2526 0/1:45,44,0:89:99:0|1:44862609_CTG_C:1712,0,1744,1848,1877,3724 0/1:45,53,0:98:99:0|1:44862609_CTG_C:2127,0,1719,2263,1881,4144 0/1:25,19,0:44:99:0|1:44862609_CTG_C:722,0,945,797,1005,1802 0/1:46,21,0:67:99:0|1:44862609_CTG_C:767,0,1858,906,1924,2830


Created 2015-07-22 11:46:00 | Updated | Tags: combinevariants haplotypecaller best-practices vcf gatk genotypegvcfs combinegvcfs gvcf
Comments (5)

I was trying to do combine sets of vcf files for all my samples so that I have one single vcf output using this command option below java -d64 -Xmx48g -jar ${GATK}/GenomeAnalysisTK.jar \ -R ${REF} \ -T GenotypeGVCFs \ --variant A.g.vcf \ --variant B.g.vcf \ --variant C.g.vcf \ -stand_emit_conf 30 \ -stand_call_conf 30 \ -o genotype.vcf

but I got this error message “The following invalid GT allele index was encountered in the file: END=21994810”. I have tried to locate where the problem could be coming from but I do not understand this. Could you please advise me.


Created 2015-07-17 22:01:12 | Updated 2015-07-17 22:14:12 | Tags: haplotypecaller ploidy haploid genotypegvcfs combinegvcfs chromosome-x
Comments (14)

Hi,

I'm attempting to run GenotypeGVCFs on a cohort of ~4200 human male samples with targeted sequencing. I'm following the current DNA-Seq guidelines for cohort genotyping, with GATK v3.4-0. For each sample, I ran HaplotypeCaller separately for diploid and haploid (i.e., chrX non-PAR) regions, specifying --ploidy 1 for the haploid regions, then combined the resulting two GVCFs with CombineGVCFs. I then combined the per-sample GVCFs into groups of 64 samples using CombineGVCFs. Finally, I ran GenotypeGVCFs with all samples separately for groups of ~100 small target intervals (baits). Every group of target intervals ran fine without error in about 4 hours with ~5GB of RAM, except for the non-PAR chrX regions, which were haploid for all samples.

For the haploid regions, GATK hangs on the very first base, slowly increasing memory usage, then eventually runs out of memory and exits. The estimated runtime keeps increasing without making any progress. The last run exited after 12 hours without making any progress. This happens no matter how much memory I specify (up to 128 GB).

Interestingly, a PAR region of chromosome X run with --ploidy 2 in HaplotypeCaller worked with no problem.

The inputted GVCF files to GenotypeGVCFs are uncompressed and were indexed by CombineGVCFs.

I'm using default settings for GenotypeGVCFs, except for the following:

--standard_min_confidence_threshold_for_calling 20 --standard_min_confidence_threshold_for_emitting 10 --dbsnp dbSNP142.20150416.GRCh37.for-GATK.chr1-MT.vcf.gz

I tried running GenotypeGVCFs with the latest v3.4-46 release, but the same problem occurred.

Below is example output:

INFO 10:57:51,803 HelpFormatter - --------------------------------------------------------------------------------- INFO 10:57:51,810 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12 INFO 10:57:51,811 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 10:57:51,811 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 10:57:51,818 HelpFormatter - Program Args: -T GenotypeGVCFs -R /tmp/12715944.hpc-pbs.hpcc.usc.edu/hs37m.fa --dbsnp dbSNP142.20150416.GRCh37.for-GATK.chr1-MT.vcf.gz --standard_min_confidence_threshold_for_calling 20 --standard_min_confidence_threshold_for_emitting 10 [LONG LIST OF VARIANT FILES OMITTED] --out gatk.hc.combined.genotyped.chunk117.vcf.gz -L split_117.intervals --log_to_file gatk.hc.combined.genotyped.chunk117.log INFO 10:57:51,824 HelpFormatter - Executing as cedlund@hpc1130.m10g.hpcc.usc.edu on Linux 2.6.32-504.12.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13. INFO 10:57:51,825 HelpFormatter - Date/Time: 2015/07/17 10:57:51 INFO 10:57:51,826 HelpFormatter - --------------------------------------------------------------------------------- INFO 10:57:51,826 HelpFormatter - --------------------------------------------------------------------------------- INFO 10:57:56,331 GenomeAnalysisEngine - Strictness is SILENT INFO 10:57:56,671 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 10:59:03,550 IntervalUtils - Processing 154370 bp from intervals WARN 10:59:03,615 IndexDictionaryUtils - Track dbsnp doesn't have a sequence dictionary built in, skipping dictionary validation INFO 10:59:03,766 GenomeAnalysisEngine - Preparing for traversal INFO 10:59:03,768 GenomeAnalysisEngine - Done preparing for traversal INFO 10:59:03,768 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 10:59:03,769 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 10:59:03,770 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 10:59:04,283 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files INFO 10:59:57,328 ProgressMeter - X:51033766 216.0 53.0 s 68.9 h 0.2% 7.2 h 7.2 h INFO 11:00:27,330 ProgressMeter - X:51035366 216.0 83.0 s 4.5 d 1.2% 111.5 m 110.1 m INFO 11:00:57,353 ProgressMeter - X:51035366 216.0 113.0 s 6.1 d 1.2% 2.5 h 2.5 h INFO 11:01:27,354 ProgressMeter - X:51035366 216.0 2.4 m 7.7 d 1.2% 3.2 h 3.2 h INFO 11:01:57,356 ProgressMeter - X:51035366 216.0 2.9 m 9.3 d 1.2% 3.9 h 3.8 h INFO 11:02:27,358 ProgressMeter - X:51035366 216.0 3.4 m 10.9 d 1.2% 4.5 h 4.5 h INFO 11:02:57,882 ProgressMeter - X:51035366 216.0 3.9 m 12.5 d 1.2% 5.2 h 5.2 h INFO 11:03:27,884 ProgressMeter - X:51035366 216.0 4.4 m 14.2 d 1.2% 5.9 h 5.8 h INFO 11:03:57,885 ProgressMeter - X:51035366 216.0 4.9 m 15.8 d 1.2% 6.6 h 6.5 h INFO 11:04:27,887 ProgressMeter - X:51035366 216.0 5.4 m 17.4 d 1.2% 7.3 h 7.2 h INFO 11:04:58,976 ProgressMeter - X:51035366 216.0 5.9 m 19.0 d 1.2% 7.9 h 7.8 h INFO 11:05:28,977 ProgressMeter - X:51035366 216.0 6.4 m 2.9 w 1.2% 8.6 h 8.5 h INFO 11:05:58,979 ProgressMeter - X:51035366 216.0 6.9 m 3.2 w 1.2% 9.3 h 9.2 h INFO 11:06:28,981 ProgressMeter - X:51035366 216.0 7.4 m 3.4 w 1.2% 10.0 h 9.8 h INFO 11:06:58,982 ProgressMeter - X:51035366 216.0 7.9 m 3.6 w 1.2% 10.6 h 10.5 h INFO 11:07:28,984 ProgressMeter - X:51035366 216.0 8.4 m 3.9 w 1.2% 11.3 h 11.2 h INFO 11:07:58,986 ProgressMeter - X:51035366 216.0 8.9 m 4.1 w 1.2% 12.0 h 11.8 h INFO 11:08:30,497 ProgressMeter - X:51035366 216.0 9.4 m 4.3 w 1.2% 12.7 h 12.5 h INFO 11:09:30,568 ProgressMeter - X:51035366 216.0 10.4 m 4.8 w 1.2% 14.0 h 13.8 h INFO 11:10:32,779 ProgressMeter - X:51035366 216.0 11.5 m 5.3 w 1.2% 15.4 h 15.2 h INFO 11:11:33,479 ProgressMeter - X:51035366 216.0 12.5 m 5.7 w 1.2% 16.8 h 16.6 h INFO 11:12:35,360 ProgressMeter - X:51035366 216.0 13.5 m 6.2 w 1.2% 18.2 h 17.9 h INFO 11:13:35,445 ProgressMeter - X:51035366 216.0 14.5 m 6.7 w 1.2% 19.5 h 19.3 h INFO 11:14:39,689 ProgressMeter - X:51035366 216.0 15.6 m 7.2 w 1.2% 20.9 h 20.7 h INFO 11:15:40,505 ProgressMeter - X:51035366 216.0 16.6 m 7.6 w 1.2% 22.3 h 22.0 h INFO 11:16:41,140 ProgressMeter - X:51035366 216.0 17.6 m 8.1 w 1.2% 23.7 h 23.4 h INFO 11:17:41,956 ProgressMeter - X:51035366 216.0 18.6 m 8.6 w 1.2% 25.0 h 24.7 h INFO 11:18:41,958 ProgressMeter - X:51035366 216.0 19.6 m 9.0 w 1.2% 26.4 h 26.0 h INFO 11:19:44,493 ProgressMeter - X:51035366 216.0 20.7 m 9.5 w 1.2% 27.8 h 27.4 h INFO 11:20:49,749 ProgressMeter - X:51035366 216.0 21.8 m 10.0 w 1.2% 29.2 h 28.8 h INFO 11:21:53,414 ProgressMeter - X:51035366 216.0 22.8 m 10.5 w 1.2% 30.6 h 30.3 h INFO 11:22:58,174 ProgressMeter - X:51035366 216.0 23.9 m 11.0 w 1.2% 32.1 h 31.7 h INFO 11:24:01,211 ProgressMeter - X:51035366 216.0 25.0 m 11.5 w 1.2% 33.5 h 33.1 h INFO 11:25:05,051 ProgressMeter - X:51035366 216.0 26.0 m 12.0 w 1.2% 34.9 h 34.5 h INFO 11:26:07,782 ProgressMeter - X:51035366 216.0 27.1 m 12.4 w 1.2% 36.3 h 35.9 h INFO 11:27:10,933 ProgressMeter - X:51035366 216.0 28.1 m 12.9 w 1.2% 37.8 h 37.3 h INFO 11:28:20,854 ProgressMeter - X:51035366 216.0 29.3 m 13.5 w 1.2% 39.3 h 38.8 h INFO 11:29:28,165 ProgressMeter - X:51035366 216.0 30.4 m 14.0 w 1.2% 40.8 h 40.3 h INFO 11:30:28,575 ProgressMeter - X:51035366 216.0 31.4 m 14.4 w 1.2% 42.2 h 41.6 h INFO 11:31:36,673 ProgressMeter - X:51035366 216.0 32.5 m 14.9 w 1.2% 43.7 h 43.1 h INFO 11:32:45,497 ProgressMeter - X:51035366 216.0 33.7 m 15.5 w 1.2% 45.2 h 44.7 h INFO 11:33:49,205 ProgressMeter - X:51035366 216.0 34.8 m 16.0 w 1.2% 46.7 h 46.1 h INFO 11:34:49,226 ProgressMeter - X:51035366 216.0 35.8 m 16.4 w 1.2% 48.0 h 47.4 h INFO 11:35:54,571 ProgressMeter - X:51035366 216.0 36.8 m 16.9 w 1.2% 49.5 h 48.8 h INFO 11:36:59,402 ProgressMeter - X:51035366 216.0 37.9 m 17.4 w 1.2% 50.9 h 50.3 h INFO 11:38:03,427 ProgressMeter - X:51035366 216.0 39.0 m 17.9 w 1.2% 52.3 h 51.7 h INFO 11:39:12,036 ProgressMeter - X:51035366 216.0 40.1 m 18.4 w 1.2% 53.9 h 53.2 h INFO 11:40:15,472 ProgressMeter - X:51035366 216.0 41.2 m 18.9 w 1.2% 55.3 h 54.6 h INFO 11:41:22,184 ProgressMeter - X:51035366 216.0 42.3 m 19.4 w 1.2% 56.8 h 56.1 h INFO 11:42:24,992 ProgressMeter - X:51035366 216.0 43.4 m 19.9 w 1.2% 58.2 h 57.5 h INFO 11:43:30,745 ProgressMeter - X:51035366 216.0 44.4 m 20.4 w 1.2% 59.7 h 58.9 h INFO 11:44:41,392 ProgressMeter - X:51035366 216.0 45.6 m 21.0 w 1.2% 61.3 h 60.5 h INFO 11:45:51,136 ProgressMeter - X:51035366 216.0 46.8 m 21.5 w 1.2% 62.8 h 62.0 h INFO 11:46:59,056 ProgressMeter - X:51035366 216.0 47.9 m 22.0 w 1.2% 64.3 h 63.5 h INFO 11:48:09,266 ProgressMeter - X:51035366 216.0 49.1 m 22.5 w 1.2% 65.9 h 65.1 h INFO 11:49:16,701 ProgressMeter - X:51035366 216.0 50.2 m 23.1 w 1.2% 67.4 h 66.6 h INFO 11:50:24,150 ProgressMeter - X:51035366 216.0 51.3 m 23.6 w 1.2% 68.9 h 68.1 h INFO 11:51:31,883 ProgressMeter - X:51035366 216.0 52.5 m 24.1 w 1.2% 70.5 h 69.6 h INFO 11:52:40,234 ProgressMeter - X:51035366 216.0 53.6 m 24.6 w 1.2% 72.0 h 71.1 h INFO 11:53:46,785 ProgressMeter - X:51035366 216.0 54.7 m 25.1 w 1.2% 73.5 h 72.6 h INFO 11:54:53,194 ProgressMeter - X:51035366 216.0 55.8 m 25.6 w 1.2% 75.0 h 74.0 h

Here is an example of the GVCF for 3 samples in one of the problem haploid regions:

X 51035345 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:78:2:0,78 .:9:99:3:0,112 X 51035346 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035347 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035348 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035349 . G <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035350 . G <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035351 . G <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035352 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035353 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035354 . T <NON_REF> . . END=51035355 GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035356 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:41:1:0,41 .:9:99:3:0,112 X 51035357 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:40:1:0,40 .:9:99:3:0,112 X 51035358 . G <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:39:1:0,39 .:9:99:3:0,112 X 51035359 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:38:1:0,38 .:9:99:3:0,112 X 51035360 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:41:1:0,41 .:9:99:3:0,112 X 51035361 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:39:1:0,39 .:9:99:3:0,112 X 51035362 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:38:1:0,38 .:9:99:3:0,112 X 51035363 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:38:1:0,38 .:9:99:3:0,112 X 51035364 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:1:38:1:0,38 .:9:99:3:0,112 X 51035365 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035366 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035367 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035368 . G <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035369 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035370 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035371 . A C,<NON_REF> . . DP=246;MQ=60.00 GT:AD:DP:GQ:MIN_DP:PL:SB .:0,4,0:4:99:.:126,0,126:0,0,1,3 .:.:2:72:2:0,72,72 .:.:9:99:3:0,112,112 X 51035372 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035373 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035374 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:2:45:2:0,45 .:2:72:2:0,72 .:9:99:3:0,112 X 51035375 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:2:45:2:0,45 .:2:72:2:0,72 .:9:99:3:0,112 X 51035376 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:2:45:2:0,45 .:2:54:2:0,54 .:9:99:3:0,112 X 51035377 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:2:45:2:0,45 .:2:71:2:0,71 .:9:99:3:0,112 X 51035378 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:18:99:3:0,105 .:2:71:2:0,71 .:9:99:3:0,112 X 51035379 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:18:99:3:0,105 .:2:81:2:0,81 .:9:99:3:0,112 X 51035380 . G <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:18:99:3:0,105 .:2:78:2:0,78 .:9:99:3:0,112 X 51035381 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:18:99:3:0,105 .:2:45:2:0,45 .:9:99:3:0,112 X 51035382 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:18:99:3:0,105 .:2:45:2:0,45 .:9:99:3:0,112

Any help is greatly appreciated. Please let me know if you need any other information.

Kindest regards, Chris


Created 2015-07-14 10:11:09 | Updated 2015-07-14 10:19:20 | Tags: genotypegvcfs
Comments (2)

The GenotypeGVCFs crash for some reasons I am not clear ! it throws the message below: ... (many lines here) ... INFO 17:25:12,879 ProgressMeter - 1:58001 0.0 12.3 m 1224.3 w 0.0% 7.9 w 7.9 w INFO 17:28:21,843 ProgressMeter - 1:58001 0.0 13.5 m 1339.8 w 0.0% 8.6 w 8.6 w INFO 17:30:43,033 ProgressMeter - 1:58001 0.0 17.8 m 1770.2 w 0.0% 11.4 w 11.4 w

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:189) at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.loadFromDisk(RMDTrackBuilder.java:335) at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.attemptToLockAndLoadIndexFromDisk(RMDTrackBuilder.java:319) at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:278) at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:224) at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:147) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.createNewResource(ReferenceOrderedDataSource.java:226) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.createNewResource(ReferenceOrderedDataSource.java:185) at org.broadinstitute.gatk.engine.datasources.rmd.ResourcePool.iterator(ResourcePool.java:84) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedDataSource.seek(ReferenceOrderedDataSource.java:168) at org.broadinstitute.gatk.engine.datasources.providers.RodLocusView.(RodLocusView.java:82) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.getLocusView(TraverseLociNano.java:129) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:80) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedConstructorAccessor16.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:408) at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:185) ... 18 more Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded at htsjdk.tribble.index.interval.IntervalTreeIndex$ChrIndex.read(IntervalTreeIndex.java:206) at htsjdk.tribble.index.AbstractIndex.read(AbstractIndex.java:364) at htsjdk.tribble.index.interval.IntervalTreeIndex.(IntervalTreeIndex.java:52) ... 22 more

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.4-46-gbc02625):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: java.lang.reflect.InvocationTargetException
ERROR ------------------------------------------------------------------------------------------

And I restart the program GenotypeGVCFs at the SAME position and SAME Arguments. The starting ProgressMeter is far from the beginning position,like this: ... ... INFO 17:38:37,766 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:38:37,766 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 17:38:37,767 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 17:38:37,888 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files INFO 17:39:17,370 ProgressMeter - Starting 0.0 39.0 s 65.5 w 100.0% 39.0 s 0.0 s INFO 17:40:01,886 ProgressMeter - 1:3000101 0.0 84.0 s 139.1 w 0.8% 2.9 h 2.9 h ... ...

Is it a BUG or NOT ???

By the way, after a half hour, The GenotypeGVCFs crash again at another position because of java.lang.reflect.InvocationTargetException ... INFO 17:58:52,940 ProgressMeter - 1:3054001 0.0 20.3 m 2009.2 w 0.8% 41.2 h 40.9 h

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException ...

Could you give me any suggestion? THX


Created 2015-06-19 19:20:58 | Updated | Tags: haplotypecaller rnaseq genotypegvcfs gvcf
Comments (5)

Hello,

I was wondering if there is a way to output all annotations for all sites when running HaplotypeCaller with BP_RESOLUTION. Currently it outputs all annotations for only called variants. Thanks in advance.


Created 2015-06-03 10:22:44 | Updated | Tags: genotypegvcfs combinegvcfs
Comments (1)

Hi,

I would like to be sure of the difference between those 2 tools. From what I understand, GenotypeGVCFs somehow re-calculate likelihood and parameters (QUAL, DP, MQ...) for each variant positions present in at least 1 input sample. Right ? And is it not the case for CombineGVCFs?

Thank you for your help, Fabrice


Created 2015-06-01 15:12:54 | Updated | Tags: genotypegvcfs
Comments (5)

I'm running GenotypeGVCF after calling SNPs and Indels through HaplotypeCaller, and my command is as follows: java -Xmx10g -Djava.io.tmpdir=pwd/tmp -jar ./GATK/GenomeAnalysisTK.jar \ -T GenotypeGVCFs \ -R ./hg19/ucsc.hg19.fasta \ --variant ./SF-16/concated.g.vcf \ --variant ./SF-17/concated.g.vcf \ --variant ./SF-18/concated.g.vcf \ --variant ./SF-19/concated.g.vcf \ --variant ./SF-20/concated.g.vcf \ -stand_call_conf 30 \ -stand_emit_conf 10 \ -o ./pedigree_multisample.vcf

and the ERROR information is:

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.3-0-g37228af):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Line 468810703: there aren't enough columns for line chr11 123591281 . A
   .       .       END (we expected 9 tokens, and saw 8 )

##### ERROR ------------------------------------------------------------------------------------------

when I checked the position(chr11 123591281) in the five vcf files, there is no abnormal column numbers.

./SF-16/concated.g.vcf: chr11 123591281 . A . . END=123591282 GT:DP:GQ:MIN_DP:PL 0/0:21:60:21:0,60,900

./SF-17/concated.g.vcf: chr11 123591270 . C . . END=123591292 GT:DP:GQ:MIN_DP:PL 0/0:25:75:25:0,72,787

./SF-18/concated.g.vcf: chr11 123591279 . T . . END=123591285 GT:DP:GQ:MIN_DP:PL 0/0:23:68:22:0,66,747

./SF-19/concated.g.vcf: chr11 123591274 . G . . END=123591281 GT:DP:GQ:MIN_DP:PL 0/0:28:78:27:0,72,895

./SF-20/concated.g.vcf: chr11 123591279 . T . . END=123591287 GT:DP:GQ:MIN_DP:PL 0/0:25:69:24:0,66,990

so I don't know what's wrong with my command or what's wrong with my datas, can you give me some advice ?


Created 2015-05-20 13:06:46 | Updated | Tags: haplotypecaller java genotypegvcfs
Comments (45)

Dear GATK,

I used the HaplotypeCaller with "-dcov 500 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000" to produce 60 gvcf files, that worked fine. However, GenotypeGVCFs gets stuck on a position and runs out of memory after about 24hours, even when I allocate 240Gb. Testing a short region of 60kb does not help. Here was my command line: software/jre1.7.0_25/bin/java -Xmx240g -jar GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R Reference.fasta -L chrom14:2240000-2300000 --variant 60samples_gvcf.list -o output.vcf

If I split my list of 60 gvcf files into two lists of 30 samples each, GenotypeGVCFs works fine for both batches within 15 minutes (~10Gb of memory).
I tested with 47 samples, it took 8 hours (31gb of memory) for a 60kb region. Once I use more than ~55 samples, it takes forever and crashes.

Any help will be much appreciated! Thanks,

Antoine


Created 2015-05-07 09:22:47 | Updated 2015-05-07 10:09:41 | Tags: haplotypecaller format genotypegvcfs info-field
Comments (18)

I have a potential bug running GATK GenotypeGVCFs. It complains that there is a DP in the INFO field, but in my haplotypecaller-generated -mg.g.vcf.gz's I do not have a DP in the info, I do have DP in the FORMAT field though, but that's present in the headers as shown below the error output.

Any idea what could be the problem?

INFO  18:30:12,694 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  18:30:12,698 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-geee94ec, Compiled 2015/03/09 14:27:22 
INFO  18:30:12,699 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  18:30:12,699 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  18:30:12,706 HelpFormatter - Program Args: -l INFO -T GenotypeGVCFs -R /net/NGSanalysis/ref/Mus_musculus.GRCm38/index/bwa/Mus_musculus.GRCm38.dna.primary_assembly.fa -o /dev/stdout -ploidy 2 --num_threads 32 --intervals:targets,BED /net/NGSanalysis/ref/Mus_musculus.GRCm38/bed/SeqCap/ex100/110624_MM10_exome_L2R_D02_EZ_HX1-ex100.bed --max_alternate_alleles 20 -V:3428_10_14_SRO_185_TGGCTTCA-mg,VCF 3428_10_14_SRO_185_TGGCTTCA-mg.g.vcf.gz -V:3428_11_14_SRO_186_TGGTGGTA-mg,VCF 3428_11_14_SRO_186_TGGTGGTA-mg.g.vcf.gz -V:3428_12_13_SRO_422_TTCACGCA-mg,VCF 3428_12_13_SRO_422_TTCACGCA-mg.g.vcf.gz -V:3428_13_13_SRO_492_AACTCACC-mg,VCF 3428_13_13_SRO_492_AACTCACC-mg.g.vcf.gz -V:3428_14_13_SRO_493_AAGAGATC-mg,VCF 3428_14_13_SRO_493_AAGAGATC-mg.g.vcf.gz -V:3428_15_14_SRO_209_AAGGACAC-mg,VCF 3428_15_14_SRO_209_AAGGACAC-mg.g.vcf.gz -V:3428_16_14_SRO_218_AATCCGTC-mg,VCF 3428_16_14_SRO_218_AATCCGTC-mg.g.vcf.gz -V:3428_17_14_SRO_201_AATGTTGC-mg,VCF 3428_17_14_SRO_201_AATGTTGC-mg.g.vcf.gz -V:3428_18_13_SRO_416_ACACGACC-mg,VCF 3428_18_13_SRO_416_ACACGACC-mg.g.vcf.gz -V:3428_19_14_SRO_66_ACAGATTC-mg,VCF 3428_19_14_SRO_66_ACAGATTC-mg.g.vcf.gz -V:3428_1_13_SRO_388_GTCGTAGA-mg,VCF 3428_1_13_SRO_388_GTCGTAGA-mg.g.vcf.gz -V:3428_20_14_SRO_68_AGATGTAC-mg,VCF 3428_20_14_SRO_68_AGATGTAC-mg.g.vcf.gz -V:3428_21_14_SRO_210_AGCACCTC-mg,VCF 3428_21_14_SRO_210_AGCACCTC-mg.g.vcf.gz -V:3428_22_14_SRO_256_AGCCATGC-mg,VCF 3428_22_14_SRO_256_AGCCATGC-mg.g.vcf.gz -V:3428_23_14_SRO_270_AGGCTAAC-mg,VCF 3428_23_14_SRO_270_AGGCTAAC-mg.g.vcf.gz -V:3428_24_13_SRO_452_ATAGCGAC-mg,VCF 3428_24_13_SRO_452_ATAGCGAC-mg.g.vcf.gz -V:3428_2_13_SRO_399_GTCTGTCA-mg,VCF 3428_2_13_SRO_399_GTCTGTCA-mg.g.vcf.gz -V:3428_3_13_SRO_461_GTGTTCTA-mg,VCF 3428_3_13_SRO_461_GTGTTCTA-mg.g.vcf.gz -V:3428_4_13_SRO_462_TAGGATGA-mg,VCF 3428_4_13_SRO_462_TAGGATGA-mg.g.vcf.gz -V:3428_5_13_SRO_465_TATCAGCA-mg,VCF 3428_5_13_SRO_465_TATCAGCA-mg.g.vcf.gz -V:3428_6_13_SRO_402_TCCGTCTA-mg,VCF 3428_6_13_SRO_402_TCCGTCTA-mg.g.vcf.gz -V:3428_7_13_SRO_474_TCTTCACA-mg,VCF 3428_7_13_SRO_474_TCTTCACA-mg.g.vcf.gz -V:3428_8_13_SRO_531_TGAAGAGA-mg,VCF 3428_8_13_SRO_531_TGAAGAGA-mg.g.vcf.gz -V:3428_9_14_SRO_166_TGGAACAA-mg,VCF 3428_9_14_SRO_166_TGGAACAA-mg.g.vcf.gz 
INFO  18:30:12,714 HelpFormatter - Executing as roel@utonium.nki.nl on Linux 2.6.32-504.12.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_75-b13. 
INFO  18:30:12,714 HelpFormatter - Date/Time: 2015/05/06 18:30:12 
INFO  18:30:12,715 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  18:30:12,715 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  18:30:15,963 GenomeAnalysisEngine - Strictness is SILENT 
INFO  18:30:16,109 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  18:30:29,705 IntervalUtils - Processing 101539431 bp from intervals 
WARN  18:30:29,726 IndexDictionaryUtils - Track 3428_10_14_SRO_185_TGGCTTCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,727 IndexDictionaryUtils - Track 3428_11_14_SRO_186_TGGTGGTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,727 IndexDictionaryUtils - Track 3428_12_13_SRO_422_TTCACGCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,728 IndexDictionaryUtils - Track 3428_13_13_SRO_492_AACTCACC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,728 IndexDictionaryUtils - Track 3428_14_13_SRO_493_AAGAGATC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,728 IndexDictionaryUtils - Track 3428_15_14_SRO_209_AAGGACAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,729 IndexDictionaryUtils - Track 3428_16_14_SRO_218_AATCCGTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,729 IndexDictionaryUtils - Track 3428_17_14_SRO_201_AATGTTGC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,730 IndexDictionaryUtils - Track 3428_18_13_SRO_416_ACACGACC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,730 IndexDictionaryUtils - Track 3428_19_14_SRO_66_ACAGATTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,730 IndexDictionaryUtils - Track 3428_1_13_SRO_388_GTCGTAGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,731 IndexDictionaryUtils - Track 3428_20_14_SRO_68_AGATGTAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,731 IndexDictionaryUtils - Track 3428_21_14_SRO_210_AGCACCTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,731 IndexDictionaryUtils - Track 3428_22_14_SRO_256_AGCCATGC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,732 IndexDictionaryUtils - Track 3428_23_14_SRO_270_AGGCTAAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,732 IndexDictionaryUtils - Track 3428_24_13_SRO_452_ATAGCGAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,732 IndexDictionaryUtils - Track 3428_2_13_SRO_399_GTCTGTCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,733 IndexDictionaryUtils - Track 3428_3_13_SRO_461_GTGTTCTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,733 IndexDictionaryUtils - Track 3428_4_13_SRO_462_TAGGATGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,733 IndexDictionaryUtils - Track 3428_5_13_SRO_465_TATCAGCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,734 IndexDictionaryUtils - Track 3428_6_13_SRO_402_TCCGTCTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,734 IndexDictionaryUtils - Track 3428_7_13_SRO_474_TCTTCACA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,734 IndexDictionaryUtils - Track 3428_8_13_SRO_531_TGAAGAGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,735 IndexDictionaryUtils - Track 3428_9_14_SRO_166_TGGAACAA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
INFO  18:30:29,749 MicroScheduler - Running the GATK in parallel mode with 32 total threads, 1 CPU thread(s) for each of 32 data thread(s), of 64 processors available on this machine 
INFO  18:30:29,878 GenomeAnalysisEngine - Preparing for traversal 
INFO  18:30:29,963 GenomeAnalysisEngine - Done preparing for traversal 
INFO  18:30:29,964 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  18:30:29,965 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  18:30:29,966 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  18:30:30,562 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files 
INFO  18:31:00,420 ProgressMeter -       1:4845033         0.0    30.0 s      50.3 w        0.0%    46.7 h      46.7 h 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
java.lang.IllegalStateException: Key DP found in VariantContext field INFO at 1:4839315 but this key isn't defined in the VCFHeader.  We require all VCFs to have complete VCF headers by default.
    at htsjdk.variant.vcf.VCFEncoder.fieldIsMissingFromHeaderError(VCFEncoder.java:176)
    at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:115)
    at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:221)
    at org.broadinstitute.gatk.engine.io.storage.VariantContextWriterStorage.add(VariantContextWriterStorage.java:182)
    at org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub.add(VariantContextWriterStub.java:269)
    at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.reduce(GenotypeGVCFs.java:351)
    at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.reduce(GenotypeGVCFs.java:119)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java:291)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java:280)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:279)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-geee94ec):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Key DP found in VariantContext field INFO at 1:4839315 but this key isn't defined in the VCFHeader.  We require all VCFs to have complete VCF headers by default.
##### ERROR ------------------------------------------------------------------------------------------

for f in *.g.vcf.gz; do echo -e "\n-- $f --"; zcat "$f" | sed -n -r "/^#.*DP/p;/^1\t4839315\t/{p;q;}"; done

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:22:0:21:0,0,432
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:20:0:20:0,0,410
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:29:0:29:0,0,773
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:25:2:25:0,3,790
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:33:0:33:0,0,837
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:23:31:23:0,31,765
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     0       .       ClippingRankSum=-0.578;MLEAC=0,0;MLEAF=0.00,0.00        GT:DP:GQ:PL:SB  0/0:21:39:0,39,488,60,491,512:20,0,0,0
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:18:0:18:0,0,514
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:29:0:29:0,0,810
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:33:0:33:0,0,812
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:28:0:25:0,0,624
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     0.08    .       ClippingRankSum=-0.189;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:17:20:20,0,311,62,320,382:14,0,3,0
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     6.76    .       ClippingRankSum=-0.374;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:25:43:43,0,401,102,417,519:20,0,3,2
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     0       .       ClippingRankSum=-1.095;MLEAC=0,0;MLEAF=0.00,0.00        GT:DP:GQ:PL:SB  0/0:23:1:0,1,395,56,406,460:19,0,0,0
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:28:0:28:0,0,626
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     5.99    .       ClippingRankSum=-0.584;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:18:42:42,0,293,84,305,388:13,1,3,1
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:22:0:22:0,0,558
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       GA,<NON_REF>    6.76    .       ClippingRankSum=0.850;MLEAC=1,0;MLEAF=0.500,0.00        GT:DP:GQ:PL:SB  0/1:19:43:43,0,262,87,274,361:12,3,4,0
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     16.82   .       ClippingRankSum=-0.784;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:21:54:54,0,352,102,367,470:16,0,4,1
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:26:0:25:0,0,419
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:30:0:30:0,0,771
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:34:77:34:0,78,1136
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:26:0:20:0,0,397
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GAA     G,GA,<NON_REF>  22.75   .       ClippingRankSum=-2.181;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00        GT:DP:GQ:PL:SB  0/2:11:22:60,22,209,0,87,104,63,153,113,176:4,2,3,0

Created 2015-04-09 18:55:24 | Updated | Tags: allelebalancebysample variantannotator genotypegvcfs strandbiasbysample
Comments (5)

I am struggling with adding AlleleBalanceBySample and StrandBiasBySample to my joint VCF file. The data through joint VCF were generated with GATK 3.2-2, and I have tried using the VariantAnnotator as well as repeating the GenotypeGVCF step specifying the annotations. I have also tried using the GenotypeGVCFs tool from 3.3-0, but am trying to avoid running the HC again in the interest of time. For the strand bias by sample I can see why the VariantAnnotator approach would fail, as the information is not contained in the joint VCF file, but it is there in the GVCF. For the allele balance by sample it seems the info is already in the joint VCF, so either VariantAnnotation or rerunning GenotypeGVCFs seem like they should work.

I should add that when I do try to add these annotations using GenotypeGVCFs, they are noted in the header ##FORMAT section, but not in the actual F:O:R:M:A:T.

Any clues or hints (or admonitions!) much appreciated.

-erikt


Created 2015-03-26 13:23:26 | Updated | Tags: genotypegvcfs
Comments (6)

I'm trying to use genotypeGCFs on 350 small gVCF files (from bacterial genomes) and I'm getting this "clock drift" error:

INFO 13:02:46,006 ProgressMeter - chr1:44001 0.0 3.0 h 15250.3 w 2.0% 6.3 d 6.2 d INFO 13:06:30,421 ProgressMeter - chr1:44001 0.0 3.0 h 15250.3 w 2.0% 6.4 d 6.3 d INFO 14:38:28,521 ProgressMeter - chr1:44001 0.0 3.1 h 15250.3 w 2.0% 6.6 d 6.5 d WARN 16:24:14,924 SimpleTimer - Clock drift of -1,425,286,806,691,525,163 - -1,425,286,778,358,860,637 = 28,332,664,526 nanoseconds detected, vs. max allowable drift of 5,000,000,000. Assuming checkpoint/restart event. WARN 18:23:08,307 SimpleTimer - Clock drift of -1,425,286,778,358,827,793 - -1,425,286,806,691,525,163 = 28,332,697,370 nanoseconds detected, vs. max allowable drift of 5,000,000,000. Assuming checkpoint/restart event. INFO 22:55:24,502 ProgressMeter - chr1:44001 0.0 0.0 s 0.0 s 2.0% 0.0 s 0.0 s INFO 02:46:17,060 ProgressMeter - chr1:44001 0.0 14.1 h 15250.3 w 2.0% 4.3 w 4.2 w

Any ideas what that is?


Created 2015-03-26 08:10:25 | Updated | Tags: genotypegvcfs
Comments (2)

Hi there!

i would like to run the whole pipeline on one sample for testing the pipeline. here are some wrong in the genotyping step.

commands are:

Step9. Call Variants

java -Xmx5g -jar $gatk -T HaplotypeCaller -R $ref_dir/ucsc.hg19.fasta -I $bam_dir/T-SZ-03-1.dedupped.realigned.recal.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o $other_dir/sample.raw.snps.indels.vcf

Joint Genotyping

java -Xmx5g -jar $gatk -T GenotypeGVCFs -R $ref_dir/ucsc.hg19.fasta --variant $other_dir/sample.raw.snps.indels.vcf -o $other_dir/test.raw.snps.indels.g.vcf

Error in GenotypeGVCFs :

INFO 09:25:33,478 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:25:33,480 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 INFO 09:25:33,480 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 09:25:33,481 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 09:25:33,484 HelpFormatter - Program Args: -T GenotypeGVCFs -R /media/LAB636/01.06.job/Ref/ucsc.hg19.fasta --variant /media/LAB636/01.06.job/test/data/other/sample.raw.snps.indels.vcf -o /media/LAB636/01.06.job/test/data/other/test.raw.snps.indels.g.vcf INFO 09:25:33,486 HelpFormatter - Executing as starlcx@starlcx-desktop on Linux 2.6.32-38-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_31-b13. INFO 09:25:33,486 HelpFormatter - Date/Time: 2015/03/26 09:25:33 INFO 09:25:33,487 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:25:33,487 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:25:33,823 GenomeAnalysisEngine - Strictness is SILENT INFO 09:25:33,893 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 09:25:34,936 GenomeAnalysisEngine - Preparing for traversal INFO 09:25:34,943 GenomeAnalysisEngine - Done preparing for traversal INFO 09:25:34,944 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 09:25:34,944 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 09:25:34,945 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 09:25:35,005 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.3-0-g37228af):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE:** The list of input alleles must contain as an allele but that is not the case at position 150; please use the Haplotype Caller with gVCF output to generate appropriate records**
ERROR ------------------------------------------------------------------------------------------

there is my confusion: 1. As you know, I run the pipeline on just one sample. Whether do I need to do joint genotyping after call variants with HC ? 2. The documents here mentions : Genotypes any number of gVCF files that were produced by the Haplotype Caller into a single joint VCF file. So is this means the input here should be gVCF files? But the output of HC are vcf file. https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.php 3.what leads to the error message above?

Thanks! :)


Created 2015-03-21 09:14:04 | Updated | Tags: genotypegvcfs compressed-g-vcf
Comments (15)

Hi Team,
I'm getting `WARN  21:19:30,478 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation` when processing gzipped g.vcf files produced by HaplotypeCaller (via -o foo.g.vcf.gz, as suggested by @Geraldine_VdAuwera in blog post 3893) with GenotypeGVCFs.
This results in dramatic increases in run time (makes sense if GenotypeGVCFs un-compresses the files), and memory requirements (why ??) for GenotypeGVCFs compared to processing the gvcf for same bam files if HC outfiles are unzipped. Most batches that previously completed with 4x8GB RAM now produce `java.lang.OutOfMemoryError: Java heap space` errors even with 4X64GB!

Could you please advise whether this warning is expected behaviour? If yes, what exactly is missing (can't see much difference in unzipped vs gzipped vcf headers), and can this be added somehow?

Created 2015-03-20 19:25:30 | Updated | Tags: genotypegvcfs multi-allelic
Comments (3)

In the output of GenotypeGVCFs, are the ALT alleles in those multi-allelic sites ordered by descending frequencies?


Created 2015-03-19 14:50:22 | Updated | Tags: haplotypecaller genotypegvcfs
Comments (6)

Hello!

I would like to run GenotypeGVCFs on 209 WES, called with HC (--emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000).

When I run GenotypeGVCFs, with this command (computing nodes have 8 cores and 24G of memory) :

java -Xmx24g -jar $GATK_JAR \
-R Homo_sapiens.GRCh37_decoy.fa \
-T GenotypeGVCFs \
-nt 8 \
-V gvcf.all.list \
-o calls.vcf

It estimates a huge runtime and just dies hanging:

INFO  10:27:00,790 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  10:27:00,795 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 
INFO  10:27:00,796 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  10:27:00,796 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  10:27:00,800 HelpFormatter - Program Args: -R Homo_sapiens.GRCh37_decoy.fa -T GenotypeGVCFs -nt 8 -V gvcf.all.list -o calls.vcf 
INFO  10:27:00,810 HelpFormatter - Executing as emixaM@r107-n50 on Linux 2.6.32-504.12.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-ea-b07. 
INFO  10:27:00,810 HelpFormatter - Date/Time: 2015/03/19 10:27:00 
INFO  10:27:00,811 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  10:27:00,811 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  10:27:04,719 GenomeAnalysisEngine - Strictness is SILENT 
INFO  10:27:04,882 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  10:27:41,565 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 1 CPU thread(s) for each of 8 data thread(s), of 8 processors available on this machine 
INFO  10:27:43,169 GenomeAnalysisEngine - Preparing for traversal 
INFO  10:27:43,179 GenomeAnalysisEngine - Done preparing for traversal 
INFO  10:27:43,179 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  10:27:43,180 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  10:27:43,180 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  10:27:44,216 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files 
INFO  10:28:15,035 ProgressMeter -       1:1000201         0.0    31.0 s      52.7 w        0.0%    27.0 h      27.0 h 
INFO  10:29:17,386 ProgressMeter -       1:1068701         0.0    94.0 s     155.8 w        0.0%    76.7 h      76.6 h 
INFO  10:30:18,055 ProgressMeter -       1:1115101         0.0     2.6 m     256.1 w        0.0%     5.0 d       5.0 d 

What did I do wrong?

Cheers!


Created 2015-03-10 18:27:56 | Updated | Tags: haplotypecaller genotypegvcfs gvcf
Comments (5)

I run the following command for "GenotypeGVCFs" for 3 VCF files output of HaplotypeCaller as below:

java data/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \ -R data/ucsc.hg19.fasta \ -T GenotypeGVCFs \ --variant data/47V_post.ERC.vcf \ --variant data/48V_post.ERC.vcf \ --variant data/49V_post.ERC.vcf \ --out data/Combined_geno_3files.vcf

but in a final VCF output there is no rsID information and all rows are "." what is the problem? I am really confused. Could you please advise how to get SNP-ID in the output VCF

Thanks


Created 2015-03-09 16:48:43 | Updated | Tags: dp gt genotypegvcfs
Comments (7)

Hi,

Sometimes when a sample has no reads, it still can have a genotype. I used GenotypeGVCFs (without CombineGVCFs) to generate a vcf with several samples. If I extract data for one sample, I sometimes see missing genotype with no read :

GT:AD:DP:GQ:PGT:PID:PL  ./.:0,0:0:.:.:.:.

which seem ok to me. Note that DP = 0.

but more rarely, I see a genotype with no read :

GT:AD:DP:GQ:PGT:PID:PL  0/0:0,0:.:3:.:.:0,3,45
GT:AD:DP:GQ:PGT:PID:PL  1/1:0,0:.:3:1|1:32486973_A_AG:45,3,0

and this is more questionnable. Note that DP = "."

Can you explain how genotype can be called without reads ? (I suspect haplotype to be involved, but without read I do not understand) any idea about the DP value ? why has it a value with missing genotype ?

thank you.


Created 2015-03-02 11:38:32 | Updated | Tags: genotypegvcfs strandallelecountsbysample
Comments (17)

Dear GATK team:

Here is just a quick follow up question on the SAC (StrandAlleleCountsBySample) annotation. I've successfully get the SAC annotation using GenotypeGVCFs with the nighty build, but there is something I notice to be a little wired. This is a line from the final vcf i got from running GenotypeGVCFs. Really glad that the SAC annotation is there, but in this example, for a biallelic site, is it right to have the SAC annotation with 3 sets of counts for "reference/first_allele/second_allele" while there is actually no second alternative allele? Or is it have something to do with the allele from GVCF format. Many thanks!

The example: 1 752894 rs3131971 T C 30876.92 . BaseQRankSum=-7.360e-01;DB;DP=1430;MLEAC=589;MLEAF=0.785;MQ=32.47;MQ0=0;MQRankSum=0.727;ReadPosRankSum=-3.580e-01;SOR=4.437 GT:AD:DP:GQ:PGT:PID:PL:SAC 0/1:2,4:.:54:.:.:118,0,54:0,2,0,4,0,0 ./.:2,0:2 0/1:2,4:.:53:.:.:119,0,53:0,2,0,4,0,0 1/1:0,2:.:6:.:.:55,6,0:0,0,0,2,0,0

BTW, if I have a follow up question on a previous one I've asked, should I just reply to the old question or open a new one like this? Thanks again!


Created 2015-03-02 09:39:51 | Updated | Tags: haplotypecaller genotypegvcfs
Comments (1)

Hi GATK team,

In the documentation of GenotypeGVCFs it is writen: "Input - One or more Haplotype Caller gVCFs to genotype" I have 3 questions regarding this tool: 1. I wonder, what is the meaning of running it on one sample? 2. I tried to run it on one sample, and noticed that the genotype quality is different than the one in the original gvcf file from HC. What is causing this difference? I'm asking this since running the tool on one sample means that there are no other samples to consider in recalculating the quality. 3. Last question - From your experience, what is the best way to analyze one exome sample? Should I run HC with the default genotype_mode parameter and do hard filtering? Should I run HC in GVCF mode, run GenotypeGVCFs and than do hard filtering? Any other suggestion?

Thank you for the answer, Maya


Created 2015-03-01 10:46:23 | Updated | Tags: genotypegvcfs strandbiasbysample strandallelecountsbysample
Comments (3)

Hi there,

Thank you guys for developing & maintaining such a good software! I'm actually just starting to use GATK with Haplotypecaller. For me, there's a very useful annotation "StrandBiasBySample" or "StrandAlleleCountsBySample" that gives all the alignment information for a variant. It can be generated by HC, but after joint genotyping, it's just gone. There is a previous post asking about how to add this annotation to the final vcf, and Sheila mentioned she's already put in a request for adding this. So I'm just wandering, have you guys fixed this in one of the nighty build?

Many Thanks!!!


Created 2015-02-25 10:31:07 | Updated | Tags: genotypegvcfs combinegvcfs
Comments (8)

Hi,

In the documentation for CombineGVCFs it says:

CombineGVCFs is meant to be used for hierarchical merging of gVCFs that will eventually be input into GenotypeGVCFs. One would use this tool when needing to genotype too large a number of individual gVCFs; instead of passing them all in to GenotypeGVCFs, one would first use CombineGVCFs on smaller batches of samples and then pass these combined gVCFs to GenotypeGVCFs.

Do you have any guidelines for this? I am trying to use genotypeGVCFS on 12 gVCF files and it doesn't work, so can you advise how I should I "pre-merge" them? Two batches of 6? Three batches of 3?

Thanks,

Mike


Created 2015-02-19 14:29:26 | Updated | Tags: haplotypecaller genotypegvcfs
Comments (4)

Hi Sheila and Geraldine

When I run HaplotypeCaller (v3.3-0-g37228af) in GENOTYPE_GIVEN_ALLELES mode with --emitRefConfidence GVCF I get the error:

Invalid command line: Argument ERC/gt_mode has a bad value: you cannot request reference confidence output and GENOTYPE_GIVEN_ALLELES at the same time

It is however strange that GENOTYPE_GIVEN_ALLELES is mentioned in the --max_alternate_alleles section of the GenotypeGVCFs documentation.

Maybe I'm missing something?

Thanks, Gerrit


Created 2015-02-04 08:52:18 | Updated 2015-02-04 08:53:33 | Tags: haplotypecaller illumina genotypegvcfs combinegvcfs
Comments (3)

Hi GATK-ers,

I have been given ~2000 gVCFs generated by Illumina (one sample per gVCF). Though they are in standard gVCF format, they were generated by an Illumina pipeline (https://support.basespace.illumina.com/knowledgebase/articles/147078-gvcf-file if you're really curious) and not the Haplotype Caller. As a result (I think ... ), the GATK doesn't want to process them (I have tried CombineGVCFs and GenotypeGVCFs to no avail). Is there a GATK walker or some other tool that will make my gVCFs GATK-friendly? I need to be able to merge this data together to make it analyze-able because in single-sample VCF format it's pretty useless at the moment.

My only other thought has been to expand all the ref blocks of data and then merge everything together, but this seems like it will result in the creation of a massive amount of data.

Any suggestions you may have are greatly appreciated!!!

Sara


Created 2015-02-02 21:24:31 | Updated | Tags: vqsr dbsnp vqslod genotypegvcfs gvcf gq-pl
Comments (18)

From my whole-genome (human) BAM files, I want to obtain: For each variant in dbSNP, the GQ and VQSLOD associated with seeing that variant in my data.

Here's my situation using HaplotypeCaller -ERC GVCF followed by GenotypeGVCFs: CHROM POS ID REF ALT chr1 1 . A # my data chr1 1 . A T # dbSNP I would like to know the confidence (in terms of GQ and/or PL) of calling A/A, A/T. or T/T. The call of isn't useful to me for the reason explained below.

How can I get something like this to work? Besides needing a GATK-style GVCF file for dbSNP, I'm not sure how GenotypeGVCFs behaves if "tricked" with a fake GVCF not from HaplotypeCaller.

My detailed reason for needing this is below:

For positions of known variation (those in dbSNP), the reference base is arbitrary. For these positions, I need to distinguish between three cases: 1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1. 2. We have sufficient evidence to call position n as homozygous reference (0/0) with confidence scores GQ=x2 and VQSLOD=y2. 3. We do not have sufficient evidence to make any call for position n.

I was planning to use VQSR because the annotations it uses seem useful to distinguish between case 3 and either of 1 and 2. For example, excessive depth suggests a bad alignment, which decreases our confidence in making any call, homozygous reference or not.

Following the best practices pipeline using HaplotypeCaller -ERC GVCF, I get ALTs with associated GQs and PLs, and GT=./.. However, GenotypeGVCF removes all of these, meaning that whenever the call by HaplotypeCaller was ./. (due to lack of evidence for variation), it isn't carried forward for use in VQSR.

Consequently, this seems to distinguish only between these two cases: 1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1. 2. We do not have sufficient evidence to call position n as a variant (it's either 0/0 or unknown).

This isn't sufficient for my application, because we care deeply about the difference between "definitely homozygous reference" and "we don't know".

Thanks in advance!

Douglas


Created 2015-01-23 16:55:57 | Updated | Tags: vqsr haplotypecaller bam gatk genotypegvcfs
Comments (2)

Hi,

I have recal.bam files for all the individuals in my study (these constitute 4 families), and each bam file contains information for one chromosome for one individual. I was wondering if it is best for me to pass all the files for a single individual together when running HaplotypeCaller, if it will increase the accuracy of the calling, or if I can just run HaplotypeCaller on each individual bam file separately.

Also, I was wondering at which step I should be using CalculateGenotypePosteriors, and if it will clean up the calls substantially. VQSR already filters the calls, but I was reading that CalculateGenotypePosteriors actually takes pedigree files, which would be useful in my case. Should I try to use CalculateGenotypePosteriors after VQSR? Are there other relevant filtering or clean-up tools that I should be aware of?

Thanks very much in advance,

Alva


Created 2015-01-13 12:40:29 | Updated 2015-01-13 12:44:57 | Tags: genotypegvcfs
Comments (9)

I am seeing an unexpected heterozygous genotype in the GenotypeGVCFs output. The region is at the edge of a long homopolymer, which presumably plays a role, but it is still not clear where the unexpected heterozygote is coming from. The final VCF is below. The unexpected call is 2:152402516A>T in Sample1. The call is 0/1, but I don't see any evidence for where that alternate allele is coming from. The input VCFs are below. Any suggestions? How is GenotypeGVCFs interpreting this situation? I am running the latest GATK 3.3. Jar.

Final VCF

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1        Sample2
2       152402515       .       T       TA      86.23   .       AC=2;AF=0.500;AN=4;BaseQRankSum=-5.700e-02;ClippingRankSum=0.321;DP=131;FS=0.000;GQ_MEAN=166.50;GQ_STDDEV=44.55;MLEAC=1;MLEAF=0.250;MQ=59.89;MQ0=0;MQRankSum=2.42;NCC=0;QD=1.31;ReadPosRankSum=0.649       GT:AD:DP:GQ:PL  0/1:10,8:34:99:203,0,198        0/1:37,11:56:99:135,0,1429
2       152402516       .       A       T       380.44  .       AC=2;AF=0.500;AN=4;BaseQRankSum=-3.647e+00;ClippingRankSum=-2.597e+00;DP=130;FS=3.866;GQ_MEAN=204.50;GQ_STDDEV=136.47;MLEAC=2;MLEAF=0.500;MQ=59.73;MQ0=0;MQRankSum=0.754;NCC=0;QD=5.76;ReadPosRankSum=1.43 GT:AD:DP:GQ:PL  0/1:10,0:34:99:108,0,157        0/1:43,23:66:99:301,0,1142

Sample1 VCF

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1
2       152402511       .       G       <NON_REF>       .       .       END=152402514   GT:DP:GQ:MIN_DP:PL      0/0:38:57:33:0,51,765
2       152402515       .       TA      T,TAA,TAAA,TAAAA,<NON_REF>      0.10    .       BaseQRankSum=-0.057;ClippingRankSum=0.321;DP=55;MLEAC=0,0,1,0,0;MLEAF=0.00,0.00,0.500,0.00,0.00;MQ=59.89;MQ0=0;MQRankSum=-1.002;ReadPosRankSum=-0.646      GT:AD:DP:GQ:PL:SB       0/3:10,8,8,7,1,0:34:21:224,189,625,21,87,219,0,109,141,388,143,293,250,447,708,116,220,156,201,330,273:0,10,0,7
2       152402517       .       A       <NON_REF>       .       .       END=152402517   GT:DP:GQ:MIN_DP:PL      0/0:55:0:55:0,0,1431
2       152402518       .       A       <NON_REF>       .       .       END=152402621   GT:DP:GQ:MIN_DP:PL      0/0:49:99:37:0,101,1442

Sample2 VCF

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample2
2       152402513       .       C       CT,<NON_REF>    0       .       BaseQRankSum=-0.168;ClippingRankSum=0.437;DP=48;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.772;ReadPosRankSum=0.034  GT:AD:DP:GQ:PL:SB  0/0:35,2,0:37:99:0,103,1228,143,1234,1275:6,29,0,0
2       152402514       .       T       <NON_REF>       .       .       END=152402514   GT:DP:GQ:MIN_DP:PL      0/0:63:99:63:0,114,1710
2       152402515       .       TA      T,TAA,TAAA,<NON_REF>    78.73   .       BaseQRankSum=-1.332;ClippingRankSum=-0.242;DP=76;MLEAC=0,1,0,0;MLEAF=0.00,0.500,0.00,0.00;MQ=59.74;MQ0=0;MQRankSum=2.423;ReadPosRankSum=0.649      GT:AD:DP:GQ:PL:SB       0/2:37,4,11,4,0:56:99:135,256,1876,0,804,1429,116,1061,1435,1759,157,933,805,963,890:2,35,0,11
2       152402516       rs199791504     A       T,<NON_REF>     272.77  .       BaseQRankSum=-3.647;ClippingRankSum=-2.597;DB;DP=75;MLEAC=1,0;MLEAF=0.500,0.00;MQ=59.73;MQ0=0;MQRankSum=0.754;ReadPosRankSum=1.426GT:AD:DP:GQ:PL:SB        0/1:43,23,0:66:99:301,0,1142,429,1210,1639:5,38,1,22
2       152402517       .       A       T,<NON_REF>     0       .       BaseQRankSum=-0.810;ClippingRankSum=-0.661;DP=79;MLEAC=0,0;MLEAF=0.00,0.00;MQ=59.75;MQ0=0;MQRankSum=-1.160;ReadPosRankSum=1.259 GT:AD:DP:GQ:PL:SB  0/0:67,4,0:71:99:0,122,2156,201,2168,2247:6,61,0,0
2       152402518       .       A       <NON_REF>       .       .       END=152402621   GT:DP:GQ:MIN_DP:PL      0/0:67:99:48:0,105,1710

Created 2015-01-07 21:16:08 | Updated | Tags: haplotypecaller genotypegvcfs
Comments (16)

Hi

I am using GATK version 3.3 (latest release) and following the best practice for variant calling. Steps i am using is 1. Run haplotye caller and get the gVCF file 2. Run GenotypeGVCF across the samples to the get the VCF file

The Question is about the Genotype calling. Here is an example:

GVCF file looks like this for a particular position chr1 14677 . G . . END=14677 GT:DP:GQ:MIN_DP:PL 0/0:9:0:9:0,0,203

VCF file looks like this after genotyping with multiple samples. chr1 14677 . G A 76.56 PASS . . GT:AD:DP:GQ:PL 0/1:8,1:9:4:4,0,180

In GVCF file there was no support for the alternate allele, but VCF file shows that there are 1 read supporting the alternate call.

Thanks ! Saurabh


Created 2014-12-16 14:44:31 | Updated | Tags: java genotypegvcfs bcftools
Comments (10)

Hi,

after running 5 exomes with GATK-v3.3 and HaplotypeCaller, I encountered a very low titv ration in my samples (~2.1) as VaraintEval report indicated. I tried running varaint filtration in these samples but I didn't see any imporvement in titv ratio nor any filtering done. therefore I filtered these with bcftools, after which the titv ratio improved to 2.5. Then when I tried running GenotypeGVCFs on these samples filtered with bcftools, I encountered the following error:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double at java.lang.Double.compareTo(Double.java:49) at java.util.ComparableTimSort.countRunAndMakeAscending(ComparableTimSort.java:290) at java.util.ComparableTimSort.sort(ComparableTimSort.java:157) at java.util.ComparableTimSort.sort(ComparableTimSort.java:146) at java.util.Arrays.sort(Arrays.java:472) at java.util.Collections.sort(Collections.java:155) at org.broadinstitute.gatk.utils.MathUtils.median(MathUtils.java:999) at org.broadinstitute.gatk.tools.walkers.variantutils.ReferenceConfidenceVariantContextMerger.combineAnnotationValues(ReferenceConfidenceVariantContextMerger.java:73) at org.broadinstitute.gatk.tools.walkers.variantutils.ReferenceConfidenceVariantContextMerger.merge(ReferenceConfidenceVariantContextMerger.java:158) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:202) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:121) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:310) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version nightly-2014-11-17-g58cfab1):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: java.lang.Integer cannot be cast to java.lang.Double
ERROR ------------------------------------------------------------------------------------------

any advice on solving this incident will be much appreciated

Victoria


Created 2014-12-15 10:17:16 | Updated | Tags: ploidy genotypegvcfs
Comments (8)

Hello. I've run Haplotype caller for 19 samples in -ploidy 19 (19 is as high as I can get without getting an error for that, but that is a discussion opened elsewhere). Then, I've tried GenotypeGVCFs and get an ERROR (both command and ERROR below).

Just to add, I've run both HaplotypeCaller and GenotypeGVCFs with and without -maxAltAlleles 4 to try to limit the referred combination; however, the output is exactly the same ERROR message at the end of GenotypeGVCFs. Do you think there might be a bug or am I doing something wrong?

java -Djava.io.tmpdir=$mytmp -Xmx28g -jar GenomeAnalysisTK.jar -R $ref -T GenotypeGVCFs -o $gVCF.ploidy.vcf -V ploidy.list -ploidy 19 -maxAltAlleles 4

ERROR MESSAGE: the combination of ploidy (19) and number of alleles (21) results in a very large number of genotypes (> 2147483647). You need to limit ploidy or the number of alternative alleles to analyz e this locus


Created 2014-12-12 12:53:50 | Updated | Tags: ad dp genotypegvcfs
Comments (7)

Hi,

I am running the GATK 3.2-2 pipeline and I am getting some odd results from the genotype VCF stage. It appears that incorrect AD results are being output.

E.g. on the single sample gvcf I have:

GC G, 522.73 . BaseQRankSum=-0.716;ClippingRankSum=0.219;DP=38;MLEAC=1,0;MLEAF=0.500,0.00;MQ=66.41;MQ0=0;MQRankSum=1.681;ReadPosRankSum=-1.155 GT:AD:DP:GQ:PL:SB 0/1:20,18,0:38:99:560,0,633,620,687,1307:10,10,10,8

This agrees with what I see in IGV for this site (see attached image).

However, when I joint call this site along with other gvcfs, the output for that sample looks like:

0/1:20,0:38:99:560,0,633

i.e. It appears to not output the expected 20,18 - but it still seems to be making the correct genotype call so I am assuming it is just an output bug and not something that is affecting calling, but it would be good to know for sure. Either way it is problematic as it is affecting our downstream depth filtering. This isn't the only example of this happening in our data. I am trying to identify more by looking for occasions where the AD fields sum to less than DP, am I right in thinking this should never be the case as AD is unfiltered compared to the filtered DP?

Thanks a lot

Dan


Created 2014-12-09 21:47:52 | Updated | Tags: vcf genotypegvcfs gvcf
Comments (17)

Hi,

I used GenotypeGVCFs with 3 input gvcf files (3 individuals) to create a vcf file, and this seems to work, but when I examine the sites in the final vcf file, there are sites that are missing. I am in the process of calculating exactly how many sites are missing, but taking an initial section of the vcf file and initial sections of my 3 gvcf files, the initial set of variant positions in the 3 gvcf files combined (called "test file") is:

16050036 16050612 16050822 16050933 16051556 16051968 16051994 16052080 16052167 16052239 16052250 16052357 etc.

whereas the initial set of sites in my final vcf file is:

16050822 16050933 16051347 16051497 16051556 16051968 16051994 16052080 16052167 16052169 16052239 16052357 etc.

First of all, there are sites in the final vcf file (in bold) that are fixed for the 3 individuals, but that are still included (the individuals are all 0/1 at these positions). I removed these positions when I created my test file, so they don't show up there, as you can see. Second, there are sites in the test file that are not in the final vcf file (in bold), even though these are most definitely variant sites (I checked them - f.ex., 16050036 is a SNP). I'm not sure why these discrepancies are occurring.

I also got this warning 3 times when running GenotypeGVCFs: WARN 20:04:45,102 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 22:21483632 has 7 alternate alleles so only the top alleles will be used How do I relax the requirements of "at most 6 alternate alleles" to allow more in the case of indels? I am using the newest version of GATK (3.3).

FYI, this is the command I used for GenotypeGVCFs: java -Xmx2g -jar GenomeAnalysisTK.jar -R hs37d5.fa -T GenotypeGVCFs --variant file1.vcf --variant file2.vcf --variant file3.vcf -o final.vcf -L 22 (only running this for chr22)

Thank you in advance, Alva


Created 2014-12-03 16:57:35 | Updated | Tags: haplotypecaller genotypegvcfs
Comments (2)

If I want the variants to be called only if they fit the following criteria:

1) Min. total coverage for consideration of heterozygous is 10.

2) Min. coverage of each of the two observed major basecalls to be called heterozygous is 5.

3) Min. percentage of each of the two observed major basecalls in order to be called heterozygous is 20.

4) Min. coverage in order for a position to be called homozygous is 6.

which command-line arguments in which tools (HaplotpyeCaller or GenotypeGVCFs) can I use to accomplish these? I cannot seem to find the proper arguments in the documentation. I apologize if I overlook.

Thank you


Created 2014-12-02 11:21:53 | Updated 2014-12-02 11:23:39 | Tags: genotypegvcfs
Comments (23)

Hi, I'm using GATKv3.3 to run the protocol of HC followed by CombineGVCFs in six batches of ~190 individuals each and finally GenotypeGVCFs of these six gVCFs. My failed command is below followed by the error message. In addition, I can tell you that my output file (1135g_.vcf) is empty even when it crashed after more than two hours of running time.

Kindly, Alberto

java -Djava.io.tmpdir=$mytmp -Xmx232g -jar $EBROOTGATK/GenomeAnalysisTK.jar -R $ref -T GenotypeGVCFs -nt 40 -o 1135g_.vcf -V 1135g_listof.list

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace java.lang.ArrayIndexOutOfBoundsException: 10000 at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.generatePL(ReferenceConfidenceVariantContextMerger.java:357) at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.mergeRefConfidenceGenotypes(ReferenceConfidenceVariantContextMerger.jav a:331) at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.merge(ReferenceConfidenceVariantContextMerger.java:134) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:200) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:119) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722)

ERROR ------------------------------------------------------------------------------------------ ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af): ERROR ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ERROR If not, please post the error message, with stack trace, to the GATK forum. ERROR Visit our website and forum for extensive documentation and answers to ERROR commonly asked questions http://www.broadinstitute.org/gatk ERROR ERROR MESSAGE: 10000 ERROR ------------------------------------------------------------------------------------------


Created 2014-11-26 22:46:02 | Updated | Tags: haplotypecaller genotypegvcfs
Comments (1)

Hi, I just finished running HaplotypeCaller version 3.3-0 separately on 6 exome samples with the new best practices.

java -Xmx8g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19/hg19_Ordered.fa -I K87/HG19_Analysis/K87-929_final.recalibrated_final.bam --dbsnp dbsnp_138_hg19_Ordered.vcf --pair_hmm_implementation VECTOR_LOGLESS_CACHING -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 --output_mode EMIT_VARIANTS_ONLY -gt_mode DISCOVERY --pcr_indel_model CONSERVATIVE -o ./Haplotypes_929.vcf

Many Variant sites are called as homozygous alt (1/1), but none of these sites that are processed to infer haplotype are called as homozygous alt in their PGT field, they are all called as hets, PGT=0|1. for example:

GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,29,0:29:93:0|1:121483392_C_G:1331,93,0,1331,93,1331:0,0,13,16

The allelic depths agree with the phased genotype but out of all 6 exomes processed, not a single 1/1 is also phased as 1|1.

I checked all output vcfs with a simple grep combo: grep 'PGT' Haplotypes_929.vcf | grep '1/1' - | grep '0|1' - | wc -l = 19046 grep 'PGT' Haplotypes_929.vcf | grep '1/1' - | grep '1|0' - | wc -l = 79 grep 'PGT' Haplotypes_929.vcf | grep '1/1' - | grep '1|1' - | wc -l = 0

This seemed odd, but I continued with GenotyeGVCF:

java -Xmx32g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R hg19/hg19_Ordered.fa -V Haplotypes_450.vcf -V Haplotypes_452.vcf -V Haplotypes_925.vcf -V Haplotypes_926.vcf -V Haplotypes_927.vcf -V Haplotypes_929.vcf -D dbsnp_138_hg19_Ordered.vcf -ped K87/HG19_Analysis/K87_6.ped -o Haplotypes_K87_GVCFs.vcf

I'm looking at the output vcf as it's being generated and now there are homozygous alt calls but they conflict with the associated Allelic Depths:

.... GT:AD:DP:GQ:PGT:PID:PL .... 1/1:0,29:29:85:1|1:33957151_G_T:948,85,0 .....

Full Line: chr1 33957152 rs4403594 T G 3166.96 . AC=12;AF=1.00;AN=12;DB;DP=99;FS=0.000;GQ_MEAN=48.50;GQ_STDDEV=27.55;MLEAC=12;MLEAF=1.00;MQ=39.65;MQ0=0;NCC=0;QD=32.32;SOR=0.693 GT:AD:DP:GQ:PGT:PID:PL 1/1:0,9:9:27:.:.:330,27,0 1/1:0,5:5:15:.:.:141,15,0 1/1:0,29:29:85:1|1:33957151_G_T:948,85,0 1/1:0,20:20:60:.:.:722,60,0 1/1:0,24:24:71:.:.:685,71,0 1/1:0,11:11:33:.:.:366,33,0

Can you help me interpret what seems to me as conflicting results?

Cheers,

Patrick


Created 2014-11-20 06:39:53 | Updated | Tags: combinevariants genotypegvcfs gvcftools
Comments (8)

Hi,

I am merging several genome vcf files apparently created by gvcftools [1]. I noticed that CombineVariants is incorrectly summing allele counts (AC) in some corner cases, and was wondering if the format [2] was supported?

If the format should be supported by the CombineVariants tool (gvcf is VCF 4.1 conformant), I have isolated a minimal example that should be of help for bugfixing.

[1] https://sites.google.com/site/gvcftools/home [2] https://sites.google.com/site/gvcftools/home/about-gvcf

Thanks, Paweł


Created 2014-11-17 11:59:46 | Updated | Tags: haplotypecaller genotypegvcfs combinegvcfs
Comments (10)

Hi all,

I am attempting to use the HaplotyperCaller / CombineGVCFs / GenotypeGVCFs to call variants on chrom X and Y of 769 samples (356 males, 413 females) sequenced at 12x coverage (WG sequening, but right not only calling X and Y).

I have called the samples according to the best practises using the HaplotypeCaller, using ploidy = 1 for males on X and Y and ploidy =2 for females on X, e.g.:

INFO 16:28:45,750 HelpFormatter - Program Args: -R /gcc/resources/b37/indices/human_g1k_v37.fa -T HaplotypeCaller -L X -ploidy 1 -minPruning 3 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -I /target/gpfs2/gcc/groups/gonl/projects/trio-analysis/rawdata_release2/A102.human_g1k_v37.trio_realigned.bam --sample_name A102a -o /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/A102a.chrX.hc.g.vcf

Then I have used CombineGVCFs to combine my samples in batches of 100 samples. Now I am attempting to genotype them and I face the same issue on both X (males + females) and Y (males only): It starts running fine and then just hang on a certain position. At first it crashed asking for additional memory but now with 24Gb or memory it simply stays at a single position for 24hrs until my job eventually stops due to walltime. Here is the chrom X output: INFO 15:00:39,501 HelpFormatter - Program Args: -R /gcc/resources/b37/indices/human_g1k_v37.fa -T GenotypeGVCFs -ploidy 1 --dbsnp /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf -stand_call_conf 10 -stand_emit_conf 10 --max_alternate_alleles 4 -o /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.vcf -L X -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.1.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.2.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.3.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.4.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.5.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.6.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.7.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.8.g.vcf INFO 15:00:39,507 HelpFormatter - Executing as lfrancioli@targetgcc15-mgmt on Linux 3.0.80-0.5-default amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13. INFO 15:00:39,507 HelpFormatter - Date/Time: 2014/11/12 15:00:39 INFO 15:00:39,508 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:00:39,508 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:00:40,951 GenomeAnalysisEngine - Strictness is SILENT INFO 15:00:41,153 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 15:57:53,416 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.4.g.vcf.idx INFO 17:09:39,597 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.5.g.vcf.idx INFO 18:21:00,656 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.6.g.vcf.idx INFO 19:30:46,624 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.7.g.vcf.idx INFO 20:22:38,368 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.8.g.vcf.idx WARN 20:26:45,716 FSLockWithShared$LockAcquisitionTask - WARNING: Unable to lock file /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf.idx because an IOException occurred with message: No locks available. INFO 20:26:45,718 RMDTrackBuilder - Could not acquire a shared lock on index file /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf.idx, falling back to using an in-memory index for this GATK run. INFO 20:33:29,491 IntervalUtils - Processing 155270560 bp from intervals INFO 20:33:29,628 GenomeAnalysisEngine - Preparing for traversal INFO 20:33:29,635 GenomeAnalysisEngine - Done preparing for traversal INFO 20:33:29,636 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 20:33:29,637 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 20:33:29,638 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 20:33:29,948 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files INFO 20:33:59,642 ProgressMeter - X:65301 0.0 30.0 s 49.6 w 0.0% 19.8 h 19.8 h INFO 20:34:59,820 ProgressMeter - X:65301 0.0 90.0 s 149.1 w 0.0% 59.4 h 59.4 h ... INFO 20:52:01,064 ProgressMeter - X:177301 0.0 18.5 m 1837.7 w 0.1% 11.3 d 11.2 d INFO 20:53:01,066 ProgressMeter - X:177301 0.0 19.5 m 1936.9 w 0.1% 11.9 d 11.9 d ... INFO 14:58:25,243 ProgressMeter - X:177301 0.0 18.4 h 15250.3 w 0.1% 96.0 w 95.9 w INFO 14:59:38,112 ProgressMeter - X:177301 0.0 18.4 h 15250.3 w 0.1% 96.1 w 96.0 w INFO 15:00:47,482 ProgressMeter - X:177301 0.0 18.5 h 15250.3 w 0.1% 96.2 w 96.1 w =>> PBS: job killed: walltime 86440 exceeded limit 86400

I would really appreciate if you could give me some pointer as how to handle this situation.

Thanks! Laurent


Created 2014-11-14 03:30:08 | Updated | Tags: variantannotator genotypegvcfs
Comments (4)

I have constructed a VCF using GenotypeGVCFs. Due to downsampling and filtering imposed by HaplotypeCaller/GenotypeGVCFs, four fields of interest within the VCF do not reflect the data that went into building the file. I would like to update a these fields to reflect all read data used in the pipeline, but I have been able to do so for just 3 of the 4. I have been able to update INFO MQ0, INFO DP, and sample-level (FORMAT) AD, but I have not been able to update sample-level (FORMAT) DP.

I have run VariantAnnotator as follows.

inVCFfn=../vcf/raw.vcf
outVCFfn=../vcf/raw.annotated.vcf

gatk=/srv/gs1/software/gatk/gatk-3.3.0/GenomeAnalysisTK.jar
ref=/srv/gs1/projects/scg/Resources/GATK/b37/human_g1k_v37_decoy.fasta
dbsnp=/srv/gs1/projects/scg/Resources/GATK/b37/dbsnp_137.b37.vcf

java -Xmx10g \
  -jar $gatk \
  -T VariantAnnotator \
  -R $ref \
  -L $inVCFfn \
  --variant $inVCFfn \
  --dbsnp $dbsnp \
  --out $outVCFfn \
  --annotation Coverage \
  --annotation DepthPerAlleleBySample \
  --annotation MappingQualityZero \
  [-I fileName.1.bam … -I fileName.810.bam]

The documentation for --annotation Coverage indicates that it should update both the INFO DP and the sample-level (FORMAT) DP, but the sample-level update seems not to be occurring. For example:

$ awk '($2==7157834) { print $9, "|", $53 }' raw.vcf
GT:AD:DP:GQ:PL | 0:4,0:4:99:0,135

$ awk '($2==7157834) { print $9, "|", $53 }' raw.annotated.vcf
GT:AD:DP:GQ:PL | 0:131,0:4:99:0,135

(Aside: I'm working with a haploid chromosome, hence the single-value GT and two-value AD and PL fields). Although AD was updated from "4,0" to "131,0", the sample-level "DP" remained at "4". Of course I could approximate DP as the sum of AD values, but since AD is pre-filter and sample-level DP is post-filter, this wouldn't quite be correct. I should note that when I run HaplotypeCaller/GenotypeGVCFs on a subset of the data, the sample-level DP values are very close to the sum of AD's, so it's not a matter of most reads being filtered out, but rather that they have been downsampled. Am I missing something?

Thanks! David

P.S. The INFO field DP and MQ0 are indeed updated as expected:

$ awk '($2==7157834) { print $8 }' raw.vcf
AC=1;AF=6.993e-03;AN=143;DP=609;FS=0.000;GQ_MEAN=124.92;GQ_STDDEV=113.68;MLEAC=1;MLEAF=6.993e-03;MQ=60.00;MQ0=0;NCC=0;QD=30.16;SOR=1.259

$ awk '($2==7157834) { print $8 }' raw.annotated.vcf
AC=1;AF=6.993e-03;AN=143;DP=10750;FS=0.000;GQ_MEAN=124.92;GQ_STDDEV=113.68;MLEAC=1;MLEAF=6.993e-03;MQ=60.00;MQ0=2;NCC=0;QD=30.16;SOR=1.259

Created 2014-11-04 05:50:02 | Updated | Tags: genotypegvcfs
Comments (1)

In the past, I’ve found MQ0 ("Total Mapping Quality Zero Reads") to be a very useful INFO feature of putatively variable sites. However, in my experience, GenotypeGVCFs reports every site to have MQ0=0, even those for which UnifiedGenotyper reports MQ0 > 0 when run on the same data. Is this a bug, or am I missing something?

Thanks!


Created 2014-10-29 16:20:48 | Updated 2014-10-29 16:21:18 | Tags: haplotypecaller genotypegvcfs joint-calling exome-capture
Comments (5)

Hi,

We have some exomes processed with GATK from one exome capture platform (Nimblegen SeqCap on HiSeq), and now I am going to analyze a small batch of exomes sequenced using a different platform (Nextera on NextSeq). I was wondering if and how GenotypeGVCFs can cope with gvcfs produced on two (or more) different exome capture platforms? Is joint-calling of such "heterogeneous" samples advised, or should I rather genotype the small set of equally processed bams separately?

The problems I could foresee were:

  • differing targets (each batch of samples will have gaps in coverage due to target-constrained (-L) haplotypecalling). If this should cause problems for GenotypeGVCFs the haplotypes could be called again on superset of targets, or easier, the calling restricted to the intersection.
  • the alignment algorithm is different (BWA BWTSW vs BWA MEM) so the mapping quality could potentially differ (haven't checked). Is this something GATK is compensating for, like base qualities?
  • the sequencer and chemistry, but here I hope that the BQSR should help in removing the variation

Any thoughts and comments appreciated,

Thanks, Paweł


Created 2014-10-23 17:22:04 | Updated | Tags: genotypegvcfs
Comments (5)

Hi,

I'm trying to call genotypes at a set of given positions from ~5,000 samples. To do so, I first generate gvcf files using HaplotypeCaller for the region around a gene for all samples separately with HaplotypeCaller. Then I'm merging the single gvcf files into batches of 1000 and then I'm generating the genotypes at ~180 sites using GenotypeGVCFs and the flag --includeNonVariantSites. Everything runs without errors, but in the end I noticed that ~20 sites were missing from the final VCF file. I then looked into these sites and the single files in detail and found out that there are simply no genotypes called for some samples at some positions.

I attached a small example. When I'm running the following command, no variant is in the output file: java -jar GenomeAnalysisTK.jar \ -R hg19.fa \ -L region.bed --includeNonVariantSites \ -T GenotypeGVCFs \ --variant wrong.gvcf --variant correct.gvcf \ -o out.vcf

The same command works when running it with only "correct.gvcf" as an input, but obviously not when running it with only "wrong.gvcf". I noticed that wrong.gvcf might show a variant at this exact position.

Best, Thomas


Created 2014-10-06 20:07:44 | Updated | Tags: genotypegvcfs
Comments (5)

Dear all,

I have been using a relatively recent nightly build of GATK to deal with some previous bug fix, and I now see some odd things in my VCF format, after using genotypegVCF. Precisely some rows have this INFO field: GT:AD:DP:GQ:PGT:PID:PL

with these new PGT and PID which are present only some of the time. I have no idea what this means (and it also messes up my data processing). Does anyone know what these fields are about? And, unless they are very useful, how to get rid of them?

Thank you in advance,


Created 2014-09-30 15:46:58 | Updated | Tags: haplotypecaller genotypegvcfs
Comments (6)

Following the the recommended best practice and using GATK v3.2, the raw vcf file created by haplotyecaller/GenotypeGVCFs does not contain any of AC, AN and AF tags. Is this an expected behaviour?

This is a line of the g.vcf file generated from HaplotypeCaller:
chr21 36104982 . AT A,ATT,<NON_REF> 0.39 . BaseQRankSum=-1.693;DP=35;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQ=60.00;MQ0=0;MQRankSum=1.874;ReadPosRankSum=-0.023 GT:AD:GQ:PL:SB 0/1:23,6,2,0:27:27,0,499,67,411,529,96,459,496,525:10,13,2,4

And the raw vcf file from GenotypeGVCFs: chr21 36122885 . A AT 7871.94 . BaseQRankSum=0.151;DP=493;FS=5.433;MLEAC=24;MLEAF=0.750;MQ=60.00;MQ0=0;MQRankSum=0.129;QD=19.53;ReadPosRankSum=0.257 GT:AD:DP:GQ:PL 0/1:13,8:.:99:162,0,238 1/1:0,21:.:62:536,62,0 1/1:0,24:.:71:610,71,0 0/1:12,12:.:99:245,0,207 1/1:0,27:.:80:693,80,0 1/1:0,34:.:99:877,101,0 0/1:31,13:.:99:210,0,624 0/0:30,0:30:65:0,65,1125 0/1:6,11:.:96:225,0,96 0/1:10,5:.:87:87,0,196 0/1:4,15:.:39:340,0,39 1/1:2,35:.:66:895,66,0 1/1:0,53:.:99:1376,161,0 1/1:0,13:.:39:328,39,0 1/1:0,24:.:74:662,74,0 1/1:1,29:.:65:702,65,0

My command for HaplotypeCaller:
java -Xmx8g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R myRef.fa -I my.bam -L chr21 --dbsnp dbSNP_chr21.vcf --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -stand_emit_conf 10 -stand_call_conf 30 -mbq 20 -nct 8 -A FisherStrand -A QualByDepth -A HaplotypeScore --annotation Coverage -o raw.g.vcf

And GenotypeGVCFs:
java -Xmx8g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R myRef.fa -V myVCF.list --dbsnp dbSNP_chr21.vcf -A FisherStrand -A QualByDepth -A HaplotypeScore -A Coverage -L chr21 -o myraw.vcf


Created 2014-09-23 13:38:59 | Updated | Tags: genotypegvcfs
Comments (2)

hi,
I did HaplotypeCaller + genotypeVCFs on 4 sample bam files (with low coverage/depth).
I saw that AC INFO field do not represent the sum of all alternate allele seen.

grep -we rs10753352 -we rs80146807 -we rs707592 trisomie.vcf
chr1    5184689 rs10753352      G       C       147.81  .       AC=4;AF=0.667;AN=6;DB;DP=7;FS=0.000;GQ_MEAN=6.00;GQ_STDDEV=6.00;MLEAC=5;MLEAF=0.833;MQ=93.02;MQ0=0;NCC=1;QD=24.64     GT:AD:DP:GQ:PL  1/1:0,4:4:12:116,12,0   ./.:0,0:0       1/1:0,2:2:6:59,6,0      0/0:1,0:1:0:0,0,8
chr1    5730409 rs80146807      G       T       142.09  .       AC=5;AF=0.625;AN=8;BaseQRankSum=0.804;ClippingRankSum=0.804;DB;DP=6;FS=0.000;GQ_MEAN=12.75;GQ_STDDEV=17.56;MLEAC=5;MLEAF=0.625;MQ=82.96;MQ0=0;MQRankSum=0.804;NCC=0;QD=26.86;ReadPosRankSum=0.804     GT:AD:DP:GQ:PL  0/0:1,0:1:3:0,3,45      1/1:0,1:1:3:45,3,0      1/1:0,2:2:6:90,6,0   0/1:1,1:2:39:39,0,62
chr1    5732382 rs707592        T       C       194.63  .       AC=5;AF=0.833;AN=6;BaseQRankSum=-7.360e-01;ClippingRankSum=0.736;DB;DP=9;FS=8.451;GQ_MEAN=13.33;GQ_STDDEV=8.08;MLEAC=5;MLEAF=0.833;MQ=26.49;MQ0=0;MQRankSum=0.736;NCC=1;QD=32.44;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL  ./.:0,0:0       1/1:0,4:4:12:149,12,0   0/1:2,1:3:22:22,0,49    1/1:0,2:2:6:53,6,0

I join the corresponding ##GATKCommandLine in case.

If I misunderstand what AC is, feel free to teach me.


Created 2014-09-19 14:34:45 | Updated | Tags: haplotypecaller malformedvcf genotypegvcfs
Comments (8)

Hi,

I've tried my best to find the problem but without luck, so my first post...

First the background:

I have used HaplotypeCaller to make GVCFs per individual for 2 cohorts of 270 and 300 individuals each.
I then used CombineGVCFs to merged all the individuals in each cohort across 10Mb chunks across the genome.

These steps appear to have worked OK! Then I have been attempting to call genotypes from pairs GVCFs (one from each cohort) per region using GenotypeGVCFs with a command such as:

java -jar -Xmx6G /gpfs0/apps/well/gatk/3.2-2/GenomeAnalysisTK.jar \ -T GenotypeGVCFs \ -R ../hs37d5/hs37d5.fa \ --variant /well/hill/kate/exomes//haploc/cap270/cap270_1:230000000-240000000_merged_gvcf.vcf \ --variant /well/hill/kate/exomes//haploc/sepsis300/sepsis300_1:230000000-240000000_merged_gvcf.vcf \ -o /well/hill/kate/exomes//haploc/cap270-sepsis300/cap270-sepsis300_1:230000000-240000000_raw.vcf

This seems to start fine (and works perfectly to completion for some regions), but then throws a error for some of the files. For this example the error is:

ERROR MESSAGE: The provided VCF file is malformed at approximately line number 519534: ./.:29:75:25:0,63,945 is not a valid start position in the VCF format, for input source: /well/hill/kate/exomes/haploc/cap270/cap270_1:230000000-240000000_merged_gvcf.vcf

So to try and solve this I have grepped "./.:29:75:25:0,63,945 " from the VCF file (expecting to find a truncated line), but could find nothing wrong with the lines containing this expression. I looked carefully at the first line containing this after the last position that was written in the output - nothing out of the ordinary that I can see.

Any help/advice gratefully received! Kate


Created 2014-09-19 10:45:49 | Updated 2014-09-19 10:52:00 | Tags: mappingqualityranksumtest readposranksumtest mqranksum readposranksum genotypegvcfs
Comments (22)

On 2000 samples I have run HC3.2, CGVCFs3.2, GGVCFs3.2 and VR3.2.

For the GenotypeGVCFs step I used the current default annotations:

InbreedingCoeff FisherStrand QualByDepth ChromosomeCounts GenotypeSummaries

And these non-default annotations:

MappingQualityRankSumTest ReadPosRankSumTest

When running VariantRecalibrator and plotting each of the dimensions I noticed all of the non-default annotations taking on discrete values; see bottom of this post. Is it no longer recommended to use ReadPosRankSum and MQRankSum for VR? Should I calculate these annotation with VariantAnnotator instead of GenotypeGVCFs? If I have to run VariantAnnotator, should I then run it separately for SNPs and INDELs cf. my previous question about annotations being different, when applied to BOTH and SNPs: http://gatkforums.broadinstitute.org/discussion/2620

Thank you.

zcat out_GenotypeGVCFs/chrom20.vcf.gz | grep -v ^# | cut -f8 | tr ";" "\n" | grep ReadPosRankSum | sort | uniq -c | awk '$1>20000' | sort -k1n,1
  41649 ReadPosRankSum=0.731
  41760 ReadPosRankSum=0.550
  46305 ReadPosRankSum=0.720
  47060 ReadPosRankSum=0.00
  87348 ReadPosRankSum=0.406
 105254 ReadPosRankSum=0.736
 116426 ReadPosRankSum=0.727
 164855 ReadPosRankSum=0.358

zcat out_GenotypeGVCFs/chrom20.vcf.gz | grep -v ^# | cut -f8 | tr ";" "\n" | grep "MQ=" | sort | uniq -c | awk '$1>5000' | sort -k1n,1
   5802 MQ=57.05
   8382 MQ=29.00
   8525 MQ=56.62
  10069 MQ=51.77
  10574 MQ=53.95
  10682 MQ=47.12
  10818 MQ=56.04
  11553 MQ=55.21
 802603 MQ=60.00

zcat out_GenotypeGVCFs/chrom20.vcf.gz | grep -v ^# | cut -f8 | tr ";" "\n" | grep MQRankSum | sort | uniq -c | awk '$1>20000' | sort -k1n,1
  21511 MQRankSum=-7.360e-01
  27222 MQRankSum=0.322
  33699 MQRankSum=0.550
  34481 MQRankSum=0.731
  37603 MQRankSum=0.720
  60729 MQRankSum=0.00
  76031 MQRankSum=0.406
  85812 MQRankSum=0.736
  98519 MQRankSum=0.727
 186092 MQRankSum=0.358

Created 2014-09-11 15:52:18 | Updated | Tags: vqsr haplotypecaller qualbydepth genotypegvcfs
Comments (21)

Hey there,

How can it be possible that some of my snps or indels calls miss the QD tag? I'm doing the recommended workflow and I've tested if for both RNAseq (hard filters complains, that's how I saw those tags were missing) and Exome sequencing (VQSR). How can a hard filter applied on QD on a line without actually that tag can be considered to pass that threshold too? I'm seeing a lot more INDELs in RNAseq where this kind of scenario is happening as well.

Here's the command lines that I used :

Forming gvcf files like this (for RNAseq in that case):

java -Djava.io.tmpdir=$LSCRATCH -Xmx46g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -pairHMM VECTOR_LOGLESS_CACHING -T HaplotypeCaller -I sample.bam -R ~/References/hg19/hg19.fasta --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o sample.g.vcf --dbsnp ~/References/hg19/dbSNP/dbsnp_138.b37.excluding_sites_after_129.CEU_converted.vcf -recoverDanglingHeads -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 0.0 -mmq 1 -U ALLOW_N_CIGAR_READS

GenotypeGVCF after :

java -Djava.io.tmpdir=$LSCRATCH -Xmx22g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -T GenotypeGVCFs --dbsnp ~/References/hg19/dbSNP/dbsnp_138.b37.excluding_sites_after_129.CEU_converted.vcf -A QualByDepth -A FisherStrand -o Joint_file/96RNAseq.10092014.STAR.q1.1kb.vcf -R ~/References/hg19/hg19.fasta

VQSR (separated for indels and snvs) or Hard filter at the end :

VQSR

java -Djava.io.tmpdir=$LSCRATCH -Xmx10g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -T VariantRecalibrator -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -mode SNP -resource:1000G,known=false,training=true,truth=false,prior=10.0 ~/References/hg19/VQSR/1000G_phase1.snps.high_confidence.b37.noGL.converted.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ~/References/hg19/VQSR/hapmap_3.3.b37.noGL.nochrM.converted.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 ~/References/hg19/VQSR/1000G_omni2.5.b37.noGL.nochrM.converted.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 ~/References/hg19/VQSR/dbsnp_138.b37.excluding_sites_after_129.noGL.nochrM.converted.vcf -input snv.vcf -recalFile 96Exomes.HC.TruSeq.snv.RECALIBRATED -tranchesFile 96Exomes.HC.TruSeq.snv.tranches -rscriptFile 96Exomes.HC.TruSeq.snv.plots.R -R ~/References/hg19/hg19.fasta --maxGaussians 4

java -Djava.io.tmpdir=$LSCRATCH -Xmx10g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -T ApplyRecalibration -ts_filter_level 99.0 -mode SNP -input snv.vcf -recalFile 96Exomes.HC.TruSeq.snv.RECALIBRATED -tranchesFile 96Exomes.HC.TruSeq.snv.tranches -o 96Exomes.HC.TruSeq.snv.recal.vcf -R ~/References/hg19/hg19.fasta

HARD FILTER (RNASeq)

java -Djava.io.tmpdir=$LSCRATCH -Xmx2g -jar /home/apps/Logiciels/GATK/3.1-1/GenomeAnalysisTK.jar -l INFO -T VariantFiltration -R ~/References/hg19/hg19.fasta -V 96RNAseq.STAR.q1.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o 96RNAseq.STAR.q1.FS30.QD2.vcf

Here are some examples for RNAseq :

chr1 6711349 . G A 79.10 PASS BaseQRankSum=-1.369e+00;ClippingRankSum=1.00;DP=635;FS=1.871;MLEAC=1;MLEAF=5.495e-03;MQ=60.00;MQ0=0;MQRankSum=-4.560e-01;ReadPosRankSum=-1.187e+00 GT:AD:DP:GQ:PL 0/0:8,0:8:24:0,24,280 ./.:0,0:0 0/0:9,0:9:21:0,21,248 0/0:7,0:7:21:0,21,196 0/0:7,0:7:21:0,21,226 0/0:8,0:8:21:0,21,227 0/0:8,0:8:21:0,21,253 0/0:7,0:7:21:0,21,218 0/0:9,0:9:27:0,27,282 1/1:0,0:5:15:137,15,0 0/0:2,0:2:6:0,6,47 0/0:28,0:28:78:0,78,860 0/0:7,0:7:21:0,21,252 0/0:2,0:2:6:0,6,49 0/0:5,0:5:12:0,12,152 0/0:3,0:3:6:0,6,90 0/0:4,0:4:12:0,12,126 0/0:9,0:9:21:0,21,315 0/0:7,0:7:21:0,21,256 0/0:7,0:7:21:0,21,160 0/0:8,0:8:21:0,21,298 0/0:20,0:20:60:0,60,605 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,67 0/0:2,0:2:6:0,6,71 0/0:14,0:14:20:0,20,390 0/0:7,0:7:21:0,21,223 0/0:7,0:7:21:0,21,221 0/0:4,0:4:12:0,12,134 0/0:2,0:2:6:0,6,54 ./.:0,0:0 0/0:4,0:4:9:0,9,118 0/0:8,0:8:21:0,21,243 0/0:6,0:6:15:0,15,143 0/0:8,0:8:21:0,21,244 0/0:7,0:7:21:0,21,192 0/0:2,0:2:6:0,6,54 0/0:13,0:13:27:0,27,359 0/0:8,0:8:21:0,21,245 0/0:7,0:7:21:0,21,218 0/0:12,0:12:36:0,36,354 0/0:8,0:8:21:0,21,315 0/0:7,0:7:21:0,21,215 0/0:2,0:2:6:0,6,49 0/0:10,0:10:24:0,24,301 0/0:7,0:7:21:0,21,208 0/0:7,0:7:21:0,21,199 0/0:2,0:2:6:0,6,47 0/0:3,0:3:9:0,9,87 0/0:2,0:2:6:0,6,73 0/0:7,0:7:21:0,21,210 0/0:8,0:8:22:0,22,268 0/0:7,0:7:21:0,21,184 0/0:7,0:7:21:0,21,213 0/0:5,0:5:9:0,9,135 0/0:7,0:7:21:0,21,200 0/0:4,0:4:12:0,12,118 0/0:7,0:7:21:0,21,232 0/0:7,0:7:21:0,21,232 0/0:7,0:7:21:0,21,217 0/0:8,0:8:21:0,21,255 0/0:9,0:9:24:0,24,314 0/0:8,0:8:21:0,21,221 0/0:9,0:9:24:0,24,276 0/0:9,0:9:21:0,21,285 0/0:3,0:3:6:0,6,90 0/0:2,0:2:6:0,6,57 0/0:13,0:13:20:0,20,385 0/0:2,0:2:6:0,6,48 0/0:11,0:11:27:0,27,317 0/0:8,0:8:21:0,21,315 0/0:9,0:9:24:0,24,284 0/0:7,0:7:21:0,21,228 0/0:14,0:14:33:0,33,446 0/0:2,0:2:6:0,6,64 0/0:2,0:2:6:0,6,72 0/0:7,0:7:21:0,21,258 0/0:10,0:10:27:0,27,348 0/0:7,0:7:21:0,21,219 0/0:9,0:9:21:0,21,289 0/0:20,0:20:57:0,57,855 0/0:4,0:4:12:0,12,146 0/0:7,0:7:21:0,21,205 0/0:12,0:14:36:0,36,1030 0/0:3,0:3:6:0,6,87 0/0:2,0:2:6:0,6,60 0/0:7,0:7:21:0,21,226 0/0:7,0:7:21:0,21,229 0/0:8,0:8:21:0,21,265 0/0:4,0:4:6:0,6,90 ./.:0,0:0 0/0:7,0:7:21:0,21,229 0/0:2,0:2:6:0,6,59 0/0:2,0:2:6:0,6,56 chr1 7992047 . T C 45.83 SnpCluster BaseQRankSum=1.03;ClippingRankSum=0.00;DP=98;FS=0.000;MLEAC=1;MLEAF=0.014;MQ=60.00;MQ0=0;MQRankSum=-1.026e+00;ReadPosRankSum=-1.026e+00 GT:AD:DP:GQ:PL ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,70 0/0:2,0:2:6:0,6,45 0/0:3,0:3:6:0,6,87 0/0:2,0:2:6:0,6,52 ./.:0,0:0 ./.:0,0:0 ./.:1,0:1 ./.:0,0:0 0/0:2,0:2:6:0,6,55 0/0:2,0:2:6:0,6,49 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,61 0/0:2,0:2:6:0,6,49 ./.:0,0:0 ./.:0,0:0 0/0:3,0:3:6:0,6,90 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,52 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,69 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,64 ./.:0,0:0 0/0:2,0:2:6:0,6,37 ./.:0,0:0 0/0:2,0:2:6:0,6,67 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,68 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:11,0:11:24:0,24,360 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,68 0/0:2,0:2:6:0,6,50 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,50 0/0:3,0:3:6:0,6,90 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:4:6:0,6,50 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:7,0:7:21:0,21,231 0/0:2,0:2:6:0,6,64 ./.:0,0:0 0/0:2,0:2:6:0,6,63 0/0:2,0:2:6:0,6,70 ./.:0,0:0 0/0:6,0:6:15:0,15,148 ./.:0,0:0 ./.:0,0:0 1/1:0,0:2:6:90,6,0 ./.:0,0:0 0/0:2,0:2:6:0,6,63 0/0:2,0:2:6:0,6,74 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,58 0/0:2,0:2:6:0,6,71 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49

For Exome Seq now : chr2 111878571 . C T 93.21 PASS DP=634;FS=0.000;MLEAC=1;MLEAF=5.319e-03;MQ=60.00;MQ0=0;VQSLOD=14.19;culprit=MQ GT:AD:DP:GQ:PL 0/0:8,0:8:24:0,24,243 0/0:4,0:4:9:0,9,135 0/0:7,0:7:18:0,18,270 0/0:7,0:7:21:0,21,230 0/0:16,0:16:48:0,48,542 0/0:8,0:8:21:0,21,315 0/0:6,0:6:18:0,18,186 0/0:5,0:5:15:0,15,168 0/0:6,0:6:15:0,15,225 0/0:10,0:10:30:0,30,333 0/0:7,0:7:21:0,21,239 0/0:6,0:6:18:0,18,202 0/0:6,0:6:15:0,15,225 0/0:7,0:7:21:0,21,225 0/0:8,0:8:24:0,24,272 0/0:5,0:5:15:0,15,168 1/1:0,0:13:13:147,13,0 0/0:2,0:2:6:0,6,73 0/0:8,0:8:24:0,24,256 0/0:14,0:14:4:0,4,437 0/0:3,0:3:9:0,9,85 0/0:4,0:4:12:0,12,159 0/0:7,0:7:21:0,21,238 0/0:5,0:5:15:0,15,195 0/0:7,0:7:15:0,15,225 0/0:12,0:12:36:0,36,414 0/0:4,0:4:12:0,12,156 0/0:7,0:7:0:0,0,190 0/0:2,0:2:6:0,6,64 0/0:7,0:7:21:0,21,242 0/0:7,0:7:21:0,21,234 0/0:8,0:8:24:0,24,267 0/0:7,0:7:21:0,21,245 0/0:7,0:7:21:0,21,261 0/0:6,0:6:18:0,18,204 0/0:8,0:8:24:0,24,302 0/0:5,0:5:15:0,15,172 0/0:9,0:9:24:0,24,360 0/0:18,0:18:51:0,51,649 0/0:5,0:5:15:0,15,176 0/0:2,0:2:6:0,6,70 0/0:14,0:14:33:0,33,495 0/0:4,0:4:9:0,9,135 0/0:8,0:8:21:0,21,315 0/0:4,0:4:12:0,12,149 0/0:4,0:4:6:0,6,90 0/0:10,0:10:27:0,27,405 0/0:3,0:3:6:0,6,90 0/0:4,0:4:12:0,12,133 0/0:14,0:14:6:0,6,431 0/0:4,0:4:12:0,12,151 0/0:5,0:5:15:0,15,163 0/0:3,0:3:9:0,9,106 0/0:7,0:7:21:0,21,237 0/0:7,0:7:21:0,21,268 0/0:8,0:8:21:0,21,315 0/0:2,0:2:6:0,6,68 ./.:0,0:0 0/0:3,0:3:9:0,9,103 0/0:7,0:7:21:0,21,230 0/0:3,0:3:6:0,6,90 0/0:9,0:9:26:0,26,277 0/0:7,0:7:21:0,21,236 0/0:5,0:5:15:0,15,170 ./.:1,0:1 0/0:15,0:15:45:0,45,653 0/0:8,0:8:24:0,24,304 0/0:6,0:6:15:0,15,225 0/0:3,0:3:9:0,9,103 0/0:2,0:2:6:0,6,79 0/0:7,0:7:21:0,21,241 0/0:4,0:4:12:0,12,134 0/0:3,0:3:6:0,6,90 0/0:5,0:5:15:0,15,159 0/0:4,0:4:12:0,12,136 0/0:5,0:5:12:0,12,180 0/0:11,0:11:21:0,21,315 0/0:13,0:13:39:0,39,501 0/0:3,0:3:9:0,9,103 0/0:8,0:8:24:0,24,257 0/0:2,0:2:6:0,6,73 0/0:8,0:8:24:0,24,280 0/0:4,0:4:12:0,12,144 0/0:4,0:4:9:0,9,135 0/0:8,0:8:24:0,24,298 0/0:4,0:4:12:0,12,129 0/0:5,0:5:15:0,15,184 0/0:2,0:2:6:0,6,62 0/0:2,0:2:6:0,6,65 0/0:9,0:9:27:0,27,337 0/0:7,0:7:21:0,21,230 0/0:7,0:7:21:0,21,239 0/0:5,0:5:0:0,0,113 0/0:11,0:11:33:0,33,369 0/0:7,0:7:21:0,21,248 0/0:10,0:10:30:0,30,395

Thanks for your help.


Created 2014-09-10 03:04:06 | Updated | Tags: genotypegvcfs gatk3
Comments (1)

Dear GATK team, http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.html “GenotypeGVCFs merges gVCF records that were produced as part of the reference model-based variant discovery pipeline (see documentation for more details) using the '-ERC GVCF' or '-ERC BP_RESOLUTION' mode of the HaplotypeCaller. This tool performs the multi-sample joint aggregation step and merges the records together in a sophisticated manner. At all positions of the target, this tool will combine all spanning records, produce correct genotype likelihoods, re-genotype the newly merged record, and then re-annotate it. Note that this tool cannot work with just any gVCF files - they must have been produced with the HaplotypeCaller, which uses a sophisticated reference model to produce accurate genotype likelihoods for every position in the target.”

The final sentence says that "accurate genotype likelihoods", how do you define "accurate"? May I ask if it is fully evaluated? If so, could you please tell me how you do the evaluation?

Thank you in advance.


Created 2014-08-28 16:46:10 | Updated | Tags: haplotypecaller genotypegvcfs
Comments (2)

Hi,

I am running HaplotypeCaller -ERC GVCF for a cohort in parallel per sample per chromosome and combined the data using CombineGVCFs across the cohort on per chromosome basis. Can I run the GenotypeGVCFs to call variants across the cohort in parallel on all samples but per chromosome basis? follow by combining the vcf files using CatVariants or CombineVariants?

Thanks

Kin


Created 2014-08-07 09:02:10 | Updated | Tags: genotypegvcfs
Comments (6)

Hi,

I run HC for several samples, then I run GenotypeGVCFs to merge all gVCF files. (I'm using version 3.1.1) I've noticed that for some variations, although there are only 2 alleles, the AD values for all samples contain information for multiple alleles.

For example: '0,83,0,0,0' for homozygous or '11,40,0,0,0' for heterozygous.

Below is the complete line: chr15 82637079 . C T 91046.53 . AC=26;AF=0.867;AN=30;BaseQRankSum=-1.146e+00;ClippingRankSum=2.25;DP=2139;FS=0.000;InbreedingCoeff=-0.1588;MLEAC=26;MLEAF=0.867;MQ=39.29;MQ0=0;MQRankSum=2.18;QD=30.88;ReadPosRankSum=0.174 GT:AD:DP:GQ:PL 1/1:0,23,0,0,0:23:77:1081,77,0 0/1:11,40,0,0,0:51:99:1688,0,473 1/1:1,320,0,0,0:321:99:14836,972,0 1/1:2,168,0,0,0:170:99:7466,471,0 0/1:23,205,0,0,0:228:99:8646,0,649 ./.:.:3 1/1:0,3,0,0,0:3:9:135,9,0 1/1:0,139,0,0,0:139:99:6349,427,0 1/1:0,83,0,0,0:83:99:4251,307,0 0/1:7,60,0,0,0:67:99:2748,0,310 0/1:6,38,0,0,0:44:99:1921,0,124 1/1:0,294,0,0,0:294:99:15437,1085,0 1/1:0,67,0,0,0:67:99:3557,253,0 1/1:0,105,0,0,0:105:99:5289,362,0 1/1:0,243,0,0,0:243:99:11587,810,0 1/1:0,257,0,0,0:257:99:11929,821,0

Any suggestion how should I interpret this values?

Thank you for your help, Lily


Created 2014-07-23 18:57:39 | Updated | Tags: genotypegvcfs
Comments (4)

Hi there,

I'm trying to follow best practices and also use the latest version of GATK/Queue. I've created 2 GVCF files, and I've run GenotypeGVCFs on them to create a VCF file. These are the commands:

java -jar -Xmx8g /home/sgfriede/gene_apps/gatk-3.2-2/GenomeAnalysisTK.jar -T GenotypeGVCFs \ -R /gpfs_share/sgfriede/ref_data/canFam3.fa \ --variant /gpfs_share/sgfriede/poodles_addisons/cocoa_wilt/gvcf/cocoa_wilt.sorted.dedup.haplotypecaller.g.vcf \ --variant /gpfs_share/sgfriede/poodles_addisons/ellie_traska/gvcf/ellie_traska.sorted.dedup.haplotypecaller.g.vcf \ -o /gpfs_share/sgfriede/poodles_addisons/affected_poodles/20140722/vcf/affected_poodles.vcf

From this, I get the output file no problem.

However, when I try to run VQSR on this VCF (affected_poodles.vcf) file I get the following error message:

ERROR MESSAGE: Line 5082: there aren't enough columns for line 0,3:3:9:71,9,0 0/0:2,0:2:6:0,6,61 (we expected 9 tokens, and saw 2 ), for input source: /gpfs_share/sgfriede/poodles_addisons/affected_poodles/20140722/vcf/affected_poodles.vcf

I think (but I could be wrong!!) I've narrowed this down to something with the output format from GenotypeGVCFs in GATK 3.2-2, because when I run VQSR on the same file but created using GenotypeGVCFs with GATK 3.1-1, VQSR runs just fine (and I'm calling the VQSR walker using 3.2-2).

It is also possible I'm doing something wrong here, but I thought I'd see if any of the experts might have an idea.

Appreciate your help as always! Steve


Created 2014-07-16 22:58:28 | Updated | Tags: runtime genotypegvcfs gatk3-2
Comments (4)

I have noticed when using the GenotypeGVCFs walker (version 3.2), that the remaining runtime estimate is very poor. The estimate from CombineGVCFs and other walkers on the other hand is very accurate. This is not a critical bug. It is rather a feature enhancement. I just noticed that the "completed" percentage is also incorrect. It does not start at 0%. In fact it has stayed constant after walking over 2 million base pairs of 2000 samples in 2 hours. I am not using multi threading. Not that important to me, but I thought I would let you know.


Created 2014-07-15 03:01:03 | Updated | Tags: alleledepth genotypegvcfs
Comments (1)

Hello,

I'm using GATK version 3.1-1 and following Best Practices to call SNPs and INDELs for my exome sequencing data. I noticed that the AD field didn't update properly after I merged per-sample GVCF files into the per-cohort VCF file.

As an example, one SNP on mitochondrion at location 185 has following entries in per-sample GVCF files: chrM 185 . G T, 1254.77 . DP=23;MLEAC=2,0;MLEAF=1.00,0.00;MQ=52.36;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,23,0:23:93:1288,93,0,1288,93,1288:0,0,0,0 chrM 185 . G T, 410.77 . DP=7;MLEAC=2,0;MLEAF=1.00,0.00;MQ=34.70;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,7,0:7:33:444,33,0,444,33,444:0,0,0,0 chrM 185 . G A, 259.84 . DP=5;MLEAC=2,0;MLEAF=1.00,0.00;MQ=37.32;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,5,0:5:21:293,21,0,293,21,293:0,0,0,0 chrM 185 . G A, 205.90 . DP=3;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,3,0:3:18:239,18,0,239,18,239:0,0,0,0 chrM 185 . G T, 1316.77 . DP=21;MLEAC=2,0;MLEAF=1.00,0.00;MQ=35.14;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,21,0:21:99:1350,99,0,1350,99,1350:0,0,0,0 chrM 185 . G T, 171.03 . DP=3;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,3,0:3:15:204,15,0,204,15,204:0,0,0,0 All other samples didn't have evidence for this SNP.

After merging with GenotypeGVCFs, the same SNP has the entry in the per-cohort VCF file: chrM 185 . G T,A 3561.78 . AC=8,4;AF=0.098,0.049;AN=82;DP=565;FS=0.000;InbreedingCoeff=0.8981;MLEAC=8,3;MLEAF=0.098,0.037;MQ=52.36;MQ0=0;QD=32.42 GT:AD:DP:GQ:PL 0/0:.:13:30:0,30,450,30,450,450 0/0:.:16:39:0,39,585,39,585,585 1/1:0,23,0:23:93:1288,93,0,1288,93,1288 0/0:.:8:21:0,21,315,21,315,315 0/0:.:14:27:0,27,405,27,405,405 0/0:.:11:24:0,24,360,24,360,360 0/0:.:2:6:0,6,75,6,75,75 0/0:.:9:18:0,18,270,18,270,270 0/0:.:8:21:0,21,269,21,269,269 0/0:.:18:36:0,36,540,36,540,540 0/0:.:10:21:0,21,315,21,315,315 0/0:.:23:60:0,60,900,60,900,900 0/0:.:45:99:0,102,1530,102,1530,1530 0/0:.:50:99:0,120,1800,120,1800,1800 0/0:.:8:21:0,21,266,21,266,266 0/0:.:16:36:0,36,540,36,540,540 1/1:0,7,0:7:33:444,33,0,444,33,444 0/0:.:5:6:0,6,90,6,90,90 0/0:.:4:9:0,9,135,9,135,135 0/0:.:7:21:0,21,249,21,249,249 0/0:.:12:27:0,27,405,27,405,405 0/0:.:23:60:0,60,900,60,900,900 2/2:0,5,0:5:21:293,293,293,21,21,0 0/0:.:8:21:0,21,278,21,278,278 0/0:.:11:24:0,24,360,24,360,360 0/0:.:8:24:0,24,292,24,292,292 2/2:0,3,0:3:18:239,239,239,18,18,0 0/0:.:12:27:0,27,405,27,405,405 0/0:.:12:21:0,21,315,21,315,315 1/1:0,21,0:21:99:1350,99,0,1350,99,1350 0/0:.:13:27:0,27,405,27,405,405 0/0:.:3:9:0,9,112,9,112,112 1/1:0,3,0:3:15:204,15,0,204,15,204 0/0:.:7:21:0,21,209,21,209,209 0/0:.:13:27:0,27,405,27,405,405 0/0:.:5:6:0,6,90,6,90,90 0/0:.:7:21:0,21,260,21,260,260 0/0:.:10:24:0,24,360,24,360,360 0/0:.:40:93:0,93,1395,93,1395,1395 0/0:.:38:84:0,84,1260,84,1260,1260 0/0:.:14:33:0,33,495,33,495,495

Both alternative alleles were included in the ALT field and the genotype field (GT) was updated correspondingly. For genotype 2/2, I think the correct AD field should be 0,0,5 and 0,0,3 in my aforementioned example. However, these fields remained the same as in the original per-sample GVCF files (i.e., 0,5,0 and 0,3,0). I'm wondering whether my understanding about the AD field is correct or not. And also, if it is an erroneous reporting, whether it'll influence the downstream VQSR and other filtering/annotations or not.

Thanks, Leo


Created 2014-07-01 14:41:58 | Updated | Tags: haplotypecaller genotypegvcfs gvcf
Comments (3)

I'm running HaplotypeCaller in GVCF mode, followed by GenotypeGVCF. For some individuals, I have more than one aligned bam file. Do I need to combine the individual's bam files in HaplotypeCaller to produce one GVCF, or can I produce multiple GVCFs per individual and combine them in GenotypeGVCF?

Thanks.


Created 2014-07-01 05:47:05 | Updated | Tags: genotypegvcfs gatk-3-0-haplotypecaller
Comments (6)

I'm wondering if it's ok to remove the bd/bi tags after haplotypecaller creates the gvcf, the increase bam size is a problematic, the computional cost of recreating them in a 'just in case' scenario isn't as bad as the on disk cost of them so i'm wondering if they are not used after the first gvcf is created from individual samples.


Created 2014-06-30 08:33:39 | Updated 2014-06-30 08:34:34 | Tags: vcf genotypegvcfs n-1
Comments (2)

Hi

I've just noticed this in some records, in some bi-allelic sites the AD field for some samples can have far more than 2 columns and the values can be separated oddly. For example in the record at the bottom of the 0/1:21,11,0,0,0:32:99:398,0,677 the site is biallelic so I'd expect the AD to be 21,11 with a DP of 32, but instead we have 21,11,0,0,0.

Then at another position here we have 0/1:12,0,9,0,0:21:99:220,0,792 with the AD 12,0,9,0,0 DP 21 when I would have expected to see AD 12,9 DP 21. The 0 between the ref and alt alleles is particularly troubling.

 chr1    110326664       rs209200395     A       C       20726.45        .       AC=65;AF=0.192;AN=338;BaseQRankSum=1.05;ClippingRankSum=-6.610e-01;DB;DP=3547;FS=0.000;InbreedingCoeff=0.0055;MLEAC=66;MLEAF=0.195;MQ=60.00;MQ0=0;MQRankSum=-3.900e-02;QD=16.71;ReadPosRankSum=0.638    GT:AD:DP:GQ:PL  0/0:.:5:15:0,15,156     0/0:.:2:6:0,6,69        0/0:.:23:60:0,60,900    0/0:.:9:27:0,27,275     0/0:.:26:63:0,63,945    0/0:.:41:99:0,102,1433  0/0:.:23:60:0,60,774    0/0:.:23:60:0,60,900    0/0:.:6:9:0,9,135       0/0:.:3:9:0,9,86        0/0:.:6:12:0,12,180     0/0:.:9:24:0,24,288     0/0:.:3:6:0,6,90        0/1:21,11,0,0,0:32:99:398,0,677 0/0:.:29:61:0,61,939    0/1:9,11,0,0,0:20:99:350,0,241  0/0:.:26:43:0,43,832    0/0:.:22:63:0,63,945    0/0:.:30:68:0,68,916    0/0:.:21:57:0,57,855    0/0:.:23:62:0,62,719    0/1:10,14,0,0,0:24:99:509,0,270 1/1:0,23,0,0,0:23:85:1001,85,0  0/1:18,13,0,0,0:31:99:426,0,485 0/0:.:25:68:0,68,812    0/1:17,10,0,0,0:27:99:357,0,518 0/0:.:38:72:0,72,1223   0/0:.:25:64:0,64,823    0/0:.:22:60:0,60,900    0/0:.:21:60:0,60,707    0/1:3,6,0,0,0:9:86:184,0,86     0/0:.:25:60:0,60,900    0/0:.:26:72:0,72,1080   0/0:.:22:60:0,60,900    0/0:.:37:81:0,81,1257   0/0:.:24:65:0,65,838    0/0:.:32:65:0,65,1055   0/0:.:32:73:0,73,1069   0/1:8,0,12,0,0:20:99:467,0,494  0/0:.:22:60:0,60,900    0/1:3,3,0,0,0:6:81:197,0,81     0/1:8,12,0,0,0:20:99:369,0,228  0/0:.:19:26:0,26,615    0/0:.:21:60:0,60,761    1/1:0,24,0,0,0:24:60:984,60,0   0/0:.:24:60:0,60,900    0/0:.:21:60:0,60,747    0/0:.:34:87:0,87,1166   0/0:.:21:60:0,60,701    0/0:.:19:41:0,41,646    0/0:.:21:63:0,63,720    0/0:.:18:29:0,29,609    0/0:.:26:69:0,69,1035   0/1:8,3,0,0,0:11:99:104,0,229   0/0:.:23:60:0,60,900    0/0:.:20:60:0,60,681    0/1:7,0,8,0,0:15:99:327,0,517   0/0:.:2:6:0,6,68        ./.:.:0 0/0:.:31:66:0,66,1067   0/0:.:2:6:0,6,44        0/0:.:3:9:0,9,87        0/0:.:2:6:0,6,66        0/0:.:1:3:0,3,30        0/0:.:2:6:0,6,68        0/0:.:18:26:0,26,583    0/0:.:2:6:0,6,64        0/1:3,5,0,0,0:8:79:150,0,79     0/0:.:2:6:0,6,65        0/0:.:5:15:0,15,150     0/1:11,13,0,0,0:24:99:445,0,323 0/0:.:49:75:0,75,1662   0/1:7,11,0,0,0:18:99:328,0,172  0/0:.:22:63:0,63,753    0/1:11,11,0,0,0:22:99:400,0,313 0/0:.:35:37:0,37,1051   0/0:.:24:66:0,66,990    0/1:10,0,16,0,0:26:99:542,0,538 0/0:.:23:63:0,63,945    0/0:.:21:60:0,60,900    0/1:7,0,5,0,0:12:99:149,0,507   0/0:.:29:74:0,74,990    0/0:.:2:0:0,0,4 0/1:5,12,0,0,0:17:99:397,0,148  0/1:8,0,11,0,0:19:99:364,0,489  0/0:.:11:27:0,27,405    0/1:10,16,0,0,0:26:99:508,0,259 0/1:9,0,6,0,0:15:99:253,0,579   0/0:.:28:54:0,54,894    0/1:16,6,0,0,0:22:99:198,0,460  0/0:.:26:60:0,60,886    0/1:9,8,0,0,0:17:99:270,0,238   0/0:.:30:78:0,78,1170   0/1:9,14,0,0,0:23:99:442,0,246  0/0:.:17:21:0,21,531    0/0:.:25:64:0,64,832    0/0:.:30:65:0,65,1002   0/0:.:28:65:0,65,926    ./.:.:0 0/0:.:2:6:0,6,64        0/1:8,9,0,0,0:17:99:320,0,216   0/0:.:2:6:0,6,56        0/1:12,14,1,0,0:27:99:419,0,351 0/0:.:29:81:0,81,1215   0/1:12,0,12,0,0:24:99:307,0,782 0/0:.:24:69:0,69,809    0/0:.:23:60:0,60,900    0/0:.:28:62:0,62,990    0/1:9,0,8,0,0:17:99:357,0,776   1/1:0,10,0,0,0:10:32:355,32,0   1/1:0,12,0,0,0:12:35:435,35,0   0/1:15,13,0,0,0:28:99:418,0,442 0/1:10,0,9,0,0:19:99:326,0,845  0/0:.:29:61:0,61,970    0/0:.:27:60:0,60,900    0/0:.:33:61:0,61,1049   0/1:12,13,0,0,0:25:99:416,0,361 0/0:.:25:69:0,69,1035   0/1:10,13,0,0,0:23:99:451,0,269 0/1:21,14,0,0,0:35:99:441,0,599 0/0:.:12:22:0,22,376    0/1:10,14,0,0,0:24:99:457,0,273 0/1:10,0,14,0,0:24:99:525,0,646 0/1:15,12,0,0,0:27:99:319,0,416 0/1:5,8,0,0,0:13:99:297,0,123   0/0:.:14:7:0,7,411      0/0:.:18:19:0,19,515    0/0:.:21:63:0,63,745    0/0:.:27:72:0,72,1080   0/1:17,8,0,0,0:25:99:358,0,478  0/1:11,15,0,0,0:26:99:627,0,290 0/0:.:16:13:0,13,506    0/0:.:21:60:0,60,900    0/1:10,0,19,0,0:29:99:579,0,793 0/0:.:33:63:0,63,1077   0/0:.:13:39:0,39,432    0/0:.:8:18:0,18,270     1/1:0,15,0,0,0:15:54:630,54,0   0/0:.:9:27:0,27,297     0/0:.:31:62:0,62,1030   1/1:0,17,0,0,0:17:50:601,50,0   0/0:.:31:59:0,59,977    0/1:18,16,0,0,0:34:99:550,0,526 0/0:.:24:61:0,61,796    0/0:.:9:27:0,27,296     0/1:5,5,0,0,0:10:99:206,0,103   0/1:17,11,0,0,0:28:99:406,0,511 0/1:8,9,0,0,0:17:99:241,0,224   0/0:.:13:33:0,33,495    0/0:.:44:84:0,84,1358   0/0:.:13:20:0,20,495    0/0:.:15:21:0,21,387    0/1:17,6,0,0,0:23:99:117,0,498  0/1:18,13,0,0,0:31:99:451,0,519 1/1:0,34,0,0,0:34:99:1375,116,0 0/0:.:18:32:0,32,564    0/0:.:16:20:0,20,535    0/0:.:29:62:0,62,995    0/1:13,0,4,0,0:17:92:92,0,740   0/0:.:17:21:0,21,559    0/0:.:30:57:0,57,1006   0/0:.:23:60:0,60,900    0/1:12,0,9,0,0:21:99:220,0,792  0/0:.:30:56:0,56,967    0/0:.:23:63:0,63,945    0/0:.:29:69:0,69,999    0/1:8,12,0,0,0:20:99:412,0,237  0/0:.:26:60:0,60,980    0/0:.:27:62:0,62,872    0/1:19,9,0,0,0:28:99:285,0,587  0/1:15,14,0,0,0:29:99:422,0,430

Created 2014-06-25 15:26:23 | Updated | Tags: haplotypecaller genotypegvcfs emit-all-confident-sites
Comments (5)

Hi,

I want to call both confident variants and confident homREF genotypes using gatk 3.x style (pretty much like EMIT_ALL_CONFIDENT_SITES in UnifiedGenotyper). I first use HaplotypeCaller with -ERC GVCF, then use GenotypeGVCFs -inv to emit all confident sites. However, in the resulting vcf file, there is no site with a 0/0 genotype for all samples in the jointly called vcf, which contains only variant sites and sites with ./. genotype for all samples. I do think all sites are in the vcf file. Why can't I find any site with 0/0 genotype for all samples in the vcf from GenotypeGVCFs? Or do I need to run HaplotypeCaller with -ERC BP_RESOLUTION first? Thanks!

aaron


Created 2014-06-17 15:47:14 | Updated | Tags: genotypegvcfs combinegvcfs gatk3
Comments (6)

Hello all,

I have a quick question about the results of an CombineGVCFs file while creating large background files. Prior to combining the files a region of the .gvcf file from HC looks like this.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR070473 13 19408518 . A <NON_REF> . . END=19409266 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 13 19409267 . T <NON_REF> . . END=19409323 GT:DP:GQ:MIN_DP:PL 0/0:4:12:2:0,6,60 13 19409324 . C <NON_REF> . . END=19409400 GT:DP:GQ:MIN_DP:PL 0/0:15:42:8:0,24,299

If you noticed the first individual looks like this SRR070473, with a 0/0:0:0:0:0,0,0 recorded, but after combining the file in batches of 200, the same information will be recorded as no call.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR070473 SRR070477 SRR070505 SRR070516 SRR070517 SRR070772 SRR070779 SRR070796 SRR0 13 19408518 . A <NON_REF> . . END=19408519 GT:DP:GQ:MIN_DP:PL ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./.

The issue I'm seeing is when you attempt to use GenotypeGVCFs I get the following an error similar to this if using the file containing no call notation.

##### ERROR MESSAGE: cannot merge genotypes from samples without PLs; sample ERR031932 does not have likelihoods at position 1:10929


Created 2014-06-16 13:08:44 | Updated | Tags: bug genotypegvcfs bug-report-or-wrong-command
Comments (6)

Gidday,

Not a question, more of a bug report - I'm using the new v3.1 best practices pipeline, so I'd successfully produced my per-sample (n=23 in total) gVCFs with no worries.

Then I used GenotypeGVCFs to combine them as follows, including a dbSNP ROD (the Sanger Mouse Genome Project's SNP calls for 17 samples, in vcf format... not gvcf!):

java -Djava.io.tmpdir=/tmp -Xmx28g -jar ./tmp/GenomeAnalysisTK_3.1-1/GenomeAnalysisTK.jar -T GenotypeGVCFs -nt 8 -R ./mm10.fa --dbsnp ./tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.snps.rsIDdbSNPv137.vcf.ordinalsorted.vcf -V GenotypeGVCFs.run1.sample.list -o ./CombinedGenotyping.run1.vcf -A InbreedingCoeff -A FisherStrand -A QualByDepth -A ChromosomeCounts

So I was very surprised to see that my output CombinedGenotyping vcf has 40 samples in it, not 23 - and of course, 23 + 17 = 40. Checking the VCF headers itself confirms that the genotype calls from the 17 Sanger strains have been included in the output vcf, not just the rsIDs as intended(?). I'm guessing that this combining of .g.vcfs and extra ROD isn't the expected behaviour...!

Cheers!


Created 2014-06-13 19:31:06 | Updated | Tags: build genotypegvcfs nightly
Comments (0)

Hi,

I was wondering if it is possible to obtain the source code that accompanies the nightly builds; looking at the protected github repository, it seems that the nightly builds are changing faster than the repository there.

My current issue is that I would like the "AD" annotation for homozygous referent samples coming from the GenotypeGVCFs tool. Reading the forum, I believe that this is currently fixed in the nightly builds. However, I have a couple of customizations (pending review) that I don't think are in the nightly builds just yet, so I would need the source to apply my changes and then build.

I realize that I'm asking for something VERY dangerous and completely unsupported; I'm just too darn impatient to wait for the next release :).

-John Wallace


Created 2014-06-12 13:23:05 | Updated | Tags: genotypegvcfs combinegvcfs genotypegvcf-error-gatk3-1
Comments (16)

I have run HaplotypeCaller (Version=3.1-1-g07a4bf8) on 99 WGS samples and created my GVCFs. I did this in 60+ intervals to enable faster turn around time. I then merged the GVCFs with CombineGVCFs (Version=3.1-1-g07a4bf8) However, when I try to call variants with java -jar $GATK -T GenotypeGVCFs -V WGS.gvcf.gz -o VCF/WGS.vcf.gz -R $GENOME -nt 8 it errors out complaining about PL values. The stack trace is here:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalStateException: The original PLs do not have enough values; accessing index 10011 but size is 10000 at org.broadinstitute.sting.utils.variant.GATKVariantContextUtils.generatePLs(GATKVariantContextUtils.java:1593) at org.broadinstitute.sting.utils.variant.GATKVariantContextUtils.mergeRefConfidenceGenotypes(GATKVariantContextUtils.java:1530) at org.broadinstitute.sting.utils.variant.GATKVariantContextUtils.referenceConfidenceMerge(GATKVariantContextUtils.java:1095) at org.broadinstitute.sting.gatk.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:183) at org.broadinstitute.sting.gatk.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:110) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: The original PLs do not have enough values; accessing index 10011 but size is 10000
ERROR ------------------------------------------------------------------------------------------

In the short term, is there a hidden setting I can use to skip over these, rather than completely throwing up?


Created 2014-06-04 14:19:59 | Updated | Tags: genotypegvcfs
Comments (1)

Is it possible to get banded GVCFs from GenotypeGVCFs or only "bp" resolution when using "includeNonVariants? It looks GenotypeGVCFs expands out the banded entries significantly increasing the size of the VCF. Should I be using CombineGVCFs in this role instead?


Created 2014-06-03 16:25:34 | Updated | Tags: allelebalance allelebalancebysample haplotypecaller bug genotypegvcfs
Comments (10)

Using GATK 3.1-1, I seem to be unable to get the "AB" (AlleleBalance) annotation for the calls using the HaplotypeCaller -> GenotypeGVCFs pipeline, and I'm not sure how to get it. Our current pipeline (GATK 2.7-4, UnifiedGenotyper) requires this field to perform filtering, so this annotation is essential for us to upgrade to GATK 3.1.

My current pipeline of commands is as follows:

$ GenomeAnalysisTK-3.1-1 -T HaplotypeCaller -R human_g1k_v37_decoy.fasta --dbsnp dbsnp_137.b37.vcf.gz -I -L targets.GRCh37.bed -stand_emit_conf 10 -mbq 20 --downsample_to_coverage 300 -ERC GVCF -pairHMM VECTOR_LOGLESS_CACHING -variant_index_type LINEAR -variant_index_parameter 128000 -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A GCContent -A AlleleBalanceBySample -A AlleleBalance -A QualByDepth -o sample_gvcf.vcf.gz

$ GenomeAnalysisTK-3.1-1 -T GenotypeGVCFs -R human_g1k_v37_decoy.fasta --dbsnp dbsnp_137.b37.vcf.gz -V -L targets.GRCh37.bed -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A GCContent -A AlleleBalanceBySample -o joint_vcf.vcf.gz

Above, note that if "-A AlleleBalance" is given to GenotypeGVCFs, GATK crashes with a NullPointerException (AlleleBalance.java, line 66).

The command above is heavily adapted from the current pipeline; do you know what I might be doing wrong with the new and improved HaplotypeCaller?

Thanks so much for your help, and if you need any further information, please let me know.


Created 2014-05-30 16:18:45 | Updated | Tags: haplotypecaller genotypegvcfs gvcf targeted-sequencing
Comments (2)

Hi

For targeted re-sequenced data, can we do variant calling using HaplotypeCaller in gVCF mode followed by using GenotypeGVCFs for joint genotyping ?

Would the following best practices work for targeted re-sequencing as well , except for VQSR as for targted resequencing GATK advices another way of filtering. https://www.broadinstitute.org/gatk/guide/best-practices?bpm=DNAseq

Thanks, Tinu


Created 2014-05-27 10:53:30 | Updated 2014-05-27 10:55:27 | Tags: vcf multi-sample vcf-format genotypegvcfs
Comments (3)

Hi,

I'm doing a variant analysis of genomic DNA from 2 related samples. I followed the up-to-date Best practices using HaplotypeCaller in GVCF mode for both samples followed by GenotypeGVCF to compute a common vcf of variant loci. I'm looking at variants that would be sample2-specific (present in sample2 but not in sample1)

Here is a line of this file:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 chrIII 91124 . A AATAAGAGGAATTAGGCT 1132.42 . AC=2;AF=0.500;AN=4;DP=47;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=58.85;MQ0=0;QD=7.99 GT:AD:DP:GQ:PL 1/1:0,25:25:55:1167,55,0 0/0:.:22:33:0,33,495

In the Genotype Field, sample2.AD is a . (dot) meaning that no reads passed the Quality filters. However, sample2.DP=22 meaning that 22 reads covered this position. This line suggest that this variation is specific to sample1 (genotype HomVar 1/1) and is not present in sample2 (HomRef 0/0). But given the biological relationship between sample1 and 2 (the way they were generated), I doubt that this variation is true: it is very likely to be present in sample2 as well. It's a false

I have 416 loci like this. For the vast majority of them, sample1 and 2 likely share the same variation. But since it is not impossible that a very few of them are really sample1=HomVar and sample2=HomRef, could you suggest me a way to detect those guys? What about comparing sample1.PL(1/1) and sample2.PL(0/0) ? For example could you suggest a rule of thumb to determine their ratio ?


Created 2014-05-12 15:51:42 | Updated | Tags: genotypegvcfs
Comments (4)

Hi, I am currently unsure of how to interpret the output of GenotypeGVCFs when typing my ~600 samples.

Genotype format at a given locus with 2 alternate alleles: 0/2:16,17,0:33:99:453,501,972,0,471,420 --This is what I am expecting for each sample

Genotype format at another given locus with 2 alternate alleles: 0/2:4,1,5,3,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0:15:20:85,46,133,0,111,358,82,64,23,190,20,60,49,41,45,20,60,49,41,45,45 --I really don't understand this

I have written several down-stream scripts which rely on this format being consistent -- why are there so many fields in some of these lines? I apologize if this is very obvious and documented somewhere, but I have tried searching with no results.

Thanks, -Briana


Created 2014-05-08 23:30:18 | Updated | Tags: combinevariants vcf genotypegvcfs
Comments (2)

I've noticed a small bug with GATK tools and VCF files. CombineVariants and GenotypeGVCF can generate files where some samples have fewer fields than the format column.

For instance, this is part of a line from the VQSR-ed output VCF of GenotypeGVCF: 1 15820 rs200482301 G T 5909.59 VQSRTrancheSNP99.90to100.00 AC=21;AF=0.154;AN=136..... GT:AD:DP:GQ:PL 0/0:.:40:66:0,66,990 0/0:.:41:69:0,69,1035 1/1:0,20,1:21:78:985,78,0 0/0:.:35:60:0,60,900 ./.:.:1 0/0:.:7:21:0,21,233 ..............

The second to last sample entry is ./.:.:1 (3 fields), while the format entry is GT:AD:DP:GQ:PL (5 fields). I think that GT=./., AD=., and DP=1, so the data is not getting messed up. This might even be within the rules of VCF, but one of the software that I use will not parse VCF files when 1 < # sample fields < # format fields. If sample entries were extended by ":." for every empty FORMAT field (unless only . or ./. was present in sample column), it would make the file parsable.

It's not too hard of a manual fix, but it might be nice to add the functionality into the toolkit. I've seen it happen with CombineVariants as well, when the input VCF files have different numbers of FORMAT fields.


Created 2014-04-29 23:32:59 | Updated | Tags: genotypegvcfs
Comments (36)

hello,

I am doing a population genetic analysis and am consequently very interested in obtaining high confidence monomorphic sites.

I used GATK 3.1 HaplotypeCaller in the new incremental variant discovery pipeline and then ran GenotypeGVCFs using the -inv option so as to print non variant sites after combining.

A monomorphic record looks like this:

scaffold_1 202986 . G . . . . GT:AD:DP:PL ./.:81:81:0 ./.:6:12:0 ./.:0:0:0 .....etc

I had a couple of questions:

  1. It seems GT, GQ, and PL is no longer reported after combining for monomorphic sites? is there a reason why? I'm asking because I'd like to be able to distinguish high qual (first genotype) / low qual (second genotype) monomorphic sites so I can change the low qual sites to missing (third genotype).

  2. If getting high confidence monomorphic sites is my goal, would you recommend that I do my own parsing starting with the individual gvcfs rather than using the GenotypeGVCFs walker?

thanks much for your help!

best, Young Wha Lee


Created 2014-04-24 12:45:30 | Updated | Tags: genotypegvcfs
Comments (15)

Hi Geraldine,

We have used the 3.x workflow to process targeted resequencing for a large cohort. At the genotypeGVCFs stage, we get the following warning in the output:

ExactAFCalc - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at chr:pos has NN alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument

Since our cohort is very large, it is extremely unlikely, but not impossible that more than 6 indel alleles exist at a given locus. Accordingly, allowing genotypeGVCFs to consider more possible alleles seems warranted.

While --max_alternate_alleles is an option for HaplotypeCaller, it is not recognized by genotypeGVCFs. Is this parameter fixed for genotypeGVCFs? If not, is there a way to change this parameter at the genotypeGVCFs stage? If not, can this parameter be modified by altering the combined GVCF header or rerunning HaplotypeCaller with a higher value for --max_alternate_alleles?

Thanks for all your hard work in this forum!


Created 2014-04-18 22:38:18 | Updated | Tags: genotypegvcfs
Comments (15)

Trying to do variantcalling using GATKs new pipeline using HaplotypeCaller and GenotypeGVCF.

Made gVCFs using HC and now trying to do joint genotyping using GenotypeGVCF

java -Xmx5G -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R hs37d5.fa -V:VCF GVCF.list -L SureSelect_AllExon_V5_wo_chr.bed --dbsnp dbsnp_137.hg19.vcf -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A InbreedingCoeff -A Coverage -A AlleleBalance -A AlleleBalanceBySample -A BaseCounts -A BaseQualityRankSumTest -A ChromosomeCounts -A ClippingRankSumTest -A Coverage -A DepthPerAlleleBySample -A FisherStrand -A GCContent -A HaplotypeScore -A HardyWeinberg -A HomopolymerRun -A InbreedingCoeff -A LowMQ -A MappingQualityRankSumTest -A MappingQualityZero -A MappingQualityZeroBySample -A NBaseCount -A QualByDepth -A RMSMappingQuality -A SpanningDeletions -A TandemRepeatAnnotator -A VariantType -o JointGenotyped.vcf

Error

ERROR A GATK RUNTIME ERROR has occurred (version 3.0-0-g6bad1c6):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)

Not sure why this is happening.

Thanks, Tinu


Created 2014-04-17 21:28:34 | Updated | Tags: genotypegvcfs combinegvcfs
Comments (25)

Hi there:

I'm trying to figure out if I need to run CombineGVCFs before GenotypeGVCFs. The documentation said "One would use this tool when needing to genotype too large a number of individual gVCFs". Is there a ballpark number for "too large a number of individuals"? For example, if I have 1000 individuals, should I use CombineGVCFs first? If so, is there a recommended chuck size? It seems CombineGVCFs does not take nt/nct option and when I tried to combine 600 files it's estimated to take a week.

Thanks.


Created 2014-04-14 21:55:57 | Updated | Tags: genotypegvcfs multiple-gvcfs gatk-3-0-haplotypecaller
Comments (7)

I am using GATK 3.0 Generated GVCFs using HaplotypeCaller and now trying to do Joint Genotyping using GenotypeGVCFs.

I have 70+ GVCFs with me, which I want to Joint Genotype. Is there a way to give all of these GVCFs together as input (--variant) to GATK-GenotypeGVCF. Something like '-I' option for BAM list. I know it is possible to use --variant VCF1 --variant VCF2 etc. But that would be tedious to give inputs like that when you have these many or more GVCFs

Thanks, Tinu


Created 2014-04-14 16:09:57 | Updated 2014-04-14 16:26:40 | Tags: genotypegvcfs
Comments (3)

Hello all!

I've been using GATK v3.1-1 for the last week or so. I've come across an error when I try and perform the Joint Genotyping on 5 different samples (family members). All 5 samples were processed through individually (following the Best Practices Guide) and the HaplotypeCaller was run in the gVCF mode. When I run GenotypeGVCFs I receive this error:

ERROR MESSAGE: Invalid argument value ' ' at position 22.

I'm not quite sure what to make of it because other projects with the same type of family size don't throw this error. Do you all know what it is referring to? I've even tried the nightly build (GenomeAnalysisTK-nightly-2014-04-14-g744780d) to see if that fixed the problem and it didn't. Here is my script:

java \
-Xmx22g \
    -Djava.io.tmpdir=$TEMP \
-jar /home/dkcrossm/tools/$GATK_DIR/GenomeAnalysisTK.jar \
-R $REF_FASTA.fa \
-T GenotypeGVCFs \
-ped $PEDIGREE \
-nt 4 \
--variant $SAMPLE1.hc.gvcf.vcf \
--variant $SAMPLE2.hc.gvcf.vcf \
--variant $SAMPLE3.hc.gvcf.vcf \
--variant $SAMPLE4.hc.gvcf.vcf \
--variant $SAMPLE5.hc.gvcf.vcf \
-o $SAMPLE.raw.snps.indels.vcf \
--dbsnp $DBSNP \    

I am also seeing this in a 16 sample run as well, but the position is 44 in this instance instead of 22. Any suggestions as to what it could be referring to? Let me know if you need anything else!

Thanks, David


Created 2014-04-04 15:34:10 | Updated | Tags: bug genotypegvcfs combinegvcfs
Comments (10)

I've run into the following bug while running GenotypeGVCFs:

##### ERROR MESSAGE: cannot merge genotypes from samples without PLs; sample <ID redacted> does not have likelihoods at position 1:1115551

The input file in question is a gVCF produced by merging a large number of smaller gVCFs using CombineGVCFs (all tasks were run using version 3.1). What's happening is that the position 1115551 doesn't exist in that particular sample:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  <Sample_ID>
1       1115550 .       AC      A,<NON_REF>     118.73  .       BaseQRankSum=-0.377;DP=15;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.72;MQ0=0;MQRankSum=-1.093;ReadPosRankSum=-0.811      GT:AD:DP:GQ:PL:SB       0/1:7,6,0:13:99:156,0,188,177,207,384:4,3,3,3
1       1115552 .       C       <NON_REF>       .       .       END=1115552     GT:DP:GQ:MIN_DP:PL      0/0:15:0:15:0,0,31

But when the sample is combined with other samples, that position gets filled in with a simple "0/0", without any PLs (or any of the other fields, including AD, DP, GQ, etc.), which causes the GenotypeGVCFs to choke.

I can imagine there might be other scenarios that will result in a "0/0" genotype field, so perhaps the easiest way to fix this would be to make sure that any "0/0" actually gets output as "./.:.:.:.:.".

Thanks,

Grace


Created 2014-04-02 14:47:06 | Updated 2014-04-02 14:47:46 | Tags: genotypegvcfs
Comments (8)

commands are:

java -Xmx10g -jar $gatk -T HaplotypeCaller \
 -R $refGenome \
 --dbsnp $dbSNP \
 -o s1.raw.var.g.vcf.gz \
 -I s1.bam \
 -pairHMM VECTOR_LOGLESS_CACHING \
 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 &&

java -Xmx10g -jar $gatk -T GenotypeGVCFs  \
 -R $refGenome \
 --variant s1.raw.var.g.vcf.gz \
 -o s1.raw.var.vcf.gz

Error in GenotypeGVCFs :

ERROR MESSAGE: Line 115: there aren't enough columns for line �������uN����ޫrf�ebh�����NH(����ٻat� �E�;J1����X��ʩá���Ç�����>��@��,0��' (we

expected 9 tokens, and saw 2 ), for input source: s1.raw.var.g.vcf.gz


Created 2014-03-27 19:32:44 | Updated | Tags: haplotypecaller genotypegvcfs
Comments (1)

So I've just run GenotypeGVCFs and get some odd looking things. The FORMAT field is GT:AD:DP:GQ:PL and on most occasions, the values are 1/1:0,1,0:1:3:10,3,0, i.e. 5 fields. However, sometimes I get ./.:.:317, only 3 fields.

I use queue to run this, so my case class is this:

case class genotypeGVCF(gvcfs:List[File], outvcf:File) extends GenotypeGVCFs with ExternalCommonArgs with FourteenCoreJob with NineDayJob {
    @Input(doc="list of gvcfs to merge") val inGVCFs = gvcfs
    @Output(doc="SNP calls in VCF format") val outVCF = outvcf
    this.variant = inGVCFs
    this.out = outVCF
    this.R = refGenome
    this.dbsnp = dbSNP
    this.jobName = "genotypeGVCF " + outvcf.getName()
    this.analysisName = this.jobName
    this.isIntermediate = false
    this.nt = 14
    this.intervals = Seq(targetIntervals)
    this.memoryLimit = 112
  }