# Tagged with #genotypegvcfs 1 documentation article | 4 announcements | 68 forum discussions

Created 2014-04-03 20:20:08 | Updated 2014-10-22 19:22:34 | Tags: haplotypecaller genotypegvcfs combinegvcfs gvcf joint-analysis

### Overview

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variants analyses.

### Important caveat

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller 3.x contain additional information that is formatted in a very specific way. Read on to find out more.

### General comparison of VCF vs. gVCF

The key difference between a regular VCF and a gVCF is that the gVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a gVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION gVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

### The two types of gVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a gVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the gVCF header. The purpose of the blocks (also called banding) is to keep file size down, and there is no downside for the downstream analysis, so we do recommend using the -GVCF option.

### Example gVCF file

This is a banded gVCF produced by HaplotypeCaller with the -GVCF option.

As you can see in the first line, the basic file format is a valid version 4.1 VCF:

##fileformat=VCFv4.1
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)
##GVCFBlock=minGQ=60(inclusive),maxGQ=2147483647(exclusive)
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=b37>
##reference=file:///humgen/1kg/reference/human_g1k_v37.fasta


Toward the middle you see the ##GVCFBlock lines (after the ##FORMAT lines) (repeated here for clarity):

##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)


which indicate the GQ ranges used for banding (corresponding to the boundaries [5, 20, 60]).

You can also see the definition of the MIN_DP annotation in the ##FORMAT lines.

#### Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10000000 and ends at 20:10000116.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10000000    .   T   <NON_REF>   .   .   END=10000116    GT:DP:GQ:MIN_DP:PL  0/0:44:99:38:0,89,1385
20  10000118    .   T   <NON_REF>   .   .   END=10000210    GT:DP:GQ:MIN_DP:PL  0/0:42:99:38:0,80,1314
20  10000212    .   A   <NON_REF>   .   .   END=10000438    GT:DP:GQ:MIN_DP:PL  0/0:52:99:42:0,99,1403
20  10000439    .   T   G,<NON_REF> 1737.77 .   DP=57;MLEAC=2,0;MLEAF=1.00,0.00;MQ=221.41;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,56,0:56:99:1771,168,0,1771,168,1771:0,0,0,0
20  10000440    .   T   <NON_REF>   .   .   END=10000597    GT:DP:GQ:MIN_DP:PL  0/0:56:99:49:0,120,1800
20  10000598    .   T   A,<NON_REF> 1754.77 .   DP=54;MLEAC=2,0;MLEAF=1.00,0.00;MQ=185.55;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,53,0:53:99:1788,158,0,1788,158,1788:0,0,0,0
20  10000599    .   T   <NON_REF>   .   .   END=10000693    GT:DP:GQ:MIN_DP:PL  0/0:51:99:47:0,120,1800
20  10000695    .   G   <NON_REF>   .   .   END=10000757    GT:DP:GQ:MIN_DP:PL  0/0:48:99:45:0,120,1800
20  10000758    .   T   A,<NON_REF> 1663.77 .   DP=51;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.32;MQ0=0  GT:AD:DP:GQ:PL:SB   1/1:0,50,0:50:99:1697,149,0,1697,149,1697:0,0,0,0
20  10000759    .   A   <NON_REF>   .   .   END=10001018    GT:DP:GQ:MIN_DP:PL  0/0:40:99:28:0,65,1080
20  10001020    .   C   <NON_REF>   .   .   END=10001020    GT:DP:GQ:MIN_DP:PL  0/0:26:72:26:0,72,1080
20  10001021    .   T   <NON_REF>   .   .   END=10001021    GT:DP:GQ:MIN_DP:PL  0/0:25:37:25:0,37,909
20  10001022    .   C   <NON_REF>   .   .   END=10001297    GT:DP:GQ:MIN_DP:PL  0/0:30:87:25:0,72,831
20  10001298    .   T   A,<NON_REF> 1404.77 .   DP=41;MLEAC=2,0;MLEAF=1.00,0.00;MQ=171.56;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,41,0:41:99:1438,123,0,1438,123,1438:0,0,0,0
20  10001299    .   C   <NON_REF>   .   .   END=10001386    GT:DP:GQ:MIN_DP:PL  0/0:43:99:39:0,95,1226
20  10001387    .   C   <NON_REF>   .   .   END=10001418    GT:DP:GQ:MIN_DP:PL  0/0:41:42:39:0,21,315
20  10001419    .   T   <NON_REF>   .   .   END=10001425    GT:DP:GQ:MIN_DP:PL  0/0:45:12:42:0,9,135
20  10001426    .   A   <NON_REF>   .   .   END=10001427    GT:DP:GQ:MIN_DP:PL  0/0:49:0:48:0,0,1282
20  10001428    .   T   <NON_REF>   .   .   END=10001428    GT:DP:GQ:MIN_DP:PL  0/0:49:21:49:0,21,315
20  10001429    .   G   <NON_REF>   .   .   END=10001429    GT:DP:GQ:MIN_DP:PL  0/0:47:18:47:0,18,270
20  10001430    .   G   <NON_REF>   .   .   END=10001431    GT:DP:GQ:MIN_DP:PL  0/0:45:0:44:0,0,1121
20  10001432    .   A   <NON_REF>   .   .   END=10001432    GT:DP:GQ:MIN_DP:PL  0/0:43:18:43:0,18,270
20  10001433    .   T   <NON_REF>   .   .   END=10001433    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1201
20  10001434    .   G   <NON_REF>   .   .   END=10001434    GT:DP:GQ:MIN_DP:PL  0/0:44:18:44:0,18,270
20  10001435    .   A   <NON_REF>   .   .   END=10001435    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1130
20  10001436    .   A   AAGGCT,<NON_REF>    1845.73 .   DP=43;MLEAC=2,0;MLEAF=1.00,0.00;MQ=220.07;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,42,0:42:99:1886,125,0,1888,126,1890:0,0,0,0
20  10001437    .   A   <NON_REF>   .   .   END=10001437    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,0


Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

Created 2015-05-15 04:52:05 | Updated | Tags: haplotypecaller release-notes genotypegvcfs gatk3

GATK 3.4 was released on May 15, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights to be published soon.

Note that the release is in progress at time of posting -- it may take a couple of hours before the new GATK jar file is updated on the downloads page.

### New tool

• ASEReadCounter: A tool to count read depth in a way that is appropriate for allele specific expression (ASE) analysis. It counts the number of reads that support the REF allele and the ALT allele, filtering low qual reads and bases and keeping only properly paired reads. See Highlights for more details.

### HaplotypeCaller & GenotypeGVCFs

• Important fix for genotyping positions over spanning deletions. Previously, if a SNP occurred in sample A at a position that was in the middle of a deletion for sample B, sample B would be genotyped as homozygous reference there (but it's NOT reference - there's a deletion). Now, sample B is genotyped as having a symbolic DEL allele. See Highlights for more details.
• Deprecated --mergeVariantsViaLD argument in HaplotypeCaller since it didn’t work. To merge complex substitutions, use ReadBackedPhasing as a post-processing step.
• Removed exclusion of MappingQualityZero, SpanningDeletions and TandemRepeatAnnotation from the list of annotators that cannot be annotated by HaplotypeCaller. These annotations are still not recommended for use with HaplotypeCaller, but this is no longer enforced by a hardcoded ban.
• Clamp the HMM window starting coordinate to 1 instead of 0 (contributed by nsubtil).
• Fixed the implementation of allowNonUniqueKmersInRef so that it applies to all kmer sizes. This resolves some assembly issues in low-complexity sequence contexts and improves calling sensitivity in those regions.
• Initialize annotations so that --disableDithering actually works.
• Automatic selection of indexing strategy based on .g.vcf file extension. See Highlights for more details.
• Removed normalization of QD based on length for indels. Length-based normalization is now only applied if the annotation is calculated in UnifiedGenotyper.
• Added the RGQ (Reference GenotypeQuality) FORMAT annotation to monomorphic sites in the VCF output of GenotypeGVCFs. Now, instead of stripping out the GQs for monomorphic ohm-ref sites, we transfer them to the RGQ. This is extremely useful for people who want to know how confident the hom-ref genotype calls are. See Highlights for more details.
• Removed GenotypeSummaries from default annotations.
• Added -uniquifySamples to GenotypeGVCFs to make it possible to genotype together two different datasets containing the same sample.
• Disallow changing -dcov setting for HaplotypeCaller (pending a fix to the downsampling control system) to prevent buggy behavior. See Highlights for more details.
• Raised per-sample limits on the number of reads in ART and HC. Active Region Traversal was using per sample limits on the number of reads that were too low, especially now that we are running one sample at a time. This caused issues with high confidence variants being dropped in high coverage data.
• Removed explicit limitation (20) of the maximum ploidy of the reference-confidence model. Previously there was a fixed-size maximum ploidy indel RCM likelihood cache; this was changed to a dynamically resizable one. There are still some de facto limitations which can be worked around by lowering the max alt alleles parameter.
• Made GQ of Hom-Ref Blocks in GVCF output be consistent with PLs.
• Fixed a bug where HC was not realigning against the reference but against the best haplotype for the read.
• Fixed a bug (in HTSJDK) that was causing GenotypeGVCFs to choke on sites with large numbers of alternate alleles (>140).
• Modified the way GVCFBlock header lines are named because the new HTSJDK version disallows duplicate header keys (aside from special-cased keys such as INFO and FORMAT).

### CombineGVCFs

• Added option to break blocks at every N sites. Using --breakBandsAtMultiplesOf N will ensure that no reference blocks span across genomic positions that are multiples of N. This is especially important in the case of scatter-gather where you don't want your scatter intervals to start in the middle of blocks (because of a limitation in the way -L works in the GATK for VCF records with the END tag). See Highlights for more details.
• Fixed a bug that caused the tool to stop processing after the first contig.
• Fixed a bug where the wrong REF allele was output to the combined gVCF.

### VariantRecalibrator

• Switched VQSR tranches plot ordering rule (ordering is now based on tranche sensitivity instead of novel titv).
• VQSR VCF header command line now contains annotations and tranche levels.

### SelectVariants

• Added -trim argument to trim (simplify) alleles to a minimal representation.
• Added -trimAlternates argument to remove all unused alternate alleles from variants. Note that this is pretty aggressive for monomorphic sites.
• Changed the default behavior to trim (remove) remaining alleles when samples are subset, and added the -noTrim argument to preserve original alleles.
• Added --keepOriginalDP argument.

### VariantAnnotator

• Improvements to the allele trimming functionalities.
• Added functionality to support multi-allelic sites when annotating a VCF with annotations from another callset. See Highlights for more details.

### CalculateGenotypePosteriors

• Fixed user-reported bug featuring "trio" family with two children, one parent.
• Added error handling for genotypes that are called but have no PLs.

### Various tools

• BQSR: Fixed an issue where GATK would skip the entire read if a SNP is entirely contained within a sequencing adapter (contributed by nsubtil); and improved how uncommon platforms (as encoded in RG:PL tag) are handled.
• DepthOfCoverage: Now logs a warning if incompatible arguments are specified.
• SplitSamFile: Fixed a bug that caused a NullPointerException.
• SplitNCigarReads: Fixed issue to make -fixNDN flag fully functional.
• IndelRealigner: Fixed an issue that was due to reads that have an incorrect CIGAR length.
• CombineVCFs: Minor change to an error check that was put into 3.3 so that identical samples don't need -genotypeMergeOption.
• VariantsToBinaryPED: Corrected swap between mother and father in PED file output.
• GenotypeConcordance: Monomorphic sites in the truth set are no longer called "Mismatching Alleles" when the comp genotype has an alternate allele.
• ReadBackedPhasing: Fixed a couple of bugs in MNP merging.
• CatVariants: Now allows different input / output file types, and spaces in directory names.
• VariantsToTable: Fixed a bug that affected the output of the FORMAT record lists when -SMA is specified. Note that FORMAT fields behave the same as INFO fields - if the annotation has a count of A (one entry per Alt Allele), it is split across the multiple output lines. Otherwise, the entire list is output with each field.

• Corrected logical expression in MateSameStrandFilter (contributed by user seru71).
• Handle X and = CIGAR operators appropriately
• Added -drf argument to disable default read filters. Limited to specific tools and specific filters (currently only DuplicateReadFilter).

### Annotations

• Calculate StrandBiasBySample using all alternate alleles as “REF vs. any ALT”.
• Modified InbreedingCoeff so that it works when genotyping uniquified samples (see GenotypeGVCFs changes).
• Changed GC Content value type from Integer to Float.
• Added StrandAlleleCountsBySample annotation. This annotation outputs the number of reads supporting each allele, stratified by sample and read strand; callable from HaplotypeCaller only.
• Made annotators emit a warning if they can't be applied.

### GATK Engine & common features

• Fixed logging of 'out' command line parameter in VCF headers; changed []-type arrays to lists so argument parsing works in VCF header commandline output.
• Modified GATK command line header for unique keys. The GATK command line header keys were being repeated in the VCF and subsequently lost to a single key value by HTSJDK. This resolves the issue by appending the name of the walker after the text "GATKCommandLine" and a number after that if the same walker was used more than once in the form: GATKCommandLine.(walker name) for the first occurrence of the walker, and GATKCommandLine.(walker name).# where # is the number of the occurrence of the walker (e.g. GATKCommandLine.SomeWalker.2 for the second occurrence of SomeWalker).
• Handle X and = CIGAR operators appropriately.
• Added barebones read/write CRAM support (no interval seeking!). See Highlights for more details.
• Cleaned up logging outputs / streams; messages (including HMM log messages) that were going to stdout now going to stderr.
• Improved error messages; when an error is related to a specific file, the engine now includes the file name in the error message.
• Fixed BCF writing when FORMAT annotations contain arrays.

### Queue

• Added -qsub-broad argument. When -qsub-broad is specified instead of -qsub, Queue will use the h_vmem parameter instead of h_rss to specify memory limit requests. This was done to accommodate changes to the Broad’s internal job scheduler. Also causes the GridEngine native arguments to be output by default to the logger, instead of only when in debug mode.
• Fixed the scala wrapper for Picard MarkDuplicates (needed because MarkDuplicates was moved to a different package within Picard).
• Added optional element "includeUnmapped" to the PartitionBy annotation. The value of this element (default true) determines whether Queue will explicitly run this walker over unmapped reads. This patch fixes a runtime error when FindCoveredIntervals was used with Queue.

### Documentation

• Plentiful enhancements and fixes to various tool docs, especially annotations and read filters.

### For developers

• Upgraded SLF4J to allow new convenient logging syntaxes.
• Patched maven pom file for slf4j-log4j12 version (contributed by user Biocyberman).
• Updated HTSJDK version (now pulling it in from Maven Central); various edits made to match.
• Collected VCF IDs and header lines into one place (GATKVCFConstants).

Created 2014-10-23 18:53:52 | Updated 2015-05-12 17:24:14 | Tags: Troll haplotypecaller ploidy release-notes genotypegvcfs gatk3 genotyperefinement

GATK 3.3 was released on October 23, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.

## Haplotype Caller

• Improved the accuracy of dangling head merging in the HC assembler (now enabled by default).
• Physical phasing information is output by default in new sample-level PID and PGT tags.
• Added the --sample_name argument. This is a shortcut for people who have multi-sample BAMs but would like to use -ERC GVCF mode with a particular one of those samples.
• Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
• Fixed IndexOutOfBounds error associated with tail merging.

## Variant Recalibrator

• New --ignore_all_filters option. If specified, the variant recalibrator will ignore all input filters and treat sites as unfiltered.

## GenotypeGVCFs

• Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
• Bug fix for the case when we assumed ADs were in the same order if the number of alleles matched.
• Changed the default GVCF GQ Bands from 5,20,60 to be 1..60 by 1s, 60...90 by 10s and 99 in order to give finer resolution.
• Bug fix in the exact model when calling multi-allelic variants. QUAL field is now more accurate.

## RNAseq analysis

• Bug fixes for working with unmapped reads.

## CalculateGenotypePosteriors

• New annotation for low- and high-confidence possible de novos (only annotates biallelics).
• FamilyLikelihoodsUtils now add joint likelihood and joint posterior annotations.
• Restricted population priors based on discovered allele count to be valid for 10 or more samples.

## DepthOfCoverage

• Fixed rare bug triggered by hash collision between sample names.

## SelectVariants

• Updated the --keepOriginalAC functionality in SelectVariants to work for sites that lose alleles in the selection.

• Read groups that are excluded by sample_name, platform, or read_group arguments no longer appear in the header.
• The performance penalty associated with filtering by read group has been essentially eliminated.

## Annotations

• StrandOddsRatio is now a standard annotation that is output by default.
• We used to output zero for FS if there was no data available at a site, now we omit FS.
• Extensive rewrite of the annotation documentation.

## Queue

• Fixed issue related to spaces in job names that were fine in GridEngine 6 but break in (Son of) GE8.
• Improved scatter contigs algorithm to be fairer when splitting many contigs into few parts (contributed by @smowton)

## Documentation

• We now generate PHP files instead of HTML.
• We now output a JSON version of the tool documentation that can be used to generate wrappers for GATK commands.

## Miscellaneous

• Output arguments --no_cmdline_in_header, --sites_only, and --bcf for VCF files, and --bam_compression, --simplifyBAM, --disable_bam_indexing, and --generate_md5 for BAM files moved to the engine level.
• htsjdk updated to version 1.120.1620

Created 2014-07-15 03:54:06 | Updated 2014-10-23 17:58:36 | Tags: variantrecalibrator haplotypecaller selectvariants variantannotator release-notes catvariants genotypegvcfs gatk-3-2

GATK 3.2 was released on July 14, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.

We also want to take this opportunity to thank super-user Phillip Dexheimer for all of his excellent contributions to the codebase, especially for this release.

## Haplotype Caller

• Various improvements were made to the assembly engine and likelihood calculation, which leads to more accurate genotype likelihoods (and hence better genotypes).
• Reads are now realigned to the most likely haplotype before being used by the annotations, so AD and DP will now correspond directly to the reads that were used to generate the likelihoods.
• The caller is now more conservative in low complexity regions, which significantly reduces false positive indels at the expense of a little sensitivity; mostly relevant for whole genome calling.
• Small performance optimizations to the function to calculate the log of exponentials and to the Smith-Waterman code (thanks to Nigel Delaney).
• Fixed small bug where indel discovery was inconsistent based on the active-region size.
• Removed scary warning messages for "VectorPairHMM".
• Made VECTOR_LOGLESS_CACHING the default implementation for PairHMM.
• When we subset PLs because alleles are removed during genotyping we now also subset the AD.
• Fixed bug where reference sample depth was dropped in the DP annotation.

## Variant Recalibrator

• The -mode argument is now required.
• The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

## AnalyzeCovariates

• The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

## Variant Annotator

• SB tables are created even if the ref or alt columns have no counts (used in the FS and SOR annotations).

## Genotype GVCFs

• Added missing arguments so that now it models more closely what's available in the Haplotype Caller.
• Fixed recurring error about missing PLs.
• No longer pulls the headers from all input rods including dbSNP, rather just from the input variants.
• --includeNonVariantSites should now be working.

## Select Variants

• The dreaded "Invalid JEXL expression detected" error is now a kinder user error.

## Indel Realigner

• Now throws a user error when it encounters reads with I operators greater than the number of read bases.
• Fixed bug where reads that are all insertions (e.g. 50I) were causing it to fail.

## CalculateGenotypePosteriors

• Now computes posterior probabilities only for SNP sites with SNP priors (other sites have flat priors applied).
• Now computes genotype posteriors using likelihoods from all members of the trio.
• Added annotations for calling potential de novo mutations.
• Now uses PP tag instead of GP tag because posteriors are Phred-scaled.

## Cat Variants

• Can now process .list files with -V.
• Can now handle BCF and Block-Compressed VCF files.

## Validate Variants

• Now works with gVCF files.
• By default, all strict validations are performed; use --validationTypeToExclude to exclude specific tests.

## FastaAlternateReferenceMaker

• Now use '--use_IUPAC_sample sample_name' to specify which sample's genotypes should be used for the IUPAC encoding with multi-sample VCF files.

## Miscellaneous

• Refactored maven directories and java packages replacing "sting" with "gatk".
• Extended on-the-fly sample renaming feature to VCFs with the --sample_rename_mapping_file argument.
• Added a new read transformer that refactors NDN cigar elements to one N element.
• Now a Tabix index is created for block-compressed output formats.
• Switched outputRoot in SplitSamFile to an empty string instead of null (thanks to Carlos Barroto).
• Enabled the AB annotation in the reference model pipeline (thanks to John Wallace).
• We now check that output files are specified in a writeable location.
• We now allow blank lines in a (non-BAM) list file.
• Added legibility improvements to the Progress Meter.
• Allow for non-tab whitespace in sample names when performing on-the-fly sample-renaming (thanks to Mike McCowan).
• Made IntervalSharder respect the IntervalMergingRule specified on the command line.
• Sam, tribble, and variant jars updated to version 1.109.1722; htsjdk updated to version 1.112.1452.

GATK 3.0 was released on March 5, 2014. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

One important change for those who prefer to build from source is that we now use maven instead of ant. See the relevant documentation for building the GATK with our new build system.

• This is a new GATK tool to be used for variant calling in RNA-seq data. Its purpose is to split reads that contain N Cigar operators (due to a limitation in the GATK that we will eventually handle internally) and to trim (and generally clean up) imperfect alignments.

## Haplotype Caller

• Fixed bug where dangling tail merging in the assembly graph occasionally created a cycle.
• Added experimental code to retrieve dangling heads in the assembly graph, which is needed for calling variants in RNA-seq data.
• Generally improved gVCF output by making it more accurate. This includes many updates so that the single sample gVCFs can be accurately genotyped together by GenotypeGVCFs.
• Fixed a bug in the PairHMM class where the transition probability was miscalculated resulting in probabilities larger than 1.
• Fixed bug in the function to find the best paths from an alignment graph which was causing bad genotypes to be emitted when running with multiple samples together.

## CombineGVCFs

• This is a new GATK tool to be used in the Haplotype Caller pipeline with large cohorts. Its purpose is to combine any number of gVCF files into a single merged gVCF. One would use this tool for hierarchical merges of the data when there are too many samples in the project to throw at all at once to GenotypeGVCFs.

## GenotypeGVCFs

• This is a new GATK tool to be used in the Haplotype Caller pipeline. Its purpose is to take any number of gVCF files and to genotype them in order to produce a VCF with raw SNP and indel calls.

• This is a new GATK tool that might be useful to some. Given a VCF file, this tool will generate simulated reads that support the variants present in the file.

## Unified Genotyper

• Fixed bug when clipping long reads in the HMM; some reads were incorrectly getting clipped.

## Variant Recalibrator

• Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

• Removed from the GATK. It was a valiant attempt, but ultimately we found a better way to process large cohorts. Reduced BAMs are no longer supported in the GATK.

## Variant Annotator

• Improved the FisherStrand (FS) calculation when used in large cohorts. When the table gets too large, we normalize it down to values that are more reasonable. Also, we don't include a particular sample's contribution unless we observe both ref and alt counts for it. We expect to improve on this even further in a future release.
• Improved the QualByDepth (QD) calculation when used in large cohorts. Now, when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1. Note that this only works in the gVCF pipeline for now.
• In addition, fixed the normalization for indels in QD (which was over-penalizing larger events).

## Combine Variants

• Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

## Select Variants

• Fixed a huge bug where selecting out a subset of samples while using multi-threading (-nt) caused genotype-level fields (e.g. AD) to get swapped among samples. This was a bad one.
• Fixed a bug where selecting out a subset of samples at multi-allelic sites occasionally caused the alternate alleles to be re-ordered but the AD values were not updated accordingly.

## CalculateGenotypePosteriors

• Fixed bug where it wasn't checking for underflow and occasionally produced bad likelihoods.
• It no longer strips out the AD annotation from genotypes.
• AC/AF/AN counts are updated after fixing genotypes.
• Updated to handle cases where the AC (and MLEAC) annotations are not good (e.g. they are greater than AN somehow).

## Indel Realigner

• Fixed bug where a realigned read can sometimes get partially aligned off the end of the contig.

• Updated the tool to use the VCF 4.1 framework for phasing; it now uses HP tags instead of '|' to convey phase information.

## Miscellaneous

• Thanks to Phillip Dexheimer for several Queue related fixes and patches.
• Thanks to Nicholas Clarke for patches to the timer which occasionally had negative elapsed times.
• Providing an empty BAM list no results in a user error.
• Fixed a bug in the gVCF writer where it was dropping the first few reference blocks at the beginnings of all but the first chromosome. Also, several unnecessary INFO field annotations were dropped from the output.
• Logger output now goes to STDERR instead of STDOUT.
• Picard, Tribble, and Variant jars updated to version 1.107.1683.

Created 2015-05-20 13:06:46 | Updated | Tags: haplotypecaller java genotypegvcfs

Dear GATK,

I used the HaplotypeCaller with "-dcov 500 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000" to produce 60 gvcf files, that worked fine. However, GenotypeGVCFs gets stuck on a position and runs out of memory after about 24hours, even when I allocate 240Gb. Testing a short region of 60kb does not help. Here was my command line: software/jre1.7.0_25/bin/java -Xmx240g -jar GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R Reference.fasta -L chrom14:2240000-2300000 --variant 60samples_gvcf.list -o output.vcf

If I split my list of 60 gvcf files into two lists of 30 samples each, GenotypeGVCFs works fine for both batches within 15 minutes (~10Gb of memory).
I tested with 47 samples, it took 8 hours (31gb of memory) for a 60kb region. Once I use more than ~55 samples, it takes forever and crashes.

Any help will be much appreciated! Thanks,

Antoine

Created 2015-05-07 09:22:47 | Updated 2015-05-07 10:09:41 | Tags: haplotypecaller format genotypegvcfs info-field

I have a potential bug running GATK GenotypeGVCFs. It complains that there is a DP in the INFO field, but in my haplotypecaller-generated -mg.g.vcf.gz's I do not have a DP in the info, I do have DP in the FORMAT field though, but that's present in the headers as shown below the error output.

Any idea what could be the problem?

INFO  18:30:12,694 HelpFormatter - --------------------------------------------------------------------------------
INFO  18:30:12,698 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-geee94ec, Compiled 2015/03/09 14:27:22
INFO  18:30:12,699 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  18:30:12,706 HelpFormatter - Program Args: -l INFO -T GenotypeGVCFs -R /net/NGSanalysis/ref/Mus_musculus.GRCm38/index/bwa/Mus_musculus.GRCm38.dna.primary_assembly.fa -o /dev/stdout -ploidy 2 --num_threads 32 --intervals:targets,BED /net/NGSanalysis/ref/Mus_musculus.GRCm38/bed/SeqCap/ex100/110624_MM10_exome_L2R_D02_EZ_HX1-ex100.bed --max_alternate_alleles 20 -V:3428_10_14_SRO_185_TGGCTTCA-mg,VCF 3428_10_14_SRO_185_TGGCTTCA-mg.g.vcf.gz -V:3428_11_14_SRO_186_TGGTGGTA-mg,VCF 3428_11_14_SRO_186_TGGTGGTA-mg.g.vcf.gz -V:3428_12_13_SRO_422_TTCACGCA-mg,VCF 3428_12_13_SRO_422_TTCACGCA-mg.g.vcf.gz -V:3428_13_13_SRO_492_AACTCACC-mg,VCF 3428_13_13_SRO_492_AACTCACC-mg.g.vcf.gz -V:3428_14_13_SRO_493_AAGAGATC-mg,VCF 3428_14_13_SRO_493_AAGAGATC-mg.g.vcf.gz -V:3428_15_14_SRO_209_AAGGACAC-mg,VCF 3428_15_14_SRO_209_AAGGACAC-mg.g.vcf.gz -V:3428_16_14_SRO_218_AATCCGTC-mg,VCF 3428_16_14_SRO_218_AATCCGTC-mg.g.vcf.gz -V:3428_17_14_SRO_201_AATGTTGC-mg,VCF 3428_17_14_SRO_201_AATGTTGC-mg.g.vcf.gz -V:3428_18_13_SRO_416_ACACGACC-mg,VCF 3428_18_13_SRO_416_ACACGACC-mg.g.vcf.gz -V:3428_19_14_SRO_66_ACAGATTC-mg,VCF 3428_19_14_SRO_66_ACAGATTC-mg.g.vcf.gz -V:3428_1_13_SRO_388_GTCGTAGA-mg,VCF 3428_1_13_SRO_388_GTCGTAGA-mg.g.vcf.gz -V:3428_20_14_SRO_68_AGATGTAC-mg,VCF 3428_20_14_SRO_68_AGATGTAC-mg.g.vcf.gz -V:3428_21_14_SRO_210_AGCACCTC-mg,VCF 3428_21_14_SRO_210_AGCACCTC-mg.g.vcf.gz -V:3428_22_14_SRO_256_AGCCATGC-mg,VCF 3428_22_14_SRO_256_AGCCATGC-mg.g.vcf.gz -V:3428_23_14_SRO_270_AGGCTAAC-mg,VCF 3428_23_14_SRO_270_AGGCTAAC-mg.g.vcf.gz -V:3428_24_13_SRO_452_ATAGCGAC-mg,VCF 3428_24_13_SRO_452_ATAGCGAC-mg.g.vcf.gz -V:3428_2_13_SRO_399_GTCTGTCA-mg,VCF 3428_2_13_SRO_399_GTCTGTCA-mg.g.vcf.gz -V:3428_3_13_SRO_461_GTGTTCTA-mg,VCF 3428_3_13_SRO_461_GTGTTCTA-mg.g.vcf.gz -V:3428_4_13_SRO_462_TAGGATGA-mg,VCF 3428_4_13_SRO_462_TAGGATGA-mg.g.vcf.gz -V:3428_5_13_SRO_465_TATCAGCA-mg,VCF 3428_5_13_SRO_465_TATCAGCA-mg.g.vcf.gz -V:3428_6_13_SRO_402_TCCGTCTA-mg,VCF 3428_6_13_SRO_402_TCCGTCTA-mg.g.vcf.gz -V:3428_7_13_SRO_474_TCTTCACA-mg,VCF 3428_7_13_SRO_474_TCTTCACA-mg.g.vcf.gz -V:3428_8_13_SRO_531_TGAAGAGA-mg,VCF 3428_8_13_SRO_531_TGAAGAGA-mg.g.vcf.gz -V:3428_9_14_SRO_166_TGGAACAA-mg,VCF 3428_9_14_SRO_166_TGGAACAA-mg.g.vcf.gz
INFO  18:30:12,714 HelpFormatter - Executing as roel@utonium.nki.nl on Linux 2.6.32-504.12.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_75-b13.
INFO  18:30:12,714 HelpFormatter - Date/Time: 2015/05/06 18:30:12
INFO  18:30:12,715 HelpFormatter - --------------------------------------------------------------------------------
INFO  18:30:12,715 HelpFormatter - --------------------------------------------------------------------------------
INFO  18:30:15,963 GenomeAnalysisEngine - Strictness is SILENT
INFO  18:30:16,109 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  18:30:29,705 IntervalUtils - Processing 101539431 bp from intervals
WARN  18:30:29,726 IndexDictionaryUtils - Track 3428_10_14_SRO_185_TGGCTTCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,727 IndexDictionaryUtils - Track 3428_11_14_SRO_186_TGGTGGTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,727 IndexDictionaryUtils - Track 3428_12_13_SRO_422_TTCACGCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,728 IndexDictionaryUtils - Track 3428_13_13_SRO_492_AACTCACC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,728 IndexDictionaryUtils - Track 3428_14_13_SRO_493_AAGAGATC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,728 IndexDictionaryUtils - Track 3428_15_14_SRO_209_AAGGACAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,729 IndexDictionaryUtils - Track 3428_16_14_SRO_218_AATCCGTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,729 IndexDictionaryUtils - Track 3428_17_14_SRO_201_AATGTTGC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,730 IndexDictionaryUtils - Track 3428_18_13_SRO_416_ACACGACC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,730 IndexDictionaryUtils - Track 3428_19_14_SRO_66_ACAGATTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,730 IndexDictionaryUtils - Track 3428_1_13_SRO_388_GTCGTAGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,731 IndexDictionaryUtils - Track 3428_20_14_SRO_68_AGATGTAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,731 IndexDictionaryUtils - Track 3428_21_14_SRO_210_AGCACCTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,731 IndexDictionaryUtils - Track 3428_22_14_SRO_256_AGCCATGC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,732 IndexDictionaryUtils - Track 3428_23_14_SRO_270_AGGCTAAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,732 IndexDictionaryUtils - Track 3428_24_13_SRO_452_ATAGCGAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,732 IndexDictionaryUtils - Track 3428_2_13_SRO_399_GTCTGTCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,733 IndexDictionaryUtils - Track 3428_3_13_SRO_461_GTGTTCTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,733 IndexDictionaryUtils - Track 3428_4_13_SRO_462_TAGGATGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,733 IndexDictionaryUtils - Track 3428_5_13_SRO_465_TATCAGCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,734 IndexDictionaryUtils - Track 3428_6_13_SRO_402_TCCGTCTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,734 IndexDictionaryUtils - Track 3428_7_13_SRO_474_TCTTCACA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,734 IndexDictionaryUtils - Track 3428_8_13_SRO_531_TGAAGAGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
WARN  18:30:29,735 IndexDictionaryUtils - Track 3428_9_14_SRO_166_TGGAACAA-mg doesn't have a sequence dictionary built in, skipping dictionary validation
INFO  18:30:29,749 MicroScheduler - Running the GATK in parallel mode with 32 total threads, 1 CPU thread(s) for each of 32 data thread(s), of 64 processors available on this machine
INFO  18:30:29,878 GenomeAnalysisEngine - Preparing for traversal
INFO  18:30:29,963 GenomeAnalysisEngine - Done preparing for traversal
INFO  18:30:29,964 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  18:30:29,965 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  18:30:29,966 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
INFO  18:30:30,562 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files
INFO  18:31:00,420 ProgressMeter -       1:4845033         0.0    30.0 s      50.3 w        0.0%    46.7 h      46.7 h
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.IllegalStateException: Key DP found in VariantContext field INFO at 1:4839315 but this key isn't defined in the VCFHeader.  We require all VCFs to have complete VCF headers by default.
at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:115)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java:291) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java:280)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-geee94ec): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Key DP found in VariantContext field INFO at 1:4839315 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default. ##### ERROR ------------------------------------------------------------------------------------------  for f in *.g.vcf.gz; do echo -e "\n--$f --"; zcat "$f" | sed -n -r "/^#.*DP/p;/^1\t4839315\t/{p;q;}"; done ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839317 GT:DP:GQ:MIN_DP:PL 0/0:22:0:21:0,0,432 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839317 GT:DP:GQ:MIN_DP:PL 0/0:20:0:20:0,0,410 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839315 GT:DP:GQ:MIN_DP:PL 0/0:29:0:29:0,0,773 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839315 GT:DP:GQ:MIN_DP:PL 0/0:25:2:25:0,3,790 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839316 GT:DP:GQ:MIN_DP:PL 0/0:33:0:33:0,0,837 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839315 GT:DP:GQ:MIN_DP:PL 0/0:23:31:23:0,31,765 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GA G,<NON_REF> 0 . ClippingRankSum=-0.578;MLEAC=0,0;MLEAF=0.00,0.00 GT:DP:GQ:PL:SB 0/0:21:39:0,39,488,60,491,512:20,0,0,0 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839315 GT:DP:GQ:MIN_DP:PL 0/0:18:0:18:0,0,514 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839316 GT:DP:GQ:MIN_DP:PL 0/0:29:0:29:0,0,810 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839316 GT:DP:GQ:MIN_DP:PL 0/0:33:0:33:0,0,812 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839317 GT:DP:GQ:MIN_DP:PL 0/0:28:0:25:0,0,624 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GA G,<NON_REF> 0.08 . ClippingRankSum=-0.189;MLEAC=1,0;MLEAF=0.500,0.00 GT:DP:GQ:PL:SB 0/1:17:20:20,0,311,62,320,382:14,0,3,0 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GA G,<NON_REF> 6.76 . ClippingRankSum=-0.374;MLEAC=1,0;MLEAF=0.500,0.00 GT:DP:GQ:PL:SB 0/1:25:43:43,0,401,102,417,519:20,0,3,2 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GA G,<NON_REF> 0 . ClippingRankSum=-1.095;MLEAC=0,0;MLEAF=0.00,0.00 GT:DP:GQ:PL:SB 0/0:23:1:0,1,395,56,406,460:19,0,0,0 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839317 GT:DP:GQ:MIN_DP:PL 0/0:28:0:28:0,0,626 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GA G,<NON_REF> 5.99 . ClippingRankSum=-0.584;MLEAC=1,0;MLEAF=0.500,0.00 GT:DP:GQ:PL:SB 0/1:18:42:42,0,293,84,305,388:13,1,3,1 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839317 GT:DP:GQ:MIN_DP:PL 0/0:22:0:22:0,0,558 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G GA,<NON_REF> 6.76 . ClippingRankSum=0.850;MLEAC=1,0;MLEAF=0.500,0.00 GT:DP:GQ:PL:SB 0/1:19:43:43,0,262,87,274,361:12,3,4,0 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GA G,<NON_REF> 16.82 . ClippingRankSum=-0.784;MLEAC=1,0;MLEAF=0.500,0.00 GT:DP:GQ:PL:SB 0/1:21:54:54,0,352,102,367,470:16,0,4,1 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839317 GT:DP:GQ:MIN_DP:PL 0/0:26:0:25:0,0,419 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839316 GT:DP:GQ:MIN_DP:PL 0/0:30:0:30:0,0,771 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839315 GT:DP:GQ:MIN_DP:PL 0/0:34:77:34:0,78,1136 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . G <NON_REF> . . END=4839316 GT:DP:GQ:MIN_DP:PL 0/0:26:0:20:0,0,397 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> 1 4839315 . GAA G,GA,<NON_REF> 22.75 . ClippingRankSum=-2.181;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00 GT:DP:GQ:PL:SB 0/2:11:22:60,22,209,0,87,104,63,153,113,176:4,2,3,0  Created 2015-04-09 18:55:24 | Updated | Tags: allelebalancebysample variantannotator genotypegvcfs strandbiasbysample I am struggling with adding AlleleBalanceBySample and StrandBiasBySample to my joint VCF file. The data through joint VCF were generated with GATK 3.2-2, and I have tried using the VariantAnnotator as well as repeating the GenotypeGVCF step specifying the annotations. I have also tried using the GenotypeGVCFs tool from 3.3-0, but am trying to avoid running the HC again in the interest of time. For the strand bias by sample I can see why the VariantAnnotator approach would fail, as the information is not contained in the joint VCF file, but it is there in the GVCF. For the allele balance by sample it seems the info is already in the joint VCF, so either VariantAnnotation or rerunning GenotypeGVCFs seem like they should work. I should add that when I do try to add these annotations using GenotypeGVCFs, they are noted in the header ##FORMAT section, but not in the actual F:O:R:M:A:T. Any clues or hints (or admonitions!) much appreciated. -erikt Created 2015-03-26 13:23:26 | Updated | Tags: genotypegvcfs I'm trying to use genotypeGCFs on 350 small gVCF files (from bacterial genomes) and I'm getting this "clock drift" error: INFO 13:02:46,006 ProgressMeter - chr1:44001 0.0 3.0 h 15250.3 w 2.0% 6.3 d 6.2 d INFO 13:06:30,421 ProgressMeter - chr1:44001 0.0 3.0 h 15250.3 w 2.0% 6.4 d 6.3 d INFO 14:38:28,521 ProgressMeter - chr1:44001 0.0 3.1 h 15250.3 w 2.0% 6.6 d 6.5 d WARN 16:24:14,924 SimpleTimer - Clock drift of -1,425,286,806,691,525,163 - -1,425,286,778,358,860,637 = 28,332,664,526 nanoseconds detected, vs. max allowable drift of 5,000,000,000. Assuming checkpoint/restart event. WARN 18:23:08,307 SimpleTimer - Clock drift of -1,425,286,778,358,827,793 - -1,425,286,806,691,525,163 = 28,332,697,370 nanoseconds detected, vs. max allowable drift of 5,000,000,000. Assuming checkpoint/restart event. INFO 22:55:24,502 ProgressMeter - chr1:44001 0.0 0.0 s 0.0 s 2.0% 0.0 s 0.0 s INFO 02:46:17,060 ProgressMeter - chr1:44001 0.0 14.1 h 15250.3 w 2.0% 4.3 w 4.2 w Any ideas what that is? Created 2015-03-26 08:10:25 | Updated | Tags: genotypegvcfs Hi there! i would like to run the whole pipeline on one sample for testing the pipeline. here are some wrong in the genotyping step. commands are: # Step9. Call Variants java -Xmx5g -jar$gatk -T HaplotypeCaller -R $ref_dir/ucsc.hg19.fasta -I$bam_dir/T-SZ-03-1.dedupped.realigned.recal.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o $other_dir/sample.raw.snps.indels.vcf # Joint Genotyping java -Xmx5g -jar$gatk -T GenotypeGVCFs -R $ref_dir/ucsc.hg19.fasta --variant$other_dir/sample.raw.snps.indels.vcf -o $other_dir/test.raw.snps.indels.g.vcf Error in GenotypeGVCFs : INFO 09:25:33,478 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:25:33,480 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 INFO 09:25:33,480 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 09:25:33,481 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 09:25:33,484 HelpFormatter - Program Args: -T GenotypeGVCFs -R /media/LAB636/01.06.job/Ref/ucsc.hg19.fasta --variant /media/LAB636/01.06.job/test/data/other/sample.raw.snps.indels.vcf -o /media/LAB636/01.06.job/test/data/other/test.raw.snps.indels.g.vcf INFO 09:25:33,486 HelpFormatter - Executing as starlcx@starlcx-desktop on Linux 2.6.32-38-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_31-b13. INFO 09:25:33,486 HelpFormatter - Date/Time: 2015/03/26 09:25:33 INFO 09:25:33,487 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:25:33,487 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:25:33,823 GenomeAnalysisEngine - Strictness is SILENT INFO 09:25:33,893 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 09:25:34,936 GenomeAnalysisEngine - Preparing for traversal INFO 09:25:34,943 GenomeAnalysisEngine - Done preparing for traversal INFO 09:25:34,944 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 09:25:34,944 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 09:25:34,945 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 09:25:35,005 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.3-0-g37228af): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE:** The list of input alleles must contain as an allele but that is not the case at position 150; please use the Haplotype Caller with gVCF output to generate appropriate records** ##### ERROR ------------------------------------------------------------------------------------------ there is my confusion： 1. As you know, I run the pipeline on just one sample. Whether do I need to do joint genotyping after call variants with HC ? 2. The documents here mentions : Genotypes any number of gVCF files that were produced by the Haplotype Caller into a single joint VCF file. So is this means the input here should be gVCF files? But the output of HC are vcf file. https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.php 3.what leads to the error message above? Thanks! :) Created 2015-03-21 09:14:04 | Updated | Tags: genotypegvcfs compressed-g-vcf Hi Team, I'm getting WARN 21:19:30,478 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation when processing gzipped g.vcf files produced by HaplotypeCaller (via -o foo.g.vcf.gz, as suggested by @Geraldine_VdAuwera in blog post 3893) with GenotypeGVCFs. This results in dramatic increases in run time (makes sense if GenotypeGVCFs un-compresses the files), and memory requirements (why ??) for GenotypeGVCFs compared to processing the gvcf for same bam files if HC outfiles are unzipped. Most batches that previously completed with 4x8GB RAM now produce java.lang.OutOfMemoryError: Java heap space errors even with 4X64GB! Could you please advise whether this warning is expected behaviour? If yes, what exactly is missing (can't see much difference in unzipped vs gzipped vcf headers), and can this be added somehow?  Created 2015-03-20 19:25:30 | Updated | Tags: genotypegvcfs multi-allelic In the output of GenotypeGVCFs, are the ALT alleles in those multi-allelic sites ordered by descending frequencies? Created 2015-03-19 14:50:22 | Updated | Tags: haplotypecaller genotypegvcfs Hello! I would like to run GenotypeGVCFs on 209 WES, called with HC (--emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000). When I run GenotypeGVCFs, with this command (computing nodes have 8 cores and 24G of memory) : java -Xmx24g -jar$GATK_JAR \
-R Homo_sapiens.GRCh37_decoy.fa \
-T GenotypeGVCFs \
-nt 8 \
-V gvcf.all.list \
-o calls.vcf


It estimates a huge runtime and just dies hanging:

INFO  10:27:00,790 HelpFormatter - --------------------------------------------------------------------------------
INFO  10:27:00,795 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22
INFO  10:27:00,796 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  10:27:00,800 HelpFormatter - Program Args: -R Homo_sapiens.GRCh37_decoy.fa -T GenotypeGVCFs -nt 8 -V gvcf.all.list -o calls.vcf
INFO  10:27:00,810 HelpFormatter - Executing as emixaM@r107-n50 on Linux 2.6.32-504.12.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-ea-b07.
INFO  10:27:00,810 HelpFormatter - Date/Time: 2015/03/19 10:27:00
INFO  10:27:00,811 HelpFormatter - --------------------------------------------------------------------------------
INFO  10:27:00,811 HelpFormatter - --------------------------------------------------------------------------------
INFO  10:27:04,719 GenomeAnalysisEngine - Strictness is SILENT
INFO  10:27:04,882 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  10:27:41,565 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 1 CPU thread(s) for each of 8 data thread(s), of 8 processors available on this machine
INFO  10:27:43,169 GenomeAnalysisEngine - Preparing for traversal
INFO  10:27:43,179 GenomeAnalysisEngine - Done preparing for traversal
INFO  10:27:43,179 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  10:27:43,180 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  10:27:43,180 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
INFO  10:27:44,216 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files
INFO  10:28:15,035 ProgressMeter -       1:1000201         0.0    31.0 s      52.7 w        0.0%    27.0 h      27.0 h
INFO  10:29:17,386 ProgressMeter -       1:1068701         0.0    94.0 s     155.8 w        0.0%    76.7 h      76.6 h
INFO  10:30:18,055 ProgressMeter -       1:1115101         0.0     2.6 m     256.1 w        0.0%     5.0 d       5.0 d


What did I do wrong?

Cheers!

Created 2015-03-10 18:27:56 | Updated | Tags: haplotypecaller genotypegvcfs gvcf

I run the following command for "GenotypeGVCFs" for 3 VCF files output of HaplotypeCaller as below:

java data/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \ -R data/ucsc.hg19.fasta \ -T GenotypeGVCFs \ --variant data/47V_post.ERC.vcf \ --variant data/48V_post.ERC.vcf \ --variant data/49V_post.ERC.vcf \ --out data/Combined_geno_3files.vcf

but in a final VCF output there is no rsID information and all rows are "." what is the problem? I am really confused. Could you please advise how to get SNP-ID in the output VCF

Thanks

Created 2015-03-09 16:48:43 | Updated | Tags: dp gt genotypegvcfs

Hi,

Sometimes when a sample has no reads, it still can have a genotype. I used GenotypeGVCFs (without CombineGVCFs) to generate a vcf with several samples. If I extract data for one sample, I sometimes see missing genotype with no read :

GT:AD:DP:GQ:PGT:PID:PL  ./.:0,0:0:.:.:.:.


which seem ok to me. Note that DP = 0.

but more rarely, I see a genotype with no read :

GT:AD:DP:GQ:PGT:PID:PL  0/0:0,0:.:3:.:.:0,3,45


and this is more questionnable. Note that DP = "."

Can you explain how genotype can be called without reads ? (I suspect haplotype to be involved, but without read I do not understand) any idea about the DP value ? why has it a value with missing genotype ?

thank you.

Created 2015-03-02 11:38:32 | Updated | Tags: genotypegvcfs strandallelecountsbysample

Dear GATK team:

Here is just a quick follow up question on the SAC (StrandAlleleCountsBySample) annotation. I've successfully get the SAC annotation using GenotypeGVCFs with the nighty build, but there is something I notice to be a little wired. This is a line from the final vcf i got from running GenotypeGVCFs. Really glad that the SAC annotation is there, but in this example, for a biallelic site, is it right to have the SAC annotation with 3 sets of counts for "reference/first_allele/second_allele" while there is actually no second alternative allele? Or is it have something to do with the allele from GVCF format. Many thanks!

The example: 1 752894 rs3131971 T C 30876.92 . BaseQRankSum=-7.360e-01;DB;DP=1430;MLEAC=589;MLEAF=0.785;MQ=32.47;MQ0=0;MQRankSum=0.727;ReadPosRankSum=-3.580e-01;SOR=4.437 GT:AD:DP:GQ:PGT:PID:PL:SAC 0/1:2,4:.:54:.:.:118,0,54:0,2,0,4,0,0 ./.:2,0:2 0/1:2,4:.:53:.:.:119,0,53:0,2,0,4,0,0 1/1:0,2:.:6:.:.:55,6,0:0,0,0,2,0,0

BTW, if I have a follow up question on a previous one I've asked, should I just reply to the old question or open a new one like this? Thanks again!

Created 2015-03-02 09:39:51 | Updated | Tags: haplotypecaller genotypegvcfs

Hi GATK team,

In the documentation of GenotypeGVCFs it is writen: "Input - One or more Haplotype Caller gVCFs to genotype" I have 3 questions regarding this tool: 1. I wonder, what is the meaning of running it on one sample? 2. I tried to run it on one sample, and noticed that the genotype quality is different than the one in the original gvcf file from HC. What is causing this difference? I'm asking this since running the tool on one sample means that there are no other samples to consider in recalculating the quality. 3. Last question - From your experience, what is the best way to analyze one exome sample? Should I run HC with the default genotype_mode parameter and do hard filtering? Should I run HC in GVCF mode, run GenotypeGVCFs and than do hard filtering? Any other suggestion?

Thank you for the answer, Maya

Created 2015-03-01 10:46:23 | Updated | Tags: genotypegvcfs strandbiasbysample strandallelecountsbysample

Hi there,

Thank you guys for developing & maintaining such a good software! I'm actually just starting to use GATK with Haplotypecaller. For me, there's a very useful annotation "StrandBiasBySample" or "StrandAlleleCountsBySample" that gives all the alignment information for a variant. It can be generated by HC, but after joint genotyping, it's just gone. There is a previous post asking about how to add this annotation to the final vcf, and Sheila mentioned she's already put in a request for adding this. So I'm just wandering, have you guys fixed this in one of the nighty build?

Many Thanks!!!

Created 2015-02-25 10:31:07 | Updated | Tags: genotypegvcfs combinegvcfs

Hi,

In the documentation for CombineGVCFs it says:

CombineGVCFs is meant to be used for hierarchical merging of gVCFs that will eventually be input into GenotypeGVCFs. One would use this tool when needing to genotype too large a number of individual gVCFs; instead of passing them all in to GenotypeGVCFs, one would first use CombineGVCFs on smaller batches of samples and then pass these combined gVCFs to GenotypeGVCFs.

Do you have any guidelines for this? I am trying to use genotypeGVCFS on 12 gVCF files and it doesn't work, so can you advise how I should I "pre-merge" them? Two batches of 6? Three batches of 3?

Thanks,

Mike

Created 2015-02-19 14:29:26 | Updated | Tags: haplotypecaller genotypegvcfs

Hi Sheila and Geraldine

When I run HaplotypeCaller (v3.3-0-g37228af) in GENOTYPE_GIVEN_ALLELES mode with --emitRefConfidence GVCF I get the error:

Invalid command line: Argument ERC/gt_mode has a bad value: you cannot request reference confidence output and GENOTYPE_GIVEN_ALLELES at the same time

It is however strange that GENOTYPE_GIVEN_ALLELES is mentioned in the --max_alternate_alleles section of the GenotypeGVCFs documentation.

Maybe I'm missing something?

Thanks, Gerrit

Created 2015-02-04 08:52:18 | Updated 2015-02-04 08:53:33 | Tags: haplotypecaller illumina genotypegvcfs combinegvcfs

Hi GATK-ers,

I have been given ~2000 gVCFs generated by Illumina (one sample per gVCF). Though they are in standard gVCF format, they were generated by an Illumina pipeline (https://support.basespace.illumina.com/knowledgebase/articles/147078-gvcf-file if you're really curious) and not the Haplotype Caller. As a result (I think ... ), the GATK doesn't want to process them (I have tried CombineGVCFs and GenotypeGVCFs to no avail). Is there a GATK walker or some other tool that will make my gVCFs GATK-friendly? I need to be able to merge this data together to make it analyze-able because in single-sample VCF format it's pretty useless at the moment.

My only other thought has been to expand all the ref blocks of data and then merge everything together, but this seems like it will result in the creation of a massive amount of data.

Any suggestions you may have are greatly appreciated!!!

Sara

Created 2015-02-02 21:24:31 | Updated | Tags: vqsr dbsnp vqslod genotypegvcfs gvcf gq-pl

From my whole-genome (human) BAM files, I want to obtain: For each variant in dbSNP, the GQ and VQSLOD associated with seeing that variant in my data.

Here's my situation using HaplotypeCaller -ERC GVCF followed by GenotypeGVCFs: CHROM POS ID REF ALT chr1 1 . A # my data chr1 1 . A T # dbSNP I would like to know the confidence (in terms of GQ and/or PL) of calling A/A, A/T. or T/T. The call of isn't useful to me for the reason explained below.

How can I get something like this to work? Besides needing a GATK-style GVCF file for dbSNP, I'm not sure how GenotypeGVCFs behaves if "tricked" with a fake GVCF not from HaplotypeCaller.

My detailed reason for needing this is below:

For positions of known variation (those in dbSNP), the reference base is arbitrary. For these positions, I need to distinguish between three cases: 1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1. 2. We have sufficient evidence to call position n as homozygous reference (0/0) with confidence scores GQ=x2 and VQSLOD=y2. 3. We do not have sufficient evidence to make any call for position n.

I was planning to use VQSR because the annotations it uses seem useful to distinguish between case 3 and either of 1 and 2. For example, excessive depth suggests a bad alignment, which decreases our confidence in making any call, homozygous reference or not.

Following the best practices pipeline using HaplotypeCaller -ERC GVCF, I get ALTs with associated GQs and PLs, and GT=./.. However, GenotypeGVCF removes all of these, meaning that whenever the call by HaplotypeCaller was ./. (due to lack of evidence for variation), it isn't carried forward for use in VQSR.

Consequently, this seems to distinguish only between these two cases: 1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1. 2. We do not have sufficient evidence to call position n as a variant (it's either 0/0 or unknown).

This isn't sufficient for my application, because we care deeply about the difference between "definitely homozygous reference" and "we don't know".

Douglas

Created 2015-01-23 16:55:57 | Updated | Tags: vqsr haplotypecaller bam gatk genotypegvcfs

Hi,

I have recal.bam files for all the individuals in my study (these constitute 4 families), and each bam file contains information for one chromosome for one individual. I was wondering if it is best for me to pass all the files for a single individual together when running HaplotypeCaller, if it will increase the accuracy of the calling, or if I can just run HaplotypeCaller on each individual bam file separately.

Also, I was wondering at which step I should be using CalculateGenotypePosteriors, and if it will clean up the calls substantially. VQSR already filters the calls, but I was reading that CalculateGenotypePosteriors actually takes pedigree files, which would be useful in my case. Should I try to use CalculateGenotypePosteriors after VQSR? Are there other relevant filtering or clean-up tools that I should be aware of?

Alva

Created 2015-01-13 12:40:29 | Updated 2015-01-13 12:44:57 | Tags: genotypegvcfs

I am seeing an unexpected heterozygous genotype in the GenotypeGVCFs output. The region is at the edge of a long homopolymer, which presumably plays a role, but it is still not clear where the unexpected heterozygote is coming from. The final VCF is below. The unexpected call is 2:152402516A>T in Sample1. The call is 0/1, but I don't see any evidence for where that alternate allele is coming from. The input VCFs are below. Any suggestions? How is GenotypeGVCFs interpreting this situation? I am running the latest GATK 3.3. Jar.

Final VCF

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1        Sample2


Sample1 VCF

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1
2       152402511       .       G       <NON_REF>       .       .       END=152402514   GT:DP:GQ:MIN_DP:PL      0/0:38:57:33:0,51,765
2       152402517       .       A       <NON_REF>       .       .       END=152402517   GT:DP:GQ:MIN_DP:PL      0/0:55:0:55:0,0,1431
2       152402518       .       A       <NON_REF>       .       .       END=152402621   GT:DP:GQ:MIN_DP:PL      0/0:49:99:37:0,101,1442


Sample2 VCF

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample2
2       152402514       .       T       <NON_REF>       .       .       END=152402514   GT:DP:GQ:MIN_DP:PL      0/0:63:99:63:0,114,1710
2       152402518       .       A       <NON_REF>       .       .       END=152402621   GT:DP:GQ:MIN_DP:PL      0/0:67:99:48:0,105,1710


Created 2015-01-07 21:16:08 | Updated | Tags: haplotypecaller genotypegvcfs

Hi

I am using GATK version 3.3 (latest release) and following the best practice for variant calling. Steps i am using is 1. Run haplotye caller and get the gVCF file 2. Run GenotypeGVCF across the samples to the get the VCF file

The Question is about the Genotype calling. Here is an example:

GVCF file looks like this for a particular position chr1 14677 . G . . END=14677 GT:DP:GQ:MIN_DP:PL 0/0:9:0:9:0,0,203

VCF file looks like this after genotyping with multiple samples. chr1 14677 . G A 76.56 PASS . . GT:AD:DP:GQ:PL 0/1:8,1:9:4:4,0,180

In GVCF file there was no support for the alternate allele, but VCF file shows that there are 1 read supporting the alternate call.

Thanks ! Saurabh

Created 2014-12-16 14:44:31 | Updated | Tags: java genotypegvcfs bcftools

Hi,

after running 5 exomes with GATK-v3.3 and HaplotypeCaller, I encountered a very low titv ration in my samples (~2.1) as VaraintEval report indicated. I tried running varaint filtration in these samples but I didn't see any imporvement in titv ratio nor any filtering done. therefore I filtered these with bcftools, after which the titv ratio improved to 2.5. Then when I tried running GenotypeGVCFs on these samples filtered with bcftools, I encountered the following error:

##### ERROR stack trace

java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double at java.lang.Double.compareTo(Double.java:49) at java.util.ComparableTimSort.countRunAndMakeAscending(ComparableTimSort.java:290) at java.util.ComparableTimSort.sort(ComparableTimSort.java:157) at java.util.ComparableTimSort.sort(ComparableTimSort.java:146) at java.util.Arrays.sort(Arrays.java:472) at java.util.Collections.sort(Collections.java:155) at org.broadinstitute.gatk.utils.MathUtils.median(MathUtils.java:999) at org.broadinstitute.gatk.tools.walkers.variantutils.ReferenceConfidenceVariantContextMerger.combineAnnotationValues(ReferenceConfidenceVariantContextMerger.java:73) at org.broadinstitute.gatk.tools.walkers.variantutils.ReferenceConfidenceVariantContextMerger.merge(ReferenceConfidenceVariantContextMerger.java:158) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:202) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:121) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:310) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

##### ERROR ------------------------------------------------------------------------------------------

any advice on solving this incident will be much appreciated

Victoria

Created 2014-12-15 10:17:16 | Updated | Tags: ploidy genotypegvcfs

Hello. I've run Haplotype caller for 19 samples in -ploidy 19 (19 is as high as I can get without getting an error for that, but that is a discussion opened elsewhere). Then, I've tried GenotypeGVCFs and get an ERROR (both command and ERROR below).

Just to add, I've run both HaplotypeCaller and GenotypeGVCFs with and without -maxAltAlleles 4 to try to limit the referred combination; however, the output is exactly the same ERROR message at the end of GenotypeGVCFs. Do you think there might be a bug or am I doing something wrong?

java -Djava.io.tmpdir=$mytmp -Xmx28g -jar GenomeAnalysisTK.jar -R$ref -T GenotypeGVCFs -o $gVCF.ploidy.vcf -V ploidy.list -ploidy 19 -maxAltAlleles 4 ERROR MESSAGE: the combination of ploidy (19) and number of alleles (21) results in a very large number of genotypes (> 2147483647). You need to limit ploidy or the number of alternative alleles to analyz e this locus Created 2014-12-12 12:53:50 | Updated | Tags: ad dp genotypegvcfs Hi, I am running the GATK 3.2-2 pipeline and I am getting some odd results from the genotype VCF stage. It appears that incorrect AD results are being output. E.g. on the single sample gvcf I have: GC G, 522.73 . BaseQRankSum=-0.716;ClippingRankSum=0.219;DP=38;MLEAC=1,0;MLEAF=0.500,0.00;MQ=66.41;MQ0=0;MQRankSum=1.681;ReadPosRankSum=-1.155 GT:AD:DP:GQ:PL:SB 0/1:20,18,0:38:99:560,0,633,620,687,1307:10,10,10,8 This agrees with what I see in IGV for this site (see attached image). However, when I joint call this site along with other gvcfs, the output for that sample looks like: 0/1:20,0:38:99:560,0,633 i.e. It appears to not output the expected 20,18 - but it still seems to be making the correct genotype call so I am assuming it is just an output bug and not something that is affecting calling, but it would be good to know for sure. Either way it is problematic as it is affecting our downstream depth filtering. This isn't the only example of this happening in our data. I am trying to identify more by looking for occasions where the AD fields sum to less than DP, am I right in thinking this should never be the case as AD is unfiltered compared to the filtered DP? Thanks a lot Dan Created 2014-12-09 21:47:52 | Updated | Tags: vcf genotypegvcfs gvcf Hi, I used GenotypeGVCFs with 3 input gvcf files (3 individuals) to create a vcf file, and this seems to work, but when I examine the sites in the final vcf file, there are sites that are missing. I am in the process of calculating exactly how many sites are missing, but taking an initial section of the vcf file and initial sections of my 3 gvcf files, the initial set of variant positions in the 3 gvcf files combined (called "test file") is: 16050036 16050612 16050822 16050933 16051556 16051968 16051994 16052080 16052167 16052239 16052250 16052357 etc. whereas the initial set of sites in my final vcf file is: 16050822 16050933 16051347 16051497 16051556 16051968 16051994 16052080 16052167 16052169 16052239 16052357 etc. First of all, there are sites in the final vcf file (in bold) that are fixed for the 3 individuals, but that are still included (the individuals are all 0/1 at these positions). I removed these positions when I created my test file, so they don't show up there, as you can see. Second, there are sites in the test file that are not in the final vcf file (in bold), even though these are most definitely variant sites (I checked them - f.ex., 16050036 is a SNP). I'm not sure why these discrepancies are occurring. I also got this warning 3 times when running GenotypeGVCFs: WARN 20:04:45,102 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 22:21483632 has 7 alternate alleles so only the top alleles will be used How do I relax the requirements of "at most 6 alternate alleles" to allow more in the case of indels? I am using the newest version of GATK (3.3). FYI, this is the command I used for GenotypeGVCFs: java -Xmx2g -jar GenomeAnalysisTK.jar -R hs37d5.fa -T GenotypeGVCFs --variant file1.vcf --variant file2.vcf --variant file3.vcf -o final.vcf -L 22 (only running this for chr22) Thank you in advance, Alva Created 2014-12-03 16:57:35 | Updated | Tags: haplotypecaller genotypegvcfs If I want the variants to be called only if they fit the following criteria: 1) Min. total coverage for consideration of heterozygous is 10. 2) Min. coverage of each of the two observed major basecalls to be called heterozygous is 5. 3) Min. percentage of each of the two observed major basecalls in order to be called heterozygous is 20. 4) Min. coverage in order for a position to be called homozygous is 6. which command-line arguments in which tools (HaplotpyeCaller or GenotypeGVCFs) can I use to accomplish these? I cannot seem to find the proper arguments in the documentation. I apologize if I overlook. Thank you Created 2014-12-02 11:21:53 | Updated 2014-12-02 11:23:39 | Tags: genotypegvcfs Hi, I'm using GATKv3.3 to run the protocol of HC followed by CombineGVCFs in six batches of ~190 individuals each and finally GenotypeGVCFs of these six gVCFs. My failed command is below followed by the error message. In addition, I can tell you that my output file (1135g_.vcf) is empty even when it crashed after more than two hours of running time. Kindly, Alberto java -Djava.io.tmpdir=$mytmp -Xmx232g -jar $EBROOTGATK/GenomeAnalysisTK.jar -R$ref -T GenotypeGVCFs -nt 40 -o 1135g_.vcf -V 1135g_listof.list

##### ERROR ------------------------------------------------------------------------------------------

ERROR stack trace java.lang.ArrayIndexOutOfBoundsException: 10000 at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.generatePL(ReferenceConfidenceVariantContextMerger.java:357) at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.mergeRefConfidenceGenotypes(ReferenceConfidenceVariantContextMerger.jav a:331) at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.merge(ReferenceConfidenceVariantContextMerger.java:134) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:200) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:119) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722)

ERROR ------------------------------------------------------------------------------------------ ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af): ERROR ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ERROR If not, please post the error message, with stack trace, to the GATK forum. ERROR Visit our website and forum for extensive documentation and answers to ERROR commonly asked questions http://www.broadinstitute.org/gatk ERROR ERROR MESSAGE: 10000 ERROR ------------------------------------------------------------------------------------------

Created 2014-11-26 22:46:02 | Updated | Tags: haplotypecaller genotypegvcfs

Hi, I just finished running HaplotypeCaller version 3.3-0 separately on 6 exome samples with the new best practices.

java -Xmx8g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19/hg19_Ordered.fa -I K87/HG19_Analysis/K87-929_final.recalibrated_final.bam --dbsnp dbsnp_138_hg19_Ordered.vcf --pair_hmm_implementation VECTOR_LOGLESS_CACHING -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 --output_mode EMIT_VARIANTS_ONLY -gt_mode DISCOVERY --pcr_indel_model CONSERVATIVE -o ./Haplotypes_929.vcf

Many Variant sites are called as homozygous alt (1/1), but none of these sites that are processed to infer haplotype are called as homozygous alt in their PGT field, they are all called as hets, PGT=0|1. for example:

The allelic depths agree with the phased genotype but out of all 6 exomes processed, not a single 1/1 is also phased as 1|1.

I checked all output vcfs with a simple grep combo: grep 'PGT' Haplotypes_929.vcf | grep '1/1' - | grep '0|1' - | wc -l = 19046 grep 'PGT' Haplotypes_929.vcf | grep '1/1' - | grep '1|0' - | wc -l = 79 grep 'PGT' Haplotypes_929.vcf | grep '1/1' - | grep '1|1' - | wc -l = 0

This seemed odd, but I continued with GenotyeGVCF:

java -Xmx32g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R hg19/hg19_Ordered.fa -V Haplotypes_450.vcf -V Haplotypes_452.vcf -V Haplotypes_925.vcf -V Haplotypes_926.vcf -V Haplotypes_927.vcf -V Haplotypes_929.vcf -D dbsnp_138_hg19_Ordered.vcf -ped K87/HG19_Analysis/K87_6.ped -o Haplotypes_K87_GVCFs.vcf

I'm looking at the output vcf as it's being generated and now there are homozygous alt calls but they conflict with the associated Allelic Depths:

Full Line: chr1 33957152 rs4403594 T G 3166.96 . AC=12;AF=1.00;AN=12;DB;DP=99;FS=0.000;GQ_MEAN=48.50;GQ_STDDEV=27.55;MLEAC=12;MLEAF=1.00;MQ=39.65;MQ0=0;NCC=0;QD=32.32;SOR=0.693 GT:AD:DP:GQ:PGT:PID:PL 1/1:0,9:9:27:.:.:330,27,0 1/1:0,5:5:15:.:.:141,15,0 1/1:0,29:29:85:1|1:33957151_G_T:948,85,0 1/1:0,20:20:60:.:.:722,60,0 1/1:0,24:24:71:.:.:685,71,0 1/1:0,11:11:33:.:.:366,33,0

Can you help me interpret what seems to me as conflicting results?

Cheers,

Patrick

Created 2014-11-20 06:39:53 | Updated | Tags: combinevariants genotypegvcfs gvcftools

Hi,

I am merging several genome vcf files apparently created by gvcftools [1]. I noticed that CombineVariants is incorrectly summing allele counts (AC) in some corner cases, and was wondering if the format [2] was supported?

If the format should be supported by the CombineVariants tool (gvcf is VCF 4.1 conformant), I have isolated a minimal example that should be of help for bugfixing.

Thanks, Paweł

Created 2014-11-17 11:59:46 | Updated | Tags: haplotypecaller genotypegvcfs combinegvcfs

Hi all,

I am attempting to use the HaplotyperCaller / CombineGVCFs / GenotypeGVCFs to call variants on chrom X and Y of 769 samples (356 males, 413 females) sequenced at 12x coverage (WG sequening, but right not only calling X and Y).

I have called the samples according to the best practises using the HaplotypeCaller, using ploidy = 1 for males on X and Y and ploidy =2 for females on X, e.g.:

INFO 16:28:45,750 HelpFormatter - Program Args: -R /gcc/resources/b37/indices/human_g1k_v37.fa -T HaplotypeCaller -L X -ploidy 1 -minPruning 3 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -I /target/gpfs2/gcc/groups/gonl/projects/trio-analysis/rawdata_release2/A102.human_g1k_v37.trio_realigned.bam --sample_name A102a -o /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/A102a.chrX.hc.g.vcf

Then I have used CombineGVCFs to combine my samples in batches of 100 samples. Now I am attempting to genotype them and I face the same issue on both X (males + females) and Y (males only): It starts running fine and then just hang on a certain position. At first it crashed asking for additional memory but now with 24Gb or memory it simply stays at a single position for 24hrs until my job eventually stops due to walltime. Here is the chrom X output: INFO 15:00:39,501 HelpFormatter - Program Args: -R /gcc/resources/b37/indices/human_g1k_v37.fa -T GenotypeGVCFs -ploidy 1 --dbsnp /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf -stand_call_conf 10 -stand_emit_conf 10 --max_alternate_alleles 4 -o /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.vcf -L X -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.1.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.2.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.3.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.4.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.5.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.6.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.7.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.8.g.vcf INFO 15:00:39,507 HelpFormatter - Executing as lfrancioli@targetgcc15-mgmt on Linux 3.0.80-0.5-default amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13. INFO 15:00:39,507 HelpFormatter - Date/Time: 2014/11/12 15:00:39 INFO 15:00:39,508 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:00:39,508 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:00:40,951 GenomeAnalysisEngine - Strictness is SILENT INFO 15:00:41,153 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 15:57:53,416 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.4.g.vcf.idx INFO 17:09:39,597 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.5.g.vcf.idx INFO 18:21:00,656 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.6.g.vcf.idx INFO 19:30:46,624 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.7.g.vcf.idx INFO 20:22:38,368 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.8.g.vcf.idx WARN 20:26:45,716 FSLockWithShared$LockAcquisitionTask - WARNING: Unable to lock file /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf.idx because an IOException occurred with message: No locks available. INFO 20:26:45,718 RMDTrackBuilder - Could not acquire a shared lock on index file /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf.idx, falling back to using an in-memory index for this GATK run. INFO 20:33:29,491 IntervalUtils - Processing 155270560 bp from intervals INFO 20:33:29,628 GenomeAnalysisEngine - Preparing for traversal INFO 20:33:29,635 GenomeAnalysisEngine - Done preparing for traversal INFO 20:33:29,636 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 20:33:29,637 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 20:33:29,638 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 20:33:29,948 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files INFO 20:33:59,642 ProgressMeter - X:65301 0.0 30.0 s 49.6 w 0.0% 19.8 h 19.8 h INFO 20:34:59,820 ProgressMeter - X:65301 0.0 90.0 s 149.1 w 0.0% 59.4 h 59.4 h ... INFO 20:52:01,064 ProgressMeter - X:177301 0.0 18.5 m 1837.7 w 0.1% 11.3 d 11.2 d INFO 20:53:01,066 ProgressMeter - X:177301 0.0 19.5 m 1936.9 w 0.1% 11.9 d 11.9 d ... INFO 14:58:25,243 ProgressMeter - X:177301 0.0 18.4 h 15250.3 w 0.1% 96.0 w 95.9 w INFO 14:59:38,112 ProgressMeter - X:177301 0.0 18.4 h 15250.3 w 0.1% 96.1 w 96.0 w INFO 15:00:47,482 ProgressMeter - X:177301 0.0 18.5 h 15250.3 w 0.1% 96.2 w 96.1 w =>> PBS: job killed: walltime 86440 exceeded limit 86400 I would really appreciate if you could give me some pointer as how to handle this situation. Thanks! Laurent Created 2014-11-14 03:30:08 | Updated | Tags: variantannotator genotypegvcfs I have constructed a VCF using GenotypeGVCFs. Due to downsampling and filtering imposed by HaplotypeCaller/GenotypeGVCFs, four fields of interest within the VCF do not reflect the data that went into building the file. I would like to update a these fields to reflect all read data used in the pipeline, but I have been able to do so for just 3 of the 4. I have been able to update INFO MQ0, INFO DP, and sample-level (FORMAT) AD, but I have not been able to update sample-level (FORMAT) DP. I have run VariantAnnotator as follows. inVCFfn=../vcf/raw.vcf outVCFfn=../vcf/raw.annotated.vcf gatk=/srv/gs1/software/gatk/gatk-3.3.0/GenomeAnalysisTK.jar ref=/srv/gs1/projects/scg/Resources/GATK/b37/human_g1k_v37_decoy.fasta dbsnp=/srv/gs1/projects/scg/Resources/GATK/b37/dbsnp_137.b37.vcf java -Xmx10g \ -jar$gatk \
-T VariantAnnotator \
-R $ref \ -L$inVCFfn \
--variant $inVCFfn \ --dbsnp$dbsnp \
--out $outVCFfn \ --annotation Coverage \ --annotation DepthPerAlleleBySample \ --annotation MappingQualityZero \ [-I fileName.1.bam … -I fileName.810.bam]  The documentation for --annotation Coverage indicates that it should update both the INFO DP and the sample-level (FORMAT) DP, but the sample-level update seems not to be occurring. For example: $ awk '($2==7157834) { print$9, "|", $53 }' raw.vcf GT:AD:DP:GQ:PL | 0:4,0:4:99:0,135$ awk '($2==7157834) { print$9, "|", $53 }' raw.annotated.vcf GT:AD:DP:GQ:PL | 0:131,0:4:99:0,135  (Aside: I'm working with a haploid chromosome, hence the single-value GT and two-value AD and PL fields). Although AD was updated from "4,0" to "131,0", the sample-level "DP" remained at "4". Of course I could approximate DP as the sum of AD values, but since AD is pre-filter and sample-level DP is post-filter, this wouldn't quite be correct. I should note that when I run HaplotypeCaller/GenotypeGVCFs on a subset of the data, the sample-level DP values are very close to the sum of AD's, so it's not a matter of most reads being filtered out, but rather that they have been downsampled. Am I missing something? Thanks! David P.S. The INFO field DP and MQ0 are indeed updated as expected: $ awk '($2==7157834) { print$8 }' raw.vcf
AC=1;AF=6.993e-03;AN=143;DP=609;FS=0.000;GQ_MEAN=124.92;GQ_STDDEV=113.68;MLEAC=1;MLEAF=6.993e-03;MQ=60.00;MQ0=0;NCC=0;QD=30.16;SOR=1.259

$awk '($2==7157834) { print 8 }' raw.annotated.vcf AC=1;AF=6.993e-03;AN=143;DP=10750;FS=0.000;GQ_MEAN=124.92;GQ_STDDEV=113.68;MLEAC=1;MLEAF=6.993e-03;MQ=60.00;MQ0=2;NCC=0;QD=30.16;SOR=1.259  Created 2014-11-04 05:50:02 | Updated | Tags: genotypegvcfs In the past, I’ve found MQ0 ("Total Mapping Quality Zero Reads") to be a very useful INFO feature of putatively variable sites. However, in my experience, GenotypeGVCFs reports every site to have MQ0=0, even those for which UnifiedGenotyper reports MQ0 > 0 when run on the same data. Is this a bug, or am I missing something? Thanks! Created 2014-10-29 16:20:48 | Updated 2014-10-29 16:21:18 | Tags: haplotypecaller genotypegvcfs joint-calling exome-capture Hi, We have some exomes processed with GATK from one exome capture platform (Nimblegen SeqCap on HiSeq), and now I am going to analyze a small batch of exomes sequenced using a different platform (Nextera on NextSeq). I was wondering if and how GenotypeGVCFs can cope with gvcfs produced on two (or more) different exome capture platforms? Is joint-calling of such "heterogeneous" samples advised, or should I rather genotype the small set of equally processed bams separately? The problems I could foresee were: • differing targets (each batch of samples will have gaps in coverage due to target-constrained (-L) haplotypecalling). If this should cause problems for GenotypeGVCFs the haplotypes could be called again on superset of targets, or easier, the calling restricted to the intersection. • the alignment algorithm is different (BWA BWTSW vs BWA MEM) so the mapping quality could potentially differ (haven't checked). Is this something GATK is compensating for, like base qualities? • the sequencer and chemistry, but here I hope that the BQSR should help in removing the variation Any thoughts and comments appreciated, Thanks, Paweł Created 2014-10-23 17:22:04 | Updated | Tags: genotypegvcfs Hi, I'm trying to call genotypes at a set of given positions from ~5,000 samples. To do so, I first generate gvcf files using HaplotypeCaller for the region around a gene for all samples separately with HaplotypeCaller. Then I'm merging the single gvcf files into batches of 1000 and then I'm generating the genotypes at ~180 sites using GenotypeGVCFs and the flag --includeNonVariantSites. Everything runs without errors, but in the end I noticed that ~20 sites were missing from the final VCF file. I then looked into these sites and the single files in detail and found out that there are simply no genotypes called for some samples at some positions. I attached a small example. When I'm running the following command, no variant is in the output file: java -jar GenomeAnalysisTK.jar \ -R hg19.fa \ -L region.bed --includeNonVariantSites \ -T GenotypeGVCFs \ --variant wrong.gvcf --variant correct.gvcf \ -o out.vcf The same command works when running it with only "correct.gvcf" as an input, but obviously not when running it with only "wrong.gvcf". I noticed that wrong.gvcf might show a variant at this exact position. Best, Thomas Created 2014-10-06 20:07:44 | Updated | Tags: genotypegvcfs Dear all, I have been using a relatively recent nightly build of GATK to deal with some previous bug fix, and I now see some odd things in my VCF format, after using genotypegVCF. Precisely some rows have this INFO field: GT:AD:DP:GQ:PGT:PID:PL with these new PGT and PID which are present only some of the time. I have no idea what this means (and it also messes up my data processing). Does anyone know what these fields are about? And, unless they are very useful, how to get rid of them? Thank you in advance, Created 2014-09-30 15:46:58 | Updated | Tags: haplotypecaller genotypegvcfs Following the the recommended best practice and using GATK v3.2, the raw vcf file created by haplotyecaller/GenotypeGVCFs does not contain any of AC, AN and AF tags. Is this an expected behaviour? This is a line of the g.vcf file generated from HaplotypeCaller: chr21 36104982 . AT A,ATT,<NON_REF> 0.39 . BaseQRankSum=-1.693;DP=35;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQ=60.00;MQ0=0;MQRankSum=1.874;ReadPosRankSum=-0.023 GT:AD:GQ:PL:SB 0/1:23,6,2,0:27:27,0,499,67,411,529,96,459,496,525:10,13,2,4 And the raw vcf file from GenotypeGVCFs: chr21 36122885 . A AT 7871.94 . BaseQRankSum=0.151;DP=493;FS=5.433;MLEAC=24;MLEAF=0.750;MQ=60.00;MQ0=0;MQRankSum=0.129;QD=19.53;ReadPosRankSum=0.257 GT:AD:DP:GQ:PL 0/1:13,8:.:99:162,0,238 1/1:0,21:.:62:536,62,0 1/1:0,24:.:71:610,71,0 0/1:12,12:.:99:245,0,207 1/1:0,27:.:80:693,80,0 1/1:0,34:.:99:877,101,0 0/1:31,13:.:99:210,0,624 0/0:30,0:30:65:0,65,1125 0/1:6,11:.:96:225,0,96 0/1:10,5:.:87:87,0,196 0/1:4,15:.:39:340,0,39 1/1:2,35:.:66:895,66,0 1/1:0,53:.:99:1376,161,0 1/1:0,13:.:39:328,39,0 1/1:0,24:.:74:662,74,0 1/1:1,29:.:65:702,65,0 My command for HaplotypeCaller: java -Xmx8g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R myRef.fa -I my.bam -L chr21 --dbsnp dbSNP_chr21.vcf --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -stand_emit_conf 10 -stand_call_conf 30 -mbq 20 -nct 8 -A FisherStrand -A QualByDepth -A HaplotypeScore --annotation Coverage -o raw.g.vcf And GenotypeGVCFs: java -Xmx8g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R myRef.fa -V myVCF.list --dbsnp dbSNP_chr21.vcf -A FisherStrand -A QualByDepth -A HaplotypeScore -A Coverage -L chr21 -o myraw.vcf Created 2014-09-23 13:38:59 | Updated | Tags: genotypegvcfs hi, I did HaplotypeCaller + genotypeVCFs on 4 sample bam files (with low coverage/depth). I saw that AC INFO field do not represent the sum of all alternate allele seen. grep -we rs10753352 -we rs80146807 -we rs707592 trisomie.vcf chr1 5184689 rs10753352 G C 147.81 . AC=4;AF=0.667;AN=6;DB;DP=7;FS=0.000;GQ_MEAN=6.00;GQ_STDDEV=6.00;MLEAC=5;MLEAF=0.833;MQ=93.02;MQ0=0;NCC=1;QD=24.64 GT:AD:DP:GQ:PL 1/1:0,4:4:12:116,12,0 ./.:0,0:0 1/1:0,2:2:6:59,6,0 0/0:1,0:1:0:0,0,8 chr1 5730409 rs80146807 G T 142.09 . AC=5;AF=0.625;AN=8;BaseQRankSum=0.804;ClippingRankSum=0.804;DB;DP=6;FS=0.000;GQ_MEAN=12.75;GQ_STDDEV=17.56;MLEAC=5;MLEAF=0.625;MQ=82.96;MQ0=0;MQRankSum=0.804;NCC=0;QD=26.86;ReadPosRankSum=0.804 GT:AD:DP:GQ:PL 0/0:1,0:1:3:0,3,45 1/1:0,1:1:3:45,3,0 1/1:0,2:2:6:90,6,0 0/1:1,1:2:39:39,0,62 chr1 5732382 rs707592 T C 194.63 . AC=5;AF=0.833;AN=6;BaseQRankSum=-7.360e-01;ClippingRankSum=0.736;DB;DP=9;FS=8.451;GQ_MEAN=13.33;GQ_STDDEV=8.08;MLEAC=5;MLEAF=0.833;MQ=26.49;MQ0=0;MQRankSum=0.736;NCC=1;QD=32.44;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL ./.:0,0:0 1/1:0,4:4:12:149,12,0 0/1:2,1:3:22:22,0,49 1/1:0,2:2:6:53,6,0  I join the corresponding ##GATKCommandLine in case. If I misunderstand what AC is, feel free to teach me. Created 2014-09-19 14:34:45 | Updated | Tags: haplotypecaller malformedvcf genotypegvcfs Hi, I've tried my best to find the problem but without luck, so my first post... First the background: I have used HaplotypeCaller to make GVCFs per individual for 2 cohorts of 270 and 300 individuals each. I then used CombineGVCFs to merged all the individuals in each cohort across 10Mb chunks across the genome. These steps appear to have worked OK! Then I have been attempting to call genotypes from pairs GVCFs (one from each cohort) per region using GenotypeGVCFs with a command such as: java -jar -Xmx6G /gpfs0/apps/well/gatk/3.2-2/GenomeAnalysisTK.jar \ -T GenotypeGVCFs \ -R ../hs37d5/hs37d5.fa \ --variant /well/hill/kate/exomes//haploc/cap270/cap270_1:230000000-240000000_merged_gvcf.vcf \ --variant /well/hill/kate/exomes//haploc/sepsis300/sepsis300_1:230000000-240000000_merged_gvcf.vcf \ -o /well/hill/kate/exomes//haploc/cap270-sepsis300/cap270-sepsis300_1:230000000-240000000_raw.vcf This seems to start fine (and works perfectly to completion for some regions), but then throws a error for some of the files. For this example the error is: ##### ERROR MESSAGE: The provided VCF file is malformed at approximately line number 519534: ./.:29:75:25:0,63,945 is not a valid start position in the VCF format, for input source: /well/hill/kate/exomes/haploc/cap270/cap270_1:230000000-240000000_merged_gvcf.vcf So to try and solve this I have grepped "./.:29:75:25:0,63,945 " from the VCF file (expecting to find a truncated line), but could find nothing wrong with the lines containing this expression. I looked carefully at the first line containing this after the last position that was written in the output - nothing out of the ordinary that I can see. Any help/advice gratefully received! Kate Created 2014-09-19 10:45:49 | Updated 2014-09-19 10:52:00 | Tags: mappingqualityranksumtest readposranksumtest mqranksum readposranksum genotypegvcfs On 2000 samples I have run HC3.2, CGVCFs3.2, GGVCFs3.2 and VR3.2. For the GenotypeGVCFs step I used the current default annotations: InbreedingCoeff FisherStrand QualByDepth ChromosomeCounts GenotypeSummaries  And these non-default annotations: MappingQualityRankSumTest ReadPosRankSumTest  When running VariantRecalibrator and plotting each of the dimensions I noticed all of the non-default annotations taking on discrete values; see bottom of this post. Is it no longer recommended to use ReadPosRankSum and MQRankSum for VR? Should I calculate these annotation with VariantAnnotator instead of GenotypeGVCFs? If I have to run VariantAnnotator, should I then run it separately for SNPs and INDELs cf. my previous question about annotations being different, when applied to BOTH and SNPs: http://gatkforums.broadinstitute.org/discussion/2620 Thank you. zcat out_GenotypeGVCFs/chrom20.vcf.gz | grep -v ^# | cut -f8 | tr ";" "\n" | grep ReadPosRankSum | sort | uniq -c | awk '1>20000' | sort -k1n,1

zcat out_GenotypeGVCFs/chrom20.vcf.gz | grep -v ^# | cut -f8 | tr ";" "\n" | grep "MQ=" | sort | uniq -c | awk '$1>5000' | sort -k1n,1 5802 MQ=57.05 8382 MQ=29.00 8525 MQ=56.62 10069 MQ=51.77 10574 MQ=53.95 10682 MQ=47.12 10818 MQ=56.04 11553 MQ=55.21 802603 MQ=60.00 zcat out_GenotypeGVCFs/chrom20.vcf.gz | grep -v ^# | cut -f8 | tr ";" "\n" | grep MQRankSum | sort | uniq -c | awk '$1>20000' | sort -k1n,1
21511 MQRankSum=-7.360e-01
27222 MQRankSum=0.322
33699 MQRankSum=0.550
34481 MQRankSum=0.731
37603 MQRankSum=0.720
60729 MQRankSum=0.00
76031 MQRankSum=0.406
85812 MQRankSum=0.736
98519 MQRankSum=0.727
186092 MQRankSum=0.358


Created 2014-09-11 15:52:18 | Updated | Tags: vqsr haplotypecaller qualbydepth genotypegvcfs

Hey there,

How can it be possible that some of my snps or indels calls miss the QD tag? I'm doing the recommended workflow and I've tested if for both RNAseq (hard filters complains, that's how I saw those tags were missing) and Exome sequencing (VQSR). How can a hard filter applied on QD on a line without actually that tag can be considered to pass that threshold too? I'm seeing a lot more INDELs in RNAseq where this kind of scenario is happening as well.

Here's the command lines that I used :

# VQSR

java -Djava.io.tmpdir=$LSCRATCH -Xmx10g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -T VariantRecalibrator -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -mode SNP -resource:1000G,known=false,training=true,truth=false,prior=10.0 ~/References/hg19/VQSR/1000G_phase1.snps.high_confidence.b37.noGL.converted.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ~/References/hg19/VQSR/hapmap_3.3.b37.noGL.nochrM.converted.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 ~/References/hg19/VQSR/1000G_omni2.5.b37.noGL.nochrM.converted.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 ~/References/hg19/VQSR/dbsnp_138.b37.excluding_sites_after_129.noGL.nochrM.converted.vcf -input snv.vcf -recalFile 96Exomes.HC.TruSeq.snv.RECALIBRATED -tranchesFile 96Exomes.HC.TruSeq.snv.tranches -rscriptFile 96Exomes.HC.TruSeq.snv.plots.R -R ~/References/hg19/hg19.fasta --maxGaussians 4 java -Djava.io.tmpdir=$LSCRATCH -Xmx10g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -T ApplyRecalibration -ts_filter_level 99.0 -mode SNP -input snv.vcf -recalFile 96Exomes.HC.TruSeq.snv.RECALIBRATED -tranchesFile 96Exomes.HC.TruSeq.snv.tranches -o 96Exomes.HC.TruSeq.snv.recal.vcf -R ~/References/hg19/hg19.fasta

# HARD FILTER (RNASeq)

$GenomeAnalysisTK-3.1-1 -T GenotypeGVCFs -R human_g1k_v37_decoy.fasta --dbsnp dbsnp_137.b37.vcf.gz -V -L targets.GRCh37.bed -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A GCContent -A AlleleBalanceBySample -o joint_vcf.vcf.gz Above, note that if "-A AlleleBalance" is given to GenotypeGVCFs, GATK crashes with a NullPointerException (AlleleBalance.java, line 66). The command above is heavily adapted from the current pipeline; do you know what I might be doing wrong with the new and improved HaplotypeCaller? Thanks so much for your help, and if you need any further information, please let me know. Created 2014-05-30 16:18:45 | Updated | Tags: haplotypecaller genotypegvcfs gvcf targeted-sequencing Hi For targeted re-sequenced data, can we do variant calling using HaplotypeCaller in gVCF mode followed by using GenotypeGVCFs for joint genotyping ? Would the following best practices work for targeted re-sequencing as well , except for VQSR as for targted resequencing GATK advices another way of filtering. https://www.broadinstitute.org/gatk/guide/best-practices?bpm=DNAseq Thanks, Tinu Created 2014-05-27 10:53:30 | Updated 2014-05-27 10:55:27 | Tags: vcf multi-sample vcf-format genotypegvcfs Hi, I'm doing a variant analysis of genomic DNA from 2 related samples. I followed the up-to-date Best practices using HaplotypeCaller in GVCF mode for both samples followed by GenotypeGVCF to compute a common vcf of variant loci. I'm looking at variants that would be sample2-specific (present in sample2 but not in sample1) Here is a line of this file: CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 chrIII 91124 . A AATAAGAGGAATTAGGCT 1132.42 . AC=2;AF=0.500;AN=4;DP=47;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=58.85;MQ0=0;QD=7.99 GT:AD:DP:GQ:PL 1/1:0,25:25:55:1167,55,0 0/0:.:22:33:0,33,495 In the Genotype Field, sample2.AD is a . (dot) meaning that no reads passed the Quality filters. However, sample2.DP=22 meaning that 22 reads covered this position. This line suggest that this variation is specific to sample1 (genotype HomVar 1/1) and is not present in sample2 (HomRef 0/0). But given the biological relationship between sample1 and 2 (the way they were generated), I doubt that this variation is true: it is very likely to be present in sample2 as well. It's a false I have 416 loci like this. For the vast majority of them, sample1 and 2 likely share the same variation. But since it is not impossible that a very few of them are really sample1=HomVar and sample2=HomRef, could you suggest me a way to detect those guys? What about comparing sample1.PL(1/1) and sample2.PL(0/0) ? For example could you suggest a rule of thumb to determine their ratio ? Created 2014-05-12 15:51:42 | Updated | Tags: genotypegvcfs Hi, I am currently unsure of how to interpret the output of GenotypeGVCFs when typing my ~600 samples. Genotype format at a given locus with 2 alternate alleles: 0/2:16,17,0:33:99:453,501,972,0,471,420 --This is what I am expecting for each sample Genotype format at another given locus with 2 alternate alleles: 0/2:4,1,5,3,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0:15:20:85,46,133,0,111,358,82,64,23,190,20,60,49,41,45,20,60,49,41,45,45 --I really don't understand this I have written several down-stream scripts which rely on this format being consistent -- why are there so many fields in some of these lines? I apologize if this is very obvious and documented somewhere, but I have tried searching with no results. Thanks, -Briana Created 2014-05-08 23:30:18 | Updated | Tags: combinevariants vcf genotypegvcfs I've noticed a small bug with GATK tools and VCF files. CombineVariants and GenotypeGVCF can generate files where some samples have fewer fields than the format column. For instance, this is part of a line from the VQSR-ed output VCF of GenotypeGVCF: 1 15820 rs200482301 G T 5909.59 VQSRTrancheSNP99.90to100.00 AC=21;AF=0.154;AN=136..... GT:AD:DP:GQ:PL 0/0:.:40:66:0,66,990 0/0:.:41:69:0,69,1035 1/1:0,20,1:21:78:985,78,0 0/0:.:35:60:0,60,900 ./.:.:1 0/0:.:7:21:0,21,233 .............. The second to last sample entry is ./.:.:1 (3 fields), while the format entry is GT:AD:DP:GQ:PL (5 fields). I think that GT=./., AD=., and DP=1, so the data is not getting messed up. This might even be within the rules of VCF, but one of the software that I use will not parse VCF files when 1 < # sample fields < # format fields. If sample entries were extended by ":." for every empty FORMAT field (unless only . or ./. was present in sample column), it would make the file parsable. It's not too hard of a manual fix, but it might be nice to add the functionality into the toolkit. I've seen it happen with CombineVariants as well, when the input VCF files have different numbers of FORMAT fields. Created 2014-04-29 23:32:59 | Updated | Tags: genotypegvcfs hello, I am doing a population genetic analysis and am consequently very interested in obtaining high confidence monomorphic sites. I used GATK 3.1 HaplotypeCaller in the new incremental variant discovery pipeline and then ran GenotypeGVCFs using the -inv option so as to print non variant sites after combining. A monomorphic record looks like this: scaffold_1 202986 . G . . . . GT:AD:DP:PL ./.:81:81:0 ./.:6:12:0 ./.:0:0:0 .....etc I had a couple of questions: 1. It seems GT, GQ, and PL is no longer reported after combining for monomorphic sites? is there a reason why? I'm asking because I'd like to be able to distinguish high qual (first genotype) / low qual (second genotype) monomorphic sites so I can change the low qual sites to missing (third genotype). 2. If getting high confidence monomorphic sites is my goal, would you recommend that I do my own parsing starting with the individual gvcfs rather than using the GenotypeGVCFs walker? thanks much for your help! best, Young Wha Lee Created 2014-04-24 12:45:30 | Updated | Tags: genotypegvcfs Hi Geraldine, We have used the 3.x workflow to process targeted resequencing for a large cohort. At the genotypeGVCFs stage, we get the following warning in the output: ExactAFCalc - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at chr:pos has NN alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument Since our cohort is very large, it is extremely unlikely, but not impossible that more than 6 indel alleles exist at a given locus. Accordingly, allowing genotypeGVCFs to consider more possible alleles seems warranted. While --max_alternate_alleles is an option for HaplotypeCaller, it is not recognized by genotypeGVCFs. Is this parameter fixed for genotypeGVCFs? If not, is there a way to change this parameter at the genotypeGVCFs stage? If not, can this parameter be modified by altering the combined GVCF header or rerunning HaplotypeCaller with a higher value for --max_alternate_alleles? Thanks for all your hard work in this forum! Created 2014-04-18 22:38:18 | Updated | Tags: genotypegvcfs Trying to do variantcalling using GATKs new pipeline using HaplotypeCaller and GenotypeGVCF. Made gVCFs using HC and now trying to do joint genotyping using GenotypeGVCF java -Xmx5G -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R hs37d5.fa -V:VCF GVCF.list -L SureSelect_AllExon_V5_wo_chr.bed --dbsnp dbsnp_137.hg19.vcf -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A InbreedingCoeff -A Coverage -A AlleleBalance -A AlleleBalanceBySample -A BaseCounts -A BaseQualityRankSumTest -A ChromosomeCounts -A ClippingRankSumTest -A Coverage -A DepthPerAlleleBySample -A FisherStrand -A GCContent -A HaplotypeScore -A HardyWeinberg -A HomopolymerRun -A InbreedingCoeff -A LowMQ -A MappingQualityRankSumTest -A MappingQualityZero -A MappingQualityZeroBySample -A NBaseCount -A QualByDepth -A RMSMappingQuality -A SpanningDeletions -A TandemRepeatAnnotator -A VariantType -o JointGenotyped.vcf Error ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.0-0-g6bad1c6): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Code exception (see stack trace for error itself) Not sure why this is happening. Thanks, Tinu Created 2014-04-17 21:28:34 | Updated | Tags: genotypegvcfs combinegvcfs Hi there: I'm trying to figure out if I need to run CombineGVCFs before GenotypeGVCFs. The documentation said "One would use this tool when needing to genotype too large a number of individual gVCFs". Is there a ballpark number for "too large a number of individuals"? For example, if I have 1000 individuals, should I use CombineGVCFs first? If so, is there a recommended chuck size? It seems CombineGVCFs does not take nt/nct option and when I tried to combine 600 files it's estimated to take a week. Thanks. Created 2014-04-14 21:55:57 | Updated | Tags: genotypegvcfs multiple-gvcfs gatk-3-0-haplotypecaller I am using GATK 3.0 Generated GVCFs using HaplotypeCaller and now trying to do Joint Genotyping using GenotypeGVCFs. I have 70+ GVCFs with me, which I want to Joint Genotype. Is there a way to give all of these GVCFs together as input (--variant) to GATK-GenotypeGVCF. Something like '-I' option for BAM list. I know it is possible to use --variant VCF1 --variant VCF2 etc. But that would be tedious to give inputs like that when you have these many or more GVCFs Thanks, Tinu Created 2014-04-14 16:09:57 | Updated 2014-04-14 16:26:40 | Tags: genotypegvcfs Hello all! I've been using GATK v3.1-1 for the last week or so. I've come across an error when I try and perform the Joint Genotyping on 5 different samples (family members). All 5 samples were processed through individually (following the Best Practices Guide) and the HaplotypeCaller was run in the gVCF mode. When I run GenotypeGVCFs I receive this error: ##### ERROR MESSAGE: Invalid argument value ' ' at position 22. I'm not quite sure what to make of it because other projects with the same type of family size don't throw this error. Do you all know what it is referring to? I've even tried the nightly build (GenomeAnalysisTK-nightly-2014-04-14-g744780d) to see if that fixed the problem and it didn't. Here is my script: java \ -Xmx22g \ -Djava.io.tmpdir=$TEMP \
-jar /home/dkcrossm/tools/$GATK_DIR/GenomeAnalysisTK.jar \ -R$REF_FASTA.fa \
-T GenotypeGVCFs \
-ped $PEDIGREE \ -nt 4 \ --variant$SAMPLE1.hc.gvcf.vcf \
--variant $SAMPLE2.hc.gvcf.vcf \ --variant$SAMPLE3.hc.gvcf.vcf \
--variant $SAMPLE4.hc.gvcf.vcf \ --variant$SAMPLE5.hc.gvcf.vcf \
-o $SAMPLE.raw.snps.indels.vcf \ --dbsnp$DBSNP \


I am also seeing this in a 16 sample run as well, but the position is 44 in this instance instead of 22. Any suggestions as to what it could be referring to? Let me know if you need anything else!

Thanks, David

Created 2014-04-04 15:34:10 | Updated | Tags: bug genotypegvcfs combinegvcfs

I've run into the following bug while running GenotypeGVCFs:

##### ERROR MESSAGE: cannot merge genotypes from samples without PLs; sample <ID redacted> does not have likelihoods at position 1:1115551

The input file in question is a gVCF produced by merging a large number of smaller gVCFs using CombineGVCFs (all tasks were run using version 3.1). What's happening is that the position 1115551 doesn't exist in that particular sample:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  <Sample_ID>
1       1115552 .       C       <NON_REF>       .       .       END=1115552     GT:DP:GQ:MIN_DP:PL      0/0:15:0:15:0,0,31


But when the sample is combined with other samples, that position gets filled in with a simple "0/0", without any PLs (or any of the other fields, including AD, DP, GQ, etc.), which causes the GenotypeGVCFs to choke.

I can imagine there might be other scenarios that will result in a "0/0" genotype field, so perhaps the easiest way to fix this would be to make sure that any "0/0" actually gets output as "./.:.:.:.:.".

Thanks,

Grace

Created 2014-04-02 14:47:06 | Updated 2014-04-02 14:47:46 | Tags: genotypegvcfs

commands are:

java -Xmx10g -jar $gatk -T HaplotypeCaller \ -R$refGenome \
--dbsnp $dbSNP \ -o s1.raw.var.g.vcf.gz \ -I s1.bam \ -pairHMM VECTOR_LOGLESS_CACHING \ --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 && java -Xmx10g -jar$gatk -T GenotypeGVCFs  \
-R \$refGenome \
--variant s1.raw.var.g.vcf.gz \
-o s1.raw.var.vcf.gz


Error in GenotypeGVCFs :

##### ERROR MESSAGE: Line 115: there aren't enough columns for line ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½uNï¿½ï¿½ï¿½ï¿½Þ«rfï¿½ebhï¿½ï¿½ï¿½ï¿½ï¿½NH(ï¿½ï¿½ï¿½ï¿½Ù»atï¿½ ï¿½Eï¿½;J1ï¿½ï¿½ï¿½ï¿½Xï¿½ï¿½Ê©áï¿½ï¿½ï¿½Çï¿½ï¿½ï¿½ï¿½ï¿½>ï¿½ï¿½@ï¿½ï¿½,0ï¿½ï¿½' (we

expected 9 tokens, and saw 2 ), for input source: s1.raw.var.g.vcf.gz

Created 2014-03-27 19:32:44 | Updated | Tags: haplotypecaller genotypegvcfs

So I've just run GenotypeGVCFs and get some odd looking things. The FORMAT field is GT:AD:DP:GQ:PL and on most occasions, the values are 1/1:0,1,0:1:3:10,3,0, i.e. 5 fields. However, sometimes I get ./.:.:317, only 3 fields.

I use queue to run this, so my case class is this:

case class genotypeGVCF(gvcfs:List[File], outvcf:File) extends GenotypeGVCFs with ExternalCommonArgs with FourteenCoreJob with NineDayJob {
@Input(doc="list of gvcfs to merge") val inGVCFs = gvcfs
@Output(doc="SNP calls in VCF format") val outVCF = outvcf
this.variant = inGVCFs
this.out = outVCF
this.R = refGenome
this.dbsnp = dbSNP
this.jobName = "genotypeGVCF " + outvcf.getName()
this.analysisName = this.jobName
this.isIntermediate = false
this.nt = 14
this.intervals = Seq(targetIntervals)
this.memoryLimit = 112
}