Tagged with #genotypegvcfs
1 documentation article | 3 announcements | 58 forum discussions


Comments (27)

Overview

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variants analyses.

Important caveat

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller 3.x contain additional information that is formatted in a very specific way. Read on to find out more.

General comparison of VCF vs. gVCF

The key difference between a regular VCF and a gVCF is that the gVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a gVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION gVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

The two types of gVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a gVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the gVCF header. The purpose of the blocks (also called banding) is to keep file size down, and there is no downside for the downstream analysis, so we do recommend using the -GVCF option.

Example gVCF file

This is a banded gVCF produced by HaplotypeCaller with the -GVCF option.

Header:

As you can see in the first line, the basic file format is a valid version 4.1 VCF:

##fileformat=VCFv4.1
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)
##GVCFBlock=minGQ=60(inclusive),maxGQ=2147483647(exclusive)
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=b37>
##reference=file:///humgen/1kg/reference/human_g1k_v37.fasta

Toward the middle you see the ##GVCFBlock lines (after the ##FORMAT lines) (repeated here for clarity):

##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)

which indicate the GQ ranges used for banding (corresponding to the boundaries [5, 20, 60]).

You can also see the definition of the MIN_DP annotation in the ##FORMAT lines.

Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10000000 and ends at 20:10000116.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10000000    .   T   <NON_REF>   .   .   END=10000116    GT:DP:GQ:MIN_DP:PL  0/0:44:99:38:0,89,1385
20  10000117    .   C   T,<NON_REF> 612.77  .   BaseQRankSum=0.000;ClippingRankSum=-0.411;DP=38;MLEAC=1,0;MLEAF=0.500,0.00;MQ=221.39;MQ0=0;MQRankSum=-2.172;ReadPosRankSum=-0.235   GT:AD:DP:GQ:PL:SB   0/1:17,21,0:38:99:641,0,456,691,519,1210:6,11,11,10
20  10000118    .   T   <NON_REF>   .   .   END=10000210    GT:DP:GQ:MIN_DP:PL  0/0:42:99:38:0,80,1314
20  10000211    .   C   T,<NON_REF> 638.77  .   BaseQRankSum=0.894;ClippingRankSum=-1.927;DP=42;MLEAC=1,0;MLEAF=0.500,0.00;MQ=221.89;MQ0=0;MQRankSum=-1.750;ReadPosRankSum=1.549    GT:AD:DP:GQ:PL:SB   0/1:20,22,0:42:99:667,0,566,728,632,1360:9,11,12,10
20  10000212    .   A   <NON_REF>   .   .   END=10000438    GT:DP:GQ:MIN_DP:PL  0/0:52:99:42:0,99,1403
20  10000439    .   T   G,<NON_REF> 1737.77 .   DP=57;MLEAC=2,0;MLEAF=1.00,0.00;MQ=221.41;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,56,0:56:99:1771,168,0,1771,168,1771:0,0,0,0
20  10000440    .   T   <NON_REF>   .   .   END=10000597    GT:DP:GQ:MIN_DP:PL  0/0:56:99:49:0,120,1800
20  10000598    .   T   A,<NON_REF> 1754.77 .   DP=54;MLEAC=2,0;MLEAF=1.00,0.00;MQ=185.55;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,53,0:53:99:1788,158,0,1788,158,1788:0,0,0,0
20  10000599    .   T   <NON_REF>   .   .   END=10000693    GT:DP:GQ:MIN_DP:PL  0/0:51:99:47:0,120,1800
20  10000694    .   G   A,<NON_REF> 961.77  .   BaseQRankSum=0.736;ClippingRankSum=-0.009;DP=54;MLEAC=1,0;MLEAF=0.500,0.00;MQ=106.92;MQ0=0;MQRankSum=0.482;ReadPosRankSum=1.537 GT:AD:DP:GQ:PL:SB   0/1:21,32,0:53:99:990,0,579,1053,675,1728:9,12,10,22
20  10000695    .   G   <NON_REF>   .   .   END=10000757    GT:DP:GQ:MIN_DP:PL  0/0:48:99:45:0,120,1800
20  10000758    .   T   A,<NON_REF> 1663.77 .   DP=51;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.32;MQ0=0  GT:AD:DP:GQ:PL:SB   1/1:0,50,0:50:99:1697,149,0,1697,149,1697:0,0,0,0
20  10000759    .   A   <NON_REF>   .   .   END=10001018    GT:DP:GQ:MIN_DP:PL  0/0:40:99:28:0,65,1080
20  10001019    .   T   G,<NON_REF> 93.77   .   BaseQRankSum=0.058;ClippingRankSum=-0.347;DP=26;MLEAC=1,0;MLEAF=0.500,0.00;MQ=29.65;MQ0=0;MQRankSum=-0.925;ReadPosRankSum=0.000 GT:AD:DP:GQ:PL:SB   0/1:19,7,0:26:99:122,0,494,179,515,694:12,7,4,3
20  10001020    .   C   <NON_REF>   .   .   END=10001020    GT:DP:GQ:MIN_DP:PL  0/0:26:72:26:0,72,1080
20  10001021    .   T   <NON_REF>   .   .   END=10001021    GT:DP:GQ:MIN_DP:PL  0/0:25:37:25:0,37,909
20  10001022    .   C   <NON_REF>   .   .   END=10001297    GT:DP:GQ:MIN_DP:PL  0/0:30:87:25:0,72,831
20  10001298    .   T   A,<NON_REF> 1404.77 .   DP=41;MLEAC=2,0;MLEAF=1.00,0.00;MQ=171.56;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,41,0:41:99:1438,123,0,1438,123,1438:0,0,0,0
20  10001299    .   C   <NON_REF>   .   .   END=10001386    GT:DP:GQ:MIN_DP:PL  0/0:43:99:39:0,95,1226
20  10001387    .   C   <NON_REF>   .   .   END=10001418    GT:DP:GQ:MIN_DP:PL  0/0:41:42:39:0,21,315
20  10001419    .   T   <NON_REF>   .   .   END=10001425    GT:DP:GQ:MIN_DP:PL  0/0:45:12:42:0,9,135
20  10001426    .   A   <NON_REF>   .   .   END=10001427    GT:DP:GQ:MIN_DP:PL  0/0:49:0:48:0,0,1282
20  10001428    .   T   <NON_REF>   .   .   END=10001428    GT:DP:GQ:MIN_DP:PL  0/0:49:21:49:0,21,315
20  10001429    .   G   <NON_REF>   .   .   END=10001429    GT:DP:GQ:MIN_DP:PL  0/0:47:18:47:0,18,270
20  10001430    .   G   <NON_REF>   .   .   END=10001431    GT:DP:GQ:MIN_DP:PL  0/0:45:0:44:0,0,1121
20  10001432    .   A   <NON_REF>   .   .   END=10001432    GT:DP:GQ:MIN_DP:PL  0/0:43:18:43:0,18,270
20  10001433    .   T   <NON_REF>   .   .   END=10001433    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1201
20  10001434    .   G   <NON_REF>   .   .   END=10001434    GT:DP:GQ:MIN_DP:PL  0/0:44:18:44:0,18,270
20  10001435    .   A   <NON_REF>   .   .   END=10001435    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1130
20  10001436    .   A   AAGGCT,<NON_REF>    1845.73 .   DP=43;MLEAC=2,0;MLEAF=1.00,0.00;MQ=220.07;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,42,0:42:99:1886,125,0,1888,126,1890:0,0,0,0
20  10001437    .   A   <NON_REF>   .   .   END=10001437    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,0

Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

Comments (2)

GATK 3.3 was released on October 23, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.


Haplotype Caller

  • Improved the accuracy of dangling head merging in the HC assembler (now enabled by default).
  • Physical phasing information is output by default in new sample-level PID and PGT tags.
  • Added the --sample_name argument. This is a shortcut for people who have multi-sample BAMs but would like to use -ERC GVCF mode with a particular one of those samples.
  • Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
  • Fixed IndexOutOfBounds error associated with tail merging.

Variant Recalibrator

  • New --ignore_all_filters option. If specified, the variant recalibrator will ignore all input filters and treat sites as unfiltered.

GenotypeGVCFs

  • Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
  • Bug fix for the case when we assumed ADs were in the same order if the number of alleles matched.
  • Changed the default GVCF GQ Bands from 5,20,60 to be 1..60 by 1s, 60...90 by 10s and 99 in order to give finer resolution.
  • Bug fix in the exact model when calling multi-allelic variants. QUAL field is now more accurate.

RNAseq analysis

  • Bug fixes for working with unmapped reads.

CalculateGenotypePosteriors

  • New annotation for low- and high-confidence possible de novos (only annotates biallelics).
  • FamilyLikelihoodsUtils now add joint likelihood and joint posterior annotations.
  • Restricted population priors based on discovered allele count to be valid for 10 or more samples.

DepthOfCoverage

  • Fixed rare bug triggered by hash collision between sample names.

SelectVariants

  • Updated the --keepOriginalAC functionality in SelectVariants to work for sites that lose alleles in the selection.

PrintReads

  • Read groups that are excluded by sample_name, platform, or read_group arguments no longer appear in the header.
  • The performance penalty associated with filtering by read group has been essentially eliminated.

Annotations

  • StrandOddsRatio is now a standard annotation that is output by default.
  • We used to output zero for FS if there was no data available at a site, now we omit FS.
  • Extensive rewrite of the annotation documentation.

Queue

  • Fixed Queue bug with bad localhost addresses.
  • Fixed issue related to spaces in job names that were fine in GridEngine 6 but break in (Son of) GE8.
  • Improved scatter contigs algorithm to be fairer when splitting many contigs into few parts (contributed by @smowton)

Documentation

  • We now generate PHP files instead of HTML.
  • We now output a JSON version of the tool documentation that can be used to generate wrappers for GATK commands.

Miscellaneous

  • Output arguments --no_cmdline_in_header, --sites_only, and --bcf for VCF files, and --bam_compression, --simplifyBAM, --disable_bam_indexing, and --generate_md5 for BAM files moved to the engine level.
  • htsjdk updated to version 1.120.1620
Comments (13)

GATK 3.2 was released on July 14, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.


We also want to take this opportunity to thank super-user Phillip Dexheimer for all of his excellent contributions to the codebase, especially for this release.


Haplotype Caller

  • Various improvements were made to the assembly engine and likelihood calculation, which leads to more accurate genotype likelihoods (and hence better genotypes).
  • Reads are now realigned to the most likely haplotype before being used by the annotations, so AD and DP will now correspond directly to the reads that were used to generate the likelihoods.
  • The caller is now more conservative in low complexity regions, which significantly reduces false positive indels at the expense of a little sensitivity; mostly relevant for whole genome calling.
  • Small performance optimizations to the function to calculate the log of exponentials and to the Smith-Waterman code (thanks to Nigel Delaney).
  • Fixed small bug where indel discovery was inconsistent based on the active-region size.
  • Removed scary warning messages for "VectorPairHMM".
  • Made VECTOR_LOGLESS_CACHING the default implementation for PairHMM.
  • When we subset PLs because alleles are removed during genotyping we now also subset the AD.
  • Fixed bug where reference sample depth was dropped in the DP annotation.

Variant Recalibrator

  • The -mode argument is now required.
  • The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

AnalyzeCovariates

  • The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

Variant Annotator

  • SB tables are created even if the ref or alt columns have no counts (used in the FS and SOR annotations).

Genotype GVCFs

  • Added missing arguments so that now it models more closely what's available in the Haplotype Caller.
  • Fixed recurring error about missing PLs.
  • No longer pulls the headers from all input rods including dbSNP, rather just from the input variants.
  • --includeNonVariantSites should now be working.

Select Variants

  • The dreaded "Invalid JEXL expression detected" error is now a kinder user error.

Indel Realigner

  • Now throws a user error when it encounters reads with I operators greater than the number of read bases.
  • Fixed bug where reads that are all insertions (e.g. 50I) were causing it to fail.

CalculateGenotypePosteriors

  • Now computes posterior probabilities only for SNP sites with SNP priors (other sites have flat priors applied).
  • Now computes genotype posteriors using likelihoods from all members of the trio.
  • Added annotations for calling potential de novo mutations.
  • Now uses PP tag instead of GP tag because posteriors are Phred-scaled.

Cat Variants

  • Can now process .list files with -V.
  • Can now handle BCF and Block-Compressed VCF files.

Validate Variants

  • Now works with gVCF files.
  • By default, all strict validations are performed; use --validationTypeToExclude to exclude specific tests.

FastaAlternateReferenceMaker

  • Now use '--use_IUPAC_sample sample_name' to specify which sample's genotypes should be used for the IUPAC encoding with multi-sample VCF files.

Miscellaneous

  • Refactored maven directories and java packages replacing "sting" with "gatk".
  • Extended on-the-fly sample renaming feature to VCFs with the --sample_rename_mapping_file argument.
  • Added a new read transformer that refactors NDN cigar elements to one N element.
  • Now a Tabix index is created for block-compressed output formats.
  • Switched outputRoot in SplitSamFile to an empty string instead of null (thanks to Carlos Barroto).
  • Enabled the AB annotation in the reference model pipeline (thanks to John Wallace).
  • We now check that output files are specified in a writeable location.
  • We now allow blank lines in a (non-BAM) list file.
  • Added legibility improvements to the Progress Meter.
  • Allow for non-tab whitespace in sample names when performing on-the-fly sample-renaming (thanks to Mike McCowan).
  • Made IntervalSharder respect the IntervalMergingRule specified on the command line.
  • Sam, tribble, and variant jars updated to version 1.109.1722; htsjdk updated to version 1.112.1452.
Comments (2)

GATK 3.0 was released on March 5, 2014. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

One important change for those who prefer to build from source is that we now use maven instead of ant. See the relevant documentation for building the GATK with our new build system.


SplitNCigarReads

  • This is a new GATK tool to be used for variant calling in RNA-seq data. Its purpose is to split reads that contain N Cigar operators (due to a limitation in the GATK that we will eventually handle internally) and to trim (and generally clean up) imperfect alignments.

Haplotype Caller

  • Fixed bug where dangling tail merging in the assembly graph occasionally created a cycle.
  • Added experimental code to retrieve dangling heads in the assembly graph, which is needed for calling variants in RNA-seq data.
  • Generally improved gVCF output by making it more accurate. This includes many updates so that the single sample gVCFs can be accurately genotyped together by GenotypeGVCFs.
  • Fixed a bug in the PairHMM class where the transition probability was miscalculated resulting in probabilities larger than 1.
  • Fixed bug in the function to find the best paths from an alignment graph which was causing bad genotypes to be emitted when running with multiple samples together.

CombineGVCFs

  • This is a new GATK tool to be used in the Haplotype Caller pipeline with large cohorts. Its purpose is to combine any number of gVCF files into a single merged gVCF. One would use this tool for hierarchical merges of the data when there are too many samples in the project to throw at all at once to GenotypeGVCFs.

GenotypeGVCFs

  • This is a new GATK tool to be used in the Haplotype Caller pipeline. Its purpose is to take any number of gVCF files and to genotype them in order to produce a VCF with raw SNP and indel calls.

SimulateReadsForVariants

  • This is a new GATK tool that might be useful to some. Given a VCF file, this tool will generate simulated reads that support the variants present in the file.

Unified Genotyper

  • Fixed bug when clipping long reads in the HMM; some reads were incorrectly getting clipped.

Variant Recalibrator

  • Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

Reduce Reads

  • Removed from the GATK. It was a valiant attempt, but ultimately we found a better way to process large cohorts. Reduced BAMs are no longer supported in the GATK.

Variant Annotator

  • Improved the FisherStrand (FS) calculation when used in large cohorts. When the table gets too large, we normalize it down to values that are more reasonable. Also, we don't include a particular sample's contribution unless we observe both ref and alt counts for it. We expect to improve on this even further in a future release.
  • Improved the QualByDepth (QD) calculation when used in large cohorts. Now, when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1. Note that this only works in the gVCF pipeline for now.
  • In addition, fixed the normalization for indels in QD (which was over-penalizing larger events).

Combine Variants

  • Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

Select Variants

  • Fixed a huge bug where selecting out a subset of samples while using multi-threading (-nt) caused genotype-level fields (e.g. AD) to get swapped among samples. This was a bad one.
  • Fixed a bug where selecting out a subset of samples at multi-allelic sites occasionally caused the alternate alleles to be re-ordered but the AD values were not updated accordingly.

CalculateGenotypePosteriors

  • Fixed bug where it wasn't checking for underflow and occasionally produced bad likelihoods.
  • It no longer strips out the AD annotation from genotypes.
  • AC/AF/AN counts are updated after fixing genotypes.
  • Updated to handle cases where the AC (and MLEAC) annotations are not good (e.g. they are greater than AN somehow).

Indel Realigner

  • Fixed bug where a realigned read can sometimes get partially aligned off the end of the contig.

Read Backed Phasing

  • Updated the tool to use the VCF 4.1 framework for phasing; it now uses HP tags instead of '|' to convey phase information.

Miscellaneous

  • Thanks to Phillip Dexheimer for several Queue related fixes and patches.
  • Thanks to Nicholas Clarke for patches to the timer which occasionally had negative elapsed times.
  • Providing an empty BAM list no results in a user error.
  • Fixed a bug in the gVCF writer where it was dropping the first few reference blocks at the beginnings of all but the first chromosome. Also, several unnecessary INFO field annotations were dropped from the output.
  • Logger output now goes to STDERR instead of STDOUT.
  • Picard, Tribble, and Variant jars updated to version 1.107.1683.
Comments (12)

Dear GATK team:

Here is just a quick follow up question on the SAC (StrandAlleleCountsBySample) annotation. I've successfully get the SAC annotation using GenotypeGVCFs with the nighty build, but there is something I notice to be a little wired. This is a line from the final vcf i got from running GenotypeGVCFs. Really glad that the SAC annotation is there, but in this example, for a biallelic site, is it right to have the SAC annotation with 3 sets of counts for "reference/first_allele/second_allele" while there is actually no second alternative allele? Or is it have something to do with the allele from GVCF format. Many thanks!

The example: 1 752894 rs3131971 T C 30876.92 . BaseQRankSum=-7.360e-01;DB;DP=1430;MLEAC=589;MLEAF=0.785;MQ=32.47;MQ0=0;MQRankSum=0.727;ReadPosRankSum=-3.580e-01;SOR=4.437 GT:AD:DP:GQ:PGT:PID:PL:SAC 0/1:2,4:.:54:.:.:118,0,54:0,2,0,4,0,0 ./.:2,0:2 0/1:2,4:.:53:.:.:119,0,53:0,2,0,4,0,0 1/1:0,2:.:6:.:.:55,6,0:0,0,0,2,0,0

BTW, if I have a follow up question on a previous one I've asked, should I just reply to the old question or open a new one like this? Thanks again!

Comments (1)

Hi GATK team,

In the documentation of GenotypeGVCFs it is writen: "Input - One or more Haplotype Caller gVCFs to genotype" I have 3 questions regarding this tool: 1. I wonder, what is the meaning of running it on one sample? 2. I tried to run it on one sample, and noticed that the genotype quality is different than the one in the original gvcf file from HC. What is causing this difference? I'm asking this since running the tool on one sample means that there are no other samples to consider in recalculating the quality. 3. Last question - From your experience, what is the best way to analyze one exome sample? Should I run HC with the default genotype_mode parameter and do hard filtering? Should I run HC in GVCF mode, run GenotypeGVCFs and than do hard filtering? Any other suggestion?

Thank you for the answer, Maya

Comments (3)

Hi there,

Thank you guys for developing & maintaining such a good software! I'm actually just starting to use GATK with Haplotypecaller. For me, there's a very useful annotation "StrandBiasBySample" or "StrandAlleleCountsBySample" that gives all the alignment information for a variant. It can be generated by HC, but after joint genotyping, it's just gone. There is a previous post asking about how to add this annotation to the final vcf, and Sheila mentioned she's already put in a request for adding this. So I'm just wandering, have you guys fixed this in one of the nighty build?

Many Thanks!!!

Comments (8)

Hi,

In the documentation for CombineGVCFs it says:

CombineGVCFs is meant to be used for hierarchical merging of gVCFs that will eventually be input into GenotypeGVCFs. One would use this tool when needing to genotype too large a number of individual gVCFs; instead of passing them all in to GenotypeGVCFs, one would first use CombineGVCFs on smaller batches of samples and then pass these combined gVCFs to GenotypeGVCFs.

Do you have any guidelines for this? I am trying to use genotypeGVCFS on 12 gVCF files and it doesn't work, so can you advise how I should I "pre-merge" them? Two batches of 6? Three batches of 3?

Thanks,

Mike

Comments (4)

Hi Sheila and Geraldine

When I run HaplotypeCaller (v3.3-0-g37228af) in GENOTYPE_GIVEN_ALLELES mode with --emitRefConfidence GVCF I get the error:

Invalid command line: Argument ERC/gt_mode has a bad value: you cannot request reference confidence output and GENOTYPE_GIVEN_ALLELES at the same time

It is however strange that GENOTYPE_GIVEN_ALLELES is mentioned in the --max_alternate_alleles section of the GenotypeGVCFs documentation.

Maybe I'm missing something?

Thanks, Gerrit

Comments (3)

Hi GATK-ers,

I have been given ~2000 gVCFs generated by Illumina (one sample per gVCF). Though they are in standard gVCF format, they were generated by an Illumina pipeline (https://support.basespace.illumina.com/knowledgebase/articles/147078-gvcf-file if you're really curious) and not the Haplotype Caller. As a result (I think ... ), the GATK doesn't want to process them (I have tried CombineGVCFs and GenotypeGVCFs to no avail). Is there a GATK walker or some other tool that will make my gVCFs GATK-friendly? I need to be able to merge this data together to make it analyze-able because in single-sample VCF format it's pretty useless at the moment.

My only other thought has been to expand all the ref blocks of data and then merge everything together, but this seems like it will result in the creation of a massive amount of data.

Any suggestions you may have are greatly appreciated!!!

Sara

Comments (9)

From my whole-genome (human) BAM files, I want to obtain: For each variant in dbSNP, the GQ and VQSLOD associated with seeing that variant in my data.

Here's my situation using HaplotypeCaller -ERC GVCF followed by GenotypeGVCFs: CHROM POS ID REF ALT chr1 1 . A # my data chr1 1 . A T # dbSNP I would like to know the confidence (in terms of GQ and/or PL) of calling A/A, A/T. or T/T. The call of isn't useful to me for the reason explained below.

How can I get something like this to work? Besides needing a GATK-style GVCF file for dbSNP, I'm not sure how GenotypeGVCFs behaves if "tricked" with a fake GVCF not from HaplotypeCaller.

My detailed reason for needing this is below:

For positions of known variation (those in dbSNP), the reference base is arbitrary. For these positions, I need to distinguish between three cases: 1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1. 2. We have sufficient evidence to call position n as homozygous reference (0/0) with confidence scores GQ=x2 and VQSLOD=y2. 3. We do not have sufficient evidence to make any call for position n.

I was planning to use VQSR because the annotations it uses seem useful to distinguish between case 3 and either of 1 and 2. For example, excessive depth suggests a bad alignment, which decreases our confidence in making any call, homozygous reference or not.

Following the best practices pipeline using HaplotypeCaller -ERC GVCF, I get ALTs with associated GQs and PLs, and GT=./.. However, GenotypeGVCF removes all of these, meaning that whenever the call by HaplotypeCaller was ./. (due to lack of evidence for variation), it isn't carried forward for use in VQSR.

Consequently, this seems to distinguish only between these two cases: 1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1. 2. We do not have sufficient evidence to call position n as a variant (it's either 0/0 or unknown).

This isn't sufficient for my application, because we care deeply about the difference between "definitely homozygous reference" and "we don't know".

Thanks in advance!

Douglas

Comments (2)

Hi,

I have recal.bam files for all the individuals in my study (these constitute 4 families), and each bam file contains information for one chromosome for one individual. I was wondering if it is best for me to pass all the files for a single individual together when running HaplotypeCaller, if it will increase the accuracy of the calling, or if I can just run HaplotypeCaller on each individual bam file separately.

Also, I was wondering at which step I should be using CalculateGenotypePosteriors, and if it will clean up the calls substantially. VQSR already filters the calls, but I was reading that CalculateGenotypePosteriors actually takes pedigree files, which would be useful in my case. Should I try to use CalculateGenotypePosteriors after VQSR? Are there other relevant filtering or clean-up tools that I should be aware of?

Thanks very much in advance,

Alva

Comments (7)

I am seeing an unexpected heterozygous genotype in the GenotypeGVCFs output. The region is at the edge of a long homopolymer, which presumably plays a role, but it is still not clear where the unexpected heterozygote is coming from. The final VCF is below. The unexpected call is 2:152402516A>T in Sample1. The call is 0/1, but I don't see any evidence for where that alternate allele is coming from. The input VCFs are below. Any suggestions? How is GenotypeGVCFs interpreting this situation? I am running the latest GATK 3.3. Jar.

Final VCF

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1        Sample2
2       152402515       .       T       TA      86.23   .       AC=2;AF=0.500;AN=4;BaseQRankSum=-5.700e-02;ClippingRankSum=0.321;DP=131;FS=0.000;GQ_MEAN=166.50;GQ_STDDEV=44.55;MLEAC=1;MLEAF=0.250;MQ=59.89;MQ0=0;MQRankSum=2.42;NCC=0;QD=1.31;ReadPosRankSum=0.649       GT:AD:DP:GQ:PL  0/1:10,8:34:99:203,0,198        0/1:37,11:56:99:135,0,1429
2       152402516       .       A       T       380.44  .       AC=2;AF=0.500;AN=4;BaseQRankSum=-3.647e+00;ClippingRankSum=-2.597e+00;DP=130;FS=3.866;GQ_MEAN=204.50;GQ_STDDEV=136.47;MLEAC=2;MLEAF=0.500;MQ=59.73;MQ0=0;MQRankSum=0.754;NCC=0;QD=5.76;ReadPosRankSum=1.43 GT:AD:DP:GQ:PL  0/1:10,0:34:99:108,0,157        0/1:43,23:66:99:301,0,1142

Sample1 VCF

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1
2       152402511       .       G       <NON_REF>       .       .       END=152402514   GT:DP:GQ:MIN_DP:PL      0/0:38:57:33:0,51,765
2       152402515       .       TA      T,TAA,TAAA,TAAAA,<NON_REF>      0.10    .       BaseQRankSum=-0.057;ClippingRankSum=0.321;DP=55;MLEAC=0,0,1,0,0;MLEAF=0.00,0.00,0.500,0.00,0.00;MQ=59.89;MQ0=0;MQRankSum=-1.002;ReadPosRankSum=-0.646      GT:AD:DP:GQ:PL:SB       0/3:10,8,8,7,1,0:34:21:224,189,625,21,87,219,0,109,141,388,143,293,250,447,708,116,220,156,201,330,273:0,10,0,7
2       152402517       .       A       <NON_REF>       .       .       END=152402517   GT:DP:GQ:MIN_DP:PL      0/0:55:0:55:0,0,1431
2       152402518       .       A       <NON_REF>       .       .       END=152402621   GT:DP:GQ:MIN_DP:PL      0/0:49:99:37:0,101,1442

Sample2 VCF

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample2
2       152402513       .       C       CT,<NON_REF>    0       .       BaseQRankSum=-0.168;ClippingRankSum=0.437;DP=48;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.772;ReadPosRankSum=0.034  GT:AD:DP:GQ:PL:SB  0/0:35,2,0:37:99:0,103,1228,143,1234,1275:6,29,0,0
2       152402514       .       T       <NON_REF>       .       .       END=152402514   GT:DP:GQ:MIN_DP:PL      0/0:63:99:63:0,114,1710
2       152402515       .       TA      T,TAA,TAAA,<NON_REF>    78.73   .       BaseQRankSum=-1.332;ClippingRankSum=-0.242;DP=76;MLEAC=0,1,0,0;MLEAF=0.00,0.500,0.00,0.00;MQ=59.74;MQ0=0;MQRankSum=2.423;ReadPosRankSum=0.649      GT:AD:DP:GQ:PL:SB       0/2:37,4,11,4,0:56:99:135,256,1876,0,804,1429,116,1061,1435,1759,157,933,805,963,890:2,35,0,11
2       152402516       rs199791504     A       T,<NON_REF>     272.77  .       BaseQRankSum=-3.647;ClippingRankSum=-2.597;DB;DP=75;MLEAC=1,0;MLEAF=0.500,0.00;MQ=59.73;MQ0=0;MQRankSum=0.754;ReadPosRankSum=1.426GT:AD:DP:GQ:PL:SB        0/1:43,23,0:66:99:301,0,1142,429,1210,1639:5,38,1,22
2       152402517       .       A       T,<NON_REF>     0       .       BaseQRankSum=-0.810;ClippingRankSum=-0.661;DP=79;MLEAC=0,0;MLEAF=0.00,0.00;MQ=59.75;MQ0=0;MQRankSum=-1.160;ReadPosRankSum=1.259 GT:AD:DP:GQ:PL:SB  0/0:67,4,0:71:99:0,122,2156,201,2168,2247:6,61,0,0
2       152402518       .       A       <NON_REF>       .       .       END=152402621   GT:DP:GQ:MIN_DP:PL      0/0:67:99:48:0,105,1710
Comments (16)

Hi

I am using GATK version 3.3 (latest release) and following the best practice for variant calling. Steps i am using is 1. Run haplotye caller and get the gVCF file 2. Run GenotypeGVCF across the samples to the get the VCF file

The Question is about the Genotype calling. Here is an example:

GVCF file looks like this for a particular position chr1 14677 . G . . END=14677 GT:DP:GQ:MIN_DP:PL 0/0:9:0:9:0,0,203

VCF file looks like this after genotyping with multiple samples. chr1 14677 . G A 76.56 PASS . . GT:AD:DP:GQ:PL 0/1:8,1:9:4:4,0,180

In GVCF file there was no support for the alternate allele, but VCF file shows that there are 1 read supporting the alternate call.

Thanks ! Saurabh

Comments (10)

Hi,

after running 5 exomes with GATK-v3.3 and HaplotypeCaller, I encountered a very low titv ration in my samples (~2.1) as VaraintEval report indicated. I tried running varaint filtration in these samples but I didn't see any imporvement in titv ratio nor any filtering done. therefore I filtered these with bcftools, after which the titv ratio improved to 2.5. Then when I tried running GenotypeGVCFs on these samples filtered with bcftools, I encountered the following error:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double at java.lang.Double.compareTo(Double.java:49) at java.util.ComparableTimSort.countRunAndMakeAscending(ComparableTimSort.java:290) at java.util.ComparableTimSort.sort(ComparableTimSort.java:157) at java.util.ComparableTimSort.sort(ComparableTimSort.java:146) at java.util.Arrays.sort(Arrays.java:472) at java.util.Collections.sort(Collections.java:155) at org.broadinstitute.gatk.utils.MathUtils.median(MathUtils.java:999) at org.broadinstitute.gatk.tools.walkers.variantutils.ReferenceConfidenceVariantContextMerger.combineAnnotationValues(ReferenceConfidenceVariantContextMerger.java:73) at org.broadinstitute.gatk.tools.walkers.variantutils.ReferenceConfidenceVariantContextMerger.merge(ReferenceConfidenceVariantContextMerger.java:158) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:202) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:121) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:310) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version nightly-2014-11-17-g58cfab1):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: java.lang.Integer cannot be cast to java.lang.Double
ERROR ------------------------------------------------------------------------------------------

any advice on solving this incident will be much appreciated

Victoria

Comments (8)

Hello. I've run Haplotype caller for 19 samples in -ploidy 19 (19 is as high as I can get without getting an error for that, but that is a discussion opened elsewhere). Then, I've tried GenotypeGVCFs and get an ERROR (both command and ERROR below).

Just to add, I've run both HaplotypeCaller and GenotypeGVCFs with and without -maxAltAlleles 4 to try to limit the referred combination; however, the output is exactly the same ERROR message at the end of GenotypeGVCFs. Do you think there might be a bug or am I doing something wrong?

java -Djava.io.tmpdir=$mytmp -Xmx28g -jar GenomeAnalysisTK.jar -R $ref -T GenotypeGVCFs -o $gVCF.ploidy.vcf -V ploidy.list -ploidy 19 -maxAltAlleles 4

ERROR MESSAGE: the combination of ploidy (19) and number of alleles (21) results in a very large number of genotypes (> 2147483647). You need to limit ploidy or the number of alternative alleles to analyz e this locus

Comments (7)

Hi,

I am running the GATK 3.2-2 pipeline and I am getting some odd results from the genotype VCF stage. It appears that incorrect AD results are being output.

E.g. on the single sample gvcf I have:

GC G, 522.73 . BaseQRankSum=-0.716;ClippingRankSum=0.219;DP=38;MLEAC=1,0;MLEAF=0.500,0.00;MQ=66.41;MQ0=0;MQRankSum=1.681;ReadPosRankSum=-1.155 GT:AD:DP:GQ:PL:SB 0/1:20,18,0:38:99:560,0,633,620,687,1307:10,10,10,8

This agrees with what I see in IGV for this site (see attached image).

However, when I joint call this site along with other gvcfs, the output for that sample looks like:

0/1:20,0:38:99:560,0,633

i.e. It appears to not output the expected 20,18 - but it still seems to be making the correct genotype call so I am assuming it is just an output bug and not something that is affecting calling, but it would be good to know for sure. Either way it is problematic as it is affecting our downstream depth filtering. This isn't the only example of this happening in our data. I am trying to identify more by looking for occasions where the AD fields sum to less than DP, am I right in thinking this should never be the case as AD is unfiltered compared to the filtered DP?

Thanks a lot

Dan

Comments (17)

Hi,

I used GenotypeGVCFs with 3 input gvcf files (3 individuals) to create a vcf file, and this seems to work, but when I examine the sites in the final vcf file, there are sites that are missing. I am in the process of calculating exactly how many sites are missing, but taking an initial section of the vcf file and initial sections of my 3 gvcf files, the initial set of variant positions in the 3 gvcf files combined (called "test file") is:

16050036 16050612 16050822 16050933 16051556 16051968 16051994 16052080 16052167 16052239 16052250 16052357 etc.

whereas the initial set of sites in my final vcf file is:

16050822 16050933 16051347 16051497 16051556 16051968 16051994 16052080 16052167 16052169 16052239 16052357 etc.

First of all, there are sites in the final vcf file (in bold) that are fixed for the 3 individuals, but that are still included (the individuals are all 0/1 at these positions). I removed these positions when I created my test file, so they don't show up there, as you can see. Second, there are sites in the test file that are not in the final vcf file (in bold), even though these are most definitely variant sites (I checked them - f.ex., 16050036 is a SNP). I'm not sure why these discrepancies are occurring.

I also got this warning 3 times when running GenotypeGVCFs: WARN 20:04:45,102 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 22:21483632 has 7 alternate alleles so only the top alleles will be used How do I relax the requirements of "at most 6 alternate alleles" to allow more in the case of indels? I am using the newest version of GATK (3.3).

FYI, this is the command I used for GenotypeGVCFs: java -Xmx2g -jar GenomeAnalysisTK.jar -R hs37d5.fa -T GenotypeGVCFs --variant file1.vcf --variant file2.vcf --variant file3.vcf -o final.vcf -L 22 (only running this for chr22)

Thank you in advance, Alva

Comments (2)

If I want the variants to be called only if they fit the following criteria:

1) Min. total coverage for consideration of heterozygous is 10.

2) Min. coverage of each of the two observed major basecalls to be called heterozygous is 5.

3) Min. percentage of each of the two observed major basecalls in order to be called heterozygous is 20.

4) Min. coverage in order for a position to be called homozygous is 6.

which command-line arguments in which tools (HaplotpyeCaller or GenotypeGVCFs) can I use to accomplish these? I cannot seem to find the proper arguments in the documentation. I apologize if I overlook.

Thank you

Comments (14)

Hi, I'm using GATKv3.3 to run the protocol of HC followed by CombineGVCFs in six batches of ~190 individuals each and finally GenotypeGVCFs of these six gVCFs. My failed command is below followed by the error message. In addition, I can tell you that my output file (1135g_.vcf) is empty even when it crashed after more than two hours of running time.

Kindly, Alberto

java -Djava.io.tmpdir=$mytmp -Xmx232g -jar $EBROOTGATK/GenomeAnalysisTK.jar -R $ref -T GenotypeGVCFs -nt 40 -o 1135g_.vcf -V 1135g_listof.list

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace java.lang.ArrayIndexOutOfBoundsException: 10000 at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.generatePL(ReferenceConfidenceVariantContextMerger.java:357) at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.mergeRefConfidenceGenotypes(ReferenceConfidenceVariantContextMerger.jav a:331) at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.merge(ReferenceConfidenceVariantContextMerger.java:134) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:200) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:119) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722)

ERROR ------------------------------------------------------------------------------------------ ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af): ERROR ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ERROR If not, please post the error message, with stack trace, to the GATK forum. ERROR Visit our website and forum for extensive documentation and answers to ERROR commonly asked questions http://www.broadinstitute.org/gatk ERROR ERROR MESSAGE: 10000 ERROR ------------------------------------------------------------------------------------------

Comments (1)

Hi, I just finished running HaplotypeCaller version 3.3-0 separately on 6 exome samples with the new best practices.

java -Xmx8g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19/hg19_Ordered.fa -I K87/HG19_Analysis/K87-929_final.recalibrated_final.bam --dbsnp dbsnp_138_hg19_Ordered.vcf --pair_hmm_implementation VECTOR_LOGLESS_CACHING -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 --output_mode EMIT_VARIANTS_ONLY -gt_mode DISCOVERY --pcr_indel_model CONSERVATIVE -o ./Haplotypes_929.vcf

Many Variant sites are called as homozygous alt (1/1), but none of these sites that are processed to infer haplotype are called as homozygous alt in their PGT field, they are all called as hets, PGT=0|1. for example:

GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,29,0:29:93:0|1:121483392_C_G:1331,93,0,1331,93,1331:0,0,13,16

The allelic depths agree with the phased genotype but out of all 6 exomes processed, not a single 1/1 is also phased as 1|1.

I checked all output vcfs with a simple grep combo: grep 'PGT' Haplotypes_929.vcf | grep '1/1' - | grep '0|1' - | wc -l = 19046 grep 'PGT' Haplotypes_929.vcf | grep '1/1' - | grep '1|0' - | wc -l = 79 grep 'PGT' Haplotypes_929.vcf | grep '1/1' - | grep '1|1' - | wc -l = 0

This seemed odd, but I continued with GenotyeGVCF:

java -Xmx32g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R hg19/hg19_Ordered.fa -V Haplotypes_450.vcf -V Haplotypes_452.vcf -V Haplotypes_925.vcf -V Haplotypes_926.vcf -V Haplotypes_927.vcf -V Haplotypes_929.vcf -D dbsnp_138_hg19_Ordered.vcf -ped K87/HG19_Analysis/K87_6.ped -o Haplotypes_K87_GVCFs.vcf

I'm looking at the output vcf as it's being generated and now there are homozygous alt calls but they conflict with the associated Allelic Depths:

.... GT:AD:DP:GQ:PGT:PID:PL .... 1/1:0,29:29:85:1|1:33957151_G_T:948,85,0 .....

Full Line: chr1 33957152 rs4403594 T G 3166.96 . AC=12;AF=1.00;AN=12;DB;DP=99;FS=0.000;GQ_MEAN=48.50;GQ_STDDEV=27.55;MLEAC=12;MLEAF=1.00;MQ=39.65;MQ0=0;NCC=0;QD=32.32;SOR=0.693 GT:AD:DP:GQ:PGT:PID:PL 1/1:0,9:9:27:.:.:330,27,0 1/1:0,5:5:15:.:.:141,15,0 1/1:0,29:29:85:1|1:33957151_G_T:948,85,0 1/1:0,20:20:60:.:.:722,60,0 1/1:0,24:24:71:.:.:685,71,0 1/1:0,11:11:33:.:.:366,33,0

Can you help me interpret what seems to me as conflicting results?

Cheers,

Patrick

Comments (8)

Hi,

I am merging several genome vcf files apparently created by gvcftools [1]. I noticed that CombineVariants is incorrectly summing allele counts (AC) in some corner cases, and was wondering if the format [2] was supported?

If the format should be supported by the CombineVariants tool (gvcf is VCF 4.1 conformant), I have isolated a minimal example that should be of help for bugfixing.

[1] https://sites.google.com/site/gvcftools/home [2] https://sites.google.com/site/gvcftools/home/about-gvcf

Thanks, Paweł

Comments (8)

Hi all,

I am attempting to use the HaplotyperCaller / CombineGVCFs / GenotypeGVCFs to call variants on chrom X and Y of 769 samples (356 males, 413 females) sequenced at 12x coverage (WG sequening, but right not only calling X and Y).

I have called the samples according to the best practises using the HaplotypeCaller, using ploidy = 1 for males on X and Y and ploidy =2 for females on X, e.g.:

INFO 16:28:45,750 HelpFormatter - Program Args: -R /gcc/resources/b37/indices/human_g1k_v37.fa -T HaplotypeCaller -L X -ploidy 1 -minPruning 3 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -I /target/gpfs2/gcc/groups/gonl/projects/trio-analysis/rawdata_release2/A102.human_g1k_v37.trio_realigned.bam --sample_name A102a -o /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/A102a.chrX.hc.g.vcf

Then I have used CombineGVCFs to combine my samples in batches of 100 samples. Now I am attempting to genotype them and I face the same issue on both X (males + females) and Y (males only): It starts running fine and then just hang on a certain position. At first it crashed asking for additional memory but now with 24Gb or memory it simply stays at a single position for 24hrs until my job eventually stops due to walltime. Here is the chrom X output: INFO 15:00:39,501 HelpFormatter - Program Args: -R /gcc/resources/b37/indices/human_g1k_v37.fa -T GenotypeGVCFs -ploidy 1 --dbsnp /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf -stand_call_conf 10 -stand_emit_conf 10 --max_alternate_alleles 4 -o /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.vcf -L X -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.1.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.2.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.3.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.4.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.5.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.6.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.7.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.8.g.vcf INFO 15:00:39,507 HelpFormatter - Executing as lfrancioli@targetgcc15-mgmt on Linux 3.0.80-0.5-default amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13. INFO 15:00:39,507 HelpFormatter - Date/Time: 2014/11/12 15:00:39 INFO 15:00:39,508 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:00:39,508 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:00:40,951 GenomeAnalysisEngine - Strictness is SILENT INFO 15:00:41,153 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 15:57:53,416 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.4.g.vcf.idx INFO 17:09:39,597 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.5.g.vcf.idx INFO 18:21:00,656 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.6.g.vcf.idx INFO 19:30:46,624 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.7.g.vcf.idx INFO 20:22:38,368 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.8.g.vcf.idx WARN 20:26:45,716 FSLockWithShared$LockAcquisitionTask - WARNING: Unable to lock file /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf.idx because an IOException occurred with message: No locks available. INFO 20:26:45,718 RMDTrackBuilder - Could not acquire a shared lock on index file /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf.idx, falling back to using an in-memory index for this GATK run. INFO 20:33:29,491 IntervalUtils - Processing 155270560 bp from intervals INFO 20:33:29,628 GenomeAnalysisEngine - Preparing for traversal INFO 20:33:29,635 GenomeAnalysisEngine - Done preparing for traversal INFO 20:33:29,636 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 20:33:29,637 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 20:33:29,638 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 20:33:29,948 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files INFO 20:33:59,642 ProgressMeter - X:65301 0.0 30.0 s 49.6 w 0.0% 19.8 h 19.8 h INFO 20:34:59,820 ProgressMeter - X:65301 0.0 90.0 s 149.1 w 0.0% 59.4 h 59.4 h ... INFO 20:52:01,064 ProgressMeter - X:177301 0.0 18.5 m 1837.7 w 0.1% 11.3 d 11.2 d INFO 20:53:01,066 ProgressMeter - X:177301 0.0 19.5 m 1936.9 w 0.1% 11.9 d 11.9 d ... INFO 14:58:25,243 ProgressMeter - X:177301 0.0 18.4 h 15250.3 w 0.1% 96.0 w 95.9 w INFO 14:59:38,112 ProgressMeter - X:177301 0.0 18.4 h 15250.3 w 0.1% 96.1 w 96.0 w INFO 15:00:47,482 ProgressMeter - X:177301 0.0 18.5 h 15250.3 w 0.1% 96.2 w 96.1 w =>> PBS: job killed: walltime 86440 exceeded limit 86400

I would really appreciate if you could give me some pointer as how to handle this situation.

Thanks! Laurent

Comments (4)

I have constructed a VCF using GenotypeGVCFs. Due to downsampling and filtering imposed by HaplotypeCaller/GenotypeGVCFs, four fields of interest within the VCF do not reflect the data that went into building the file. I would like to update a these fields to reflect all read data used in the pipeline, but I have been able to do so for just 3 of the 4. I have been able to update INFO MQ0, INFO DP, and sample-level (FORMAT) AD, but I have not been able to update sample-level (FORMAT) DP.

I have run VariantAnnotator as follows.

inVCFfn=../vcf/raw.vcf
outVCFfn=../vcf/raw.annotated.vcf

gatk=/srv/gs1/software/gatk/gatk-3.3.0/GenomeAnalysisTK.jar
ref=/srv/gs1/projects/scg/Resources/GATK/b37/human_g1k_v37_decoy.fasta
dbsnp=/srv/gs1/projects/scg/Resources/GATK/b37/dbsnp_137.b37.vcf

java -Xmx10g \
  -jar $gatk \
  -T VariantAnnotator \
  -R $ref \
  -L $inVCFfn \
  --variant $inVCFfn \
  --dbsnp $dbsnp \
  --out $outVCFfn \
  --annotation Coverage \
  --annotation DepthPerAlleleBySample \
  --annotation MappingQualityZero \
  [-I fileName.1.bam … -I fileName.810.bam]

The documentation for --annotation Coverage indicates that it should update both the INFO DP and the sample-level (FORMAT) DP, but the sample-level update seems not to be occurring. For example:

$ awk '($2==7157834) { print $9, "|", $53 }' raw.vcf
GT:AD:DP:GQ:PL | 0:4,0:4:99:0,135

$ awk '($2==7157834) { print $9, "|", $53 }' raw.annotated.vcf
GT:AD:DP:GQ:PL | 0:131,0:4:99:0,135

(Aside: I'm working with a haploid chromosome, hence the single-value GT and two-value AD and PL fields). Although AD was updated from "4,0" to "131,0", the sample-level "DP" remained at "4". Of course I could approximate DP as the sum of AD values, but since AD is pre-filter and sample-level DP is post-filter, this wouldn't quite be correct. I should note that when I run HaplotypeCaller/GenotypeGVCFs on a subset of the data, the sample-level DP values are very close to the sum of AD's, so it's not a matter of most reads being filtered out, but rather that they have been downsampled. Am I missing something?

Thanks! David

P.S. The INFO field DP and MQ0 are indeed updated as expected:

$ awk '($2==7157834) { print $8 }' raw.vcf
AC=1;AF=6.993e-03;AN=143;DP=609;FS=0.000;GQ_MEAN=124.92;GQ_STDDEV=113.68;MLEAC=1;MLEAF=6.993e-03;MQ=60.00;MQ0=0;NCC=0;QD=30.16;SOR=1.259

$ awk '($2==7157834) { print $8 }' raw.annotated.vcf
AC=1;AF=6.993e-03;AN=143;DP=10750;FS=0.000;GQ_MEAN=124.92;GQ_STDDEV=113.68;MLEAC=1;MLEAF=6.993e-03;MQ=60.00;MQ0=2;NCC=0;QD=30.16;SOR=1.259
Comments (1)

In the past, I’ve found MQ0 ("Total Mapping Quality Zero Reads") to be a very useful INFO feature of putatively variable sites. However, in my experience, GenotypeGVCFs reports every site to have MQ0=0, even those for which UnifiedGenotyper reports MQ0 > 0 when run on the same data. Is this a bug, or am I missing something?

Thanks!

Comments (5)

Hi,

We have some exomes processed with GATK from one exome capture platform (Nimblegen SeqCap on HiSeq), and now I am going to analyze a small batch of exomes sequenced using a different platform (Nextera on NextSeq). I was wondering if and how GenotypeGVCFs can cope with gvcfs produced on two (or more) different exome capture platforms? Is joint-calling of such "heterogeneous" samples advised, or should I rather genotype the small set of equally processed bams separately?

The problems I could foresee were:

  • differing targets (each batch of samples will have gaps in coverage due to target-constrained (-L) haplotypecalling). If this should cause problems for GenotypeGVCFs the haplotypes could be called again on superset of targets, or easier, the calling restricted to the intersection.
  • the alignment algorithm is different (BWA BWTSW vs BWA MEM) so the mapping quality could potentially differ (haven't checked). Is this something GATK is compensating for, like base qualities?
  • the sequencer and chemistry, but here I hope that the BQSR should help in removing the variation

Any thoughts and comments appreciated,

Thanks, Paweł

Comments (5)

Hi,

I'm trying to call genotypes at a set of given positions from ~5,000 samples. To do so, I first generate gvcf files using HaplotypeCaller for the region around a gene for all samples separately with HaplotypeCaller. Then I'm merging the single gvcf files into batches of 1000 and then I'm generating the genotypes at ~180 sites using GenotypeGVCFs and the flag --includeNonVariantSites. Everything runs without errors, but in the end I noticed that ~20 sites were missing from the final VCF file. I then looked into these sites and the single files in detail and found out that there are simply no genotypes called for some samples at some positions.

I attached a small example. When I'm running the following command, no variant is in the output file: java -jar GenomeAnalysisTK.jar \ -R hg19.fa \ -L region.bed --includeNonVariantSites \ -T GenotypeGVCFs \ --variant wrong.gvcf --variant correct.gvcf \ -o out.vcf

The same command works when running it with only "correct.gvcf" as an input, but obviously not when running it with only "wrong.gvcf". I noticed that wrong.gvcf might show a variant at this exact position.

Best, Thomas

Comments (5)

Dear all,

I have been using a relatively recent nightly build of GATK to deal with some previous bug fix, and I now see some odd things in my VCF format, after using genotypegVCF. Precisely some rows have this INFO field: GT:AD:DP:GQ:PGT:PID:PL

with these new PGT and PID which are present only some of the time. I have no idea what this means (and it also messes up my data processing). Does anyone know what these fields are about? And, unless they are very useful, how to get rid of them?

Thank you in advance,

Comments (6)

Following the the recommended best practice and using GATK v3.2, the raw vcf file created by haplotyecaller/GenotypeGVCFs does not contain any of AC, AN and AF tags. Is this an expected behaviour?

This is a line of the g.vcf file generated from HaplotypeCaller:
chr21 36104982 . AT A,ATT,<NON_REF> 0.39 . BaseQRankSum=-1.693;DP=35;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQ=60.00;MQ0=0;MQRankSum=1.874;ReadPosRankSum=-0.023 GT:AD:GQ:PL:SB 0/1:23,6,2,0:27:27,0,499,67,411,529,96,459,496,525:10,13,2,4

And the raw vcf file from GenotypeGVCFs: chr21 36122885 . A AT 7871.94 . BaseQRankSum=0.151;DP=493;FS=5.433;MLEAC=24;MLEAF=0.750;MQ=60.00;MQ0=0;MQRankSum=0.129;QD=19.53;ReadPosRankSum=0.257 GT:AD:DP:GQ:PL 0/1:13,8:.:99:162,0,238 1/1:0,21:.:62:536,62,0 1/1:0,24:.:71:610,71,0 0/1:12,12:.:99:245,0,207 1/1:0,27:.:80:693,80,0 1/1:0,34:.:99:877,101,0 0/1:31,13:.:99:210,0,624 0/0:30,0:30:65:0,65,1125 0/1:6,11:.:96:225,0,96 0/1:10,5:.:87:87,0,196 0/1:4,15:.:39:340,0,39 1/1:2,35:.:66:895,66,0 1/1:0,53:.:99:1376,161,0 1/1:0,13:.:39:328,39,0 1/1:0,24:.:74:662,74,0 1/1:1,29:.:65:702,65,0

My command for HaplotypeCaller:
java -Xmx8g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R myRef.fa -I my.bam -L chr21 --dbsnp dbSNP_chr21.vcf --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -stand_emit_conf 10 -stand_call_conf 30 -mbq 20 -nct 8 -A FisherStrand -A QualByDepth -A HaplotypeScore --annotation Coverage -o raw.g.vcf

And GenotypeGVCFs:
java -Xmx8g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R myRef.fa -V myVCF.list --dbsnp dbSNP_chr21.vcf -A FisherStrand -A QualByDepth -A HaplotypeScore -A Coverage -L chr21 -o myraw.vcf

Comments (2)

hi,
I did HaplotypeCaller + genotypeVCFs on 4 sample bam files (with low coverage/depth).
I saw that AC INFO field do not represent the sum of all alternate allele seen.

grep -we rs10753352 -we rs80146807 -we rs707592 trisomie.vcf
chr1    5184689 rs10753352      G       C       147.81  .       AC=4;AF=0.667;AN=6;DB;DP=7;FS=0.000;GQ_MEAN=6.00;GQ_STDDEV=6.00;MLEAC=5;MLEAF=0.833;MQ=93.02;MQ0=0;NCC=1;QD=24.64     GT:AD:DP:GQ:PL  1/1:0,4:4:12:116,12,0   ./.:0,0:0       1/1:0,2:2:6:59,6,0      0/0:1,0:1:0:0,0,8
chr1    5730409 rs80146807      G       T       142.09  .       AC=5;AF=0.625;AN=8;BaseQRankSum=0.804;ClippingRankSum=0.804;DB;DP=6;FS=0.000;GQ_MEAN=12.75;GQ_STDDEV=17.56;MLEAC=5;MLEAF=0.625;MQ=82.96;MQ0=0;MQRankSum=0.804;NCC=0;QD=26.86;ReadPosRankSum=0.804     GT:AD:DP:GQ:PL  0/0:1,0:1:3:0,3,45      1/1:0,1:1:3:45,3,0      1/1:0,2:2:6:90,6,0   0/1:1,1:2:39:39,0,62
chr1    5732382 rs707592        T       C       194.63  .       AC=5;AF=0.833;AN=6;BaseQRankSum=-7.360e-01;ClippingRankSum=0.736;DB;DP=9;FS=8.451;GQ_MEAN=13.33;GQ_STDDEV=8.08;MLEAC=5;MLEAF=0.833;MQ=26.49;MQ0=0;MQRankSum=0.736;NCC=1;QD=32.44;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL  ./.:0,0:0       1/1:0,4:4:12:149,12,0   0/1:2,1:3:22:22,0,49    1/1:0,2:2:6:53,6,0

I join the corresponding ##GATKCommandLine in case.

If I misunderstand what AC is, feel free to teach me.

Comments (8)

Hi,

I've tried my best to find the problem but without luck, so my first post...

First the background:

I have used HaplotypeCaller to make GVCFs per individual for 2 cohorts of 270 and 300 individuals each.
I then used CombineGVCFs to merged all the individuals in each cohort across 10Mb chunks across the genome.

These steps appear to have worked OK! Then I have been attempting to call genotypes from pairs GVCFs (one from each cohort) per region using GenotypeGVCFs with a command such as:

java -jar -Xmx6G /gpfs0/apps/well/gatk/3.2-2/GenomeAnalysisTK.jar \ -T GenotypeGVCFs \ -R ../hs37d5/hs37d5.fa \ --variant /well/hill/kate/exomes//haploc/cap270/cap270_1:230000000-240000000_merged_gvcf.vcf \ --variant /well/hill/kate/exomes//haploc/sepsis300/sepsis300_1:230000000-240000000_merged_gvcf.vcf \ -o /well/hill/kate/exomes//haploc/cap270-sepsis300/cap270-sepsis300_1:230000000-240000000_raw.vcf

This seems to start fine (and works perfectly to completion for some regions), but then throws a error for some of the files. For this example the error is:

ERROR MESSAGE: The provided VCF file is malformed at approximately line number 519534: ./.:29:75:25:0,63,945 is not a valid start position in the VCF format, for input source: /well/hill/kate/exomes/haploc/cap270/cap270_1:230000000-240000000_merged_gvcf.vcf

So to try and solve this I have grepped "./.:29:75:25:0,63,945 " from the VCF file (expecting to find a truncated line), but could find nothing wrong with the lines containing this expression. I looked carefully at the first line containing this after the last position that was written in the output - nothing out of the ordinary that I can see.

Any help/advice gratefully received! Kate

Comments (22)

On 2000 samples I have run HC3.2, CGVCFs3.2, GGVCFs3.2 and VR3.2.

For the GenotypeGVCFs step I used the current default annotations:

InbreedingCoeff FisherStrand QualByDepth ChromosomeCounts GenotypeSummaries

And these non-default annotations:

MappingQualityRankSumTest ReadPosRankSumTest

When running VariantRecalibrator and plotting each of the dimensions I noticed all of the non-default annotations taking on discrete values; see bottom of this post. Is it no longer recommended to use ReadPosRankSum and MQRankSum for VR? Should I calculate these annotation with VariantAnnotator instead of GenotypeGVCFs? If I have to run VariantAnnotator, should I then run it separately for SNPs and INDELs cf. my previous question about annotations being different, when applied to BOTH and SNPs: http://gatkforums.broadinstitute.org/discussion/2620

Thank you.

zcat out_GenotypeGVCFs/chrom20.vcf.gz | grep -v ^# | cut -f8 | tr ";" "\n" | grep ReadPosRankSum | sort | uniq -c | awk '$1>20000' | sort -k1n,1
  41649 ReadPosRankSum=0.731
  41760 ReadPosRankSum=0.550
  46305 ReadPosRankSum=0.720
  47060 ReadPosRankSum=0.00
  87348 ReadPosRankSum=0.406
 105254 ReadPosRankSum=0.736
 116426 ReadPosRankSum=0.727
 164855 ReadPosRankSum=0.358

zcat out_GenotypeGVCFs/chrom20.vcf.gz | grep -v ^# | cut -f8 | tr ";" "\n" | grep "MQ=" | sort | uniq -c | awk '$1>5000' | sort -k1n,1
   5802 MQ=57.05
   8382 MQ=29.00
   8525 MQ=56.62
  10069 MQ=51.77
  10574 MQ=53.95
  10682 MQ=47.12
  10818 MQ=56.04
  11553 MQ=55.21
 802603 MQ=60.00

zcat out_GenotypeGVCFs/chrom20.vcf.gz | grep -v ^# | cut -f8 | tr ";" "\n" | grep MQRankSum | sort | uniq -c | awk '$1>20000' | sort -k1n,1
  21511 MQRankSum=-7.360e-01
  27222 MQRankSum=0.322
  33699 MQRankSum=0.550
  34481 MQRankSum=0.731
  37603 MQRankSum=0.720
  60729 MQRankSum=0.00
  76031 MQRankSum=0.406
  85812 MQRankSum=0.736
  98519 MQRankSum=0.727
 186092 MQRankSum=0.358
Comments (14)

Hey there,

How can it be possible that some of my snps or indels calls miss the QD tag? I'm doing the recommended workflow and I've tested if for both RNAseq (hard filters complains, that's how I saw those tags were missing) and Exome sequencing (VQSR). How can a hard filter applied on QD on a line without actually that tag can be considered to pass that threshold too? I'm seeing a lot more INDELs in RNAseq where this kind of scenario is happening as well.

Here's the command lines that I used :

Forming gvcf files like this (for RNAseq in that case):

java -Djava.io.tmpdir=$LSCRATCH -Xmx46g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -pairHMM VECTOR_LOGLESS_CACHING -T HaplotypeCaller -I sample.bam -R ~/References/hg19/hg19.fasta --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o sample.g.vcf --dbsnp ~/References/hg19/dbSNP/dbsnp_138.b37.excluding_sites_after_129.CEU_converted.vcf -recoverDanglingHeads -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 0.0 -mmq 1 -U ALLOW_N_CIGAR_READS

GenotypeGVCF after :

java -Djava.io.tmpdir=$LSCRATCH -Xmx22g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -T GenotypeGVCFs --dbsnp ~/References/hg19/dbSNP/dbsnp_138.b37.excluding_sites_after_129.CEU_converted.vcf -A QualByDepth -A FisherStrand -o Joint_file/96RNAseq.10092014.STAR.q1.1kb.vcf -R ~/References/hg19/hg19.fasta

VQSR (separated for indels and snvs) or Hard filter at the end :

VQSR

java -Djava.io.tmpdir=$LSCRATCH -Xmx10g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -T VariantRecalibrator -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -mode SNP -resource:1000G,known=false,training=true,truth=false,prior=10.0 ~/References/hg19/VQSR/1000G_phase1.snps.high_confidence.b37.noGL.converted.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ~/References/hg19/VQSR/hapmap_3.3.b37.noGL.nochrM.converted.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 ~/References/hg19/VQSR/1000G_omni2.5.b37.noGL.nochrM.converted.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 ~/References/hg19/VQSR/dbsnp_138.b37.excluding_sites_after_129.noGL.nochrM.converted.vcf -input snv.vcf -recalFile 96Exomes.HC.TruSeq.snv.RECALIBRATED -tranchesFile 96Exomes.HC.TruSeq.snv.tranches -rscriptFile 96Exomes.HC.TruSeq.snv.plots.R -R ~/References/hg19/hg19.fasta --maxGaussians 4

java -Djava.io.tmpdir=$LSCRATCH -Xmx10g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -T ApplyRecalibration -ts_filter_level 99.0 -mode SNP -input snv.vcf -recalFile 96Exomes.HC.TruSeq.snv.RECALIBRATED -tranchesFile 96Exomes.HC.TruSeq.snv.tranches -o 96Exomes.HC.TruSeq.snv.recal.vcf -R ~/References/hg19/hg19.fasta

HARD FILTER (RNASeq)

java -Djava.io.tmpdir=$LSCRATCH -Xmx2g -jar /home/apps/Logiciels/GATK/3.1-1/GenomeAnalysisTK.jar -l INFO -T VariantFiltration -R ~/References/hg19/hg19.fasta -V 96RNAseq.STAR.q1.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o 96RNAseq.STAR.q1.FS30.QD2.vcf

Here are some examples for RNAseq :

chr1 6711349 . G A 79.10 PASS BaseQRankSum=-1.369e+00;ClippingRankSum=1.00;DP=635;FS=1.871;MLEAC=1;MLEAF=5.495e-03;MQ=60.00;MQ0=0;MQRankSum=-4.560e-01;ReadPosRankSum=-1.187e+00 GT:AD:DP:GQ:PL 0/0:8,0:8:24:0,24,280 ./.:0,0:0 0/0:9,0:9:21:0,21,248 0/0:7,0:7:21:0,21,196 0/0:7,0:7:21:0,21,226 0/0:8,0:8:21:0,21,227 0/0:8,0:8:21:0,21,253 0/0:7,0:7:21:0,21,218 0/0:9,0:9:27:0,27,282 1/1:0,0:5:15:137,15,0 0/0:2,0:2:6:0,6,47 0/0:28,0:28:78:0,78,860 0/0:7,0:7:21:0,21,252 0/0:2,0:2:6:0,6,49 0/0:5,0:5:12:0,12,152 0/0:3,0:3:6:0,6,90 0/0:4,0:4:12:0,12,126 0/0:9,0:9:21:0,21,315 0/0:7,0:7:21:0,21,256 0/0:7,0:7:21:0,21,160 0/0:8,0:8:21:0,21,298 0/0:20,0:20:60:0,60,605 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,67 0/0:2,0:2:6:0,6,71 0/0:14,0:14:20:0,20,390 0/0:7,0:7:21:0,21,223 0/0:7,0:7:21:0,21,221 0/0:4,0:4:12:0,12,134 0/0:2,0:2:6:0,6,54 ./.:0,0:0 0/0:4,0:4:9:0,9,118 0/0:8,0:8:21:0,21,243 0/0:6,0:6:15:0,15,143 0/0:8,0:8:21:0,21,244 0/0:7,0:7:21:0,21,192 0/0:2,0:2:6:0,6,54 0/0:13,0:13:27:0,27,359 0/0:8,0:8:21:0,21,245 0/0:7,0:7:21:0,21,218 0/0:12,0:12:36:0,36,354 0/0:8,0:8:21:0,21,315 0/0:7,0:7:21:0,21,215 0/0:2,0:2:6:0,6,49 0/0:10,0:10:24:0,24,301 0/0:7,0:7:21:0,21,208 0/0:7,0:7:21:0,21,199 0/0:2,0:2:6:0,6,47 0/0:3,0:3:9:0,9,87 0/0:2,0:2:6:0,6,73 0/0:7,0:7:21:0,21,210 0/0:8,0:8:22:0,22,268 0/0:7,0:7:21:0,21,184 0/0:7,0:7:21:0,21,213 0/0:5,0:5:9:0,9,135 0/0:7,0:7:21:0,21,200 0/0:4,0:4:12:0,12,118 0/0:7,0:7:21:0,21,232 0/0:7,0:7:21:0,21,232 0/0:7,0:7:21:0,21,217 0/0:8,0:8:21:0,21,255 0/0:9,0:9:24:0,24,314 0/0:8,0:8:21:0,21,221 0/0:9,0:9:24:0,24,276 0/0:9,0:9:21:0,21,285 0/0:3,0:3:6:0,6,90 0/0:2,0:2:6:0,6,57 0/0:13,0:13:20:0,20,385 0/0:2,0:2:6:0,6,48 0/0:11,0:11:27:0,27,317 0/0:8,0:8:21:0,21,315 0/0:9,0:9:24:0,24,284 0/0:7,0:7:21:0,21,228 0/0:14,0:14:33:0,33,446 0/0:2,0:2:6:0,6,64 0/0:2,0:2:6:0,6,72 0/0:7,0:7:21:0,21,258 0/0:10,0:10:27:0,27,348 0/0:7,0:7:21:0,21,219 0/0:9,0:9:21:0,21,289 0/0:20,0:20:57:0,57,855 0/0:4,0:4:12:0,12,146 0/0:7,0:7:21:0,21,205 0/0:12,0:14:36:0,36,1030 0/0:3,0:3:6:0,6,87 0/0:2,0:2:6:0,6,60 0/0:7,0:7:21:0,21,226 0/0:7,0:7:21:0,21,229 0/0:8,0:8:21:0,21,265 0/0:4,0:4:6:0,6,90 ./.:0,0:0 0/0:7,0:7:21:0,21,229 0/0:2,0:2:6:0,6,59 0/0:2,0:2:6:0,6,56 chr1 7992047 . T C 45.83 SnpCluster BaseQRankSum=1.03;ClippingRankSum=0.00;DP=98;FS=0.000;MLEAC=1;MLEAF=0.014;MQ=60.00;MQ0=0;MQRankSum=-1.026e+00;ReadPosRankSum=-1.026e+00 GT:AD:DP:GQ:PL ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,70 0/0:2,0:2:6:0,6,45 0/0:3,0:3:6:0,6,87 0/0:2,0:2:6:0,6,52 ./.:0,0:0 ./.:0,0:0 ./.:1,0:1 ./.:0,0:0 0/0:2,0:2:6:0,6,55 0/0:2,0:2:6:0,6,49 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,61 0/0:2,0:2:6:0,6,49 ./.:0,0:0 ./.:0,0:0 0/0:3,0:3:6:0,6,90 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,52 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,69 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,64 ./.:0,0:0 0/0:2,0:2:6:0,6,37 ./.:0,0:0 0/0:2,0:2:6:0,6,67 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,68 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:11,0:11:24:0,24,360 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,68 0/0:2,0:2:6:0,6,50 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,50 0/0:3,0:3:6:0,6,90 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:4:6:0,6,50 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:7,0:7:21:0,21,231 0/0:2,0:2:6:0,6,64 ./.:0,0:0 0/0:2,0:2:6:0,6,63 0/0:2,0:2:6:0,6,70 ./.:0,0:0 0/0:6,0:6:15:0,15,148 ./.:0,0:0 ./.:0,0:0 1/1:0,0:2:6:90,6,0 ./.:0,0:0 0/0:2,0:2:6:0,6,63 0/0:2,0:2:6:0,6,74 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,58 0/0:2,0:2:6:0,6,71 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49

For Exome Seq now : chr2 111878571 . C T 93.21 PASS DP=634;FS=0.000;MLEAC=1;MLEAF=5.319e-03;MQ=60.00;MQ0=0;VQSLOD=14.19;culprit=MQ GT:AD:DP:GQ:PL 0/0:8,0:8:24:0,24,243 0/0:4,0:4:9:0,9,135 0/0:7,0:7:18:0,18,270 0/0:7,0:7:21:0,21,230 0/0:16,0:16:48:0,48,542 0/0:8,0:8:21:0,21,315 0/0:6,0:6:18:0,18,186 0/0:5,0:5:15:0,15,168 0/0:6,0:6:15:0,15,225 0/0:10,0:10:30:0,30,333 0/0:7,0:7:21:0,21,239 0/0:6,0:6:18:0,18,202 0/0:6,0:6:15:0,15,225 0/0:7,0:7:21:0,21,225 0/0:8,0:8:24:0,24,272 0/0:5,0:5:15:0,15,168 1/1:0,0:13:13:147,13,0 0/0:2,0:2:6:0,6,73 0/0:8,0:8:24:0,24,256 0/0:14,0:14:4:0,4,437 0/0:3,0:3:9:0,9,85 0/0:4,0:4:12:0,12,159 0/0:7,0:7:21:0,21,238 0/0:5,0:5:15:0,15,195 0/0:7,0:7:15:0,15,225 0/0:12,0:12:36:0,36,414 0/0:4,0:4:12:0,12,156 0/0:7,0:7:0:0,0,190 0/0:2,0:2:6:0,6,64 0/0:7,0:7:21:0,21,242 0/0:7,0:7:21:0,21,234 0/0:8,0:8:24:0,24,267 0/0:7,0:7:21:0,21,245 0/0:7,0:7:21:0,21,261 0/0:6,0:6:18:0,18,204 0/0:8,0:8:24:0,24,302 0/0:5,0:5:15:0,15,172 0/0:9,0:9:24:0,24,360 0/0:18,0:18:51:0,51,649 0/0:5,0:5:15:0,15,176 0/0:2,0:2:6:0,6,70 0/0:14,0:14:33:0,33,495 0/0:4,0:4:9:0,9,135 0/0:8,0:8:21:0,21,315 0/0:4,0:4:12:0,12,149 0/0:4,0:4:6:0,6,90 0/0:10,0:10:27:0,27,405 0/0:3,0:3:6:0,6,90 0/0:4,0:4:12:0,12,133 0/0:14,0:14:6:0,6,431 0/0:4,0:4:12:0,12,151 0/0:5,0:5:15:0,15,163 0/0:3,0:3:9:0,9,106 0/0:7,0:7:21:0,21,237 0/0:7,0:7:21:0,21,268 0/0:8,0:8:21:0,21,315 0/0:2,0:2:6:0,6,68 ./.:0,0:0 0/0:3,0:3:9:0,9,103 0/0:7,0:7:21:0,21,230 0/0:3,0:3:6:0,6,90 0/0:9,0:9:26:0,26,277 0/0:7,0:7:21:0,21,236 0/0:5,0:5:15:0,15,170 ./.:1,0:1 0/0:15,0:15:45:0,45,653 0/0:8,0:8:24:0,24,304 0/0:6,0:6:15:0,15,225 0/0:3,0:3:9:0,9,103 0/0:2,0:2:6:0,6,79 0/0:7,0:7:21:0,21,241 0/0:4,0:4:12:0,12,134 0/0:3,0:3:6:0,6,90 0/0:5,0:5:15:0,15,159 0/0:4,0:4:12:0,12,136 0/0:5,0:5:12:0,12,180 0/0:11,0:11:21:0,21,315 0/0:13,0:13:39:0,39,501 0/0:3,0:3:9:0,9,103 0/0:8,0:8:24:0,24,257 0/0:2,0:2:6:0,6,73 0/0:8,0:8:24:0,24,280 0/0:4,0:4:12:0,12,144 0/0:4,0:4:9:0,9,135 0/0:8,0:8:24:0,24,298 0/0:4,0:4:12:0,12,129 0/0:5,0:5:15:0,15,184 0/0:2,0:2:6:0,6,62 0/0:2,0:2:6:0,6,65 0/0:9,0:9:27:0,27,337 0/0:7,0:7:21:0,21,230 0/0:7,0:7:21:0,21,239 0/0:5,0:5:0:0,0,113 0/0:11,0:11:33:0,33,369 0/0:7,0:7:21:0,21,248 0/0:10,0:10:30:0,30,395

Thanks for your help.

Comments (1)

Dear GATK team, http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.html “GenotypeGVCFs merges gVCF records that were produced as part of the reference model-based variant discovery pipeline (see documentation for more details) using the '-ERC GVCF' or '-ERC BP_RESOLUTION' mode of the HaplotypeCaller. This tool performs the multi-sample joint aggregation step and merges the records together in a sophisticated manner. At all positions of the target, this tool will combine all spanning records, produce correct genotype likelihoods, re-genotype the newly merged record, and then re-annotate it. Note that this tool cannot work with just any gVCF files - they must have been produced with the HaplotypeCaller, which uses a sophisticated reference model to produce accurate genotype likelihoods for every position in the target.”

The final sentence says that "accurate genotype likelihoods", how do you define "accurate"? May I ask if it is fully evaluated? If so, could you please tell me how you do the evaluation?

Thank you in advance.

Comments (2)

Hi,

I am running HaplotypeCaller -ERC GVCF for a cohort in parallel per sample per chromosome and combined the data using CombineGVCFs across the cohort on per chromosome basis. Can I run the GenotypeGVCFs to call variants across the cohort in parallel on all samples but per chromosome basis? follow by combining the vcf files using CatVariants or CombineVariants?

Thanks

Kin

Comments (6)

Hi,

I run HC for several samples, then I run GenotypeGVCFs to merge all gVCF files. (I'm using version 3.1.1) I've noticed that for some variations, although there are only 2 alleles, the AD values for all samples contain information for multiple alleles.

For example: '0,83,0,0,0' for homozygous or '11,40,0,0,0' for heterozygous.

Below is the complete line: chr15 82637079 . C T 91046.53 . AC=26;AF=0.867;AN=30;BaseQRankSum=-1.146e+00;ClippingRankSum=2.25;DP=2139;FS=0.000;InbreedingCoeff=-0.1588;MLEAC=26;MLEAF=0.867;MQ=39.29;MQ0=0;MQRankSum=2.18;QD=30.88;ReadPosRankSum=0.174 GT:AD:DP:GQ:PL 1/1:0,23,0,0,0:23:77:1081,77,0 0/1:11,40,0,0,0:51:99:1688,0,473 1/1:1,320,0,0,0:321:99:14836,972,0 1/1:2,168,0,0,0:170:99:7466,471,0 0/1:23,205,0,0,0:228:99:8646,0,649 ./.:.:3 1/1:0,3,0,0,0:3:9:135,9,0 1/1:0,139,0,0,0:139:99:6349,427,0 1/1:0,83,0,0,0:83:99:4251,307,0 0/1:7,60,0,0,0:67:99:2748,0,310 0/1:6,38,0,0,0:44:99:1921,0,124 1/1:0,294,0,0,0:294:99:15437,1085,0 1/1:0,67,0,0,0:67:99:3557,253,0 1/1:0,105,0,0,0:105:99:5289,362,0 1/1:0,243,0,0,0:243:99:11587,810,0 1/1:0,257,0,0,0:257:99:11929,821,0

Any suggestion how should I interpret this values?

Thank you for your help, Lily

Comments (4)

Hi there,

I'm trying to follow best practices and also use the latest version of GATK/Queue. I've created 2 GVCF files, and I've run GenotypeGVCFs on them to create a VCF file. These are the commands:

java -jar -Xmx8g /home/sgfriede/gene_apps/gatk-3.2-2/GenomeAnalysisTK.jar -T GenotypeGVCFs \ -R /gpfs_share/sgfriede/ref_data/canFam3.fa \ --variant /gpfs_share/sgfriede/poodles_addisons/cocoa_wilt/gvcf/cocoa_wilt.sorted.dedup.haplotypecaller.g.vcf \ --variant /gpfs_share/sgfriede/poodles_addisons/ellie_traska/gvcf/ellie_traska.sorted.dedup.haplotypecaller.g.vcf \ -o /gpfs_share/sgfriede/poodles_addisons/affected_poodles/20140722/vcf/affected_poodles.vcf

From this, I get the output file no problem.

However, when I try to run VQSR on this VCF (affected_poodles.vcf) file I get the following error message:

ERROR MESSAGE: Line 5082: there aren't enough columns for line 0,3:3:9:71,9,0 0/0:2,0:2:6:0,6,61 (we expected 9 tokens, and saw 2 ), for input source: /gpfs_share/sgfriede/poodles_addisons/affected_poodles/20140722/vcf/affected_poodles.vcf

I think (but I could be wrong!!) I've narrowed this down to something with the output format from GenotypeGVCFs in GATK 3.2-2, because when I run VQSR on the same file but created using GenotypeGVCFs with GATK 3.1-1, VQSR runs just fine (and I'm calling the VQSR walker using 3.2-2).

It is also possible I'm doing something wrong here, but I thought I'd see if any of the experts might have an idea.

Appreciate your help as always! Steve

Comments (4)

I have noticed when using the GenotypeGVCFs walker (version 3.2), that the remaining runtime estimate is very poor. The estimate from CombineGVCFs and other walkers on the other hand is very accurate. This is not a critical bug. It is rather a feature enhancement. I just noticed that the "completed" percentage is also incorrect. It does not start at 0%. In fact it has stayed constant after walking over 2 million base pairs of 2000 samples in 2 hours. I am not using multi threading. Not that important to me, but I thought I would let you know.

Comments (1)

Hello,

I'm using GATK version 3.1-1 and following Best Practices to call SNPs and INDELs for my exome sequencing data. I noticed that the AD field didn't update properly after I merged per-sample GVCF files into the per-cohort VCF file.

As an example, one SNP on mitochondrion at location 185 has following entries in per-sample GVCF files: chrM 185 . G T, 1254.77 . DP=23;MLEAC=2,0;MLEAF=1.00,0.00;MQ=52.36;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,23,0:23:93:1288,93,0,1288,93,1288:0,0,0,0 chrM 185 . G T, 410.77 . DP=7;MLEAC=2,0;MLEAF=1.00,0.00;MQ=34.70;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,7,0:7:33:444,33,0,444,33,444:0,0,0,0 chrM 185 . G A, 259.84 . DP=5;MLEAC=2,0;MLEAF=1.00,0.00;MQ=37.32;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,5,0:5:21:293,21,0,293,21,293:0,0,0,0 chrM 185 . G A, 205.90 . DP=3;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,3,0:3:18:239,18,0,239,18,239:0,0,0,0 chrM 185 . G T, 1316.77 . DP=21;MLEAC=2,0;MLEAF=1.00,0.00;MQ=35.14;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,21,0:21:99:1350,99,0,1350,99,1350:0,0,0,0 chrM 185 . G T, 171.03 . DP=3;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:DP:GQ:PL:SB 1/1:0,3,0:3:15:204,15,0,204,15,204:0,0,0,0 All other samples didn't have evidence for this SNP.

After merging with GenotypeGVCFs, the same SNP has the entry in the per-cohort VCF file: chrM 185 . G T,A 3561.78 . AC=8,4;AF=0.098,0.049;AN=82;DP=565;FS=0.000;InbreedingCoeff=0.8981;MLEAC=8,3;MLEAF=0.098,0.037;MQ=52.36;MQ0=0;QD=32.42 GT:AD:DP:GQ:PL 0/0:.:13:30:0,30,450,30,450,450 0/0:.:16:39:0,39,585,39,585,585 1/1:0,23,0:23:93:1288,93,0,1288,93,1288 0/0:.:8:21:0,21,315,21,315,315 0/0:.:14:27:0,27,405,27,405,405 0/0:.:11:24:0,24,360,24,360,360 0/0:.:2:6:0,6,75,6,75,75 0/0:.:9:18:0,18,270,18,270,270 0/0:.:8:21:0,21,269,21,269,269 0/0:.:18:36:0,36,540,36,540,540 0/0:.:10:21:0,21,315,21,315,315 0/0:.:23:60:0,60,900,60,900,900 0/0:.:45:99:0,102,1530,102,1530,1530 0/0:.:50:99:0,120,1800,120,1800,1800 0/0:.:8:21:0,21,266,21,266,266 0/0:.:16:36:0,36,540,36,540,540 1/1:0,7,0:7:33:444,33,0,444,33,444 0/0:.:5:6:0,6,90,6,90,90 0/0:.:4:9:0,9,135,9,135,135 0/0:.:7:21:0,21,249,21,249,249 0/0:.:12:27:0,27,405,27,405,405 0/0:.:23:60:0,60,900,60,900,900 2/2:0,5,0:5:21:293,293,293,21,21,0 0/0:.:8:21:0,21,278,21,278,278 0/0:.:11:24:0,24,360,24,360,360 0/0:.:8:24:0,24,292,24,292,292 2/2:0,3,0:3:18:239,239,239,18,18,0 0/0:.:12:27:0,27,405,27,405,405 0/0:.:12:21:0,21,315,21,315,315 1/1:0,21,0:21:99:1350,99,0,1350,99,1350 0/0:.:13:27:0,27,405,27,405,405 0/0:.:3:9:0,9,112,9,112,112 1/1:0,3,0:3:15:204,15,0,204,15,204 0/0:.:7:21:0,21,209,21,209,209 0/0:.:13:27:0,27,405,27,405,405 0/0:.:5:6:0,6,90,6,90,90 0/0:.:7:21:0,21,260,21,260,260 0/0:.:10:24:0,24,360,24,360,360 0/0:.:40:93:0,93,1395,93,1395,1395 0/0:.:38:84:0,84,1260,84,1260,1260 0/0:.:14:33:0,33,495,33,495,495

Both alternative alleles were included in the ALT field and the genotype field (GT) was updated correspondingly. For genotype 2/2, I think the correct AD field should be 0,0,5 and 0,0,3 in my aforementioned example. However, these fields remained the same as in the original per-sample GVCF files (i.e., 0,5,0 and 0,3,0). I'm wondering whether my understanding about the AD field is correct or not. And also, if it is an erroneous reporting, whether it'll influence the downstream VQSR and other filtering/annotations or not.

Thanks, Leo

Comments (3)

I'm running HaplotypeCaller in GVCF mode, followed by GenotypeGVCF. For some individuals, I have more than one aligned bam file. Do I need to combine the individual's bam files in HaplotypeCaller to produce one GVCF, or can I produce multiple GVCFs per individual and combine them in GenotypeGVCF?

Thanks.

Comments (6)

I'm wondering if it's ok to remove the bd/bi tags after haplotypecaller creates the gvcf, the increase bam size is a problematic, the computional cost of recreating them in a 'just in case' scenario isn't as bad as the on disk cost of them so i'm wondering if they are not used after the first gvcf is created from individual samples.

Comments (2)

Hi

I've just noticed this in some records, in some bi-allelic sites the AD field for some samples can have far more than 2 columns and the values can be separated oddly. For example in the record at the bottom of the 0/1:21,11,0,0,0:32:99:398,0,677 the site is biallelic so I'd expect the AD to be 21,11 with a DP of 32, but instead we have 21,11,0,0,0.

Then at another position here we have 0/1:12,0,9,0,0:21:99:220,0,792 with the AD 12,0,9,0,0 DP 21 when I would have expected to see AD 12,9 DP 21. The 0 between the ref and alt alleles is particularly troubling.

 chr1    110326664       rs209200395     A       C       20726.45        .       AC=65;AF=0.192;AN=338;BaseQRankSum=1.05;ClippingRankSum=-6.610e-01;DB;DP=3547;FS=0.000;InbreedingCoeff=0.0055;MLEAC=66;MLEAF=0.195;MQ=60.00;MQ0=0;MQRankSum=-3.900e-02;QD=16.71;ReadPosRankSum=0.638    GT:AD:DP:GQ:PL  0/0:.:5:15:0,15,156     0/0:.:2:6:0,6,69        0/0:.:23:60:0,60,900    0/0:.:9:27:0,27,275     0/0:.:26:63:0,63,945    0/0:.:41:99:0,102,1433  0/0:.:23:60:0,60,774    0/0:.:23:60:0,60,900    0/0:.:6:9:0,9,135       0/0:.:3:9:0,9,86        0/0:.:6:12:0,12,180     0/0:.:9:24:0,24,288     0/0:.:3:6:0,6,90        0/1:21,11,0,0,0:32:99:398,0,677 0/0:.:29:61:0,61,939    0/1:9,11,0,0,0:20:99:350,0,241  0/0:.:26:43:0,43,832    0/0:.:22:63:0,63,945    0/0:.:30:68:0,68,916    0/0:.:21:57:0,57,855    0/0:.:23:62:0,62,719    0/1:10,14,0,0,0:24:99:509,0,270 1/1:0,23,0,0,0:23:85:1001,85,0  0/1:18,13,0,0,0:31:99:426,0,485 0/0:.:25:68:0,68,812    0/1:17,10,0,0,0:27:99:357,0,518 0/0:.:38:72:0,72,1223   0/0:.:25:64:0,64,823    0/0:.:22:60:0,60,900    0/0:.:21:60:0,60,707    0/1:3,6,0,0,0:9:86:184,0,86     0/0:.:25:60:0,60,900    0/0:.:26:72:0,72,1080   0/0:.:22:60:0,60,900    0/0:.:37:81:0,81,1257   0/0:.:24:65:0,65,838    0/0:.:32:65:0,65,1055   0/0:.:32:73:0,73,1069   0/1:8,0,12,0,0:20:99:467,0,494  0/0:.:22:60:0,60,900    0/1:3,3,0,0,0:6:81:197,0,81     0/1:8,12,0,0,0:20:99:369,0,228  0/0:.:19:26:0,26,615    0/0:.:21:60:0,60,761    1/1:0,24,0,0,0:24:60:984,60,0   0/0:.:24:60:0,60,900    0/0:.:21:60:0,60,747    0/0:.:34:87:0,87,1166   0/0:.:21:60:0,60,701    0/0:.:19:41:0,41,646    0/0:.:21:63:0,63,720    0/0:.:18:29:0,29,609    0/0:.:26:69:0,69,1035   0/1:8,3,0,0,0:11:99:104,0,229   0/0:.:23:60:0,60,900    0/0:.:20:60:0,60,681    0/1:7,0,8,0,0:15:99:327,0,517   0/0:.:2:6:0,6,68        ./.:.:0 0/0:.:31:66:0,66,1067   0/0:.:2:6:0,6,44        0/0:.:3:9:0,9,87        0/0:.:2:6:0,6,66        0/0:.:1:3:0,3,30        0/0:.:2:6:0,6,68        0/0:.:18:26:0,26,583    0/0:.:2:6:0,6,64        0/1:3,5,0,0,0:8:79:150,0,79     0/0:.:2:6:0,6,65        0/0:.:5:15:0,15,150     0/1:11,13,0,0,0:24:99:445,0,323 0/0:.:49:75:0,75,1662   0/1:7,11,0,0,0:18:99:328,0,172  0/0:.:22:63:0,63,753    0/1:11,11,0,0,0:22:99:400,0,313 0/0:.:35:37:0,37,1051   0/0:.:24:66:0,66,990    0/1:10,0,16,0,0:26:99:542,0,538 0/0:.:23:63:0,63,945    0/0:.:21:60:0,60,900    0/1:7,0,5,0,0:12:99:149,0,507   0/0:.:29:74:0,74,990    0/0:.:2:0:0,0,4 0/1:5,12,0,0,0:17:99:397,0,148  0/1:8,0,11,0,0:19:99:364,0,489  0/0:.:11:27:0,27,405    0/1:10,16,0,0,0:26:99:508,0,259 0/1:9,0,6,0,0:15:99:253,0,579   0/0:.:28:54:0,54,894    0/1:16,6,0,0,0:22:99:198,0,460  0/0:.:26:60:0,60,886    0/1:9,8,0,0,0:17:99:270,0,238   0/0:.:30:78:0,78,1170   0/1:9,14,0,0,0:23:99:442,0,246  0/0:.:17:21:0,21,531    0/0:.:25:64:0,64,832    0/0:.:30:65:0,65,1002   0/0:.:28:65:0,65,926    ./.:.:0 0/0:.:2:6:0,6,64        0/1:8,9,0,0,0:17:99:320,0,216   0/0:.:2:6:0,6,56        0/1:12,14,1,0,0:27:99:419,0,351 0/0:.:29:81:0,81,1215   0/1:12,0,12,0,0:24:99:307,0,782 0/0:.:24:69:0,69,809    0/0:.:23:60:0,60,900    0/0:.:28:62:0,62,990    0/1:9,0,8,0,0:17:99:357,0,776   1/1:0,10,0,0,0:10:32:355,32,0   1/1:0,12,0,0,0:12:35:435,35,0   0/1:15,13,0,0,0:28:99:418,0,442 0/1:10,0,9,0,0:19:99:326,0,845  0/0:.:29:61:0,61,970    0/0:.:27:60:0,60,900    0/0:.:33:61:0,61,1049   0/1:12,13,0,0,0:25:99:416,0,361 0/0:.:25:69:0,69,1035   0/1:10,13,0,0,0:23:99:451,0,269 0/1:21,14,0,0,0:35:99:441,0,599 0/0:.:12:22:0,22,376    0/1:10,14,0,0,0:24:99:457,0,273 0/1:10,0,14,0,0:24:99:525,0,646 0/1:15,12,0,0,0:27:99:319,0,416 0/1:5,8,0,0,0:13:99:297,0,123   0/0:.:14:7:0,7,411      0/0:.:18:19:0,19,515    0/0:.:21:63:0,63,745    0/0:.:27:72:0,72,1080   0/1:17,8,0,0,0:25:99:358,0,478  0/1:11,15,0,0,0:26:99:627,0,290 0/0:.:16:13:0,13,506    0/0:.:21:60:0,60,900    0/1:10,0,19,0,0:29:99:579,0,793 0/0:.:33:63:0,63,1077   0/0:.:13:39:0,39,432    0/0:.:8:18:0,18,270     1/1:0,15,0,0,0:15:54:630,54,0   0/0:.:9:27:0,27,297     0/0:.:31:62:0,62,1030   1/1:0,17,0,0,0:17:50:601,50,0   0/0:.:31:59:0,59,977    0/1:18,16,0,0,0:34:99:550,0,526 0/0:.:24:61:0,61,796    0/0:.:9:27:0,27,296     0/1:5,5,0,0,0:10:99:206,0,103   0/1:17,11,0,0,0:28:99:406,0,511 0/1:8,9,0,0,0:17:99:241,0,224   0/0:.:13:33:0,33,495    0/0:.:44:84:0,84,1358   0/0:.:13:20:0,20,495    0/0:.:15:21:0,21,387    0/1:17,6,0,0,0:23:99:117,0,498  0/1:18,13,0,0,0:31:99:451,0,519 1/1:0,34,0,0,0:34:99:1375,116,0 0/0:.:18:32:0,32,564    0/0:.:16:20:0,20,535    0/0:.:29:62:0,62,995    0/1:13,0,4,0,0:17:92:92,0,740   0/0:.:17:21:0,21,559    0/0:.:30:57:0,57,1006   0/0:.:23:60:0,60,900    0/1:12,0,9,0,0:21:99:220,0,792  0/0:.:30:56:0,56,967    0/0:.:23:63:0,63,945    0/0:.:29:69:0,69,999    0/1:8,12,0,0,0:20:99:412,0,237  0/0:.:26:60:0,60,980    0/0:.:27:62:0,62,872    0/1:19,9,0,0,0:28:99:285,0,587  0/1:15,14,0,0,0:29:99:422,0,430
Comments (5)

Hi,

I want to call both confident variants and confident homREF genotypes using gatk 3.x style (pretty much like EMIT_ALL_CONFIDENT_SITES in UnifiedGenotyper). I first use HaplotypeCaller with -ERC GVCF, then use GenotypeGVCFs -inv to emit all confident sites. However, in the resulting vcf file, there is no site with a 0/0 genotype for all samples in the jointly called vcf, which contains only variant sites and sites with ./. genotype for all samples. I do think all sites are in the vcf file. Why can't I find any site with 0/0 genotype for all samples in the vcf from GenotypeGVCFs? Or do I need to run HaplotypeCaller with -ERC BP_RESOLUTION first? Thanks!

aaron

Comments (6)

Hello all,

I have a quick question about the results of an CombineGVCFs file while creating large background files. Prior to combining the files a region of the .gvcf file from HC looks like this.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR070473 13 19408518 . A <NON_REF> . . END=19409266 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 13 19409267 . T <NON_REF> . . END=19409323 GT:DP:GQ:MIN_DP:PL 0/0:4:12:2:0,6,60 13 19409324 . C <NON_REF> . . END=19409400 GT:DP:GQ:MIN_DP:PL 0/0:15:42:8:0,24,299

If you noticed the first individual looks like this SRR070473, with a 0/0:0:0:0:0,0,0 recorded, but after combining the file in batches of 200, the same information will be recorded as no call.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR070473 SRR070477 SRR070505 SRR070516 SRR070517 SRR070772 SRR070779 SRR070796 SRR0 13 19408518 . A <NON_REF> . . END=19408519 GT:DP:GQ:MIN_DP:PL ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./.

The issue I'm seeing is when you attempt to use GenotypeGVCFs I get the following an error similar to this if using the file containing no call notation.

##### ERROR MESSAGE: cannot merge genotypes from samples without PLs; sample ERR031932 does not have likelihoods at position 1:10929

Comments (6)

Gidday,

Not a question, more of a bug report - I'm using the new v3.1 best practices pipeline, so I'd successfully produced my per-sample (n=23 in total) gVCFs with no worries.

Then I used GenotypeGVCFs to combine them as follows, including a dbSNP ROD (the Sanger Mouse Genome Project's SNP calls for 17 samples, in vcf format... not gvcf!):

java -Djava.io.tmpdir=/tmp -Xmx28g -jar ./tmp/GenomeAnalysisTK_3.1-1/GenomeAnalysisTK.jar -T GenotypeGVCFs -nt 8 -R ./mm10.fa --dbsnp ./tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.snps.rsIDdbSNPv137.vcf.ordinalsorted.vcf -V GenotypeGVCFs.run1.sample.list -o ./CombinedGenotyping.run1.vcf -A InbreedingCoeff -A FisherStrand -A QualByDepth -A ChromosomeCounts

So I was very surprised to see that my output CombinedGenotyping vcf has 40 samples in it, not 23 - and of course, 23 + 17 = 40. Checking the VCF headers itself confirms that the genotype calls from the 17 Sanger strains have been included in the output vcf, not just the rsIDs as intended(?). I'm guessing that this combining of .g.vcfs and extra ROD isn't the expected behaviour...!

Cheers!

Comments (0)

Hi,

I was wondering if it is possible to obtain the source code that accompanies the nightly builds; looking at the protected github repository, it seems that the nightly builds are changing faster than the repository there.

My current issue is that I would like the "AD" annotation for homozygous referent samples coming from the GenotypeGVCFs tool. Reading the forum, I believe that this is currently fixed in the nightly builds. However, I have a couple of customizations (pending review) that I don't think are in the nightly builds just yet, so I would need the source to apply my changes and then build.

I realize that I'm asking for something VERY dangerous and completely unsupported; I'm just too darn impatient to wait for the next release :).

-John Wallace

Comments (16)

I have run HaplotypeCaller (Version=3.1-1-g07a4bf8) on 99 WGS samples and created my GVCFs. I did this in 60+ intervals to enable faster turn around time. I then merged the GVCFs with CombineGVCFs (Version=3.1-1-g07a4bf8) However, when I try to call variants with java -jar $GATK -T GenotypeGVCFs -V WGS.gvcf.gz -o VCF/WGS.vcf.gz -R $GENOME -nt 8 it errors out complaining about PL values. The stack trace is here:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalStateException: The original PLs do not have enough values; accessing index 10011 but size is 10000 at org.broadinstitute.sting.utils.variant.GATKVariantContextUtils.generatePLs(GATKVariantContextUtils.java:1593) at org.broadinstitute.sting.utils.variant.GATKVariantContextUtils.mergeRefConfidenceGenotypes(GATKVariantContextUtils.java:1530) at org.broadinstitute.sting.utils.variant.GATKVariantContextUtils.referenceConfidenceMerge(GATKVariantContextUtils.java:1095) at org.broadinstitute.sting.gatk.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:183) at org.broadinstitute.sting.gatk.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:110) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: The original PLs do not have enough values; accessing index 10011 but size is 10000
ERROR ------------------------------------------------------------------------------------------

In the short term, is there a hidden setting I can use to skip over these, rather than completely throwing up?

Comments (1)

Is it possible to get banded GVCFs from GenotypeGVCFs or only "bp" resolution when using "includeNonVariants? It looks GenotypeGVCFs expands out the banded entries significantly increasing the size of the VCF. Should I be using CombineGVCFs in this role instead?

Comments (10)

Using GATK 3.1-1, I seem to be unable to get the "AB" (AlleleBalance) annotation for the calls using the HaplotypeCaller -> GenotypeGVCFs pipeline, and I'm not sure how to get it. Our current pipeline (GATK 2.7-4, UnifiedGenotyper) requires this field to perform filtering, so this annotation is essential for us to upgrade to GATK 3.1.

My current pipeline of commands is as follows:

$ GenomeAnalysisTK-3.1-1 -T HaplotypeCaller -R human_g1k_v37_decoy.fasta --dbsnp dbsnp_137.b37.vcf.gz -I -L targets.GRCh37.bed -stand_emit_conf 10 -mbq 20 --downsample_to_coverage 300 -ERC GVCF -pairHMM VECTOR_LOGLESS_CACHING -variant_index_type LINEAR -variant_index_parameter 128000 -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A GCContent -A AlleleBalanceBySample -A AlleleBalance -A QualByDepth -o sample_gvcf.vcf.gz

$ GenomeAnalysisTK-3.1-1 -T GenotypeGVCFs -R human_g1k_v37_decoy.fasta --dbsnp dbsnp_137.b37.vcf.gz -V -L targets.GRCh37.bed -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A GCContent -A AlleleBalanceBySample -o joint_vcf.vcf.gz

Above, note that if "-A AlleleBalance" is given to GenotypeGVCFs, GATK crashes with a NullPointerException (AlleleBalance.java, line 66).

The command above is heavily adapted from the current pipeline; do you know what I might be doing wrong with the new and improved HaplotypeCaller?

Thanks so much for your help, and if you need any further information, please let me know.

Comments (2)

Hi

For targeted re-sequenced data, can we do variant calling using HaplotypeCaller in gVCF mode followed by using GenotypeGVCFs for joint genotyping ?

Would the following best practices work for targeted re-sequencing as well , except for VQSR as for targted resequencing GATK advices another way of filtering. https://www.broadinstitute.org/gatk/guide/best-practices?bpm=DNAseq

Thanks, Tinu

Comments (3)

Hi,

I'm doing a variant analysis of genomic DNA from 2 related samples. I followed the up-to-date Best practices using HaplotypeCaller in GVCF mode for both samples followed by GenotypeGVCF to compute a common vcf of variant loci. I'm looking at variants that would be sample2-specific (present in sample2 but not in sample1)

Here is a line of this file:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 chrIII 91124 . A AATAAGAGGAATTAGGCT 1132.42 . AC=2;AF=0.500;AN=4;DP=47;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=58.85;MQ0=0;QD=7.99 GT:AD:DP:GQ:PL 1/1:0,25:25:55:1167,55,0 0/0:.:22:33:0,33,495

In the Genotype Field, sample2.AD is a . (dot) meaning that no reads passed the Quality filters. However, sample2.DP=22 meaning that 22 reads covered this position. This line suggest that this variation is specific to sample1 (genotype HomVar 1/1) and is not present in sample2 (HomRef 0/0). But given the biological relationship between sample1 and 2 (the way they were generated), I doubt that this variation is true: it is very likely to be present in sample2 as well. It's a false

I have 416 loci like this. For the vast majority of them, sample1 and 2 likely share the same variation. But since it is not impossible that a very few of them are really sample1=HomVar and sample2=HomRef, could you suggest me a way to detect those guys? What about comparing sample1.PL(1/1) and sample2.PL(0/0) ? For example could you suggest a rule of thumb to determine their ratio ?

Comments (4)

Hi, I am currently unsure of how to interpret the output of GenotypeGVCFs when typing my ~600 samples.

Genotype format at a given locus with 2 alternate alleles: 0/2:16,17,0:33:99:453,501,972,0,471,420 --This is what I am expecting for each sample

Genotype format at another given locus with 2 alternate alleles: 0/2:4,1,5,3,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0:15:20:85,46,133,0,111,358,82,64,23,190,20,60,49,41,45,20,60,49,41,45,45 --I really don't understand this

I have written several down-stream scripts which rely on this format being consistent -- why are there so many fields in some of these lines? I apologize if this is very obvious and documented somewhere, but I have tried searching with no results.

Thanks, -Briana

Comments (2)

I've noticed a small bug with GATK tools and VCF files. CombineVariants and GenotypeGVCF can generate files where some samples have fewer fields than the format column.

For instance, this is part of a line from the VQSR-ed output VCF of GenotypeGVCF: 1 15820 rs200482301 G T 5909.59 VQSRTrancheSNP99.90to100.00 AC=21;AF=0.154;AN=136..... GT:AD:DP:GQ:PL 0/0:.:40:66:0,66,990 0/0:.:41:69:0,69,1035 1/1:0,20,1:21:78:985,78,0 0/0:.:35:60:0,60,900 ./.:.:1 0/0:.:7:21:0,21,233 ..............

The second to last sample entry is ./.:.:1 (3 fields), while the format entry is GT:AD:DP:GQ:PL (5 fields). I think that GT=./., AD=., and DP=1, so the data is not getting messed up. This might even be within the rules of VCF, but one of the software that I use will not parse VCF files when 1 < # sample fields < # format fields. If sample entries were extended by ":." for every empty FORMAT field (unless only . or ./. was present in sample column), it would make the file parsable.

It's not too hard of a manual fix, but it might be nice to add the functionality into the toolkit. I've seen it happen with CombineVariants as well, when the input VCF files have different numbers of FORMAT fields.

Comments (36)

hello,

I am doing a population genetic analysis and am consequently very interested in obtaining high confidence monomorphic sites.

I used GATK 3.1 HaplotypeCaller in the new incremental variant discovery pipeline and then ran GenotypeGVCFs using the -inv option so as to print non variant sites after combining.

A monomorphic record looks like this:

scaffold_1 202986 . G . . . . GT:AD:DP:PL ./.:81:81:0 ./.:6:12:0 ./.:0:0:0 .....etc

I had a couple of questions:

  1. It seems GT, GQ, and PL is no longer reported after combining for monomorphic sites? is there a reason why? I'm asking because I'd like to be able to distinguish high qual (first genotype) / low qual (second genotype) monomorphic sites so I can change the low qual sites to missing (third genotype).

  2. If getting high confidence monomorphic sites is my goal, would you recommend that I do my own parsing starting with the individual gvcfs rather than using the GenotypeGVCFs walker?

thanks much for your help!

best, Young Wha Lee

Comments (15)

Hi Geraldine,

We have used the 3.x workflow to process targeted resequencing for a large cohort. At the genotypeGVCFs stage, we get the following warning in the output:

ExactAFCalc - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at chr:pos has NN alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument

Since our cohort is very large, it is extremely unlikely, but not impossible that more than 6 indel alleles exist at a given locus. Accordingly, allowing genotypeGVCFs to consider more possible alleles seems warranted.

While --max_alternate_alleles is an option for HaplotypeCaller, it is not recognized by genotypeGVCFs. Is this parameter fixed for genotypeGVCFs? If not, is there a way to change this parameter at the genotypeGVCFs stage? If not, can this parameter be modified by altering the combined GVCF header or rerunning HaplotypeCaller with a higher value for --max_alternate_alleles?

Thanks for all your hard work in this forum!

Comments (15)

Trying to do variantcalling using GATKs new pipeline using HaplotypeCaller and GenotypeGVCF.

Made gVCFs using HC and now trying to do joint genotyping using GenotypeGVCF

java -Xmx5G -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R hs37d5.fa -V:VCF GVCF.list -L SureSelect_AllExon_V5_wo_chr.bed --dbsnp dbsnp_137.hg19.vcf -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A InbreedingCoeff -A Coverage -A AlleleBalance -A AlleleBalanceBySample -A BaseCounts -A BaseQualityRankSumTest -A ChromosomeCounts -A ClippingRankSumTest -A Coverage -A DepthPerAlleleBySample -A FisherStrand -A GCContent -A HaplotypeScore -A HardyWeinberg -A HomopolymerRun -A InbreedingCoeff -A LowMQ -A MappingQualityRankSumTest -A MappingQualityZero -A MappingQualityZeroBySample -A NBaseCount -A QualByDepth -A RMSMappingQuality -A SpanningDeletions -A TandemRepeatAnnotator -A VariantType -o JointGenotyped.vcf

Error

ERROR A GATK RUNTIME ERROR has occurred (version 3.0-0-g6bad1c6):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)

Not sure why this is happening.

Thanks, Tinu

Comments (25)

Hi there:

I'm trying to figure out if I need to run CombineGVCFs before GenotypeGVCFs. The documentation said "One would use this tool when needing to genotype too large a number of individual gVCFs". Is there a ballpark number for "too large a number of individuals"? For example, if I have 1000 individuals, should I use CombineGVCFs first? If so, is there a recommended chuck size? It seems CombineGVCFs does not take nt/nct option and when I tried to combine 600 files it's estimated to take a week.

Thanks.

Comments (6)

I am using GATK 3.0 Generated GVCFs using HaplotypeCaller and now trying to do Joint Genotyping using GenotypeGVCFs.

I have 70+ GVCFs with me, which I want to Joint Genotype. Is there a way to give all of these GVCFs together as input (--variant) to GATK-GenotypeGVCF. Something like '-I' option for BAM list. I know it is possible to use --variant VCF1 --variant VCF2 etc. But that would be tedious to give inputs like that when you have these many or more GVCFs

Thanks, Tinu

Comments (3)

Hello all!

I've been using GATK v3.1-1 for the last week or so. I've come across an error when I try and perform the Joint Genotyping on 5 different samples (family members). All 5 samples were processed through individually (following the Best Practices Guide) and the HaplotypeCaller was run in the gVCF mode. When I run GenotypeGVCFs I receive this error:

ERROR MESSAGE: Invalid argument value ' ' at position 22.

I'm not quite sure what to make of it because other projects with the same type of family size don't throw this error. Do you all know what it is referring to? I've even tried the nightly build (GenomeAnalysisTK-nightly-2014-04-14-g744780d) to see if that fixed the problem and it didn't. Here is my script:

java \
-Xmx22g \
    -Djava.io.tmpdir=$TEMP \
-jar /home/dkcrossm/tools/$GATK_DIR/GenomeAnalysisTK.jar \
-R $REF_FASTA.fa \
-T GenotypeGVCFs \
-ped $PEDIGREE \
-nt 4 \
--variant $SAMPLE1.hc.gvcf.vcf \
--variant $SAMPLE2.hc.gvcf.vcf \
--variant $SAMPLE3.hc.gvcf.vcf \
--variant $SAMPLE4.hc.gvcf.vcf \
--variant $SAMPLE5.hc.gvcf.vcf \
-o $SAMPLE.raw.snps.indels.vcf \
--dbsnp $DBSNP \    

I am also seeing this in a 16 sample run as well, but the position is 44 in this instance instead of 22. Any suggestions as to what it could be referring to? Let me know if you need anything else!

Thanks, David

Comments (10)

I've run into the following bug while running GenotypeGVCFs:

##### ERROR MESSAGE: cannot merge genotypes from samples without PLs; sample <ID redacted> does not have likelihoods at position 1:1115551

The input file in question is a gVCF produced by merging a large number of smaller gVCFs using CombineGVCFs (all tasks were run using version 3.1). What's happening is that the position 1115551 doesn't exist in that particular sample:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  <Sample_ID>
1       1115550 .       AC      A,<NON_REF>     118.73  .       BaseQRankSum=-0.377;DP=15;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.72;MQ0=0;MQRankSum=-1.093;ReadPosRankSum=-0.811      GT:AD:DP:GQ:PL:SB       0/1:7,6,0:13:99:156,0,188,177,207,384:4,3,3,3
1       1115552 .       C       <NON_REF>       .       .       END=1115552     GT:DP:GQ:MIN_DP:PL      0/0:15:0:15:0,0,31

But when the sample is combined with other samples, that position gets filled in with a simple "0/0", without any PLs (or any of the other fields, including AD, DP, GQ, etc.), which causes the GenotypeGVCFs to choke.

I can imagine there might be other scenarios that will result in a "0/0" genotype field, so perhaps the easiest way to fix this would be to make sure that any "0/0" actually gets output as "./.:.:.:.:.".

Thanks,

Grace

Comments (6)

commands are:

java -Xmx10g -jar $gatk -T HaplotypeCaller \
 -R $refGenome \
 --dbsnp $dbSNP \
 -o s1.raw.var.g.vcf.gz \
 -I s1.bam \
 -pairHMM VECTOR_LOGLESS_CACHING \
 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 &&

java -Xmx10g -jar $gatk -T GenotypeGVCFs  \
 -R $refGenome \
 --variant s1.raw.var.g.vcf.gz \
 -o s1.raw.var.vcf.gz

Error in GenotypeGVCFs :

ERROR MESSAGE: Line 115: there aren't enough columns for line �������uN����ޫrf�ebh�����NH(����ٻat� �E�;J1����X��ʩá���Ç�����>��@��,0��' (we

expected 9 tokens, and saw 2 ), for input source: s1.raw.var.g.vcf.gz

Comments (1)

So I've just run GenotypeGVCFs and get some odd looking things. The FORMAT field is GT:AD:DP:GQ:PL and on most occasions, the values are 1/1:0,1,0:1:3:10,3,0, i.e. 5 fields. However, sometimes I get ./.:.:317, only 3 fields.

I use queue to run this, so my case class is this:

case class genotypeGVCF(gvcfs:List[File], outvcf:File) extends GenotypeGVCFs with ExternalCommonArgs with FourteenCoreJob with NineDayJob {
    @Input(doc="list of gvcfs to merge") val inGVCFs = gvcfs
    @Output(doc="SNP calls in VCF format") val outVCF = outvcf
    this.variant = inGVCFs
    this.out = outVCF
    this.R = refGenome
    this.dbsnp = dbSNP
    this.jobName = "genotypeGVCF " + outvcf.getName()
    this.analysisName = this.jobName
    this.isIntermediate = false
    this.nt = 14
    this.intervals = Seq(targetIntervals)
    this.memoryLimit = 112
  }