# Tagged with #genotypegvcfs 1 documentation article | 2 announcements | 97 forum discussions

Created 2014-04-03 20:20:08 | Updated 2014-10-22 19:22:34 | Tags: haplotypecaller genotypegvcfs combinegvcfs gvcf joint-analysis

### Overview

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variants analyses.

### Important caveat

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller 3.x contain additional information that is formatted in a very specific way. Read on to find out more.

### General comparison of VCF vs. gVCF

The key difference between a regular VCF and a gVCF is that the gVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a gVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION gVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

### The two types of gVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a gVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the gVCF header. The purpose of the blocks (also called banding) is to keep file size down, and there is no downside for the downstream analysis, so we do recommend using the -GVCF option.

### Example gVCF file

This is a banded gVCF produced by HaplotypeCaller with the -GVCF option.

As you can see in the first line, the basic file format is a valid version 4.1 VCF:

##fileformat=VCFv4.1
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)
##GVCFBlock=minGQ=60(inclusive),maxGQ=2147483647(exclusive)
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=b37>
##reference=file:///humgen/1kg/reference/human_g1k_v37.fasta

Toward the middle you see the ##GVCFBlock lines (after the ##FORMAT lines) (repeated here for clarity):

##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)

which indicate the GQ ranges used for banding (corresponding to the boundaries [5, 20, 60]).

You can also see the definition of the MIN_DP annotation in the ##FORMAT lines.

#### Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10000000 and ends at 20:10000116.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10000000    .   T   <NON_REF>   .   .   END=10000116    GT:DP:GQ:MIN_DP:PL  0/0:44:99:38:0,89,1385
20  10000118    .   T   <NON_REF>   .   .   END=10000210    GT:DP:GQ:MIN_DP:PL  0/0:42:99:38:0,80,1314
20  10000212    .   A   <NON_REF>   .   .   END=10000438    GT:DP:GQ:MIN_DP:PL  0/0:52:99:42:0,99,1403
20  10000439    .   T   G,<NON_REF> 1737.77 .   DP=57;MLEAC=2,0;MLEAF=1.00,0.00;MQ=221.41;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,56,0:56:99:1771,168,0,1771,168,1771:0,0,0,0
20  10000440    .   T   <NON_REF>   .   .   END=10000597    GT:DP:GQ:MIN_DP:PL  0/0:56:99:49:0,120,1800
20  10000598    .   T   A,<NON_REF> 1754.77 .   DP=54;MLEAC=2,0;MLEAF=1.00,0.00;MQ=185.55;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,53,0:53:99:1788,158,0,1788,158,1788:0,0,0,0
20  10000599    .   T   <NON_REF>   .   .   END=10000693    GT:DP:GQ:MIN_DP:PL  0/0:51:99:47:0,120,1800
20  10000695    .   G   <NON_REF>   .   .   END=10000757    GT:DP:GQ:MIN_DP:PL  0/0:48:99:45:0,120,1800
20  10000758    .   T   A,<NON_REF> 1663.77 .   DP=51;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.32;MQ0=0  GT:AD:DP:GQ:PL:SB   1/1:0,50,0:50:99:1697,149,0,1697,149,1697:0,0,0,0
20  10000759    .   A   <NON_REF>   .   .   END=10001018    GT:DP:GQ:MIN_DP:PL  0/0:40:99:28:0,65,1080
20  10001020    .   C   <NON_REF>   .   .   END=10001020    GT:DP:GQ:MIN_DP:PL  0/0:26:72:26:0,72,1080
20  10001021    .   T   <NON_REF>   .   .   END=10001021    GT:DP:GQ:MIN_DP:PL  0/0:25:37:25:0,37,909
20  10001022    .   C   <NON_REF>   .   .   END=10001297    GT:DP:GQ:MIN_DP:PL  0/0:30:87:25:0,72,831
20  10001298    .   T   A,<NON_REF> 1404.77 .   DP=41;MLEAC=2,0;MLEAF=1.00,0.00;MQ=171.56;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,41,0:41:99:1438,123,0,1438,123,1438:0,0,0,0
20  10001299    .   C   <NON_REF>   .   .   END=10001386    GT:DP:GQ:MIN_DP:PL  0/0:43:99:39:0,95,1226
20  10001387    .   C   <NON_REF>   .   .   END=10001418    GT:DP:GQ:MIN_DP:PL  0/0:41:42:39:0,21,315
20  10001419    .   T   <NON_REF>   .   .   END=10001425    GT:DP:GQ:MIN_DP:PL  0/0:45:12:42:0,9,135
20  10001426    .   A   <NON_REF>   .   .   END=10001427    GT:DP:GQ:MIN_DP:PL  0/0:49:0:48:0,0,1282
20  10001428    .   T   <NON_REF>   .   .   END=10001428    GT:DP:GQ:MIN_DP:PL  0/0:49:21:49:0,21,315
20  10001429    .   G   <NON_REF>   .   .   END=10001429    GT:DP:GQ:MIN_DP:PL  0/0:47:18:47:0,18,270
20  10001430    .   G   <NON_REF>   .   .   END=10001431    GT:DP:GQ:MIN_DP:PL  0/0:45:0:44:0,0,1121
20  10001432    .   A   <NON_REF>   .   .   END=10001432    GT:DP:GQ:MIN_DP:PL  0/0:43:18:43:0,18,270
20  10001433    .   T   <NON_REF>   .   .   END=10001433    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1201
20  10001434    .   G   <NON_REF>   .   .   END=10001434    GT:DP:GQ:MIN_DP:PL  0/0:44:18:44:0,18,270
20  10001435    .   A   <NON_REF>   .   .   END=10001435    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1130
20  10001436    .   A   AAGGCT,<NON_REF>    1845.73 .   DP=43;MLEAC=2,0;MLEAF=1.00,0.00;MQ=220.07;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,42,0:42:99:1886,125,0,1888,126,1890:0,0,0,0
20  10001437    .   A   <NON_REF>   .   .   END=10001437    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,0

Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

Created 2015-05-15 04:52:05 | Updated 2015-11-25 07:08:50 | Tags: haplotypecaller release-notes genotypegvcfs gatk3

GATK 3.4 was released on May 15, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights.

### New tool

• ASEReadCounter: A tool to count read depth in a way that is appropriate for allele specific expression (ASE) analysis. It counts the number of reads that support the REF allele and the ALT allele, filtering low qual reads and bases and keeping only properly paired reads. See Highlights for more details.

### HaplotypeCaller & GenotypeGVCFs

• Important fix for genotyping positions over spanning deletions. Previously, if a SNP occurred in sample A at a position that was in the middle of a deletion for sample B, sample B would be genotyped as homozygous reference there (but it's NOT reference - there's a deletion). Now, sample B is genotyped as having a symbolic DEL allele. See Highlights for more details.
• Deprecated --mergeVariantsViaLD argument in HaplotypeCaller since it didn’t work. To merge complex substitutions, use ReadBackedPhasing as a post-processing step.
• Removed exclusion of MappingQualityZero, SpanningDeletions and TandemRepeatAnnotation from the list of annotators that cannot be annotated by HaplotypeCaller. These annotations are still not recommended for use with HaplotypeCaller, but this is no longer enforced by a hardcoded ban.
• Clamp the HMM window starting coordinate to 1 instead of 0 (contributed by nsubtil).
• Fixed the implementation of allowNonUniqueKmersInRef so that it applies to all kmer sizes. This resolves some assembly issues in low-complexity sequence contexts and improves calling sensitivity in those regions.
• Initialize annotations so that --disableDithering actually works.
• Automatic selection of indexing strategy based on .g.vcf file extension. See Highlights for more details.
• Removed normalization of QD based on length for indels. Length-based normalization is now only applied if the annotation is calculated in UnifiedGenotyper.
• Added the RGQ (Reference GenotypeQuality) FORMAT annotation to monomorphic sites in the VCF output of GenotypeGVCFs. Now, instead of stripping out the GQs for monomorphic ohm-ref sites, we transfer them to the RGQ. This is extremely useful for people who want to know how confident the hom-ref genotype calls are. See Highlights for more details.
• Removed GenotypeSummaries from default annotations.
• Added -uniquifySamples to GenotypeGVCFs to make it possible to genotype together two different datasets containing the same sample.
• Disallow changing -dcov setting for HaplotypeCaller (pending a fix to the downsampling control system) to prevent buggy behavior. See Highlights for more details.
• Raised per-sample limits on the number of reads in ART and HC. Active Region Traversal was using per sample limits on the number of reads that were too low, especially now that we are running one sample at a time. This caused issues with high confidence variants being dropped in high coverage data.
• Removed explicit limitation (20) of the maximum ploidy of the reference-confidence model. Previously there was a fixed-size maximum ploidy indel RCM likelihood cache; this was changed to a dynamically resizable one. There are still some de facto limitations which can be worked around by lowering the max alt alleles parameter.
• Made GQ of Hom-Ref Blocks in GVCF output be consistent with PLs.
• Fixed a bug where HC was not realigning against the reference but against the best haplotype for the read.
• Fixed a bug (in HTSJDK) that was causing GenotypeGVCFs to choke on sites with large numbers of alternate alleles (>140).
• Modified the way GVCFBlock header lines are named because the new HTSJDK version disallows duplicate header keys (aside from special-cased keys such as INFO and FORMAT).

### CombineGVCFs

• Added option to break blocks at every N sites. Using --breakBandsAtMultiplesOf N will ensure that no reference blocks span across genomic positions that are multiples of N. This is especially important in the case of scatter-gather where you don't want your scatter intervals to start in the middle of blocks (because of a limitation in the way -L works in the GATK for VCF records with the END tag). See Highlights for more details.
• Fixed a bug that caused the tool to stop processing after the first contig.
• Fixed a bug where the wrong REF allele was output to the combined gVCF.

### VariantRecalibrator

• Switched VQSR tranches plot ordering rule (ordering is now based on tranche sensitivity instead of novel titv).
• VQSR VCF header command line now contains annotations and tranche levels.

### SelectVariants

• Added -trim argument to trim (simplify) alleles to a minimal representation.
• Added -trimAlternates argument to remove all unused alternate alleles from variants. Note that this is pretty aggressive for monomorphic sites.
• Changed the default behavior to trim (remove) remaining alleles when samples are subset, and added the -noTrim argument to preserve original alleles.
• Added --keepOriginalDP argument.

### VariantAnnotator

• Improvements to the allele trimming functionalities.
• Added functionality to support multi-allelic sites when annotating a VCF with annotations from another callset. See Highlights for more details.

### CalculateGenotypePosteriors

• Fixed user-reported bug featuring "trio" family with two children, one parent.
• Added error handling for genotypes that are called but have no PLs.

### Various tools

• BQSR: Fixed an issue where GATK would skip the entire read if a SNP is entirely contained within a sequencing adapter (contributed by nsubtil); and improved how uncommon platforms (as encoded in RG:PL tag) are handled.
• DepthOfCoverage: Now logs a warning if incompatible arguments are specified.
• SplitSamFile: Fixed a bug that caused a NullPointerException.
• SplitNCigarReads: Fixed issue to make -fixNDN flag fully functional.
• IndelRealigner: Fixed an issue that was due to reads that have an incorrect CIGAR length.
• CombineVCFs: Minor change to an error check that was put into 3.3 so that identical samples don't need -genotypeMergeOption.
• VariantsToBinaryPED: Corrected swap between mother and father in PED file output.
• GenotypeConcordance: Monomorphic sites in the truth set are no longer called "Mismatching Alleles" when the comp genotype has an alternate allele.
• ReadBackedPhasing: Fixed a couple of bugs in MNP merging.
• CatVariants: Now allows different input / output file types, and spaces in directory names.
• VariantsToTable: Fixed a bug that affected the output of the FORMAT record lists when -SMA is specified. Note that FORMAT fields behave the same as INFO fields - if the annotation has a count of A (one entry per Alt Allele), it is split across the multiple output lines. Otherwise, the entire list is output with each field.

• Corrected logical expression in MateSameStrandFilter (contributed by user seru71).
• Handle X and = CIGAR operators appropriately
• Added -drf argument to disable default read filters. Limited to specific tools and specific filters (currently only DuplicateReadFilter).

### Annotations

• Calculate StrandBiasBySample using all alternate alleles as “REF vs. any ALT”.
• Modified InbreedingCoeff so that it works when genotyping uniquified samples (see GenotypeGVCFs changes).
• Changed GC Content value type from Integer to Float.
• Added StrandAlleleCountsBySample annotation. This annotation outputs the number of reads supporting each allele, stratified by sample and read strand; callable from HaplotypeCaller only.
• Made annotators emit a warning if they can't be applied.

### GATK Engine & common features

• Fixed logging of 'out' command line parameter in VCF headers; changed []-type arrays to lists so argument parsing works in VCF header commandline output.
• Modified GATK command line header for unique keys. The GATK command line header keys were being repeated in the VCF and subsequently lost to a single key value by HTSJDK. This resolves the issue by appending the name of the walker after the text "GATKCommandLine" and a number after that if the same walker was used more than once in the form: GATKCommandLine.(walker name) for the first occurrence of the walker, and GATKCommandLine.(walker name).# where # is the number of the occurrence of the walker (e.g. GATKCommandLine.SomeWalker.2 for the second occurrence of SomeWalker).
• Handle X and = CIGAR operators appropriately.
• Added barebones read/write CRAM support (no interval seeking!). See Highlights for more details.
• Cleaned up logging outputs / streams; messages (including HMM log messages) that were going to stdout now going to stderr.
• Improved error messages; when an error is related to a specific file, the engine now includes the file name in the error message.
• Fixed BCF writing when FORMAT annotations contain arrays.

### Queue

• Added -qsub-broad argument. When -qsub-broad is specified instead of -qsub, Queue will use the h_vmem parameter instead of h_rss to specify memory limit requests. This was done to accommodate changes to the Broad’s internal job scheduler. Also causes the GridEngine native arguments to be output by default to the logger, instead of only when in debug mode.
• Fixed the scala wrapper for Picard MarkDuplicates (needed because MarkDuplicates was moved to a different package within Picard).
• Added optional element "includeUnmapped" to the PartitionBy annotation. The value of this element (default true) determines whether Queue will explicitly run this walker over unmapped reads. This patch fixes a runtime error when FindCoveredIntervals was used with Queue.

### Documentation

• Plentiful enhancements and fixes to various tool docs, especially annotations and read filters.

### For developers

• Upgraded SLF4J to allow new convenient logging syntaxes.
• Patched maven pom file for slf4j-log4j12 version (contributed by user Biocyberman).
• Updated HTSJDK version (now pulling it in from Maven Central); various edits made to match.
• Collected VCF IDs and header lines into one place (GATKVCFConstants).

Created 2014-10-23 18:53:52 | Updated 2015-05-12 17:24:14 | Tags: Troll haplotypecaller ploidy release-notes genotype-refinement genotypegvcfs gatk3

GATK 3.3 was released on October 23, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.

## Haplotype Caller

• Improved the accuracy of dangling head merging in the HC assembler (now enabled by default).
• Physical phasing information is output by default in new sample-level PID and PGT tags.
• Added the --sample_name argument. This is a shortcut for people who have multi-sample BAMs but would like to use -ERC GVCF mode with a particular one of those samples.
• Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
• Fixed IndexOutOfBounds error associated with tail merging.

## Variant Recalibrator

• New --ignore_all_filters option. If specified, the variant recalibrator will ignore all input filters and treat sites as unfiltered.

## GenotypeGVCFs

• Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
• Bug fix for the case when we assumed ADs were in the same order if the number of alleles matched.
• Changed the default GVCF GQ Bands from 5,20,60 to be 1..60 by 1s, 60...90 by 10s and 99 in order to give finer resolution.
• Bug fix in the exact model when calling multi-allelic variants. QUAL field is now more accurate.

## RNAseq analysis

• Bug fixes for working with unmapped reads.

## CalculateGenotypePosteriors

• New annotation for low- and high-confidence possible de novos (only annotates biallelics).
• FamilyLikelihoodsUtils now add joint likelihood and joint posterior annotations.
• Restricted population priors based on discovered allele count to be valid for 10 or more samples.

## DepthOfCoverage

• Fixed rare bug triggered by hash collision between sample names.

## SelectVariants

• Updated the --keepOriginalAC functionality in SelectVariants to work for sites that lose alleles in the selection.

• Read groups that are excluded by sample_name, platform, or read_group arguments no longer appear in the header.
• The performance penalty associated with filtering by read group has been essentially eliminated.

## Annotations

• StrandOddsRatio is now a standard annotation that is output by default.
• We used to output zero for FS if there was no data available at a site, now we omit FS.
• Extensive rewrite of the annotation documentation.

## Queue

• Fixed issue related to spaces in job names that were fine in GridEngine 6 but break in (Son of) GE8.
• Improved scatter contigs algorithm to be fairer when splitting many contigs into few parts (contributed by @smowton)

## Documentation

• We now generate PHP files instead of HTML.
• We now output a JSON version of the tool documentation that can be used to generate wrappers for GATK commands.

## Miscellaneous

• Output arguments --no_cmdline_in_header, --sites_only, and --bcf for VCF files, and --bam_compression, --simplifyBAM, --disable_bam_indexing, and --generate_md5 for BAM files moved to the engine level.
• htsjdk updated to version 1.120.1620

Created 2016-04-28 10:05:58 | Updated | Tags: genotypegvcfs progress warning

Hi GATK Team,

I have a few quick queries about my GenotypeGVCFs output, or rather the progress report generated. I'm concerned that it suggests all is not well! I ran GenotypeGVCFs across 8 threads for a 5-scaffold interval of a cichlid genome. The 233 input genomes had been grouped into 10 sets using CombineGVCFs. My concerns are:

1) The first measure of completeness in my progress report is 14.9%. From there, progress was mostly steady (0.1 - 0.5% between reports) but it did occasionally skip 4-5% at a time between lines. The generated vcf file contains lines for positions prior to the first one reported, so does this just mean that the program had periods of relative super-efficiency or does it suggest trouble?

2) 500 positions were met with the warning: 'GenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location ...'. Having looked at 2 of these positions, it appears that they are recorded as repeats. As this is 500 sites in just 1/16th of a genome, I thought it best to check that this is just the genome rather than a problem with the program?

3) Finally, such repeats' warnings were reported out of order in the progress report. That is, their genomic position was not inbetween the entries above and below them. Is this cause for concern?

Thank you for your time (and to Geraldine for advising me on how to get GenotypeGVCFs running for gzipped files in the first place!).

Best wishes,

Ian

Created 2016-04-27 16:23:36 | Updated | Tags: ploidy genotypegvcfs combinegvcfs sex-chromosomes sample-ploidy

Hi,

I am setting up a pipeline for HaplotypeCaller and there is something I do not understand. In the manual pages of CombineGVCFs and GenotypeGVCFs it is said: "This tool is able to handle any ploidy (or mix of ploidies) intelligently; there is no need to specify ploidy for non-diploid organisms".

But then there is the description of the "--sample_ploidy" argument. Do I have to set the number of samples?

And if I want to genotype the X chromosome for a set of samples (for which, depending on the sex, HC was run with ploidy 1 or ploidy 2), do I have to do something special?

Many thanks, Federico

Created 2016-04-27 12:04:17 | Updated | Tags: gatk genotypegvcfs combinegvcfs concatvariants

Hi there,

I'm attempting to use GenotypeGVCFs with 223 gzipped gvcf files containing WGS data. The files were gzipped by GATK programs (I don't have the space to deal with uncompressed files). As has been discussed in other threads, GenotypeGVCFs crashes with the gzipped files. I believe this is because HaplotypeCaller doesn't index the files, and GATK programs automatically use tabix to index any unindexed gzipped files they receive, yes?

As such, I'm looking for the best workaround. So far, I have used HaplotypeCaller to call sites on intervals in each genome, then used ConcatVariants to concatenate each set of intervals. I am now attempting to run CombineGVCFs on the concatenated GVCF files and split the files into 10 groups to run on GenotypeGVCFs. This will take approx 48 hours with no guarantee that GenotypeGVCFs will like the output. Will it make any difference to runtime or output (and, indeed, does it make sense) to run CombineGVCFs on the groups of intervals before giving them to ConcatVariants then GenotypeGVCFs?

Apologies if this is a daft question but I've already lost a fair amount of time on this so I just want to make sure what I'm doing makes sense :)

Thanks,

Ian

Created 2016-04-21 13:06:39 | Updated 2016-04-21 13:17:48 | Tags: multithreading genotypegvcfs

Dear GATK Devs,

While working on a new version of our human exome pipeline I ran into the problem that the output VCF files were not the same, even though the commands and samples run through it were exactly the same. Some of the quality scores like MQranksum or QD are slightly different, but this influences the end result quite a bit. Initially I thought this might be our older version of GATK (3.3), but I noticed the same effect when using 3.5.

I couldn't find anything about this on the forum or in the documentation. Is this a known bug by chance?

Below is the result of a diff command on the two resulting VCF files. Paths and sample IDs are censored, but are identical for both.

14c14
---
145,148c145,148
chr1  888639  rs3748596       T       C       5306.35 .       AC=8;AF=1.00;AN=8;DB;DP=150;ExcessHet=3.0103;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=50.20;MQ0=0;QD=29.09;SOR=1.061      GT:AD:DP:GQ:PL  1/1:0,41:41:99:1498,123,0       1/1:0,49:49:99:1714,147,0       1/1:0,26:26:78:930,78,0 1/1:0,34:34:99:1190,102,0
chr1  888659  rs3748597       T       C       4601.35 .       AC=8;AF=1.00;AN=8;DB;DP=132;ExcessHet=3.0103;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=52.14;MQ0=0;QD=32.93;SOR=1.420      GT:AD:DP:GQ:PL  1/1:0,32:32:96:1173,96,0        1/1:0,43:43:99:1522,129,0       1/1:0,28:28:84:896,84,0 1/1:0,28:28:84:1036,84,0
chr1  889158  rs56262069      G       C       4026.35 .       AC=8;AF=1.00;AN=8;DB;DP=89;ExcessHet=3.0103;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=57.77;MQ0=0;QD=30.83;SOR=1.408       GT:AD:DP:GQ:PGT:PID:PL  1/1:0,22:22:69:1|1:889158_G_C:1024,69,0 1/1:0,26:26:78:1|1:889158_G_C:1170,78,0 1/1:0,18:18:57:1|1:889158_G_C:823,57,0  1/1:0,23:23:69:1|1:889158_G_C:1035,69,0
chr1  889159  rs13302945      A       C       4026.35 .       AC=8;AF=1.00;AN=8;DB;DP=91;ExcessHet=3.0103;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=57.77;MQ0=0;QD=29.67;SOR=1.389       GT:AD:DP:GQ:PGT:PID:PL  1/1:0,23:23:69:1|1:889158_G_C:1024,69,0 1/1:0,26:26:78:1|1:889158_G_C:1170,78,0 1/1:0,19:19:57:1|1:889158_G_C:823,57,0  1/1:0,23:23:69:1|1:889158_G_C:1035,69,0
---
chr1  888639  rs3748596       T       C       5306.35 .       AC=8;AF=1.00;AN=8;DB;DP=150;ExcessHet=3.0103;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=50.20;MQ0=0;QD=30.83;SOR=1.061      GT:AD:DP:GQ:PL  1/1:0,41:41:99:1498,123,0       1/1:0,49:49:99:1714,147,0       1/1:0,26:26:78:930,78,0 1/1:0,34:34:99:1190,102,0
chr1  888659  rs3748597       T       C       4601.35 .       AC=8;AF=1.00;AN=8;DB;DP=132;ExcessHet=3.0103;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=52.14;MQ0=0;QD=29.67;SOR=1.420      GT:AD:DP:GQ:PL  1/1:0,32:32:96:1173,96,0        1/1:0,43:43:99:1522,129,0       1/1:0,28:28:84:896,84,0 1/1:0,28:28:84:1036,84,0
chr1  889158  rs56262069      G       C       4026.35 .       AC=8;AF=1.00;AN=8;DB;DP=89;ExcessHet=3.0103;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=57.77;MQ0=0;QD=29.09;SOR=1.408       GT:AD:DP:GQ:PGT:PID:PL  1/1:0,22:22:69:1|1:889158_G_C:1024,69,0 1/1:0,26:26:78:1|1:889158_G_C:1170,78,0 1/1:0,18:18:57:1|1:889158_G_C:823,57,0  1/1:0,23:23:69:1|1:889158_G_C:1035,69,0
chr1  889159  rs13302945      A       C       4026.35 .       AC=8;AF=1.00;AN=8;DB;DP=91;ExcessHet=3.0103;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=57.77;MQ0=0;QD=32.93;SOR=1.389       GT:AD:DP:GQ:PGT:PID:PL  1/1:0,23:23:69:1|1:889158_G_C:1024,69,0 1/1:0,26:26:78:1|1:889158_G_C:1170,78,0 1/1:0,19:19:57:1|1:889158_G_C:823,57,0  1/1:0,23:23:69:1|1:889158_G_C:1035,69,0

Thanks in forward!

Created 2016-04-19 17:43:33 | Updated | Tags: genotypegvcfs

I'm working on the last step of our lab's well established variant calling pipeline, running GATK GenotypeGVCFs on 4392 whole exome sequenced individuals. In the past I haven't had any problems with this sort of thing, but on this last run the job would be killed on the supercomputer cluster for using too much memory. Now it appears that even with allocating 16 threads and 64 GB of memory the log file predicts nearly 8 weeks of runtime remaining! I am using GATK 3.3 with the following arguments:

-T GenotypeGVCFs -R /projects/resources/Homo_sapiens_assembly19.fasta --variant /projects/combinedgvcfs/combined_gvcfs.list --dbsnp /projects/resources/gatk_bundle/dbsnp_138.b37.vcf -o /04_15_2016/genotype_gvcfs/04_15_2016_raw.vcf -log /04_15_2016/genotype_gvcfs/04_15_2016_raw.log -L /projects/resources/bed.and.interval.files/b37_refseqplus50_clean.bed -nt 16 --max_alternate_alleles 6

If anyone has any ideas please let me know because in the past this would take no more than 72 hours to run to completion. Let me know if I can provide any additional information to help.

Created 2016-04-15 22:01:44 | Updated | Tags: commandlinegatk haplotypecaller genotypegvcfs joint-calling

I have been following the best practices outline for calling SNPs on our samples, but I'm a little confused as to what to do with the VCF file produced following the joint genotyping/genotypeGVCFs step.

I understand the principle of gVCF calling for the most part, but my confusion is what are we to do with the VCF file once we do the joint genotyping step? We are looking at a F1 mapping population of a non-model organism, so does this VCF file have individual progeny (bam file names) indicated within it? I think not since I can't find any of the sample names while scrolling through it.

Can this VCF file be used to construct a pedigree file to use during genotype refinement? Should it be somehow fed back into Haplotypecaller to inform on likely calls during a second round of variant calling? Do you use it to go back to the individual gVCF files to extract the high confidence variants?

There seems to be a good amount of literature on the Broad websites about what a gVCF file is and how to perform joint genotyping, but not much direction about what to do with the joint genotyped VCF file once it is produced.

Any advice or referral to other walkthroughs/guides would be very appreciated.

Michael

[extra project information: My project involves calling SNPs across a mapping population for a non-model organism with the intent of mapping a trait. The goal is to produce robust SNP calls for each individual progeny (of which we have 30 currently, and >60 in the near future) and the two parents. We only have halfway-decent sequencing coverage of ~10-20x for each sample, which is thus why doing gVCF calling and joint genotyping sounds attractive to us. Since we work on a non-model, we also lack previously produced "gold standard" SNP sets or other resources allowing us to refine genotypes.]

Created 2016-04-12 06:35:52 | Updated 2016-04-12 06:36:21 | Tags: genotypegvcfs

HI: GenotypeGVCFs only can output the variants which were mutation, However, did it have any parameter that can print all kind of variants, even they were not mutation?

Thanks a lot!

Wendy

Created 2016-04-07 16:31:11 | Updated | Tags: haplotypecaller dp vcf-format genotypegvcfs

Hi,

I've run into the (already reported http://gatkforums.broadinstitute.org/dsde/discussion/5598/missing-depth-dp-after-haplotypecaller ) bug of the missing DP format field in my callings.

I've run the following (relevant) commands:

Haplotype Caller -> Generate GVCF:

    java -Xmx${xmx}${gct} -Djava.io.tmpdir=${NEWTMPDIR} -jar${gatkpath}/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R ${ref} \ -I${NEWTMPDIR}/{prefix}.realigned.fixed.recal.bam \ -L{reg} \
-ERC GVCF \
-nct ${nct} \ --genotyping_mode DISCOVERY \ -stand_emit_conf 10 \ -stand_call_conf 30 \ -o${prefix}.raw_variants.annotated.g.vcf \
-A QualByDepth -A RMSMappingQuality -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A StrandOddsRatio -A Coverage

That generates GVCF files that DO HAVE the DP field for all reference positions, but DO NOT HAVE the DP format field for any called variant (but still keep the DP in the INFO field):

18      11255   .       T       <NON_REF>       .       .       END=11256       GT:DP:GQ:MIN_DP:PL      0/0:18:48:18:0,48,720
18      11258   .       G       <NON_REF>       .       .       END=11260       GT:DP:GQ:MIN_DP:PL      0/0:17:48:16:0,48,530

Later, I ran Genotype GVCF joining all the samples with the following command:

java -Xmx${xmx}${gct} -Djava.io.tmpdir=${NEWTMPDIR} -jar${gatkpath}/GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R ${ref} \ -L${pos} \
-o ${prefix}.raw_variants.annotated.vcf \ --variant${variant} [...]

This generated vcf files where the DP field is present in the format description, it IS present in the Homozygous REF samples, but IS MISSING in any Heterozygous or HomoALT samples.

22  17280388    .   T   C   18459.8 PASS    AC=34;AF=0.340;AN=100;BaseQRankSum=-2.179e+00;DP=1593;FS=2.526;InbreedingCoeff=0.0196;MLEAC=34;MLEAF=0.340;MQ=60.00;MQRankSum=0.196;QD=19.76;ReadPosRankSum=-9.400e-02;SOR=0.523    GT:AD:DP:GQ:PL  0/0:29,0:29:81:0,81,1118    0/1:20,22:.:99:688,0,682    1/1:0,27:.:81:1018,81,0 0/0:22,0:22:60:0,60,869 0/1:20,10:.:99:286,0,664    0/1:11,17:.:99:532,0,330    0/1:14,14:.:99:431,0,458    0/0:28,0:28:81:0,81,1092    0/0:35,0:35:99:0,99,1326    0/1:14,20:.:99:631,0,453    0/1:13,16:.:99:511,0,423    0/1:38,29:.:99:845,0,1231   0/1:20,10:.:99:282,0,671    0/0:22,0:22:63:0,63,837 0/1:8,15:.:99:497,0,248 0/0:32,0:32:90:0,90,1350    0/1:12,12:.:99:378,0,391    0/1:14,26:.:99:865,0,433    0/0:37,0:37:99:0,105,1406   0/0:44,0:44:99:0,120,1800   0/0:24,0:24:72:0,72,877 0/0:30,0:30:84:0,84,1250    0/0:31,0:31:90:0,90,1350    0/1:15,25:.:99:827,0,462    0/0:35,0:35:99:0,99,1445    0/0:29,0:29:72:0,72,1089    1/1:0,32:.:96:1164,96,0 0/0:21,0:21:63:0,63,809 0/1:21,15:.:99:450,0,718    1/1:0,40:.:99:1539,120,0    0/0:20,0:20:60:0,60,765 0/1:11,9:.:99:293,0,381 1/1:0,35:.:99:1306,105,0    0/1:18,14:.:99:428,0,606    0/0:32,0:32:90:0,90,1158    0/1:24,22:.:99:652,0,816    0/0:20,0:20:60:0,60,740 1/1:0,30:.:90:1120,90,0 0/1:15,13:.:99:415,0,501    0/0:31,0:31:90:0,90,1350    0/1:15,18:.:99:570,0,480    0/1:22,13:.:99:384,0,742    0/1:19,11:.:99:318,0,632    0/0:28,0:28:75:0,75,1125    0/0:20,0:20:60:0,60,785 1/1:0,27:.:81:1030,81,0 0/0:30,0:30:90:0,90,1108    0/1:16,16:.:99:479,0,493    0/1:14,22:.:99:745,0,439    0/0:31,0:31:90:0,90,1252
22  17280822    .   G   A   5491.56 PASS    AC=8;AF=0.080;AN=100;BaseQRankSum=1.21;DP=1651;FS=0.000;InbreedingCoeff=-0.0870;MLEAC=8;MLEAF=0.080;MQ=60.00;MQRankSum=0.453;QD=17.89;ReadPosRankSum=-1.380e-01;SOR=0.695   GT:AD:DP:GQ:PL  0/0:27,0:27:72:0,72,1080    0/0:34,0:34:90:0,90,1350    0/1:15,16:.:99:528,0,491    0/0:27,0:27:60:0,60,900 0/1:15,22:.:99:699,0,453    0/0:32,0:32:90:0,90,1350    0/0:37,0:37:99:0,99,1485    0/0:31,0:31:87:0,87,1305    0/0:40,0:40:99:0,108,1620   0/1:20,9:.:99:258,0,652 0/0:26,0:26:72:0,72,954 0/1:16,29:.:99:943,0,476    0/0:27,0:27:69:0,69,1035    0/0:19,0:19:48:0,48,720 0/0:32,0:32:81:0,81,1215    0/0:36,0:36:99:0,99,1435    0/0:34,0:34:99:0,99,1299    0/0:35,0:35:99:0,102,1339   0/0:38,0:38:99:0,102,1520   0/0:36,0:36:99:0,99,1476    0/0:31,0:31:81:0,81,1215    0/0:31,0:31:75:0,75,1125    0/0:35,0:35:99:0,99,1485    0/0:37,0:37:99:0,99,1485    0/0:35,0:35:90:0,90,1350    0/0:20,0:20:28:0,28,708 0/1:16,22:.:99:733,0,474    0/0:32,0:32:90:0,90,1350    0/0:35,0:35:99:0,99,1467    0/1:27,36:.:99:1169,0,831   0/0:28,0:28:75:0,75,1125    0/0:36,0:36:81:0,81,1215    0/0:35,0:35:90:0,90,1350    0/0:28,0:28:72:0,72,1080    0/0:31,0:31:81:0,81,1215    0/0:37,0:37:99:0,99,1485    0/0:31,0:31:84:0,84,1260    0/0:39,0:39:99:0,101,1575   0/0:37,0:37:96:0,96,1440    0/0:34,0:34:99:0,99,1269    0/0:30,0:30:81:0,81,1215    0/0:36,0:36:99:0,99,1485    0/1:17,17:.:99:567,0,530    0/0:26,0:26:72:0,72,1008    0/0:18,0:18:45:0,45,675 0/0:33,0:33:84:0,84,1260    0/0:25,0:25:61:0,61,877 0/1:9,21:.:99:706,0,243 0/0:35,0:35:81:0,81,1215    0/0:35,0:35:99:0,99,1485

I've just discovered this issue, and I need to run an analysis trying on the differential depth of coverage in different regions, and if there is a DP bias between called/not-called samples.

I have thousands of files and I've spent almost 1 year generating all these callings, so redoing the callings is not an option.

What would be the best/fastest strategy to either fix my final vcfs with the DP data present in all intermediate gvcf files (preferably) or, at least, extracting this data for all snps and samples?

Txema

PS: Recalling the individual samples from bamfiles is not an option. Fixing the individual gvcfs and redoing the joint GenotypeGVCFs could be.

Created 2016-04-05 09:19:53 | Updated 2016-04-05 09:24:24 | Tags: genotypegvcfs

Dear GATK team,

1) I am running GenotypeGVCFs on list of 90 individuals, the reference consists of 48 000 contigs.

   java  -jar /bin/GATK3.5/GenomeAnalysisTK.jar -T  GenotypeGVCFs -R ref.fa -V gvcfs.list -o all.gvcf -nt 8

I get warnings for each of 90 GVCFs from the list:

"Could not acquire a shared lock on index file file.gvcf.idx, falling back to using an in-memory index for this GATK run."

After he processes all that for 90 GVCFs (about 8 hours) it estimates 200 weeks of computation and is still increasing.

   java  -jar /bin/GATK3.5/GenomeAnalysisTK.jar -T  GenotypeGVCFs -R ../references/ref.fa -V gvcfs.list -o all.gvcf

Still the same warnings and problems

3) I run it on each contig seprately:

   while read l;do java  -jar /bin/GATK3.5/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ref.fa -V gvcfs.list -o ${l}_flax.gvcf -L${l}; done < contigs.list

Still the same warnings
He computes genotyped GVCFs quite fast but for each of 48 k contigs he takes 8 hours to solve warnings:
"Could not acquire a shared lock on index file file.gvcf.idx, falling back to using an in-memory index for this GATK run."

   Warning is not appearing but time to process each contig is dramaticallly long even before he starts genotyping.

Many thanks, Rafal

Created 2016-03-19 21:06:07 | Updated | Tags: genotypegvcfs

Can someone help me fix this error? Thanks in advance

INFO 21:07:43,512 HttpMethodDirector - I/O exception (java.net.SocketException) caught when processing request: Network is unreachable INFO 21:07:43,513 HttpMethodDirector - Retrying request INFO 21:07:43,516 HttpMethodDirector - I/O exception (java.net.SocketException) caught when processing request: Network is unreachable INFO 21:07:43,516 HttpMethodDirector - Retrying request INFO 21:07:43,519 HttpMethodDirector - I/O exception (java.net.SocketException) caught when processing request: Network is unreachable INFO 21:07:43,519 HttpMethodDirector - Retrying request INFO 21:07:43,521 HttpMethodDirector - I/O exception (java.net.SocketException) caught when processing request: Network is unreachable INFO 21:07:43,521 HttpMethodDirector - Retrying request INFO 21:07:43,523 HttpMethodDirector - I/O exception (java.net.SocketException) caught when processing request: Network is unreachable INFO 21:07:43,524 HttpMethodDirector - Retrying request

java.lang.ArrayIndexOutOfBoundsException: 10296 at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.generatePL(ReferenceConfidenceVariantContextMerger.java:357) at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.mergeRefConfidenceGenotypes(ReferenceConfidenceVariantContextMerger.java:331) at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.merge(ReferenceConfidenceVariantContextMerger.java:134) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:200) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:119) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutorWorker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: 10296 ##### ERROR ------------------------------------------------------------------------------------------ Created 2016-03-15 15:44:05 | Updated 2016-03-15 15:45:35 | Tags: genotypegvcfs Greetings, I've run best practices for HC. Merged my populations into seven GVCFs and genotyped. After trying to run BEAGLE I see that there are some genotypes that have too many PL values. Has anyone run into this? Doesn't this break VCF spec? Exception in thread "main" java.lang.IllegalArgumentException: unexpected number of PL subfields: 0,21,315,21,315,315 1 840113 . T C 1098.14 PASS AC=7;AF=0.013;AN=528;BaseQRankSum=-7.400e-01;ClippingRankSum=-4.030e-01;DP=7058;ExcessHet=7.7159;FS=11.599;InbreedingCoeff=-0.0435;MLEAC=7;MLEAF=0.013;MQ=58.26;MQ0=0;MQRankSum=-6.300e-01;QD=4.24;ReadPosRankSum=-1.517e+00;SOR=0.991;VQSLOD=-2.012e+00;culprit=QD;NEGATIVE_TRAIN_SITE GT:AB:AD:DP:GQ:PGT:PID:PL Created 2016-03-10 17:10:07 | Updated 2016-03-10 17:11:23 | Tags: genotypegvcfs hangs Hi team, thanks for your work. I'm currently running GenotypeGVCF on 78 WG individuals using the -L option to run each chromosome separately. java -jar -Xmx20g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/scratch/local/ GenomeAnalysisTK.jar -T GenotypeGVCFs -R references/human_g1k_v37_decoy.fasta --disable_auto_index_creation_and_locking_when_reading_rods --num_threads 10 --variant [78 WG individuals] --variant ... -L chr10_region.interval I started running v3.0, but also ran v3.3, with the same issue. ### Issue: Each time it runs it gets to the same point and then hangs for >18 hours, before I kill the job. INFO 09:57:47,938 HelpFormatter - Executing as me@kingspeak20 on Linux 2.6.32-504.30.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_85-mockbuild_2015_07_25_13_10-b00. INFO 09:57:47,939 HelpFormatter - Date/Time: 2016/03/10 09:57:47 INFO 09:57:47,939 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:57:47,940 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:57:48,441 GenomeAnalysisEngine - Strictness is SILENT INFO 09:57:48,661 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 If I'm correct 78 WG individuals should not be an issue. Any ideas as to what is causing this issue? --Thanks Created 2016-03-03 19:55:38 | Updated 2016-03-03 19:57:50 | Tags: nct nt genotypegvcfs I'm trying to get GenotypeGVCFs to use only a few CPUs (shared cluster). I've tried running GenotypeGVCFs with INFO 11:50:37,171 HelpFormatter - Program Args: -nt 1 -T GenotypeGVCFs -V vcffiles.list and INFO 11:50:37,171 HelpFormatter - Program Args: -T GenotypeGVCFs -V vcffiles.list It uses as many CPUs as it pleases, causing angry admins and lab mates... Suggestions? Created 2016-02-29 00:11:59 | Updated | Tags: selectvariants variantfiltration phasing genotypegvcfs Hi, I'd be grateful for your help in understanding how GATK add phase information (PID field) in GenotypeGVCFs. Here's an example of two lines from a VCF I'm getting for running GenotypeGVCFs on a set of samples: chr1 8864083 . T C 462.27 . AC=4;AF=0.500;AN=8;BaseQRankSum=-4.950e-01;ClippingRankSum=0.406;DP=74;FS=0.000;MLEAC=4;MLEAF=0.500;MQ=60.00;MQRankSum=0.306;QD=6.42;ReadPosRankSum=1.54;SOR=0.674 GT:AD:DP:GQ:PL 0/1:22,12:34:99:205,0,518 0/1:10,4:14:60:60,0,220 0/1:3,3:6:71:71,0,72 0/1:10,8:18:99:158,0,238 ./.:2,0:2 chr1 8864426 . C T 5302.72 . AC=5;AF=0.500;AN=10;BaseQRankSum=0.815;ClippingRankSum=-6.270e-01;DP=316;FS=7.664;MLEAC=5;MLEAF=0.500;MQ=60.00;MQRankSum=-6.130e-01;QD=17.05;ReadPosRankSum=-1.240e-01;SOR=0.777 GT:AD:DP:GQ:PGT:PID:PL 0/1:24,66:90:99:0|1:8864426_C_T:1596,0,649 0/1:20,62:82:99:.:.:1502,0,536 0/1:24,37:61:99:0|1:8864426_C_T:921,0,647 0/1:22,41:63:99:0|1:8864426_C_T:1014,0,717 0/1:2,13:15:74:0|1:8864426_C_T:304,0,74 What's the interpretation for the phase information of the last sample in chr1:8864426? in position chr1:8864083 its genotype is not reported but in the proceeding position chr1:8864426 it is phased 0|1 WRT position chr1:8864083. If I subsequently run SelectVariants for that sample, with --excludeNonVariants this phase information is retained but clearly looses context. So another question is if there is a way to change the PID field when running SelectVariants or VariantFiltration in case the position it is phased with is filtered? Thanks a lot Created 2016-02-27 23:29:35 | Updated | Tags: multiallelic genotypegvcfs Hi, I ran the GenotypeGVCFs (on RNA-seq data) on several samples and I'm getting several positions I'm having a hard time interpreting. They look like this example: chr6 135674438 . C A,* 10623.61 PASS AC=1,1;AF=0.167,0.167;AN=6;DP=860;FS=2.692;MLEAC=1,1;MLEAF=0.167,0.167;MQ=60.00;QD=26.76;SOR=0.390 GT:AD:DP:GQ:PL 1/2:2,0,223:401:99:10634,6990,5967,1315,0,4631 0/0:23,0,0:23:69:0,69,672,69,672,672 0/0:5,0,0:5:0:0,0,98,0,98,98 ./.:0,0,0:0 ./.:0,0,0:0 ./.:0,0,0:0 ./.:0,0,0:0 ./.:0,0,0:0 What I don't understand is where does the "A" ALT allele come from? All the samples leave the first (the one with GT = 1/2) are either homozygous for the REF allele or do not cover this site. The first sample is indicated to have 2 reads with the REF allele, 0 reads with the "A" allele, and 223 reads with the "*" allele - which I guess is all other possible alleles? So my question is why is the "A" allele even indicated if it is not supported by any of the samples? Thanks a lot Created 2016-02-10 04:42:23 | Updated | Tags: best-practices genotypegvcfs gvcf I am using the GATK pipeline to call variants by aligning reads to a draft quality reference genome that is ~367000 scaffolds. I split the scaffolds up into 50 intervals and successfully (and pretty quickly) generated GVCFs for 25 individuals using the -L option. However, I am having the worst of times with GenotypeGVCFs. After running for nearly 2 days on the first interval list, GenotypeGVCFs has not even output a file. Based on another post in the forum, I removed the scaffolds that are NOT in the interval from the GVCF header, and that sped up the process slightly - I have a combined VCF file with just the header generated after about 18 hours. Not sure how much longer the process has as the progress meter doesn't seem to be making any sense. Is there any known way(s) to optimize this process? Currently using the following command: java -Djava.io.tmpdir=/data/lwwvd/genoGVCF.tmp -XX:ParallelGCThreads=4 -Xmx15g -jar /usr/local/bin/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -nt 16 -T GenotypeGVCFs -R ../ref_genomes/bbu_ref_UMD_CASPUR_WB_2.0.fa -L interval_lists/bbub.refctgs.49.interval_list -V ./1095/1095.49.g.vcf.gz -V ./189/189.49.g.vcf.gz -V ./190/190.49.g.vcf.gz -V ./196/196.49.g.vcf.gz -V ./246/246.49.g.vcf.gz -V ./337/337.49.g.vcf.gz -V ./581/581.49.g.vcf.gz -V ./583/583.49.g.vcf.gz -V ./662/662.49.g.vcf.gz -V ./701/701.49.g.vcf.gz -V ./850/850.49.g.vcf.gz -V ./92764/92764.49.g.vcf.gz -V ./92765/92765.49.g.vcf.gz -V ./92766/92766.49.g.vcf.gz -V ./92767/92767.49.g.vcf.gz -V ./92768/92768.49.g.vcf.gz -V ./92769/92769.49.g.vcf.gz -V ./92770/92770.49.g.vcf.gz -V ./92771/92771.49.g.vcf.gz -V ./92774/92774.49.g.vcf.gz -V ./92775/92775.49.g.vcf.gz -V ./92776/92776.49.g.vcf.gz -V ./92777/92777.49.g.vcf.gz -V ./92778/92778.49.g.vcf.gz -V ./92795/92795.49.g.vcf.gz -o BBUB.combined.49.vcf Created 2016-01-28 01:54:09 | Updated 2016-01-28 02:02:57 | Tags: genotypegvcfs pgt Hi, In the output VCF of GenotypeGVCFs, there are 0|1, 1|0, 1|1 and . for PGT, and no 0|0. When seeing a . for PGT, I feel uncertain, as I don't know whether it is actually a 0|0 or missing data (e.g. due to zero useful coverage at the position). So it will be great if GenotypeGVCFs can also output the 0|0. Tested GATK version: V3.5 Thanks! Frank Created 2016-01-24 09:53:26 | Updated | Tags: genotypegvcfs hi there, I am trying to use GenotypeGVCFs on a sample of just five maize individuals.(version 3.4) The programs stalls here and stays there for hours without exiting or starting: INFO 03:10:45,593 HelpFormatter - --------------------------------------------------------------------------------- INFO 03:10:45,596 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12 INFO 03:10:45,596 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 03:10:45,596 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 03:10:45,600 HelpFormatter - Program Args: -T GenotypeGVCFs -R /home/ywlee1/scratch_yw/reference_genomes/GATK_Zea.fasta -L Zm_B73_CR10 -V 01DKD2_CR10.g.vcf -V 80IDM2_CR10.g.vcf -V B73_CR10.g.vcf -V COJO528_CR10.g.vcf -V FBLL_CR10.g.vcf --disable_auto_index_creation_and_locking_when_reading_rods -o test_new_ref.vcf INFO 03:10:45,605 HelpFormatter - Executing as ywlee1@stluhpcsubprd02 on Linux 2.6.32-504.16.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_75-b13. INFO 03:10:45,605 HelpFormatter - Date/Time: 2016/01/24 03:10:45 INFO 03:10:45,605 HelpFormatter - --------------------------------------------------------------------------------- INFO 03:10:45,606 HelpFormatter - --------------------------------------------------------------------------------- INFO 03:10:47,653 GenomeAnalysisEngine - Strictness is SILENT INFO 03:10:48,454 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 I'm not sure what's happening? I tried the same command line on a sample of four soy lines and it ran in 20 minutes. The command line is below: java -Xmx10g -jar /home/ywlee1/bin/apps/GATK3/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /home/ywlee1/scratch_yw/reference_genomes/X.fasta -V E.g.vcf -V D.g.vcf -V C.g.vcf -V B.g.vcf -V A.g.vcf -allSites -o Zm.CR10.GATK3.vcf Because there aren't any error messages I'm not sure where to start troubleshooting. Any advice would be greatly appreciated! thanks, Young Wha Created 2016-01-02 01:21:30 | Updated | Tags: genotypegvcfs Hi, can I combine zipped gvcf files using GenotypeGVCFs? I believe it makes an error, but I cannot use unzipped files due to space limit. Are there any other method I can do? I made a vcf file per sample and tried to merge those vcf files using vcftools, but failed. Created 2015-12-30 19:55:46 | Updated | Tags: genotypegvcfs When we run GenotypeGVCFs without the combine step, we are getting wrong genotypes for positions spanning deletions. We are also missing the AD values for the deletion allele. For example see these calls: 1 6529182 rs375111412 TTCCTCC T,TTCC 7506.41 . AC=1,1;AF=0.009615,0.221;AN=4;BaseQRankSum=0.096;ClippingRankSum=0.292;DB;DP=3437;ExcessHet=18.9343;FS=0;InbreedingCoeff=-0.2993;MLEAC=1,23;MLEAF=0.009615,0.221;MQ=60.04;MQRankSum=0.042;QD=4.27;ReadPosRankSum=0.585;SOR=0.674 GT:AD:DP:GQ:PGT:PID:PL 0/2:165,4,22:191:99:.:.:408,769,11860,0,7242,6888 0/1:24,38,2:64:99:.:.:1513,0,1384,1504,993,2521 1 6529183 . T . . . AN=4;DP=3786 GT:AD:DP:PGT:PID:RGQ 0/0:165:191:.:.:99 0/0:24:64:.:.:99 We are using GATK 3.5. We had 52 samples in this VCF and sliced it with bcftools for easier reading. This is the same output if we do run CombineGVCFs on the individual gVCF files first. 1 6529182 rs375111412 TTCCTCC TTCC,T 7506.41 . AC=1,1;AF=0.221,0.009615;AN=4;BaseQRankSum=0.096;ClippingRankSum=0.292;DB;DP=3437;ExcessHet=18.9343;FS=0;InbreedingCoeff=-0.2993;MLEAC=23,1;MLEAF=0.221,0.009615;MQ=60.04;MQRankSum=0.042;QD=4.27;ReadPosRankSum=0.585;SOR=0.674 GT:AD:DP:GQ:PGT:PID:PL 0/1:165,22,4:191:99:.:.:408,0,6888,769,7242,11860 0/2:24,2,38:64:99:.:.:1513,1504,2521,0,993,1384 1 6529183 . T * . . AC=0;AF=0;AN=4;DP=3786 GT:AD:DP:PGT:PID:RGQ 0/0:165,22:191:.:.:99 0/0:24,38:64:.:.:99 Should GenotypeGVCFs properly genotype positions spanning deletions? Or is running CombineGVCFs a required step? Thanks, Carlos Created 2015-12-29 05:30:09 | Updated | Tags: genotypegvcfs gvcf ask the joint genotyping tool, GenotypeGVCFs Created 2015-12-21 17:55:25 | Updated | Tags: haplotypecaller genotypegvcfs combinegvcfs Hi all, I'm very new to GATK. I'm trying to map an EMS mutation in Arabidopsis. I have fastq files of a wt M3 bulk and a mut M3 bullk (both offspring of the same parent). The strategy is to call for SNPs->GenotypeGVCFs to a single file. That was done succesfully (I think). Next step is to look for SNPs that are homozygous (1/1) for the mut reads and het (1/0 or 0/0) or ref in the wt bulk; I used this command for this: grep -v '^##'line.genotype10.vcf | awk 'BEGIN{FS=" "; OFS=" "} $10~/^1\/1/ && ($11~/^1\/0/ || $11~/^0\/0/) {$3=$7=""; print$0}' | sed 's/ */ /g' >file.taxt

Tha also worked pretty well. I noticed that I have ~150,000 records (SNPs or indels) using the HC but after merging the files using the GenotypeGVCFs I'm left w/ only a few thousands records. The same happens if I use CombineGVCFs (which keep ~150,000 records) and then go for GenotypeGVCFs.

The problem is that with such low # of reads it doesn't recognize a genomic region that fulfil that hom requirement for the mut bulk and het/ref for the wt one. My question are:

1. Why does GenotypeGVCFs reduces the read #.
2. If anyone has other suggestions that would be great.

Thanks a lot, Guy

Created 2015-12-18 06:03:50 | Updated | Tags: genotypegvcfs

Recently I got some problem during running GenotypeGVCFs on my 360 WGS samples.

My samples are human 15X whole-genome sequencing data.

I made 360 gvcf files by running HaplotypeCaller for each sample.

But when I run GenotypeGVCFs with the default parameters, it showed the estimated time as 1000 weeks.

I tried to combine gvcfs by using CombineGVCFs and tried to run GenotypeGVCFs on these combined vcfs, it failed due to the memory shortage, I allocated 400G for java heap though.

Is there any guideline for running GenotypeGVCFs efficiently to many samples?

Junehawk

Created 2015-12-17 07:35:45 | Updated | Tags: genotypegvcfs

I'm running GenotypeGVCFs on about 50 samples, split by chromosome across several cluster nodes. I see that each sample, regardless of chromosome stalls at position 999901. This continues at 1999901, 2999901, and so on. More out of curiosity than anything, is there a straightforward explanation to this behavior?

Thanks!

Created 2015-12-10 18:43:04 | Updated | Tags: haplotypecaller heterozygosity genotypegvcfs

Hi GATK team,

I was hoping I could get some insight on determining rate of heterozygosity from a gvcf file. We have three diploid lizard samples. Each was run through our GATK pipeline using HC in GVCF mode with -ERC GVCF followed by joint genotyping using all three samples.

I want to determine the rate of heterozygosity in each lizard by counting the number of heterozygous sites and dividing by the number of callable sites (i.e not './.') for every position in the genome (whether or not it is a variant).

However after reading a response to a question on the forum http://gatkforums.broadinstitute.org/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf, "Short answer is that you shouldn't be looking at the genotype calls emitted by HC in GVCF mode. Longer answer, the gVCF is meant to be only an intermediate and the genotype calls are not final"

I am note sure this is the correct way to count heterozygous sites.

From the HC gvcf file, could I extract the number of callable sites from the GCVFBlocks and variant entries in the file to get the total number of callable sites (excluding './.' entries)? Then count the number of heterozygouse genotypes from the joint genotyping gvcf output?

HC command:

java -Xmx3g -jar /home/dut/bin/GATK/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ --variant_index_type LINEAR \ --variant_index_parameter 128000 \ -ERC GVCF \ -R reference.fa \ -I sample001.bam \ -stand_call_conf 30 \ -stand_emit_conf 30 \ -mbq 17 \ -o sample001.rawVAR.vcf)

JointGenotyping command: java -Xmx3g -jar /home/dut/bin/GATK/GenomeAnalysisTK.jar \ -T GenotypeGVCFs \ --includeNonVariantSites \ -ploidy 2 \ -R reference.fa \ --variant sample8450.rawVAR.vcf \ --variant sample003.rawVAR.vcf \ --variant sample001.rawVAR.vcf \ -o jg_sample001_sample003_sample8450.vcf

We are using GATK version 3.3.0

Any suggestions are appreciated! Thank you for your time.

Best, Morgan

Created 2015-12-07 22:04:39 | Updated | Tags: genotypegvcfs gvcf

We would like to reduce the size of the output file when using --includeNonVariantSites. Can GenotypeGVCFs output in gVCF format when using --includeNonVariantSites?

Thanks, Carlos

Created 2015-11-28 21:23:12 | Updated | Tags: knownsites genotypegvcfs

I have what seems like a simple question, but I'm a little confused on the proper steps to take:

I want to run GenotypeGVCFs, but force a call at all known sites from a previous curated VCF records file, even if the site is no-called (./.).

I've search through the Forum boards, but have not found a clear way to do this. Any help would be appreciated.

Thanks!

Created 2015-11-25 17:35:43 | Updated | Tags: genotypegvcfs mergedgvcf

Basically what I would like to do is give a selection of sample names to use to GenotypeGvcf so that only a subset of the samples in a merged gVCF file is used. So if I would have a merged gVCF file with sample A, B and C I would like to be able to do

 java -jar GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R reference.fasta \
--variant sample1.g.vcf \
--variant sample2.g.vcf \
-o output.vcf \
**--subset A, B**

so that only sample A and B are used. I could not find such a parameter existing, but I may have missed it. Is this possible?

Thanks! Niek

Created 2015-11-23 23:42:52 | Updated 2015-11-24 00:02:00 | Tags: combinevariants beagle genotypegvcfs rgq

Hi all,

I am trying to combine two VCF files. The first VCF contains data from genotyping-by-sequencing (low coverage) and variants were called with another software (TASSEL). This first VCF has FORMAT fields GT:AD:DP:GQ:PL.

The second VCF was generated by running GenotypeGVCF using the first VCF as the --intervals file. Basically, I am calling the sites from my first VCF out of the whole-genomes contained in the GVCF. I then run CombineVariants on the two VCFs to get the union of both sets of samples.

My problem is that the GenotypeGVCF contains GT:AD:DP:RGQ. There is no more PL field. My guess is that this happened since I used the --includeNonVariantSites option, which was necessary since many sites in my VCF are not variant in the GVCF.

When I CombineVariants I get the union of FORMAT fields. So the samples from the first VCF have PL, those from the GVCF do not. I now want to do imputation (probably with Beagle) and take advantage of the information contained in the genotype likelihoods. Beagle uses the PL field, so the samples without PL would force me to use the GT fields. Using the GT is not okay because Beagle would use that field even for the low coverage data where you might call a homozygote with just one read.

So how can I get PL out of GenotypeGVCF in my situation OR how can I calculate PL, or re-calculate all PL's from the read depth information in the combined VCF?

Created 2015-11-17 15:29:39 | Updated | Tags: haplotypecaller genotypegvcfs

Is there a way to use gzipped gVCFs files when using HaplotypeCaller and GenotypeGVCFs.

If so how do you make the index files? I can't seem to get it to work.

(I have searched the forum and can't seem to find a definitive answer. Sorry)

Created 2015-11-10 17:03:00 | Updated 2015-11-10 17:05:10 | Tags: genotypegvcfs gatk3-3

I am trying to run GenotypeGVCF using the following command on a cluster and am getting the following exception. How can I fix this? -- Thanks in advance!

java -Xmx30G -jar /opt/ngstools/variantcallers/gatk/GATK-3.3-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /opt/ngstools/Human_Decoy_REF/hs37d5.fa -V /home/user/gvcfs.list -L /home/user/chr_1.bed -D /opt/ngstools/dbsnp_138.b37.vcf --out /home/user/chr_1.vcf

INFO 11:23:53,718 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 11:23:53,718 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 11:23:56,923 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files

Where $filesToJoin is a string with -V individual.gvcf 1. The total number of individuals is 400, and I combined every 100 together. Can I do the calibration with several -input arguments or must I merge all merged files into one? Created 2015-10-26 15:46:12 | Updated | Tags: ad genotypegvcfs I have found several sites in my VCF files where GenotypeGVCFs appears to output the AD records in the wrong order, since the AD values do not match either the call or the order of values in the PL field. The vast majority of sites are correct, but sometimes they look like this (INFO column removed for readability): POS ID REF ALT QUAL FILTER FORMAT Sample1 Sample2 349 . G A 210669.47 . GT:AD:DP:GQ:PL 1:3,0,238:241:99:9052,0 1:0,0,59:59:99:2699,0 910 . T A,G 177554.11 . GT:AD:DP:GQ:PL 2:0,0,0,76:76:99:3600,3600,0 1:0,90,0,0:90:99:4321,0,4321 1948 . T G,A,C 209510.27 . GT:AD:DP:GQ:PL 1:0,234,0,0,0:234:99:10934,0,10934,10934 2:0,0,0,57,0:57:99:2745,2745,0,2745 Some observations: • In all cases I've found so far, there's more AD entries than possible haplotypes or PL entries (including the extreme case of having 5 AD values at site 1948 above) • Not all sites are affected, and not even all samples at an affected site (see site 910 and 1948 above, although this could just be due to where exactly the 'extra' AD value gets added) • There does not seem to be a pattern as to where the extra entry gets added - at several sites there are still too many AD fields, but the order is not wrong, so the extra entry must be at the end • I did check the input VCF files (produced using HaplotypeCaller with --emitRefConfidence BP_RESOLUTION), and the order/number of AD fields is fine there I haven't been able to find anything special about the affected sites, so it's not at all clear what is happing. The command line arguments used were: java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -V Sample1_raw.g.vcf -V Sample2_raw.g.vcf -R TanzaniaDog_RV2772_KF155002.1.fasta -o AllSamples_raw.gvcf --annotateNDA --includeNonVariantSites -stand_emit_conf 10 -stand_call_conf 30 I did try specifying ploidy in the above command (--sample_ploidy 1), but this does not solve the problem. Created 2015-10-26 12:17:38 | Updated | Tags: selectvariants genotypegvcfs gvcf GATK Admins, We are currently working on a project, and we have found that some of our samples were contaminated after the gVCF merging phase. Is it possible to remove samples from a merged gVCF (likely using SelectVariants), or would we need to re-merge only the good gVCFs into a new merged gVCF? (Note that we're actually working with a double-merged gVCF file containing ~5,000 samples, so re-merging would be potentially costly). Thanks, John Wallace Created 2015-10-21 13:46:12 | Updated 2015-10-21 13:56:18 | Tags: haplotypecaller genotypegvcfs gvcf gatk3-4 Hi all, I'm currently confused about the snips called as shown below. If I am not mistaken, the first row shows gatk called an 34 bp insertion in sample 001 at position 3229753. It didn't call anything for sample 001 on position 3229753, but then for position 3229756, it calls another 15bp insertion for sample 001, which overlaps completely with the first insertion. I have three questions about this. 1) Is my interpretation of the data shown below correct 2) If this is correct, is this expected behaviour for gatk? What kind of circumstances are expected to generate these results? 3) How can I interpret these conflicting snips, should I just pick the call with the highest confidence and ignore the other? What about if a lower-confidence call is a substring of a previous call in another sample? #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001 002 003 004 gi|ref| 3229753 0 A AACTTGCCTGCCACGCTTTTCTTTATACTTAACCC 9635.2 0 AC=3;AF=1.00;AN=3;DP=304;FS=0.000;MLEAC=3;MLEAF=1.00;MQ=59.86;QD=29.65;SOR=0.779 GT:AD:DP:GQ:PL 1:0,48:48:99:2153,0 1:0,84:84:99:3696,0 .:0,0 1:0,85:85:99:3813,0 gi|ref| 3229754 0 A ACTTGCCTGCCACGCTTTTCTTTATACTTAACCCAGGCGCTAATTCATCTGCAACG 3012.2 0 AC=1;AF=1.00;AN=1;DP=291;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=59.91;QD=28.35;SOR=0.910 GT:AD:DP:GQ:PL .:0,0 .:0,0 1:0,69:69:99:3039,0 .:0,0 gi|ref| 3229756 0 G GCGCTAATTCATCTGC 3654.2 0 AC=3;AF=1.00;AN=3;DP=74;FS=0.000;MLEAC=3;MLEAF=1.00;MQ=60.00;QD=28.36;SOR=0.747 GT:AD:DP:GQ:PL 1:0,17:17:99:854,0 1:0,25:25:99:1213,0 .:0,0 1:0,32:32:99:1614,0 Created 2015-10-13 08:56:55 | Updated | Tags: genotypegvcfs chloroplast Dear GATK specialists, I would like to ask you for advice; I am genotyping combined gvcf for about 1000 samples. For each of them I have chloroplast genome of about 150kb. Problem is that coverage for those chloroplasts reach 500-1000x. When I try to genotype them altogether - I wait 16h and not even a single position is genotyped for the dataset: My command: java -Xmx200G -jar /ebio/abt6/rgutaker/bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ref.fa -V combined.gvcf.gz -o genotyped.gvcf -nt 5 Would you have any idea how should I genotype that dataset whilst still being able to include all the information for genotyping? Many thanks. Rafal Created 2015-10-09 09:31:41 | Updated | Tags: downsampling genotypegvcfs The documentation for GenotypeGVCFs states that it is employing downsampling: Downsampling settings This tool applies the following downsampling settings by default. Mode: BY_SAMPLE To coverage: 1,000 I also see this in my output :"INFO 14:51:47,705 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000" Specifically what is it downsampling, the genotype likelihoods per locus or active region? Is this behaviour changeable via -dcov, or would changing it be inadvisable? Created 2015-10-08 08:59:25 | Updated | Tags: multi-sample genotypegvcfs Hi, I have generated 9 .g.vcf files from my .bam files, and want to do joint genotype calling with all these gvcf files using GenotypeGVCFs tool. I have the following command: java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R ref.fasta -V:VCF gvcflist.list -o output.vcf In my output, I expected to find 9 sample columns for the genotype (ie 0/1:195,32:250:99:534,0,6705), however I can only find one sample column in the vcf file, and it seems that all 9 samples were pooled together. May I know if I have done something wrong? I would like to know individual genotype for each sample. I have tried to use: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I sam1.bam -I sam2.bam -I sam3.bam .... and also only obtain one column instead of multiple-sample column for the genotype. Thank you. Created 2015-09-15 21:04:45 | Updated | Tags: genotypegvcfs Dear GATK Team, I have a quick question. Can I use a .vcf file generated by Varscan as input for GenotypeGVCFs? These VCF files have all variant and non variant positions. Created 2015-08-18 08:28:40 | Updated 2015-08-18 08:30:16 | Tags: genotypegvcfs Hi, I tried to use GenotypeGVCFs to perform joint genotyping with 5 samples. I used -V mygvcfs.list to submit multiple gvcfs, but I got the following error message: Invalid command line: No tribble type was provided on the command line and the type of the file could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types. So, I run again with -V:VCF mygvcfs.list. I got another error: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file I checked back my gvcfs, the #CHROM header line is present. I also tried to run GeotypeGVCFs by typing the input gvcfs one by one on command line, and it works well. Any suggestion on why this happened and how can I submit the input gvcfs in list? I am using GATK version: 3.3.0 Thanks. Created 2015-08-12 15:42:52 | Updated | Tags: genotypegvcfs I wanted to produce the cohort vcf file from several g.vcf files made using HaplotypeCaller. I used the 'GenotypeGVCFs' and gave a list of g.vcf files. The order of samples in the final vcf file is different from the gvcf file list I provided. It looks like it sorted alphabetically with the sample name. How can I get samples in the same order as in the gvcf file list? I am using GATK 3.3. Created 2015-08-07 22:07:06 | Updated | Tags: combinevariants genotypegvcfs Dear GATK team, I have a set of samples for which I have WES and WGS data. I have run the same pipeline (mostly following GATK's Best Proctices) on both datasets with the same reference sets and target sites down to VQSR. So now I have to VCFs ready to analyze. My idea is to merge these datasets and I believe my situation parallels to the third case described in this document gatkforums.broadinstitute.org/discussion/53/combining-variants-from-different-files-into-one. Therefore, I used -T CombineVariants to merge these two files into one. Up until here everything runs ok and I obtain a working final VCF. However when I look deeper into it, there are some variants that have been kept duplicated but present different reference allele. When I look back at the separate VCF files, these differences are already there. My question is, how two sets of files that have been processed exactly in the same way, can present different reference sites? I paste couple of examples for clarification: chr10 100167436 . T C 177.06 PASS AC=1;AF=0.011;AN=94; set=WGS chr10 100167436 . TGTCACCAGGGGTCACCAGGGATGAGGACC CGTCACCAGGGGTCACCAGGGATGAGGACC,T 56690.81 PASS AC=50,2;AF=0.024,9.533e-04;AN=2098; set=WES chr10 101462504 . CT CTT,C 1539.20 PASS AC=34,48;AF=0.014,0.020;AN=2348; set=WES chr10 101462504 . C T,CT . PASS AC=0,0;AF=0.00,0.00;AN=114; set=WGS chr10 102295645 . G GT 627.02 PASS AC=7;AF=0.061;AN=114; set=WGS chr10 102295645 . GT GTT,G,TT 63780.89 PASS AC=203,403,17;AF=0.088,0.174,7.328e-03;AN=2320; set=WES Any thoughts on this behaviour? and possible solutions? Thanks in advance Victoria Created 2015-08-04 23:35:08 | Updated 2015-08-04 23:36:54 | Tags: haplotypecaller vcf genotypegvcfs It appears that there are some cases in which an upstream site is heterozygous for a deletion, and a downstream site should be heterozygous for something like 1/2 (where 2 is the * allele), but GATK is still making erroneous 0/1 calls. For instance, in my dataset I have a deletion at position 773 (I'm leaving out some fields and GT info for additional samples for brevity). scaffold_1 773 . CAAAACATAGTTTCCCATGTCCGCCATCCCATGTTCTGCATCCGTGCA C ... 0/1:102,70:172:99:0|1:773_CAAAACATAGTTTCCCATGTCCGCCATCCCATGTTCTGCATCCGTGCA_C:283 9,0,4099 And just 2 bases downstream, there's a SNP for the same sample that has non- calls on both strands, which seems impossible. Shouldn't there be a allele here, e.g (ALT==G,* ; GT==1/2)? Here's the call: scaffold_1 775 . A G ... 0/1:60,134:194:99:4507,0,1642 Could this be a bug, such that when there's a heterozygous ALT call on the strand without the deletion, the half of the GT field for deleted strand is still getting a reference (0) call when it should be getting a * ? At another site overlapping with the same deletion, at which the strand without the deletion for this sample is a REF call , I get something that's more like what I'd expect. scaffold_1 805 . G T,* ... 0/2:235,0,33:268:99:0|1:773_CAAAACATAGTTTCCCATGTCCGCCATCCCATGTTCTGCATCCGTGCA_C:676,1384,11251,0,9868,9762 This was done with GenotypeGVCFs from GATK 3.4-46. Created 2015-07-28 14:07:17 | Updated | Tags: genotypegvcfs Hi. I found that in some case, GenotypeGVCFs may output IUB code (such as G, )into vcf. What is the meaning of "" and how to remove the "" from ALTs ? GATK Version 3.4.0-46 Command: java -jar GATK.jar -T GenotypeGVCFs -R${fullref} --dbsnp ${dbsnp} -o out.vcf -V v1.g.vcf -V v2.g.vcf ... (no -nt option) the VCF file(for test, we call var on chr22 only.) # CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_100N Sample_100T Sample_101N Sample_101T Sample_104_N Sample_104_T Sample_105_N Sample_105_T Sample_106_N Sample_106_T Sample_107_N Sample_107_T Sample_108_N Sample_108_T Sample_109_N Sample_109_T 22 44862615 rs8777 A G,* 29879.7 . AC=18,2;AF=0.563,0.063;AN=32;BaseQRankSum=1.35;ClippingRankSum=0.182;DB;DP=1221;FS=1.288;InbreedingCoeff=-0.6;MLEAC=18,2;MLEAF=0.563,0.063;MQ=60;MQRankSum=0.119;QD=24.71;ReadPosRankSum=-0.194;SOR=0.855 GT:AD:DP:GQ:PGT:PID:PL 0/1:31,46,0:77:99:0|1:44862609_CTG_C:1073,0,1208,1167,1293,2460 0/1:50,48,0:98:99:0|1:44862609_CTG_C:1841,0,1843,1990,1987,3977 1/1:0,80,0:80:99:1|1:44862609_CTG_C:3770,256,0,3770,256,3770 1/1:0,80,0:80:99:1|1:44862609_CTG_C:3553,244,0,3553,244,3553 0/1:41,42,0:83:99:0|1:44862609_CTG_C:1598,0,1571,1722,1700,3422 0/1:38,50,0:88:99:0|1:44862609_CTG_C:1988,0,1404,2102,1561,3662 1/1:0,73,0:73:99:1|1:44862609_CTG_C:2961,226,0,2961,226,2961 1/1:0,60,0:60:99:1|1:44862609_CTG_C:2510,181,0,2510,181,2510 0/2:45,0,40:85:99:.:.:1502,1637,3506,0,1869,1749 0/2:32,0,33:65:99:.:.:1269,1365,2707,0,1341,1242 0/1:31,30,0:61:99:0|1:44862609_CTG_C:1166,0,1207,1260,1297,2557 0/1:26,35,0:61:99:0|1:44862609_CTG_C:1392,0,951,1470,1057,2526 0/1:45,44,0:89:99:0|1:44862609_CTG_C:1712,0,1744,1848,1877,3724 0/1:45,53,0:98:99:0|1:44862609_CTG_C:2127,0,1719,2263,1881,4144 0/1:25,19,0:44:99:0|1:44862609_CTG_C:722,0,945,797,1005,1802 0/1:46,21,0:67:99:0|1:44862609_CTG_C:767,0,1858,906,1924,2830 Created 2015-07-22 11:46:00 | Updated | Tags: combinevariants haplotypecaller best-practices vcf gatk genotypegvcfs combinegvcfs gvcf I was trying to do combine sets of vcf files for all my samples so that I have one single vcf output using this command option below java -d64 -Xmx48g -jar${GATK}/GenomeAnalysisTK.jar \ -R ${REF} \ -T GenotypeGVCFs \ --variant A.g.vcf \ --variant B.g.vcf \ --variant C.g.vcf \ -stand_emit_conf 30 \ -stand_call_conf 30 \ -o genotype.vcf but I got this error message “The following invalid GT allele index was encountered in the file: END=21994810”. I have tried to locate where the problem could be coming from but I do not understand this. Could you please advise me. Created 2015-07-17 22:01:12 | Updated 2015-07-17 22:14:12 | Tags: haplotypecaller ploidy haploid genotypegvcfs combinegvcfs Hi, I'm attempting to run GenotypeGVCFs on a cohort of ~4200 human male samples with targeted sequencing. I'm following the current DNA-Seq guidelines for cohort genotyping, with GATK v3.4-0. For each sample, I ran HaplotypeCaller separately for diploid and haploid (i.e., chrX non-PAR) regions, specifying --ploidy 1 for the haploid regions, then combined the resulting two GVCFs with CombineGVCFs. I then combined the per-sample GVCFs into groups of 64 samples using CombineGVCFs. Finally, I ran GenotypeGVCFs with all samples separately for groups of ~100 small target intervals (baits). Every group of target intervals ran fine without error in about 4 hours with ~5GB of RAM, except for the non-PAR chrX regions, which were haploid for all samples. For the haploid regions, GATK hangs on the very first base, slowly increasing memory usage, then eventually runs out of memory and exits. The estimated runtime keeps increasing without making any progress. The last run exited after 12 hours without making any progress. This happens no matter how much memory I specify (up to 128 GB). Interestingly, a PAR region of chromosome X run with --ploidy 2 in HaplotypeCaller worked with no problem. The inputted GVCF files to GenotypeGVCFs are uncompressed and were indexed by CombineGVCFs. I'm using default settings for GenotypeGVCFs, except for the following:  --standard_min_confidence_threshold_for_calling 20 --standard_min_confidence_threshold_for_emitting 10 --dbsnp dbSNP142.20150416.GRCh37.for-GATK.chr1-MT.vcf.gz  I tried running GenotypeGVCFs with the latest v3.4-46 release, but the same problem occurred. Below is example output: INFO 10:57:51,803 HelpFormatter - --------------------------------------------------------------------------------- INFO 10:57:51,810 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12 INFO 10:57:51,811 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 10:57:51,811 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 10:57:51,818 HelpFormatter - Program Args: -T GenotypeGVCFs -R /tmp/12715944.hpc-pbs.hpcc.usc.edu/hs37m.fa --dbsnp dbSNP142.20150416.GRCh37.for-GATK.chr1-MT.vcf.gz --standard_min_confidence_threshold_for_calling 20 --standard_min_confidence_threshold_for_emitting 10 [LONG LIST OF VARIANT FILES OMITTED] --out gatk.hc.combined.genotyped.chunk117.vcf.gz -L split_117.intervals --log_to_file gatk.hc.combined.genotyped.chunk117.log INFO 10:57:51,824 HelpFormatter - Executing as cedlund@hpc1130.m10g.hpcc.usc.edu on Linux 2.6.32-504.12.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13. INFO 10:57:51,825 HelpFormatter - Date/Time: 2015/07/17 10:57:51 INFO 10:57:51,826 HelpFormatter - --------------------------------------------------------------------------------- INFO 10:57:51,826 HelpFormatter - --------------------------------------------------------------------------------- INFO 10:57:56,331 GenomeAnalysisEngine - Strictness is SILENT INFO 10:57:56,671 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 10:59:03,550 IntervalUtils - Processing 154370 bp from intervals WARN 10:59:03,615 IndexDictionaryUtils - Track dbsnp doesn't have a sequence dictionary built in, skipping dictionary validation INFO 10:59:03,766 GenomeAnalysisEngine - Preparing for traversal INFO 10:59:03,768 GenomeAnalysisEngine - Done preparing for traversal INFO 10:59:03,768 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 10:59:03,769 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 10:59:03,770 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 10:59:04,283 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files INFO 10:59:57,328 ProgressMeter - X:51033766 216.0 53.0 s 68.9 h 0.2% 7.2 h 7.2 h INFO 11:00:27,330 ProgressMeter - X:51035366 216.0 83.0 s 4.5 d 1.2% 111.5 m 110.1 m INFO 11:00:57,353 ProgressMeter - X:51035366 216.0 113.0 s 6.1 d 1.2% 2.5 h 2.5 h INFO 11:01:27,354 ProgressMeter - X:51035366 216.0 2.4 m 7.7 d 1.2% 3.2 h 3.2 h INFO 11:01:57,356 ProgressMeter - X:51035366 216.0 2.9 m 9.3 d 1.2% 3.9 h 3.8 h INFO 11:02:27,358 ProgressMeter - X:51035366 216.0 3.4 m 10.9 d 1.2% 4.5 h 4.5 h INFO 11:02:57,882 ProgressMeter - X:51035366 216.0 3.9 m 12.5 d 1.2% 5.2 h 5.2 h INFO 11:03:27,884 ProgressMeter - X:51035366 216.0 4.4 m 14.2 d 1.2% 5.9 h 5.8 h INFO 11:03:57,885 ProgressMeter - X:51035366 216.0 4.9 m 15.8 d 1.2% 6.6 h 6.5 h INFO 11:04:27,887 ProgressMeter - X:51035366 216.0 5.4 m 17.4 d 1.2% 7.3 h 7.2 h INFO 11:04:58,976 ProgressMeter - X:51035366 216.0 5.9 m 19.0 d 1.2% 7.9 h 7.8 h INFO 11:05:28,977 ProgressMeter - X:51035366 216.0 6.4 m 2.9 w 1.2% 8.6 h 8.5 h INFO 11:05:58,979 ProgressMeter - X:51035366 216.0 6.9 m 3.2 w 1.2% 9.3 h 9.2 h INFO 11:06:28,981 ProgressMeter - X:51035366 216.0 7.4 m 3.4 w 1.2% 10.0 h 9.8 h INFO 11:06:58,982 ProgressMeter - X:51035366 216.0 7.9 m 3.6 w 1.2% 10.6 h 10.5 h INFO 11:07:28,984 ProgressMeter - X:51035366 216.0 8.4 m 3.9 w 1.2% 11.3 h 11.2 h INFO 11:07:58,986 ProgressMeter - X:51035366 216.0 8.9 m 4.1 w 1.2% 12.0 h 11.8 h INFO 11:08:30,497 ProgressMeter - X:51035366 216.0 9.4 m 4.3 w 1.2% 12.7 h 12.5 h INFO 11:09:30,568 ProgressMeter - X:51035366 216.0 10.4 m 4.8 w 1.2% 14.0 h 13.8 h INFO 11:10:32,779 ProgressMeter - X:51035366 216.0 11.5 m 5.3 w 1.2% 15.4 h 15.2 h INFO 11:11:33,479 ProgressMeter - X:51035366 216.0 12.5 m 5.7 w 1.2% 16.8 h 16.6 h INFO 11:12:35,360 ProgressMeter - X:51035366 216.0 13.5 m 6.2 w 1.2% 18.2 h 17.9 h INFO 11:13:35,445 ProgressMeter - X:51035366 216.0 14.5 m 6.7 w 1.2% 19.5 h 19.3 h INFO 11:14:39,689 ProgressMeter - X:51035366 216.0 15.6 m 7.2 w 1.2% 20.9 h 20.7 h INFO 11:15:40,505 ProgressMeter - X:51035366 216.0 16.6 m 7.6 w 1.2% 22.3 h 22.0 h INFO 11:16:41,140 ProgressMeter - X:51035366 216.0 17.6 m 8.1 w 1.2% 23.7 h 23.4 h INFO 11:17:41,956 ProgressMeter - X:51035366 216.0 18.6 m 8.6 w 1.2% 25.0 h 24.7 h INFO 11:18:41,958 ProgressMeter - X:51035366 216.0 19.6 m 9.0 w 1.2% 26.4 h 26.0 h INFO 11:19:44,493 ProgressMeter - X:51035366 216.0 20.7 m 9.5 w 1.2% 27.8 h 27.4 h INFO 11:20:49,749 ProgressMeter - X:51035366 216.0 21.8 m 10.0 w 1.2% 29.2 h 28.8 h INFO 11:21:53,414 ProgressMeter - X:51035366 216.0 22.8 m 10.5 w 1.2% 30.6 h 30.3 h INFO 11:22:58,174 ProgressMeter - X:51035366 216.0 23.9 m 11.0 w 1.2% 32.1 h 31.7 h INFO 11:24:01,211 ProgressMeter - X:51035366 216.0 25.0 m 11.5 w 1.2% 33.5 h 33.1 h INFO 11:25:05,051 ProgressMeter - X:51035366 216.0 26.0 m 12.0 w 1.2% 34.9 h 34.5 h INFO 11:26:07,782 ProgressMeter - X:51035366 216.0 27.1 m 12.4 w 1.2% 36.3 h 35.9 h INFO 11:27:10,933 ProgressMeter - X:51035366 216.0 28.1 m 12.9 w 1.2% 37.8 h 37.3 h INFO 11:28:20,854 ProgressMeter - X:51035366 216.0 29.3 m 13.5 w 1.2% 39.3 h 38.8 h INFO 11:29:28,165 ProgressMeter - X:51035366 216.0 30.4 m 14.0 w 1.2% 40.8 h 40.3 h INFO 11:30:28,575 ProgressMeter - X:51035366 216.0 31.4 m 14.4 w 1.2% 42.2 h 41.6 h INFO 11:31:36,673 ProgressMeter - X:51035366 216.0 32.5 m 14.9 w 1.2% 43.7 h 43.1 h INFO 11:32:45,497 ProgressMeter - X:51035366 216.0 33.7 m 15.5 w 1.2% 45.2 h 44.7 h INFO 11:33:49,205 ProgressMeter - X:51035366 216.0 34.8 m 16.0 w 1.2% 46.7 h 46.1 h INFO 11:34:49,226 ProgressMeter - X:51035366 216.0 35.8 m 16.4 w 1.2% 48.0 h 47.4 h INFO 11:35:54,571 ProgressMeter - X:51035366 216.0 36.8 m 16.9 w 1.2% 49.5 h 48.8 h INFO 11:36:59,402 ProgressMeter - X:51035366 216.0 37.9 m 17.4 w 1.2% 50.9 h 50.3 h INFO 11:38:03,427 ProgressMeter - X:51035366 216.0 39.0 m 17.9 w 1.2% 52.3 h 51.7 h INFO 11:39:12,036 ProgressMeter - X:51035366 216.0 40.1 m 18.4 w 1.2% 53.9 h 53.2 h INFO 11:40:15,472 ProgressMeter - X:51035366 216.0 41.2 m 18.9 w 1.2% 55.3 h 54.6 h INFO 11:41:22,184 ProgressMeter - X:51035366 216.0 42.3 m 19.4 w 1.2% 56.8 h 56.1 h INFO 11:42:24,992 ProgressMeter - X:51035366 216.0 43.4 m 19.9 w 1.2% 58.2 h 57.5 h INFO 11:43:30,745 ProgressMeter - X:51035366 216.0 44.4 m 20.4 w 1.2% 59.7 h 58.9 h INFO 11:44:41,392 ProgressMeter - X:51035366 216.0 45.6 m 21.0 w 1.2% 61.3 h 60.5 h INFO 11:45:51,136 ProgressMeter - X:51035366 216.0 46.8 m 21.5 w 1.2% 62.8 h 62.0 h INFO 11:46:59,056 ProgressMeter - X:51035366 216.0 47.9 m 22.0 w 1.2% 64.3 h 63.5 h INFO 11:48:09,266 ProgressMeter - X:51035366 216.0 49.1 m 22.5 w 1.2% 65.9 h 65.1 h INFO 11:49:16,701 ProgressMeter - X:51035366 216.0 50.2 m 23.1 w 1.2% 67.4 h 66.6 h INFO 11:50:24,150 ProgressMeter - X:51035366 216.0 51.3 m 23.6 w 1.2% 68.9 h 68.1 h INFO 11:51:31,883 ProgressMeter - X:51035366 216.0 52.5 m 24.1 w 1.2% 70.5 h 69.6 h INFO 11:52:40,234 ProgressMeter - X:51035366 216.0 53.6 m 24.6 w 1.2% 72.0 h 71.1 h INFO 11:53:46,785 ProgressMeter - X:51035366 216.0 54.7 m 25.1 w 1.2% 73.5 h 72.6 h INFO 11:54:53,194 ProgressMeter - X:51035366 216.0 55.8 m 25.6 w 1.2% 75.0 h 74.0 h Here is an example of the GVCF for 3 samples in one of the problem haploid regions: X 51035345 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:78:2:0,78 .:9:99:3:0,112 X 51035346 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035347 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035348 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035349 . G <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035350 . G <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035351 . G <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035352 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035353 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035354 . T <NON_REF> . . END=51035355 GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:2:45:2:0,45 .:9:99:3:0,112 X 51035356 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:41:1:0,41 .:9:99:3:0,112 X 51035357 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:40:1:0,40 .:9:99:3:0,112 X 51035358 . G <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:39:1:0,39 .:9:99:3:0,112 X 51035359 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:38:1:0,38 .:9:99:3:0,112 X 51035360 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:41:1:0,41 .:9:99:3:0,112 X 51035361 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:39:1:0,39 .:9:99:3:0,112 X 51035362 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:38:1:0,38 .:9:99:3:0,112 X 51035363 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:7:99:4:0,120 .:1:38:1:0,38 .:9:99:3:0,112 X 51035364 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:1:38:1:0,38 .:9:99:3:0,112 X 51035365 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035366 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035367 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035368 . G <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035369 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035370 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035371 . A C,<NON_REF> . . DP=246;MQ=60.00 GT:AD:DP:GQ:MIN_DP:PL:SB .:0,4,0:4:99:.:126,0,126:0,0,1,3 .:.:2:72:2:0,72,72 .:.:9:99:3:0,112,112 X 51035372 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035373 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:4:90:4:0,90 .:2:72:2:0,72 .:9:99:3:0,112 X 51035374 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:2:45:2:0,45 .:2:72:2:0,72 .:9:99:3:0,112 X 51035375 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:2:45:2:0,45 .:2:72:2:0,72 .:9:99:3:0,112 X 51035376 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:2:45:2:0,45 .:2:54:2:0,54 .:9:99:3:0,112 X 51035377 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:2:45:2:0,45 .:2:71:2:0,71 .:9:99:3:0,112 X 51035378 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:18:99:3:0,105 .:2:71:2:0,71 .:9:99:3:0,112 X 51035379 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:18:99:3:0,105 .:2:81:2:0,81 .:9:99:3:0,112 X 51035380 . G <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:18:99:3:0,105 .:2:78:2:0,78 .:9:99:3:0,112 X 51035381 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:18:99:3:0,105 .:2:45:2:0,45 .:9:99:3:0,112 X 51035382 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:PL .:18:99:3:0,105 .:2:45:2:0,45 .:9:99:3:0,112 Any help is greatly appreciated. Please let me know if you need any other information. Kindest regards, Chris Created 2015-07-14 10:11:09 | Updated 2015-07-14 10:19:20 | Tags: genotypegvcfs The GenotypeGVCFs crash for some reasons I am not clear ! it throws the message below: ... (many lines here) ... INFO 17:25:12,879 ProgressMeter - 1:58001 0.0 12.3 m 1224.3 w 0.0% 7.9 w 7.9 w INFO 17:28:21,843 ProgressMeter - 1:58001 0.0 13.5 m 1339.8 w 0.0% 8.6 w 8.6 w INFO 17:30:43,033 ProgressMeter - 1:58001 0.0 17.8 m 1770.2 w 0.0% 11.4 w 11.4 w ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:189) at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.loadFromDisk(RMDTrackBuilder.java:335) at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.attemptToLockAndLoadIndexFromDisk(RMDTrackBuilder.java:319) at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:278) at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:224) at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:147) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.createNewResource(ReferenceOrderedDataSource.java:226) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.createNewResource(ReferenceOrderedDataSource.java:185) at org.broadinstitute.gatk.engine.datasources.rmd.ResourcePool.iterator(ResourcePool.java:84) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedDataSource.seek(ReferenceOrderedDataSource.java:168) at org.broadinstitute.gatk.engine.datasources.providers.RodLocusView.(RodLocusView.java:82) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.getLocusView(TraverseLociNano.java:129) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:80) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedConstructorAccessor16.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:408) at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:185) ... 18 more Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded at htsjdk.tribble.index.interval.IntervalTreeIndex$ChrIndex.read(IntervalTreeIndex.java:206) at htsjdk.tribble.index.AbstractIndex.read(AbstractIndex.java:364) at htsjdk.tribble.index.interval.IntervalTreeIndex.(IntervalTreeIndex.java:52) ... 22 more ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.4-46-gbc02625): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: java.lang.reflect.InvocationTargetException ##### ERROR ------------------------------------------------------------------------------------------ And I restart the program GenotypeGVCFs at the SAME position and SAME Arguments. The starting ProgressMeter is far from the beginning position,like this: ... ... INFO 17:38:37,766 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:38:37,766 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 17:38:37,767 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 17:38:37,888 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files INFO 17:39:17,370 ProgressMeter - Starting 0.0 39.0 s 65.5 w 100.0% 39.0 s 0.0 s INFO 17:40:01,886 ProgressMeter - 1:3000101 0.0 84.0 s 139.1 w 0.8% 2.9 h 2.9 h ... ... Is it a BUG or NOT ??? By the way, after a half hour, The GenotypeGVCFs crash again at another position because of java.lang.reflect.InvocationTargetException ... INFO 17:58:52,940 ProgressMeter - 1:3054001 0.0 20.3 m 2009.2 w 0.8% 41.2 h 40.9 h ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.RuntimeException: java.lang.reflect.InvocationTargetException ... Could you give me any suggestion? THX Created 2015-06-19 19:20:58 | Updated | Tags: haplotypecaller rnaseq genotypegvcfs gvcf Hello, I was wondering if there is a way to output all annotations for all sites when running HaplotypeCaller with BP_RESOLUTION. Currently it outputs all annotations for only called variants. Thanks in advance. Created 2015-06-03 10:22:44 | Updated | Tags: genotypegvcfs combinegvcfs Hi, I would like to be sure of the difference between those 2 tools. From what I understand, GenotypeGVCFs somehow re-calculate likelihood and parameters (QUAL, DP, MQ...) for each variant positions present in at least 1 input sample. Right ? And is it not the case for CombineGVCFs? Thank you for your help, Fabrice Created 2015-06-01 15:12:54 | Updated | Tags: genotypegvcfs I'm running GenotypeGVCF after calling SNPs and Indels through HaplotypeCaller, and my command is as follows: java -Xmx10g -Djava.io.tmpdir=pwd/tmp -jar ./GATK/GenomeAnalysisTK.jar \ -T GenotypeGVCFs \ -R ./hg19/ucsc.hg19.fasta \ --variant ./SF-16/concated.g.vcf \ --variant ./SF-17/concated.g.vcf \ --variant ./SF-18/concated.g.vcf \ --variant ./SF-19/concated.g.vcf \ --variant ./SF-20/concated.g.vcf \ -stand_call_conf 30 \ -stand_emit_conf 10 \ -o ./pedigree_multisample.vcf # and the ERROR information is: ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.3-0-g37228af): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: Line 468810703: there aren't enough columns for line chr11 123591281 . A  . . END (we expected 9 tokens, and saw 8 ) ##### ERROR ------------------------------------------------------------------------------------------ ========================================================================== when I checked the position(chr11 123591281) in the five vcf files, there is no abnormal column numbers. ./SF-16/concated.g.vcf: chr11 123591281 . A . . END=123591282 GT:DP:GQ:MIN_DP:PL 0/0:21:60:21:0,60,900 ./SF-17/concated.g.vcf: chr11 123591270 . C . . END=123591292 GT:DP:GQ:MIN_DP:PL 0/0:25:75:25:0,72,787 ./SF-18/concated.g.vcf: chr11 123591279 . T . . END=123591285 GT:DP:GQ:MIN_DP:PL 0/0:23:68:22:0,66,747 ./SF-19/concated.g.vcf: chr11 123591274 . G . . END=123591281 GT:DP:GQ:MIN_DP:PL 0/0:28:78:27:0,72,895 ./SF-20/concated.g.vcf: chr11 123591279 . T . . END=123591287 GT:DP:GQ:MIN_DP:PL 0/0:25:69:24:0,66,990 so I don't know what's wrong with my command or what's wrong with my datas, can you give me some advice ? Created 2015-05-20 13:06:46 | Updated | Tags: haplotypecaller java genotypegvcfs Dear GATK, I used the HaplotypeCaller with "-dcov 500 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000" to produce 60 gvcf files, that worked fine. However, GenotypeGVCFs gets stuck on a position and runs out of memory after about 24hours, even when I allocate 240Gb. Testing a short region of 60kb does not help. Here was my command line: software/jre1.7.0_25/bin/java -Xmx240g -jar GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R Reference.fasta -L chrom14:2240000-2300000 --variant 60samples_gvcf.list -o output.vcf If I split my list of 60 gvcf files into two lists of 30 samples each, GenotypeGVCFs works fine for both batches within 15 minutes (~10Gb of memory). I tested with 47 samples, it took 8 hours (31gb of memory) for a 60kb region. Once I use more than ~55 samples, it takes forever and crashes. Any help will be much appreciated! Thanks, Antoine Created 2015-05-07 09:22:47 | Updated 2015-05-07 10:09:41 | Tags: haplotypecaller format genotypegvcfs I have a potential bug running GATK GenotypeGVCFs. It complains that there is a DP in the INFO field, but in my haplotypecaller-generated -mg.g.vcf.gz's I do not have a DP in the info, I do have DP in the FORMAT field though, but that's present in the headers as shown below the error output. Any idea what could be the problem? INFO 18:30:12,694 HelpFormatter - -------------------------------------------------------------------------------- INFO 18:30:12,698 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-geee94ec, Compiled 2015/03/09 14:27:22 INFO 18:30:12,699 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 18:30:12,699 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 18:30:12,706 HelpFormatter - Program Args: -l INFO -T GenotypeGVCFs -R /net/NGSanalysis/ref/Mus_musculus.GRCm38/index/bwa/Mus_musculus.GRCm38.dna.primary_assembly.fa -o /dev/stdout -ploidy 2 --num_threads 32 --intervals:targets,BED /net/NGSanalysis/ref/Mus_musculus.GRCm38/bed/SeqCap/ex100/110624_MM10_exome_L2R_D02_EZ_HX1-ex100.bed --max_alternate_alleles 20 -V:3428_10_14_SRO_185_TGGCTTCA-mg,VCF 3428_10_14_SRO_185_TGGCTTCA-mg.g.vcf.gz -V:3428_11_14_SRO_186_TGGTGGTA-mg,VCF 3428_11_14_SRO_186_TGGTGGTA-mg.g.vcf.gz -V:3428_12_13_SRO_422_TTCACGCA-mg,VCF 3428_12_13_SRO_422_TTCACGCA-mg.g.vcf.gz -V:3428_13_13_SRO_492_AACTCACC-mg,VCF 3428_13_13_SRO_492_AACTCACC-mg.g.vcf.gz -V:3428_14_13_SRO_493_AAGAGATC-mg,VCF 3428_14_13_SRO_493_AAGAGATC-mg.g.vcf.gz -V:3428_15_14_SRO_209_AAGGACAC-mg,VCF 3428_15_14_SRO_209_AAGGACAC-mg.g.vcf.gz -V:3428_16_14_SRO_218_AATCCGTC-mg,VCF 3428_16_14_SRO_218_AATCCGTC-mg.g.vcf.gz -V:3428_17_14_SRO_201_AATGTTGC-mg,VCF 3428_17_14_SRO_201_AATGTTGC-mg.g.vcf.gz -V:3428_18_13_SRO_416_ACACGACC-mg,VCF 3428_18_13_SRO_416_ACACGACC-mg.g.vcf.gz -V:3428_19_14_SRO_66_ACAGATTC-mg,VCF 3428_19_14_SRO_66_ACAGATTC-mg.g.vcf.gz -V:3428_1_13_SRO_388_GTCGTAGA-mg,VCF 3428_1_13_SRO_388_GTCGTAGA-mg.g.vcf.gz -V:3428_20_14_SRO_68_AGATGTAC-mg,VCF 3428_20_14_SRO_68_AGATGTAC-mg.g.vcf.gz -V:3428_21_14_SRO_210_AGCACCTC-mg,VCF 3428_21_14_SRO_210_AGCACCTC-mg.g.vcf.gz -V:3428_22_14_SRO_256_AGCCATGC-mg,VCF 3428_22_14_SRO_256_AGCCATGC-mg.g.vcf.gz -V:3428_23_14_SRO_270_AGGCTAAC-mg,VCF 3428_23_14_SRO_270_AGGCTAAC-mg.g.vcf.gz -V:3428_24_13_SRO_452_ATAGCGAC-mg,VCF 3428_24_13_SRO_452_ATAGCGAC-mg.g.vcf.gz -V:3428_2_13_SRO_399_GTCTGTCA-mg,VCF 3428_2_13_SRO_399_GTCTGTCA-mg.g.vcf.gz -V:3428_3_13_SRO_461_GTGTTCTA-mg,VCF 3428_3_13_SRO_461_GTGTTCTA-mg.g.vcf.gz -V:3428_4_13_SRO_462_TAGGATGA-mg,VCF 3428_4_13_SRO_462_TAGGATGA-mg.g.vcf.gz -V:3428_5_13_SRO_465_TATCAGCA-mg,VCF 3428_5_13_SRO_465_TATCAGCA-mg.g.vcf.gz -V:3428_6_13_SRO_402_TCCGTCTA-mg,VCF 3428_6_13_SRO_402_TCCGTCTA-mg.g.vcf.gz -V:3428_7_13_SRO_474_TCTTCACA-mg,VCF 3428_7_13_SRO_474_TCTTCACA-mg.g.vcf.gz -V:3428_8_13_SRO_531_TGAAGAGA-mg,VCF 3428_8_13_SRO_531_TGAAGAGA-mg.g.vcf.gz -V:3428_9_14_SRO_166_TGGAACAA-mg,VCF 3428_9_14_SRO_166_TGGAACAA-mg.g.vcf.gz INFO 18:30:12,714 HelpFormatter - Executing as roel@utonium.nki.nl on Linux 2.6.32-504.12.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_75-b13. INFO 18:30:12,714 HelpFormatter - Date/Time: 2015/05/06 18:30:12 INFO 18:30:12,715 HelpFormatter - -------------------------------------------------------------------------------- INFO 18:30:12,715 HelpFormatter - -------------------------------------------------------------------------------- INFO 18:30:15,963 GenomeAnalysisEngine - Strictness is SILENT INFO 18:30:16,109 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 18:30:29,705 IntervalUtils - Processing 101539431 bp from intervals WARN 18:30:29,726 IndexDictionaryUtils - Track 3428_10_14_SRO_185_TGGCTTCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,727 IndexDictionaryUtils - Track 3428_11_14_SRO_186_TGGTGGTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,727 IndexDictionaryUtils - Track 3428_12_13_SRO_422_TTCACGCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,728 IndexDictionaryUtils - Track 3428_13_13_SRO_492_AACTCACC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,728 IndexDictionaryUtils - Track 3428_14_13_SRO_493_AAGAGATC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,728 IndexDictionaryUtils - Track 3428_15_14_SRO_209_AAGGACAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,729 IndexDictionaryUtils - Track 3428_16_14_SRO_218_AATCCGTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,729 IndexDictionaryUtils - Track 3428_17_14_SRO_201_AATGTTGC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,730 IndexDictionaryUtils - Track 3428_18_13_SRO_416_ACACGACC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,730 IndexDictionaryUtils - Track 3428_19_14_SRO_66_ACAGATTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,730 IndexDictionaryUtils - Track 3428_1_13_SRO_388_GTCGTAGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,731 IndexDictionaryUtils - Track 3428_20_14_SRO_68_AGATGTAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,731 IndexDictionaryUtils - Track 3428_21_14_SRO_210_AGCACCTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,731 IndexDictionaryUtils - Track 3428_22_14_SRO_256_AGCCATGC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,732 IndexDictionaryUtils - Track 3428_23_14_SRO_270_AGGCTAAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,732 IndexDictionaryUtils - Track 3428_24_13_SRO_452_ATAGCGAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,732 IndexDictionaryUtils - Track 3428_2_13_SRO_399_GTCTGTCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,733 IndexDictionaryUtils - Track 3428_3_13_SRO_461_GTGTTCTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,733 IndexDictionaryUtils - Track 3428_4_13_SRO_462_TAGGATGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,733 IndexDictionaryUtils - Track 3428_5_13_SRO_465_TATCAGCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,734 IndexDictionaryUtils - Track 3428_6_13_SRO_402_TCCGTCTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,734 IndexDictionaryUtils - Track 3428_7_13_SRO_474_TCTTCACA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,734 IndexDictionaryUtils - Track 3428_8_13_SRO_531_TGAAGAGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation WARN 18:30:29,735 IndexDictionaryUtils - Track 3428_9_14_SRO_166_TGGAACAA-mg doesn't have a sequence dictionary built in, skipping dictionary validation INFO 18:30:29,749 MicroScheduler - Running the GATK in parallel mode with 32 total threads, 1 CPU thread(s) for each of 32 data thread(s), of 64 processors available on this machine INFO 18:30:29,878 GenomeAnalysisEngine - Preparing for traversal INFO 18:30:29,963 GenomeAnalysisEngine - Done preparing for traversal INFO 18:30:29,964 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 18:30:29,965 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 18:30:29,966 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 18:30:30,562 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files INFO 18:31:00,420 ProgressMeter - 1:4845033 0.0 30.0 s 50.3 w 0.0% 46.7 h 46.7 h ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.IllegalStateException: Key DP found in VariantContext field INFO at 1:4839315 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default. at htsjdk.variant.vcf.VCFEncoder.fieldIsMissingFromHeaderError(VCFEncoder.java:176) at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:115) at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:221) at org.broadinstitute.gatk.engine.io.storage.VariantContextWriterStorage.add(VariantContextWriterStorage.java:182) at org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub.add(VariantContextWriterStub.java:269) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.reduce(GenotypeGVCFs.java:351) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.reduce(GenotypeGVCFs.java:119) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java:291)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java:280) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:279) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-geee94ec):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR
##### ERROR MESSAGE: Key DP found in VariantContext field INFO at 1:4839315 but this key isn't defined in the VCFHeader.  We require all VCFs to have complete VCF headers by default.
##### ERROR ------------------------------------------------------------------------------------------

for f in *.g.vcf.gz; do echo -e "\n-- $f --"; zcat "$f" | sed -n -r "/^#.*DP/p;/^1\t4839315\t/{p;q;}"; done

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:22:0:21:0,0,432
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:20:0:20:0,0,410
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:29:0:29:0,0,773
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:25:2:25:0,3,790
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:33:0:33:0,0,837
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:23:31:23:0,31,765
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     0       .       ClippingRankSum=-0.578;MLEAC=0,0;MLEAF=0.00,0.00        GT:DP:GQ:PL:SB  0/0:21:39:0,39,488,60,491,512:20,0,0,0
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:18:0:18:0,0,514
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:29:0:29:0,0,810
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:33:0:33:0,0,812
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:28:0:25:0,0,624
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     0.08    .       ClippingRankSum=-0.189;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:17:20:20,0,311,62,320,382:14,0,3,0
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     6.76    .       ClippingRankSum=-0.374;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:25:43:43,0,401,102,417,519:20,0,3,2
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     0       .       ClippingRankSum=-1.095;MLEAC=0,0;MLEAF=0.00,0.00        GT:DP:GQ:PL:SB  0/0:23:1:0,1,395,56,406,460:19,0,0,0
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:28:0:28:0,0,626
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     5.99    .       ClippingRankSum=-0.584;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:18:42:42,0,293,84,305,388:13,1,3,1
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:22:0:22:0,0,558
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       GA,<NON_REF>    6.76    .       ClippingRankSum=0.850;MLEAC=1,0;MLEAF=0.500,0.00        GT:DP:GQ:PL:SB  0/1:19:43:43,0,262,87,274,361:12,3,4,0
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     16.82   .       ClippingRankSum=-0.784;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:21:54:54,0,352,102,367,470:16,0,4,1
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:26:0:25:0,0,419
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:30:0:30:0,0,771
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:34:77:34:0,78,1136
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:26:0:20:0,0,397
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GAA     G,GA,<NON_REF>  22.75   .       ClippingRankSum=-2.181;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00        GT:DP:GQ:PL:SB  0/2:11:22:60,22,209,0,87,104,63,153,113,176:4,2,3,0

Created 2015-04-09 18:55:24 | Updated | Tags: allelebalancebysample variantannotator genotypegvcfs strandbiasbysample

I am struggling with adding AlleleBalanceBySample and StrandBiasBySample to my joint VCF file. The data through joint VCF were generated with GATK 3.2-2, and I have tried using the VariantAnnotator as well as repeating the GenotypeGVCF step specifying the annotations. I have also tried using the GenotypeGVCFs tool from 3.3-0, but am trying to avoid running the HC again in the interest of time. For the strand bias by sample I can see why the VariantAnnotator approach would fail, as the information is not contained in the joint VCF file, but it is there in the GVCF. For the allele balance by sample it seems the info is already in the joint VCF, so either VariantAnnotation or rerunning GenotypeGVCFs seem like they should work.

I should add that when I do try to add these annotations using GenotypeGVCFs, they are noted in the header ##FORMAT section, but not in the actual F:O:R:M:A:T.

Any clues or hints (or admonitions!) much appreciated.

-erikt

Created 2015-03-26 13:23:26 | Updated | Tags: genotypegvcfs

I'm trying to use genotypeGCFs on 350 small gVCF files (from bacterial genomes) and I'm getting this "clock drift" error:

INFO 13:02:46,006 ProgressMeter - chr1:44001 0.0 3.0 h 15250.3 w 2.0% 6.3 d 6.2 d INFO 13:06:30,421 ProgressMeter - chr1:44001 0.0 3.0 h 15250.3 w 2.0% 6.4 d 6.3 d INFO 14:38:28,521 ProgressMeter - chr1:44001 0.0 3.1 h 15250.3 w 2.0% 6.6 d 6.5 d WARN 16:24:14,924 SimpleTimer - Clock drift of -1,425,286,806,691,525,163 - -1,425,286,778,358,860,637 = 28,332,664,526 nanoseconds detected, vs. max allowable drift of 5,000,000,000. Assuming checkpoint/restart event. WARN 18:23:08,307 SimpleTimer - Clock drift of -1,425,286,778,358,827,793 - -1,425,286,806,691,525,163 = 28,332,697,370 nanoseconds detected, vs. max allowable drift of 5,000,000,000. Assuming checkpoint/restart event. INFO 22:55:24,502 ProgressMeter - chr1:44001 0.0 0.0 s 0.0 s 2.0% 0.0 s 0.0 s INFO 02:46:17,060 ProgressMeter - chr1:44001 0.0 14.1 h 15250.3 w 2.0% 4.3 w 4.2 w

Any ideas what that is?

Created 2015-03-26 08:10:25 | Updated | Tags: genotypegvcfs

Hi there!

i would like to run the whole pipeline on one sample for testing the pipeline. here are some wrong in the genotyping step.

commands are:

# Step9. Call Variants

java -Xmx5g -jar $gatk -T HaplotypeCaller -R$ref_dir/ucsc.hg19.fasta -I bam_dir/T-SZ-03-1.dedupped.realigned.recal.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -oother_dir/sample.raw.snps.indels.vcf

# Joint Genotyping

java -Xmx5g -jar $gatk -T GenotypeGVCFs -R$ref_dir/ucsc.hg19.fasta --variant $other_dir/sample.raw.snps.indels.vcf -o$other_dir/test.raw.snps.indels.g.vcf

Error in GenotypeGVCFs :

INFO 09:25:33,478 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:25:33,480 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 INFO 09:25:33,480 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 09:25:33,481 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 09:25:33,484 HelpFormatter - Program Args: -T GenotypeGVCFs -R /media/LAB636/01.06.job/Ref/ucsc.hg19.fasta --variant /media/LAB636/01.06.job/test/data/other/sample.raw.snps.indels.vcf -o /media/LAB636/01.06.job/test/data/other/test.raw.snps.indels.g.vcf INFO 09:25:33,486 HelpFormatter - Executing as starlcx@starlcx-desktop on Linux 2.6.32-38-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_31-b13. INFO 09:25:33,486 HelpFormatter - Date/Time: 2015/03/26 09:25:33 INFO 09:25:33,487 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:25:33,487 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:25:33,823 GenomeAnalysisEngine - Strictness is SILENT INFO 09:25:33,893 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 09:25:34,936 GenomeAnalysisEngine - Preparing for traversal INFO 09:25:34,943 GenomeAnalysisEngine - Done preparing for traversal INFO 09:25:34,944 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 09:25:34,944 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 09:25:34,945 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 09:25:35,005 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files

##### ERROR ------------------------------------------------------------------------------------------

there is my confusion：

1. As you know, I run the pipeline on just one sample. Whether do I need to do joint genotyping after call variants with HC ?
2. The documents here mentions : Genotypes any number of gVCF files that were produced by the Haplotype Caller into a single joint VCF file. So is this means the input here should be gVCF files? But the output of HC are vcf file. https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.php 3.what leads to the error message above?

Thanks! :)

Created 2015-03-21 09:14:04 | Updated | Tags: genotypegvcfs

Hi Team,
I'm getting WARN  21:19:30,478 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation when processing gzipped g.vcf files produced by HaplotypeCaller (via -o foo.g.vcf.gz, as suggested by @Geraldine_VdAuwera in blog post 3893) with GenotypeGVCFs.
This results in dramatic increases in run time (makes sense if GenotypeGVCFs un-compresses the files), and memory requirements (why ??) for GenotypeGVCFs compared to processing the gvcf for same bam files if HC outfiles are unzipped. Most batches that previously completed with 4x8GB RAM now produce java.lang.OutOfMemoryError: Java heap space errors even with 4X64GB!

Could you please advise whether this warning is expected behaviour? If yes, what exactly is missing (can't see much difference in unzipped vs gzipped vcf headers), and can this be added somehow?

Created 2015-03-20 19:25:30 | Updated | Tags: genotypegvcfs multi-allelic

In the output of GenotypeGVCFs, are the ALT alleles in those multi-allelic sites ordered by descending frequencies?

Created 2015-03-19 14:50:22 | Updated | Tags: haplotypecaller genotypegvcfs

Hello!

I would like to run GenotypeGVCFs on 209 WES, called with HC (--emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000).

When I run GenotypeGVCFs, with this command (computing nodes have 8 cores and 24G of memory) :

java -Xmx24g -jar GATK_JAR \ -R Homo_sapiens.GRCh37_decoy.fa \ -T GenotypeGVCFs \ -nt 8 \ -V gvcf.all.list \ -o calls.vcf It estimates a huge runtime and just dies hanging: INFO 10:27:00,790 HelpFormatter - -------------------------------------------------------------------------------- INFO 10:27:00,795 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 INFO 10:27:00,796 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 10:27:00,796 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 10:27:00,800 HelpFormatter - Program Args: -R Homo_sapiens.GRCh37_decoy.fa -T GenotypeGVCFs -nt 8 -V gvcf.all.list -o calls.vcf INFO 10:27:00,810 HelpFormatter - Executing as emixaM@r107-n50 on Linux 2.6.32-504.12.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-ea-b07. INFO 10:27:00,810 HelpFormatter - Date/Time: 2015/03/19 10:27:00 INFO 10:27:00,811 HelpFormatter - -------------------------------------------------------------------------------- INFO 10:27:00,811 HelpFormatter - -------------------------------------------------------------------------------- INFO 10:27:04,719 GenomeAnalysisEngine - Strictness is SILENT INFO 10:27:04,882 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 10:27:41,565 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 1 CPU thread(s) for each of 8 data thread(s), of 8 processors available on this machine INFO 10:27:43,169 GenomeAnalysisEngine - Preparing for traversal INFO 10:27:43,179 GenomeAnalysisEngine - Done preparing for traversal INFO 10:27:43,179 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 10:27:43,180 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 10:27:43,180 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 10:27:44,216 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files INFO 10:28:15,035 ProgressMeter - 1:1000201 0.0 31.0 s 52.7 w 0.0% 27.0 h 27.0 h INFO 10:29:17,386 ProgressMeter - 1:1068701 0.0 94.0 s 155.8 w 0.0% 76.7 h 76.6 h INFO 10:30:18,055 ProgressMeter - 1:1115101 0.0 2.6 m 256.1 w 0.0% 5.0 d 5.0 d  What did I do wrong? Cheers! Created 2015-03-10 18:27:56 | Updated | Tags: haplotypecaller genotypegvcfs gvcf I run the following command for "GenotypeGVCFs" for 3 VCF files output of HaplotypeCaller as below: java data/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \ -R data/ucsc.hg19.fasta \ -T GenotypeGVCFs \ --variant data/47V_post.ERC.vcf \ --variant data/48V_post.ERC.vcf \ --variant data/49V_post.ERC.vcf \ --out data/Combined_geno_3files.vcf but in a final VCF output there is no rsID information and all rows are "." what is the problem? I am really confused. Could you please advise how to get SNP-ID in the output VCF Thanks Created 2015-03-09 16:48:43 | Updated | Tags: dp gt genotypegvcfs Hi, Sometimes when a sample has no reads, it still can have a genotype. I used GenotypeGVCFs (without CombineGVCFs) to generate a vcf with several samples. If I extract data for one sample, I sometimes see missing genotype with no read : GT:AD:DP:GQ:PGT:PID:PL ./.:0,0:0:.:.:.:.  which seem ok to me. Note that DP = 0. but more rarely, I see a genotype with no read : GT:AD:DP:GQ:PGT:PID:PL 0/0:0,0:.:3:.:.:0,3,45 GT:AD:DP:GQ:PGT:PID:PL 1/1:0,0:.:3:1|1:32486973_A_AG:45,3,0  and this is more questionnable. Note that DP = "." Can you explain how genotype can be called without reads ? (I suspect haplotype to be involved, but without read I do not understand) any idea about the DP value ? why has it a value with missing genotype ? thank you. Created 2015-03-02 11:38:32 | Updated | Tags: genotypegvcfs strandallelecountsbysample Dear GATK team: Here is just a quick follow up question on the SAC (StrandAlleleCountsBySample) annotation. I've successfully get the SAC annotation using GenotypeGVCFs with the nighty build, but there is something I notice to be a little wired. This is a line from the final vcf i got from running GenotypeGVCFs. Really glad that the SAC annotation is there, but in this example, for a biallelic site, is it right to have the SAC annotation with 3 sets of counts for "reference/first_allele/second_allele" while there is actually no second alternative allele? Or is it have something to do with the allele from GVCF format. Many thanks! The example: 1 752894 rs3131971 T C 30876.92 . BaseQRankSum=-7.360e-01;DB;DP=1430;MLEAC=589;MLEAF=0.785;MQ=32.47;MQ0=0;MQRankSum=0.727;ReadPosRankSum=-3.580e-01;SOR=4.437 GT:AD:DP:GQ:PGT:PID:PL:SAC 0/1:2,4:.:54:.:.:118,0,54:0,2,0,4,0,0 ./.:2,0:2 0/1:2,4:.:53:.:.:119,0,53:0,2,0,4,0,0 1/1:0,2:.:6:.:.:55,6,0:0,0,0,2,0,0 BTW, if I have a follow up question on a previous one I've asked, should I just reply to the old question or open a new one like this? Thanks again! Created 2015-03-02 09:39:51 | Updated | Tags: haplotypecaller genotypegvcfs Hi GATK team, In the documentation of GenotypeGVCFs it is writen: "Input - One or more Haplotype Caller gVCFs to genotype" I have 3 questions regarding this tool: 1. I wonder, what is the meaning of running it on one sample? 2. I tried to run it on one sample, and noticed that the genotype quality is different than the one in the original gvcf file from HC. What is causing this difference? I'm asking this since running the tool on one sample means that there are no other samples to consider in recalculating the quality. 3. Last question - From your experience, what is the best way to analyze one exome sample? Should I run HC with the default genotype_mode parameter and do hard filtering? Should I run HC in GVCF mode, run GenotypeGVCFs and than do hard filtering? Any other suggestion? Thank you for the answer, Maya Created 2015-03-01 10:46:23 | Updated | Tags: genotypegvcfs strandbiasbysample strandallelecountsbysample Hi there, Thank you guys for developing & maintaining such a good software! I'm actually just starting to use GATK with Haplotypecaller. For me, there's a very useful annotation "StrandBiasBySample" or "StrandAlleleCountsBySample" that gives all the alignment information for a variant. It can be generated by HC, but after joint genotyping, it's just gone. There is a previous post asking about how to add this annotation to the final vcf, and Sheila mentioned she's already put in a request for adding this. So I'm just wandering, have you guys fixed this in one of the nighty build? Many Thanks!!! Created 2015-02-25 10:31:07 | Updated | Tags: genotypegvcfs combinegvcfs Hi, In the documentation for CombineGVCFs it says: CombineGVCFs is meant to be used for hierarchical merging of gVCFs that will eventually be input into GenotypeGVCFs. One would use this tool when needing to genotype too large a number of individual gVCFs; instead of passing them all in to GenotypeGVCFs, one would first use CombineGVCFs on smaller batches of samples and then pass these combined gVCFs to GenotypeGVCFs. Do you have any guidelines for this? I am trying to use genotypeGVCFS on 12 gVCF files and it doesn't work, so can you advise how I should I "pre-merge" them? Two batches of 6? Three batches of 3? Thanks, Mike Created 2015-02-19 14:29:26 | Updated | Tags: haplotypecaller genotypegvcfs Hi Sheila and Geraldine When I run HaplotypeCaller (v3.3-0-g37228af) in GENOTYPE_GIVEN_ALLELES mode with --emitRefConfidence GVCF I get the error: Invalid command line: Argument ERC/gt_mode has a bad value: you cannot request reference confidence output and GENOTYPE_GIVEN_ALLELES at the same time It is however strange that GENOTYPE_GIVEN_ALLELES is mentioned in the --max_alternate_alleles section of the GenotypeGVCFs documentation. Maybe I'm missing something? Thanks, Gerrit Created 2015-02-04 08:52:18 | Updated 2015-02-04 08:53:33 | Tags: haplotypecaller illumina genotypegvcfs combinegvcfs Hi GATK-ers, I have been given ~2000 gVCFs generated by Illumina (one sample per gVCF). Though they are in standard gVCF format, they were generated by an Illumina pipeline (https://support.basespace.illumina.com/knowledgebase/articles/147078-gvcf-file if you're really curious) and not the Haplotype Caller. As a result (I think ... ), the GATK doesn't want to process them (I have tried CombineGVCFs and GenotypeGVCFs to no avail). Is there a GATK walker or some other tool that will make my gVCFs GATK-friendly? I need to be able to merge this data together to make it analyze-able because in single-sample VCF format it's pretty useless at the moment. My only other thought has been to expand all the ref blocks of data and then merge everything together, but this seems like it will result in the creation of a massive amount of data. Any suggestions you may have are greatly appreciated!!! Sara Created 2015-02-02 21:24:31 | Updated | Tags: vqsr dbsnp vqslod genotypegvcfs gvcf From my whole-genome (human) BAM files, I want to obtain: For each variant in dbSNP, the GQ and VQSLOD associated with seeing that variant in my data. Here's my situation using HaplotypeCaller -ERC GVCF followed by GenotypeGVCFs: CHROM POS ID REF ALT chr1 1 . A # my data chr1 1 . A T # dbSNP I would like to know the confidence (in terms of GQ and/or PL) of calling A/A, A/T. or T/T. The call of isn't useful to me for the reason explained below. How can I get something like this to work? Besides needing a GATK-style GVCF file for dbSNP, I'm not sure how GenotypeGVCFs behaves if "tricked" with a fake GVCF not from HaplotypeCaller. My detailed reason for needing this is below: For positions of known variation (those in dbSNP), the reference base is arbitrary. For these positions, I need to distinguish between three cases: 1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1. 2. We have sufficient evidence to call position n as homozygous reference (0/0) with confidence scores GQ=x2 and VQSLOD=y2. 3. We do not have sufficient evidence to make any call for position n. I was planning to use VQSR because the annotations it uses seem useful to distinguish between case 3 and either of 1 and 2. For example, excessive depth suggests a bad alignment, which decreases our confidence in making any call, homozygous reference or not. Following the best practices pipeline using HaplotypeCaller -ERC GVCF, I get ALTs with associated GQs and PLs, and GT=./.. However, GenotypeGVCF removes all of these, meaning that whenever the call by HaplotypeCaller was ./. (due to lack of evidence for variation), it isn't carried forward for use in VQSR. Consequently, this seems to distinguish only between these two cases: 1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1. 2. We do not have sufficient evidence to call position n as a variant (it's either 0/0 or unknown). This isn't sufficient for my application, because we care deeply about the difference between "definitely homozygous reference" and "we don't know". Thanks in advance! Douglas Created 2015-01-23 16:55:57 | Updated | Tags: vqsr haplotypecaller bam gatk genotypegvcfs variant-calling Hi, I have recal.bam files for all the individuals in my study (these constitute 4 families), and each bam file contains information for one chromosome for one individual. I was wondering if it is best for me to pass all the files for a single individual together when running HaplotypeCaller, if it will increase the accuracy of the calling, or if I can just run HaplotypeCaller on each individual bam file separately. Also, I was wondering at which step I should be using CalculateGenotypePosteriors, and if it will clean up the calls substantially. VQSR already filters the calls, but I was reading that CalculateGenotypePosteriors actually takes pedigree files, which would be useful in my case. Should I try to use CalculateGenotypePosteriors after VQSR? Are there other relevant filtering or clean-up tools that I should be aware of? Thanks very much in advance, Alva Created 2015-01-13 12:40:29 | Updated 2015-01-13 12:44:57 | Tags: genotypegvcfs I am seeing an unexpected heterozygous genotype in the GenotypeGVCFs output. The region is at the edge of a long homopolymer, which presumably plays a role, but it is still not clear where the unexpected heterozygote is coming from. The final VCF is below. The unexpected call is 2:152402516A>T in Sample1. The call is 0/1, but I don't see any evidence for where that alternate allele is coming from. The input VCFs are below. Any suggestions? How is GenotypeGVCFs interpreting this situation? I am running the latest GATK 3.3. Jar. Final VCF #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 2 152402515 . T TA 86.23 . AC=2;AF=0.500;AN=4;BaseQRankSum=-5.700e-02;ClippingRankSum=0.321;DP=131;FS=0.000;GQ_MEAN=166.50;GQ_STDDEV=44.55;MLEAC=1;MLEAF=0.250;MQ=59.89;MQ0=0;MQRankSum=2.42;NCC=0;QD=1.31;ReadPosRankSum=0.649 GT:AD:DP:GQ:PL 0/1:10,8:34:99:203,0,198 0/1:37,11:56:99:135,0,1429 2 152402516 . A T 380.44 . AC=2;AF=0.500;AN=4;BaseQRankSum=-3.647e+00;ClippingRankSum=-2.597e+00;DP=130;FS=3.866;GQ_MEAN=204.50;GQ_STDDEV=136.47;MLEAC=2;MLEAF=0.500;MQ=59.73;MQ0=0;MQRankSum=0.754;NCC=0;QD=5.76;ReadPosRankSum=1.43 GT:AD:DP:GQ:PL 0/1:10,0:34:99:108,0,157 0/1:43,23:66:99:301,0,1142 Sample1 VCF #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 2 152402511 . G <NON_REF> . . END=152402514 GT:DP:GQ:MIN_DP:PL 0/0:38:57:33:0,51,765 2 152402515 . TA T,TAA,TAAA,TAAAA,<NON_REF> 0.10 . BaseQRankSum=-0.057;ClippingRankSum=0.321;DP=55;MLEAC=0,0,1,0,0;MLEAF=0.00,0.00,0.500,0.00,0.00;MQ=59.89;MQ0=0;MQRankSum=-1.002;ReadPosRankSum=-0.646 GT:AD:DP:GQ:PL:SB 0/3:10,8,8,7,1,0:34:21:224,189,625,21,87,219,0,109,141,388,143,293,250,447,708,116,220,156,201,330,273:0,10,0,7 2 152402517 . A <NON_REF> . . END=152402517 GT:DP:GQ:MIN_DP:PL 0/0:55:0:55:0,0,1431 2 152402518 . A <NON_REF> . . END=152402621 GT:DP:GQ:MIN_DP:PL 0/0:49:99:37:0,101,1442 Sample2 VCF #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample2 2 152402513 . C CT,<NON_REF> 0 . BaseQRankSum=-0.168;ClippingRankSum=0.437;DP=48;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.772;ReadPosRankSum=0.034 GT:AD:DP:GQ:PL:SB 0/0:35,2,0:37:99:0,103,1228,143,1234,1275:6,29,0,0 2 152402514 . T <NON_REF> . . END=152402514 GT:DP:GQ:MIN_DP:PL 0/0:63:99:63:0,114,1710 2 152402515 . TA T,TAA,TAAA,<NON_REF> 78.73 . BaseQRankSum=-1.332;ClippingRankSum=-0.242;DP=76;MLEAC=0,1,0,0;MLEAF=0.00,0.500,0.00,0.00;MQ=59.74;MQ0=0;MQRankSum=2.423;ReadPosRankSum=0.649 GT:AD:DP:GQ:PL:SB 0/2:37,4,11,4,0:56:99:135,256,1876,0,804,1429,116,1061,1435,1759,157,933,805,963,890:2,35,0,11 2 152402516 rs199791504 A T,<NON_REF> 272.77 . BaseQRankSum=-3.647;ClippingRankSum=-2.597;DB;DP=75;MLEAC=1,0;MLEAF=0.500,0.00;MQ=59.73;MQ0=0;MQRankSum=0.754;ReadPosRankSum=1.426GT:AD:DP:GQ:PL:SB 0/1:43,23,0:66:99:301,0,1142,429,1210,1639:5,38,1,22 2 152402517 . A T,<NON_REF> 0 . BaseQRankSum=-0.810;ClippingRankSum=-0.661;DP=79;MLEAC=0,0;MLEAF=0.00,0.00;MQ=59.75;MQ0=0;MQRankSum=-1.160;ReadPosRankSum=1.259 GT:AD:DP:GQ:PL:SB 0/0:67,4,0:71:99:0,122,2156,201,2168,2247:6,61,0,0 2 152402518 . A <NON_REF> . . END=152402621 GT:DP:GQ:MIN_DP:PL 0/0:67:99:48:0,105,1710 Created 2015-01-07 21:16:08 | Updated | Tags: haplotypecaller genotypegvcfs Hi I am using GATK version 3.3 (latest release) and following the best practice for variant calling. Steps i am using is 1. Run haplotye caller and get the gVCF file 2. Run GenotypeGVCF across the samples to the get the VCF file The Question is about the Genotype calling. Here is an example: GVCF file looks like this for a particular position chr1 14677 . G . . END=14677 GT:DP:GQ:MIN_DP:PL 0/0:9:0:9:0,0,203 VCF file looks like this after genotyping with multiple samples. chr1 14677 . G A 76.56 PASS . . GT:AD:DP:GQ:PL 0/1:8,1:9:4:4,0,180 In GVCF file there was no support for the alternate allele, but VCF file shows that there are 1 read supporting the alternate call. Thanks ! Saurabh Created 2014-12-16 14:44:31 | Updated | Tags: java genotypegvcfs bcftools Hi, after running 5 exomes with GATK-v3.3 and HaplotypeCaller, I encountered a very low titv ration in my samples (~2.1) as VaraintEval report indicated. I tried running varaint filtration in these samples but I didn't see any imporvement in titv ratio nor any filtering done. therefore I filtered these with bcftools, after which the titv ratio improved to 2.5. Then when I tried running GenotypeGVCFs on these samples filtered with bcftools, I encountered the following error: ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double at java.lang.Double.compareTo(Double.java:49) at java.util.ComparableTimSort.countRunAndMakeAscending(ComparableTimSort.java:290) at java.util.ComparableTimSort.sort(ComparableTimSort.java:157) at java.util.ComparableTimSort.sort(ComparableTimSort.java:146) at java.util.Arrays.sort(Arrays.java:472) at java.util.Collections.sort(Collections.java:155) at org.broadinstitute.gatk.utils.MathUtils.median(MathUtils.java:999) at org.broadinstitute.gatk.tools.walkers.variantutils.ReferenceConfidenceVariantContextMerger.combineAnnotationValues(ReferenceConfidenceVariantContextMerger.java:73) at org.broadinstitute.gatk.tools.walkers.variantutils.ReferenceConfidenceVariantContextMerger.merge(ReferenceConfidenceVariantContextMerger.java:158) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:202) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:121) at org.broadinstitute.gatk.engine.traversals.TraverseLociNanoTraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:310) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version nightly-2014-11-17-g58cfab1): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: java.lang.Integer cannot be cast to java.lang.Double ##### ERROR ------------------------------------------------------------------------------------------ any advice on solving this incident will be much appreciated Victoria Created 2014-12-15 10:17:16 | Updated | Tags: ploidy genotypegvcfs Hello. I've run Haplotype caller for 19 samples in -ploidy 19 (19 is as high as I can get without getting an error for that, but that is a discussion opened elsewhere). Then, I've tried GenotypeGVCFs and get an ERROR (both command and ERROR below). Just to add, I've run both HaplotypeCaller and GenotypeGVCFs with and without -maxAltAlleles 4 to try to limit the referred combination; however, the output is exactly the same ERROR message at the end of GenotypeGVCFs. Do you think there might be a bug or am I doing something wrong? java -Djava.io.tmpdir=$mytmp -Xmx28g -jar GenomeAnalysisTK.jar -R $ref -T GenotypeGVCFs -o$gVCF.ploidy.vcf -V ploidy.list -ploidy 19 -maxAltAlleles 4

ERROR MESSAGE: the combination of ploidy (19) and number of alleles (21) results in a very large number of genotypes (> 2147483647). You need to limit ploidy or the number of alternative alleles to analyz e this locus

Created 2014-12-12 12:53:50 | Updated | Tags: ad dp genotypegvcfs

Hi,

I am running the GATK 3.2-2 pipeline and I am getting some odd results from the genotype VCF stage. It appears that incorrect AD results are being output.

E.g. on the single sample gvcf I have:

This agrees with what I see in IGV for this site (see attached image).

However, when I joint call this site along with other gvcfs, the output for that sample looks like:

0/1:20,0:38:99:560,0,633

i.e. It appears to not output the expected 20,18 - but it still seems to be making the correct genotype call so I am assuming it is just an output bug and not something that is affecting calling, but it would be good to know for sure. Either way it is problematic as it is affecting our downstream depth filtering. This isn't the only example of this happening in our data. I am trying to identify more by looking for occasions where the AD fields sum to less than DP, am I right in thinking this should never be the case as AD is unfiltered compared to the filtered DP?

Thanks a lot

Dan

Created 2014-12-09 21:47:52 | Updated | Tags: vcf genotypegvcfs gvcf variant-calling

Hi,

I used GenotypeGVCFs with 3 input gvcf files (3 individuals) to create a vcf file, and this seems to work, but when I examine the sites in the final vcf file, there are sites that are missing. I am in the process of calculating exactly how many sites are missing, but taking an initial section of the vcf file and initial sections of my 3 gvcf files, the initial set of variant positions in the 3 gvcf files combined (called "test file") is:

16050036 16050612 16050822 16050933 16051556 16051968 16051994 16052080 16052167 16052239 16052250 16052357 etc.

whereas the initial set of sites in my final vcf file is:

16050822 16050933 16051347 16051497 16051556 16051968 16051994 16052080 16052167 16052169 16052239 16052357 etc.

First of all, there are sites in the final vcf file (in bold) that are fixed for the 3 individuals, but that are still included (the individuals are all 0/1 at these positions). I removed these positions when I created my test file, so they don't show up there, as you can see. Second, there are sites in the test file that are not in the final vcf file (in bold), even though these are most definitely variant sites (I checked them - f.ex., 16050036 is a SNP). I'm not sure why these discrepancies are occurring.

I also got this warning 3 times when running GenotypeGVCFs: WARN 20:04:45,102 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 22:21483632 has 7 alternate alleles so only the top alleles will be used How do I relax the requirements of "at most 6 alternate alleles" to allow more in the case of indels? I am using the newest version of GATK (3.3).

FYI, this is the command I used for GenotypeGVCFs: java -Xmx2g -jar GenomeAnalysisTK.jar -R hs37d5.fa -T GenotypeGVCFs --variant file1.vcf --variant file2.vcf --variant file3.vcf -o final.vcf -L 22 (only running this for chr22)

Created 2014-12-03 16:57:35 | Updated | Tags: haplotypecaller genotypegvcfs variant-calling

If I want the variants to be called only if they fit the following criteria:

1) Min. total coverage for consideration of heterozygous is 10.

2) Min. coverage of each of the two observed major basecalls to be called heterozygous is 5.

3) Min. percentage of each of the two observed major basecalls in order to be called heterozygous is 20.

4) Min. coverage in order for a position to be called homozygous is 6.

which command-line arguments in which tools (HaplotpyeCaller or GenotypeGVCFs) can I use to accomplish these? I cannot seem to find the proper arguments in the documentation. I apologize if I overlook.

Thank you

Created 2014-12-02 11:21:53 | Updated 2014-12-02 11:23:39 | Tags: genotypegvcfs

Hi, I'm using GATKv3.3 to run the protocol of HC followed by CombineGVCFs in six batches of ~190 individuals each and finally GenotypeGVCFs of these six gVCFs. My failed command is below followed by the error message. In addition, I can tell you that my output file (1135g_.vcf) is empty even when it crashed after more than two hours of running time.

Kindly, Alberto

java -Djava.io.tmpdir=$mytmp -Xmx232g -jar$EBROOTGATK/GenomeAnalysisTK.jar -R $ref -T GenotypeGVCFs -nt 40 -o 1135g_.vcf -V 1135g_listof.list ##### ERROR ------------------------------------------------------------------------------------------ ERROR stack trace java.lang.ArrayIndexOutOfBoundsException: 10000 at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.generatePL(ReferenceConfidenceVariantContextMerger.java:357) at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.mergeRefConfidenceGenotypes(ReferenceConfidenceVariantContextMerger.jav a:331) at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.merge(ReferenceConfidenceVariantContextMerger.java:134) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:200) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:119) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutorWorker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) ERROR ------------------------------------------------------------------------------------------ ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af): ERROR ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ERROR If not, please post the error message, with stack trace, to the GATK forum. ERROR Visit our website and forum for extensive documentation and answers to ERROR commonly asked questions http://www.broadinstitute.org/gatk ERROR ERROR MESSAGE: 10000 ERROR ------------------------------------------------------------------------------------------ Created 2014-11-26 22:46:02 | Updated | Tags: haplotypecaller genotypegvcfs Hi, I just finished running HaplotypeCaller version 3.3-0 separately on 6 exome samples with the new best practices. java -Xmx8g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19/hg19_Ordered.fa -I K87/HG19_Analysis/K87-929_final.recalibrated_final.bam --dbsnp dbsnp_138_hg19_Ordered.vcf --pair_hmm_implementation VECTOR_LOGLESS_CACHING -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 --output_mode EMIT_VARIANTS_ONLY -gt_mode DISCOVERY --pcr_indel_model CONSERVATIVE -o ./Haplotypes_929.vcf Many Variant sites are called as homozygous alt (1/1), but none of these sites that are processed to infer haplotype are called as homozygous alt in their PGT field, they are all called as hets, PGT=0|1. for example: GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,29,0:29:93:0|1:121483392_C_G:1331,93,0,1331,93,1331:0,0,13,16 The allelic depths agree with the phased genotype but out of all 6 exomes processed, not a single 1/1 is also phased as 1|1. I checked all output vcfs with a simple grep combo: grep 'PGT' Haplotypes_929.vcf | grep '1/1' - | grep '0|1' - | wc -l = 19046 grep 'PGT' Haplotypes_929.vcf | grep '1/1' - | grep '1|0' - | wc -l = 79 grep 'PGT' Haplotypes_929.vcf | grep '1/1' - | grep '1|1' - | wc -l = 0 This seemed odd, but I continued with GenotyeGVCF: java -Xmx32g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R hg19/hg19_Ordered.fa -V Haplotypes_450.vcf -V Haplotypes_452.vcf -V Haplotypes_925.vcf -V Haplotypes_926.vcf -V Haplotypes_927.vcf -V Haplotypes_929.vcf -D dbsnp_138_hg19_Ordered.vcf -ped K87/HG19_Analysis/K87_6.ped -o Haplotypes_K87_GVCFs.vcf I'm looking at the output vcf as it's being generated and now there are homozygous alt calls but they conflict with the associated Allelic Depths: .... GT:AD:DP:GQ:PGT:PID:PL .... 1/1:0,29:29:85:1|1:33957151_G_T:948,85,0 ..... Full Line: chr1 33957152 rs4403594 T G 3166.96 . AC=12;AF=1.00;AN=12;DB;DP=99;FS=0.000;GQ_MEAN=48.50;GQ_STDDEV=27.55;MLEAC=12;MLEAF=1.00;MQ=39.65;MQ0=0;NCC=0;QD=32.32;SOR=0.693 GT:AD:DP:GQ:PGT:PID:PL 1/1:0,9:9:27:.:.:330,27,0 1/1:0,5:5:15:.:.:141,15,0 1/1:0,29:29:85:1|1:33957151_G_T:948,85,0 1/1:0,20:20:60:.:.:722,60,0 1/1:0,24:24:71:.:.:685,71,0 1/1:0,11:11:33:.:.:366,33,0 Can you help me interpret what seems to me as conflicting results? Cheers, Patrick Created 2014-11-20 06:39:53 | Updated | Tags: combinevariants genotypegvcfs gvcftools Hi, I am merging several genome vcf files apparently created by gvcftools [1]. I noticed that CombineVariants is incorrectly summing allele counts (AC) in some corner cases, and was wondering if the format [2] was supported? If the format should be supported by the CombineVariants tool (gvcf is VCF 4.1 conformant), I have isolated a minimal example that should be of help for bugfixing. Thanks, Paweł Created 2014-11-17 11:59:46 | Updated | Tags: haplotypecaller genotypegvcfs combinegvcfs Hi all, I am attempting to use the HaplotyperCaller / CombineGVCFs / GenotypeGVCFs to call variants on chrom X and Y of 769 samples (356 males, 413 females) sequenced at 12x coverage (WG sequening, but right not only calling X and Y). I have called the samples according to the best practises using the HaplotypeCaller, using ploidy = 1 for males on X and Y and ploidy =2 for females on X, e.g.: INFO 16:28:45,750 HelpFormatter - Program Args: -R /gcc/resources/b37/indices/human_g1k_v37.fa -T HaplotypeCaller -L X -ploidy 1 -minPruning 3 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -I /target/gpfs2/gcc/groups/gonl/projects/trio-analysis/rawdata_release2/A102.human_g1k_v37.trio_realigned.bam --sample_name A102a -o /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/A102a.chrX.hc.g.vcf Then I have used CombineGVCFs to combine my samples in batches of 100 samples. Now I am attempting to genotype them and I face the same issue on both X (males + females) and Y (males only): It starts running fine and then just hang on a certain position. At first it crashed asking for additional memory but now with 24Gb or memory it simply stays at a single position for 24hrs until my job eventually stops due to walltime. Here is the chrom X output: INFO 15:00:39,501 HelpFormatter - Program Args: -R /gcc/resources/b37/indices/human_g1k_v37.fa -T GenotypeGVCFs -ploidy 1 --dbsnp /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf -stand_call_conf 10 -stand_emit_conf 10 --max_alternate_alleles 4 -o /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.vcf -L X -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.1.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.2.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.3.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.4.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.5.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.6.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.7.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.8.g.vcf INFO 15:00:39,507 HelpFormatter - Executing as lfrancioli@targetgcc15-mgmt on Linux 3.0.80-0.5-default amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13. INFO 15:00:39,507 HelpFormatter - Date/Time: 2014/11/12 15:00:39 INFO 15:00:39,508 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:00:39,508 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:00:40,951 GenomeAnalysisEngine - Strictness is SILENT INFO 15:00:41,153 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 15:57:53,416 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.4.g.vcf.idx INFO 17:09:39,597 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.5.g.vcf.idx INFO 18:21:00,656 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.6.g.vcf.idx INFO 19:30:46,624 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.7.g.vcf.idx INFO 20:22:38,368 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.8.g.vcf.idx WARN 20:26:45,716 FSLockWithSharedLockAcquisitionTask - WARNING: Unable to lock file /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf.idx because an IOException occurred with message: No locks available.
INFO  20:26:45,718 RMDTrackBuilder - Could not acquire a shared lock on index file /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf.idx, falling back to using an in-memory index for this GATK run.
INFO  20:33:29,491 IntervalUtils - Processing 155270560 bp from intervals
INFO  20:33:29,628 GenomeAnalysisEngine - Preparing for traversal
INFO  20:33:29,635 GenomeAnalysisEngine - Done preparing for traversal
INFO  20:33:29,636 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  20:33:29,637 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  20:33:29,638 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
INFO  20:33:29,948 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files
INFO  20:33:59,642 ProgressMeter -         X:65301         0.0    30.0 s      49.6 w        0.0%    19.8 h      19.8 h
INFO  20:34:59,820 ProgressMeter -         X:65301         0.0    90.0 s     149.1 w        0.0%    59.4 h      59.4 h
...
INFO  20:52:01,064 ProgressMeter -        X:177301         0.0    18.5 m    1837.7 w        0.1%    11.3 d      11.2 d
INFO  20:53:01,066 ProgressMeter -        X:177301         0.0    19.5 m    1936.9 w        0.1%    11.9 d      11.9 d
...
INFO  14:58:25,243 ProgressMeter -        X:177301         0.0    18.4 h   15250.3 w        0.1%    96.0 w      95.9 w
INFO  14:59:38,112 ProgressMeter -        X:177301         0.0    18.4 h   15250.3 w        0.1%    96.1 w      96.0 w
INFO  15:00:47,482 ProgressMeter -        X:177301         0.0    18.5 h   15250.3 w        0.1%    96.2 w      96.1 w
=>> PBS: job killed: walltime 86440 exceeded limit 86400

I would really appreciate if you could give me some pointer as how to handle this situation.

Thanks! Laurent

Created 2014-11-14 03:30:08 | Updated | Tags: variantannotator genotypegvcfs

I have constructed a VCF using GenotypeGVCFs. Due to downsampling and filtering imposed by HaplotypeCaller/GenotypeGVCFs, four fields of interest within the VCF do not reflect the data that went into building the file. I would like to update a these fields to reflect all read data used in the pipeline, but I have been able to do so for just 3 of the 4. I have been able to update INFO MQ0, INFO DP, and sample-level (FORMAT) AD, but I have not been able to update sample-level (FORMAT) DP.

I have run VariantAnnotator as follows.

inVCFfn=../vcf/raw.vcf
outVCFfn=../vcf/raw.annotated.vcf

gatk=/srv/gs1/software/gatk/gatk-3.3.0/GenomeAnalysisTK.jar
ref=/srv/gs1/projects/scg/Resources/GATK/b37/human_g1k_v37_decoy.fasta
dbsnp=/srv/gs1/projects/scg/Resources/GATK/b37/dbsnp_137.b37.vcf

java -Xmx10g \
-jar $gatk \ -T VariantAnnotator \ -R$ref \
-L $inVCFfn \ --variant$inVCFfn \
--dbsnp $dbsnp \ --out$outVCFfn \
--annotation Coverage \
--annotation DepthPerAlleleBySample \
--annotation MappingQualityZero \
[-I fileName.1.bam … -I fileName.810.bam]

The documentation for --annotation Coverage indicates that it should update both the INFO DP and the sample-level (FORMAT) DP, but the sample-level update seems not to be occurring. For example:

$awk '($2==7157834) { print $9, "|",$53 }' raw.vcf

$awk '($2==7157834) { print $9, "|",$53 }' raw.annotated.vcf
GT:AD:DP:GQ:PL | 0:131,0:4:99:0,135

(Aside: I'm working with a haploid chromosome, hence the single-value GT and two-value AD and PL fields). Although AD was updated from "4,0" to "131,0", the sample-level "DP" remained at "4". Of course I could approximate DP as the sum of AD values, but since AD is pre-filter and sample-level DP is post-filter, this wouldn't quite be correct. I should note that when I run HaplotypeCaller/GenotypeGVCFs on a subset of the data, the sample-level DP values are very close to the sum of AD's, so it's not a matter of most reads being filtered out, but rather that they have been downsampled. Am I missing something?

Thanks! David

P.S. The INFO field DP and MQ0 are indeed updated as expected:

$awk '($2==7157834) { print $8 }' raw.vcf AC=1;AF=6.993e-03;AN=143;DP=609;FS=0.000;GQ_MEAN=124.92;GQ_STDDEV=113.68;MLEAC=1;MLEAF=6.993e-03;MQ=60.00;MQ0=0;NCC=0;QD=30.16;SOR=1.259$ awk '($2==7157834) { print$8 }' raw.annotated.vcf
AC=1;AF=6.993e-03;AN=143;DP=10750;FS=0.000;GQ_MEAN=124.92;GQ_STDDEV=113.68;MLEAC=1;MLEAF=6.993e-03;MQ=60.00;MQ0=2;NCC=0;QD=30.16;SOR=1.259

Created 2014-11-04 05:50:02 | Updated | Tags: genotypegvcfs

In the past, I’ve found MQ0 ("Total Mapping Quality Zero Reads") to be a very useful INFO feature of putatively variable sites. However, in my experience, GenotypeGVCFs reports every site to have MQ0=0, even those for which UnifiedGenotyper reports MQ0 > 0 when run on the same data. Is this a bug, or am I missing something?

Thanks!

Created 2014-10-29 16:20:48 | Updated 2014-10-29 16:21:18 | Tags: haplotypecaller genotypegvcfs joint-calling

Hi,

We have some exomes processed with GATK from one exome capture platform (Nimblegen SeqCap on HiSeq), and now I am going to analyze a small batch of exomes sequenced using a different platform (Nextera on NextSeq). I was wondering if and how GenotypeGVCFs can cope with gvcfs produced on two (or more) different exome capture platforms? Is joint-calling of such "heterogeneous" samples advised, or should I rather genotype the small set of equally processed bams separately?

The problems I could foresee were:

• differing targets (each batch of samples will have gaps in coverage due to target-constrained (-L) haplotypecalling). If this should cause problems for GenotypeGVCFs the haplotypes could be called again on superset of targets, or easier, the calling restricted to the intersection.
• the alignment algorithm is different (BWA BWTSW vs BWA MEM) so the mapping quality could potentially differ (haven't checked). Is this something GATK is compensating for, like base qualities?
• the sequencer and chemistry, but here I hope that the BQSR should help in removing the variation

Thanks, Paweł

Created 2014-10-23 17:22:04 | Updated | Tags: genotypegvcfs

Hi,

I'm trying to call genotypes at a set of given positions from ~5,000 samples. To do so, I first generate gvcf files using HaplotypeCaller for the region around a gene for all samples separately with HaplotypeCaller. Then I'm merging the single gvcf files into batches of 1000 and then I'm generating the genotypes at ~180 sites using GenotypeGVCFs and the flag --includeNonVariantSites. Everything runs without errors, but in the end I noticed that ~20 sites were missing from the final VCF file. I then looked into these sites and the single files in detail and found out that there are simply no genotypes called for some samples at some positions.

I attached a small example. When I'm running the following command, no variant is in the output file:  java -jar GenomeAnalysisTK.jar \ -R hg19.fa \ -L region.bed --includeNonVariantSites \ -T GenotypeGVCFs \ --variant wrong.gvcf --variant correct.gvcf \ -o out.vcf

The same command works when running it with only "correct.gvcf" as an input, but obviously not when running it with only "wrong.gvcf". I noticed that wrong.gvcf might show a variant at this exact position.

Best, Thomas

Created 2014-10-06 20:07:44 | Updated | Tags: genotypegvcfs

Dear all,

I have been using a relatively recent nightly build of GATK to deal with some previous bug fix, and I now see some odd things in my VCF format, after using genotypegVCF. Precisely some rows have this INFO field: GT:AD:DP:GQ:PGT:PID:PL

with these new PGT and PID which are present only some of the time. I have no idea what this means (and it also messes up my data processing). Does anyone know what these fields are about? And, unless they are very useful, how to get rid of them?

Created 2014-09-30 15:46:58 | Updated | Tags: haplotypecaller genotypegvcfs

Following the the recommended best practice and using GATK v3.2, the raw vcf file created by haplotyecaller/GenotypeGVCFs does not contain any of AC, AN and AF tags. Is this an expected behaviour?

This is a line of the g.vcf file generated from HaplotypeCaller:
chr21 36104982 . AT A,ATT,<NON_REF> 0.39 . BaseQRankSum=-1.693;DP=35;MLEAC=1,0,0;MLEAF=0.500,0.00,0.00;MQ=60.00;MQ0=0;MQRankSum=1.874;ReadPosRankSum=-0.023 GT:AD:GQ:PL:SB 0/1:23,6,2,0:27:27,0,499,67,411,529,96,459,496,525:10,13,2,4

And the raw vcf file from GenotypeGVCFs: chr21 36122885 . A AT 7871.94 . BaseQRankSum=0.151;DP=493;FS=5.433;MLEAC=24;MLEAF=0.750;MQ=60.00;MQ0=0;MQRankSum=0.129;QD=19.53;ReadPosRankSum=0.257 GT:AD:DP:GQ:PL 0/1:13,8:.:99:162,0,238 1/1:0,21:.:62:536,62,0 1/1:0,24:.:71:610,71,0 0/1:12,12:.:99:245,0,207 1/1:0,27:.:80:693,80,0 1/1:0,34:.:99:877,101,0 0/1:31,13:.:99:210,0,624 0/0:30,0:30:65:0,65,1125 0/1:6,11:.:96:225,0,96 0/1:10,5:.:87:87,0,196 0/1:4,15:.:39:340,0,39 1/1:2,35:.:66:895,66,0 1/1:0,53:.:99:1376,161,0 1/1:0,13:.:39:328,39,0 1/1:0,24:.:74:662,74,0 1/1:1,29:.:65:702,65,0

My command for HaplotypeCaller:
java -Xmx8g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R myRef.fa -I my.bam -L chr21 --dbsnp dbSNP_chr21.vcf --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -stand_emit_conf 10 -stand_call_conf 30 -mbq 20 -nct 8 -A FisherStrand -A QualByDepth -A HaplotypeScore --annotation Coverage -o raw.g.vcf

And GenotypeGVCFs:
java -Xmx8g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R myRef.fa -V myVCF.list --dbsnp dbSNP_chr21.vcf -A FisherStrand -A QualByDepth -A HaplotypeScore -A Coverage -L chr21 -o myraw.vcf

Created 2014-09-19 14:34:45 | Updated | Tags: haplotypecaller malformedvcf genotypegvcfs

Hi,

I've tried my best to find the problem but without luck, so my first post...

First the background:

I have used HaplotypeCaller to make GVCFs per individual for 2 cohorts of 270 and 300 individuals each.
I then used CombineGVCFs to merged all the individuals in each cohort across 10Mb chunks across the genome.

These steps appear to have worked OK! Then I have been attempting to call genotypes from pairs GVCFs (one from each cohort) per region using GenotypeGVCFs with a command such as:

java -jar -Xmx6G /gpfs0/apps/well/gatk/3.2-2/GenomeAnalysisTK.jar \ -T GenotypeGVCFs \ -R ../hs37d5/hs37d5.fa \ --variant /well/hill/kate/exomes//haploc/cap270/cap270_1:230000000-240000000_merged_gvcf.vcf \ --variant /well/hill/kate/exomes//haploc/sepsis300/sepsis300_1:230000000-240000000_merged_gvcf.vcf \ -o /well/hill/kate/exomes//haploc/cap270-sepsis300/cap270-sepsis300_1:230000000-240000000_raw.vcf

This seems to start fine (and works perfectly to completion for some regions), but then throws a error for some of the files. For this example the error is:

##### ERROR MESSAGE: The provided VCF file is malformed at approximately line number 519534: ./.:29:75:25:0,63,945 is not a valid start position in the VCF format, for input source: /well/hill/kate/exomes/haploc/cap270/cap270_1:230000000-240000000_merged_gvcf.vcf

So to try and solve this I have grepped "./.:29:75:25:0,63,945 " from the VCF file (expecting to find a truncated line), but could find nothing wrong with the lines containing this expression. I looked carefully at the first line containing this after the last position that was written in the output - nothing out of the ordinary that I can see.

Created 2014-09-19 10:45:49 | Updated 2014-09-19 10:52:00 | Tags: mappingqualityranksumtest readposranksumtest mqranksum readposranksum genotypegvcfs

On 2000 samples I have run HC3.2, CGVCFs3.2, GGVCFs3.2 and VR3.2.

For the GenotypeGVCFs step I used the current default annotations:

InbreedingCoeff FisherStrand QualByDepth ChromosomeCounts GenotypeSummaries

And these non-default annotations:

MappingQualityRankSumTest ReadPosRankSumTest

When running VariantRecalibrator and plotting each of the dimensions I noticed all of the non-default annotations taking on discrete values; see bottom of this post. Is it no longer recommended to use ReadPosRankSum and MQRankSum for VR? Should I calculate these annotation with VariantAnnotator instead of GenotypeGVCFs? If I have to run VariantAnnotator, should I then run it separately for SNPs and INDELs cf. my previous question about annotations being different, when applied to BOTH and SNPs: http://gatkforums.broadinstitute.org/discussion/2620

Thank you.

zcat out_GenotypeGVCFs/chrom20.vcf.gz | grep -v ^# | cut -f8 | tr ";" "\n" | grep ReadPosRankSum | sort | uniq -c | awk '$1>20000' | sort -k1n,1 41649 ReadPosRankSum=0.731 41760 ReadPosRankSum=0.550 46305 ReadPosRankSum=0.720 47060 ReadPosRankSum=0.00 87348 ReadPosRankSum=0.406 105254 ReadPosRankSum=0.736 116426 ReadPosRankSum=0.727 164855 ReadPosRankSum=0.358 zcat out_GenotypeGVCFs/chrom20.vcf.gz | grep -v ^# | cut -f8 | tr ";" "\n" | grep "MQ=" | sort | uniq -c | awk '$1>5000' | sort -k1n,1
5802 MQ=57.05
8382 MQ=29.00
8525 MQ=56.62
10069 MQ=51.77
10574 MQ=53.95
10682 MQ=47.12
10818 MQ=56.04
11553 MQ=55.21
802603 MQ=60.00

java -Djava.io.tmpdir=$LSCRATCH -Xmx10g -jar /home/apps/Logiciels/GATK/3.2-2/GenomeAnalysisTK.jar -l INFO -T ApplyRecalibration -ts_filter_level 99.0 -mode SNP -input snv.vcf -recalFile 96Exomes.HC.TruSeq.snv.RECALIBRATED -tranchesFile 96Exomes.HC.TruSeq.snv.tranches -o 96Exomes.HC.TruSeq.snv.recal.vcf -R ~/References/hg19/hg19.fasta # HARD FILTER (RNASeq) java -Djava.io.tmpdir=$LSCRATCH -Xmx2g -jar /home/apps/Logiciels/GATK/3.1-1/GenomeAnalysisTK.jar -l INFO -T VariantFiltration -R ~/References/hg19/hg19.fasta -V 96RNAseq.STAR.q1.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o 96RNAseq.STAR.q1.FS30.QD2.vcf

Here are some examples for RNAseq :

chr1 6711349 . G A 79.10 PASS BaseQRankSum=-1.369e+00;ClippingRankSum=1.00;DP=635;FS=1.871;MLEAC=1;MLEAF=5.495e-03;MQ=60.00;MQ0=0;MQRankSum=-4.560e-01;ReadPosRankSum=-1.187e+00 GT:AD:DP:GQ:PL 0/0:8,0:8:24:0,24,280 ./.:0,0:0 0/0:9,0:9:21:0,21,248 0/0:7,0:7:21:0,21,196 0/0:7,0:7:21:0,21,226 0/0:8,0:8:21:0,21,227 0/0:8,0:8:21:0,21,253 0/0:7,0:7:21:0,21,218 0/0:9,0:9:27:0,27,282 1/1:0,0:5:15:137,15,0 0/0:2,0:2:6:0,6,47 0/0:28,0:28:78:0,78,860 0/0:7,0:7:21:0,21,252 0/0:2,0:2:6:0,6,49 0/0:5,0:5:12:0,12,152 0/0:3,0:3:6:0,6,90 0/0:4,0:4:12:0,12,126 0/0:9,0:9:21:0,21,315 0/0:7,0:7:21:0,21,256 0/0:7,0:7:21:0,21,160 0/0:8,0:8:21:0,21,298 0/0:20,0:20:60:0,60,605 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,67 0/0:2,0:2:6:0,6,71 0/0:14,0:14:20:0,20,390 0/0:7,0:7:21:0,21,223 0/0:7,0:7:21:0,21,221 0/0:4,0:4:12:0,12,134 0/0:2,0:2:6:0,6,54 ./.:0,0:0 0/0:4,0:4:9:0,9,118 0/0:8,0:8:21:0,21,243 0/0:6,0:6:15:0,15,143 0/0:8,0:8:21:0,21,244 0/0:7,0:7:21:0,21,192 0/0:2,0:2:6:0,6,54 0/0:13,0:13:27:0,27,359 0/0:8,0:8:21:0,21,245 0/0:7,0:7:21:0,21,218 0/0:12,0:12:36:0,36,354 0/0:8,0:8:21:0,21,315 0/0:7,0:7:21:0,21,215 0/0:2,0:2:6:0,6,49 0/0:10,0:10:24:0,24,301 0/0:7,0:7:21:0,21,208 0/0:7,0:7:21:0,21,199 0/0:2,0:2:6:0,6,47 0/0:3,0:3:9:0,9,87 0/0:2,0:2:6:0,6,73 0/0:7,0:7:21:0,21,210 0/0:8,0:8:22:0,22,268 0/0:7,0:7:21:0,21,184 0/0:7,0:7:21:0,21,213 0/0:5,0:5:9:0,9,135 0/0:7,0:7:21:0,21,200 0/0:4,0:4:12:0,12,118 0/0:7,0:7:21:0,21,232 0/0:7,0:7:21:0,21,232 0/0:7,0:7:21:0,21,217 0/0:8,0:8:21:0,21,255 0/0:9,0:9:24:0,24,314 0/0:8,0:8:21:0,21,221 0/0:9,0:9:24:0,24,276 0/0:9,0:9:21:0,21,285 0/0:3,0:3:6:0,6,90 0/0:2,0:2:6:0,6,57 0/0:13,0:13:20:0,20,385 0/0:2,0:2:6:0,6,48 0/0:11,0:11:27:0,27,317 0/0:8,0:8:21:0,21,315 0/0:9,0:9:24:0,24,284 0/0:7,0:7:21:0,21,228 0/0:14,0:14:33:0,33,446 0/0:2,0:2:6:0,6,64 0/0:2,0:2:6:0,6,72 0/0:7,0:7:21:0,21,258 0/0:10,0:10:27:0,27,348 0/0:7,0:7:21:0,21,219 0/0:9,0:9:21:0,21,289 0/0:20,0:20:57:0,57,855 0/0:4,0:4:12:0,12,146 0/0:7,0:7:21:0,21,205 0/0:12,0:14:36:0,36,1030 0/0:3,0:3:6:0,6,87 0/0:2,0:2:6:0,6,60 0/0:7,0:7:21:0,21,226 0/0:7,0:7:21:0,21,229 0/0:8,0:8:21:0,21,265 0/0:4,0:4:6:0,6,90 ./.:0,0:0 0/0:7,0:7:21:0,21,229 0/0:2,0:2:6:0,6,59 0/0:2,0:2:6:0,6,56 chr1 7992047 . T C 45.83 SnpCluster BaseQRankSum=1.03;ClippingRankSum=0.00;DP=98;FS=0.000;MLEAC=1;MLEAF=0.014;MQ=60.00;MQ0=0;MQRankSum=-1.026e+00;ReadPosRankSum=-1.026e+00 GT:AD:DP:GQ:PL ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,70 0/0:2,0:2:6:0,6,45 0/0:3,0:3:6:0,6,87 0/0:2,0:2:6:0,6,52 ./.:0,0:0 ./.:0,0:0 ./.:1,0:1 ./.:0,0:0 0/0:2,0:2:6:0,6,55 0/0:2,0:2:6:0,6,49 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,61 0/0:2,0:2:6:0,6,49 ./.:0,0:0 ./.:0,0:0 0/0:3,0:3:6:0,6,90 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,52 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,69 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,64 ./.:0,0:0 0/0:2,0:2:6:0,6,37 ./.:0,0:0 0/0:2,0:2:6:0,6,67 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,68 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:11,0:11:24:0,24,360 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49 0/0:2,0:2:6:0,6,68 0/0:2,0:2:6:0,6,50 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,50 0/0:3,0:3:6:0,6,90 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:4:6:0,6,50 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:7,0:7:21:0,21,231 0/0:2,0:2:6:0,6,64 ./.:0,0:0 0/0:2,0:2:6:0,6,63 0/0:2,0:2:6:0,6,70 ./.:0,0:0 0/0:6,0:6:15:0,15,148 ./.:0,0:0 ./.:0,0:0 1/1:0,0:2:6:90,6,0 ./.:0,0:0 0/0:2,0:2:6:0,6,63 0/0:2,0:2:6:0,6,74 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,58 0/0:2,0:2:6:0,6,71 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:6:0,6,49

For Exome Seq now : chr2 111878571 . C T 93.21 PASS DP=634;FS=0.000;MLEAC=1;MLEAF=5.319e-03;MQ=60.00;MQ0=0;VQSLOD=14.19;culprit=MQ GT:AD:DP:GQ:PL 0/0:8,0:8:24:0,24,243 0/0:4,0:4:9:0,9,135 0/0:7,0:7:18:0,18,270 0/0:7,0:7:21:0,21,230 0/0:16,0:16:48:0,48,542 0/0:8,0:8:21:0,21,315 0/0:6,0:6:18:0,18,186 0/0:5,0:5:15:0,15,168 0/0:6,0:6:15:0,15,225 0/0:10,0:10:30:0,30,333 0/0:7,0:7:21:0,21,239 0/0:6,0:6:18:0,18,202 0/0:6,0:6:15:0,15,225 0/0:7,0:7:21:0,21,225 0/0:8,0:8:24:0,24,272 0/0:5,0:5:15:0,15,168 1/1:0,0:13:13:147,13,0 0/0:2,0:2:6:0,6,73 0/0:8,0:8:24:0,24,256 0/0:14,0:14:4:0,4,437 0/0:3,0:3:9:0,9,85 0/0:4,0:4:12:0,12,159 0/0:7,0:7:21:0,21,238 0/0:5,0:5:15:0,15,195 0/0:7,0:7:15:0,15,225 0/0:12,0:12:36:0,36,414 0/0:4,0:4:12:0,12,156 0/0:7,0:7:0:0,0,190 0/0:2,0:2:6:0,6,64 0/0:7,0:7:21:0,21,242 0/0:7,0:7:21:0,21,234 0/0:8,0:8:24:0,24,267 0/0:7,0:7:21:0,21,245 0/0:7,0:7:21:0,21,261 0/0:6,0:6:18:0,18,204 0/0:8,0:8:24:0,24,302 0/0:5,0:5:15:0,15,172 0/0:9,0:9:24:0,24,360 0/0:18,0:18:51:0,51,649 0/0:5,0:5:15:0,15,176 0/0:2,0:2:6:0,6,70 0/0:14,0:14:33:0,33,495 0/0:4,0:4:9:0,9,135 0/0:8,0:8:21:0,21,315 0/0:4,0:4:12:0,12,149 0/0:4,0:4:6:0,6,90 0/0:10,0:10:27:0,27,405 0/0:3,0:3:6:0,6,90 0/0:4,0:4:12:0,12,133 0/0:14,0:14:6:0,6,431 0/0:4,0:4:12:0,12,151 0/0:5,0:5:15:0,15,163 0/0:3,0:3:9:0,9,106 0/0:7,0:7:21:0,21,237 0/0:7,0:7:21:0,21,268 0/0:8,0:8:21:0,21,315 0/0:2,0:2:6:0,6,68 ./.:0,0:0 0/0:3,0:3:9:0,9,103 0/0:7,0:7:21:0,21,230 0/0:3,0:3:6:0,6,90 0/0:9,0:9:26:0,26,277 0/0:7,0:7:21:0,21,236 0/0:5,0:5:15:0,15,170 ./.:1,0:1 0/0:15,0:15:45:0,45,653 0/0:8,0:8:24:0,24,304 0/0:6,0:6:15:0,15,225 0/0:3,0:3:9:0,9,103 0/0:2,0:2:6:0,6,79 0/0:7,0:7:21:0,21,241 0/0:4,0:4:12:0,12,134 0/0:3,0:3:6:0,6,90 0/0:5,0:5:15:0,15,159 0/0:4,0:4:12:0,12,136 0/0:5,0:5:12:0,12,180 0/0:11,0:11:21:0,21,315 0/0:13,0:13:39:0,39,501 0/0:3,0:3:9:0,9,103 0/0:8,0:8:24:0,24,257 0/0:2,0:2:6:0,6,73 0/0:8,0:8:24:0,24,280 0/0:4,0:4:12:0,12,144 0/0:4,0:4:9:0,9,135 0/0:8,0:8:24:0,24,298 0/0:4,0:4:12:0,12,129 0/0:5,0:5:15:0,15,184 0/0:2,0:2:6:0,6,62 0/0:2,0:2:6:0,6,65 0/0:9,0:9:27:0,27,337 0/0:7,0:7:21:0,21,230 0/0:7,0:7:21:0,21,239 0/0:5,0:5:0:0,0,113 0/0:11,0:11:33:0,33,369 0/0:7,0:7:21:0,21,248 0/0:10,0:10:30:0,30,395

Created 2014-06-12 13:23:05 | Updated | Tags: genotypegvcfs combinegvcfs

I have run HaplotypeCaller (Version=3.1-1-g07a4bf8) on 99 WGS samples and created my GVCFs. I did this in 60+ intervals to enable faster turn around time. I then merged the GVCFs with CombineGVCFs (Version=3.1-1-g07a4bf8) However, when I try to call variants with java -jar $GATK -T GenotypeGVCFs -V WGS.gvcf.gz -o VCF/WGS.vcf.gz -R$GENOME -nt 8 it errors out complaining about PL values. The stack trace is here:

##### ERROR stack trace

java.lang.IllegalStateException: The original PLs do not have enough values; accessing index 10011 but size is 10000 at org.broadinstitute.sting.utils.variant.GATKVariantContextUtils.generatePLs(GATKVariantContextUtils.java:1593) at org.broadinstitute.sting.utils.variant.GATKVariantContextUtils.mergeRefConfidenceGenotypes(GATKVariantContextUtils.java:1530) at org.broadinstitute.sting.utils.variant.GATKVariantContextUtils.referenceConfidenceMerge(GATKVariantContextUtils.java:1095) at org.broadinstitute.sting.gatk.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:183) at org.broadinstitute.sting.gatk.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:110) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722)

##### ERROR ------------------------------------------------------------------------------------------

In the short term, is there a hidden setting I can use to skip over these, rather than completely throwing up?

Created 2014-04-18 22:38:18 | Updated | Tags: genotypegvcfs

Trying to do variantcalling using GATKs new pipeline using HaplotypeCaller and GenotypeGVCF.

Made gVCFs using HC and now trying to do joint genotyping using GenotypeGVCF

java -Xmx5G -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R hs37d5.fa -V:VCF GVCF.list -L SureSelect_AllExon_V5_wo_chr.bed --dbsnp dbsnp_137.hg19.vcf -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A InbreedingCoeff -A Coverage -A AlleleBalance -A AlleleBalanceBySample -A BaseCounts -A BaseQualityRankSumTest -A ChromosomeCounts -A ClippingRankSumTest -A Coverage -A DepthPerAlleleBySample -A FisherStrand -A GCContent -A HaplotypeScore -A HardyWeinberg -A HomopolymerRun -A InbreedingCoeff -A LowMQ -A MappingQualityRankSumTest -A MappingQualityZero -A MappingQualityZeroBySample -A NBaseCount -A QualByDepth -A RMSMappingQuality -A SpanningDeletions -A TandemRepeatAnnotator -A VariantType -o JointGenotyped.vcf

Error

##### ERROR MESSAGE: Code exception (see stack trace for error itself)

Not sure why this is happening.

Thanks, Tinu

Created 2014-04-17 21:28:34 | Updated | Tags: genotypegvcfs combinegvcfs

Hi there:

I'm trying to figure out if I need to run CombineGVCFs before GenotypeGVCFs. The documentation said "One would use this tool when needing to genotype too large a number of individual gVCFs". Is there a ballpark number for "too large a number of individuals"? For example, if I have 1000 individuals, should I use CombineGVCFs first? If so, is there a recommended chuck size? It seems CombineGVCFs does not take nt/nct option and when I tried to combine 600 files it's estimated to take a week.

Thanks.

Created 2014-04-14 21:55:57 | Updated | Tags: genotypegvcfs

I am using GATK 3.0 Generated GVCFs using HaplotypeCaller and now trying to do Joint Genotyping using GenotypeGVCFs.

I have 70+ GVCFs with me, which I want to Joint Genotype. Is there a way to give all of these GVCFs together as input (--variant) to GATK-GenotypeGVCF. Something like '-I' option for BAM list. I know it is possible to use --variant VCF1 --variant VCF2 etc. But that would be tedious to give inputs like that when you have these many or more GVCFs

Thanks, Tinu

Created 2014-04-02 14:47:06 | Updated 2014-04-02 14:47:46 | Tags: genotypegvcfs

commands are:

java -Xmx10g -jar $gatk -T HaplotypeCaller \ -R$refGenome \
--dbsnp $dbSNP \ -o s1.raw.var.g.vcf.gz \ -I s1.bam \ -pairHMM VECTOR_LOGLESS_CACHING \ --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 && java -Xmx10g -jar$gatk -T GenotypeGVCFs  \
-R \$refGenome \
--variant s1.raw.var.g.vcf.gz \
-o s1.raw.var.vcf.gz

Error in GenotypeGVCFs :

##### ERROR MESSAGE: Line 115: there aren't enough columns for line ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½uNï¿½ï¿½ï¿½ï¿½Þ«rfï¿½ebhï¿½ï¿½ï¿½ï¿½ï¿½NH(ï¿½ï¿½ï¿½ï¿½Ù»atï¿½ ï¿½Eï¿½;J1ï¿½ï¿½ï¿½ï¿½Xï¿½ï¿½Ê©áï¿½ï¿½ï¿½Çï¿½ï¿½ï¿½ï¿½ï¿½>ï¿½ï¿½@ï¿½ï¿½,0ï¿½ï¿½' (we

expected 9 tokens, and saw 2 ), for input source: s1.raw.var.g.vcf.gz