# Tagged with #gvcf 3 documentation articles | 3 announcements | 57 forum discussions

Created 2016-04-01 19:25:14 | Updated | Tags: haplotypecaller rnaseq joint-discovery gvcf joint-calling

We have not yet validated the joint genotyping methods (HaplotypeCaller in -ERC GVCF mode per-sample then GenotypeGVCFs per-cohort) on RNAseq data. Our standard recommendation is to process RNAseq samples individually as laid out in the RNAseq-specific documentation.

However, we know that a lot of people have been trying out the joint genotyping workflow on RNAseq data, and there do not seem to be any major technical problems. You are welcome to try it on your own data, with the caveat that we cannot guarantee correctness of results, and may not be able to help you if something goes wrong. Please be sure to examine your results carefully and critically.

If you do pursue this, you will need to pre-process your samples according to our RNA-specific documentation, then switch to the GVCF workflow at the HaplotypeCaller stage. For filtering, it will be up to you to determine whether the hard filtering or VQSR filtering method produce best results. We have not tested any of this so we cannot provide a recommendation. Be prepared to do a lot of analysis to validate the quality of your results.

Good luck!

Created 2014-04-10 21:57:10 | Updated 2015-02-20 18:10:09 | Tags: haplotypecaller reference-model gvcf

This document describes the reference confidence model applied by HaplotypeCaller to generate genomic VCFs (gVCFS), invoked by -ERC GVCF or -ERC BP_RESOLUTION (see the FAQ on gVCFs for format details).

Please note that this document may be expanded with more detailed information in the near future.

### How it works

The mode works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. For each position in the genome we have either an ALT call (via the standard calling mechanism) or we can estimate the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:

• Estimate the confidence that no SNP exists at the site by contrasting all reads with the ref base vs all reads with any non-reference base.
• Estimate the confidence that no indel of size < X (determined by command line parameter) could exist at this site by calculating the number of reads that provide evidence against such an indel, and from this value estimate the chance that we would not have seen the allele confidently.

Based on this, we emit the genotype likelihoods (PL) and compute the GQ (from the PLs) for the least confidence of these two models.

We use a symbolic allele pair, <NON_REF>, to indicate that the site is not homozygous reference, and because we have an ALT allele we can provide allele-specific AD and PL field values.

For details of the gVCF format, please see the document that explains what is a gVCF.

Created 2014-04-03 20:20:08 | Updated 2014-10-22 19:22:34 | Tags: haplotypecaller genotypegvcfs combinegvcfs gvcf joint-analysis

### Overview

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variants analyses.

### Important caveat

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller 3.x contain additional information that is formatted in a very specific way. Read on to find out more.

### General comparison of VCF vs. gVCF

The key difference between a regular VCF and a gVCF is that the gVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a gVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION gVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

### The two types of gVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a gVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the gVCF header. The purpose of the blocks (also called banding) is to keep file size down, and there is no downside for the downstream analysis, so we do recommend using the -GVCF option.

### Example gVCF file

This is a banded gVCF produced by HaplotypeCaller with the -GVCF option.

As you can see in the first line, the basic file format is a valid version 4.1 VCF:

##fileformat=VCFv4.1
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)
##GVCFBlock=minGQ=60(inclusive),maxGQ=2147483647(exclusive)
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=b37>
##reference=file:///humgen/1kg/reference/human_g1k_v37.fasta

Toward the middle you see the ##GVCFBlock lines (after the ##FORMAT lines) (repeated here for clarity):

##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)

which indicate the GQ ranges used for banding (corresponding to the boundaries [5, 20, 60]).

You can also see the definition of the MIN_DP annotation in the ##FORMAT lines.

#### Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10000000 and ends at 20:10000116.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10000000    .   T   <NON_REF>   .   .   END=10000116    GT:DP:GQ:MIN_DP:PL  0/0:44:99:38:0,89,1385
20  10000118    .   T   <NON_REF>   .   .   END=10000210    GT:DP:GQ:MIN_DP:PL  0/0:42:99:38:0,80,1314
20  10000212    .   A   <NON_REF>   .   .   END=10000438    GT:DP:GQ:MIN_DP:PL  0/0:52:99:42:0,99,1403
20  10000439    .   T   G,<NON_REF> 1737.77 .   DP=57;MLEAC=2,0;MLEAF=1.00,0.00;MQ=221.41;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,56,0:56:99:1771,168,0,1771,168,1771:0,0,0,0
20  10000440    .   T   <NON_REF>   .   .   END=10000597    GT:DP:GQ:MIN_DP:PL  0/0:56:99:49:0,120,1800
20  10000598    .   T   A,<NON_REF> 1754.77 .   DP=54;MLEAC=2,0;MLEAF=1.00,0.00;MQ=185.55;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,53,0:53:99:1788,158,0,1788,158,1788:0,0,0,0
20  10000599    .   T   <NON_REF>   .   .   END=10000693    GT:DP:GQ:MIN_DP:PL  0/0:51:99:47:0,120,1800
20  10000695    .   G   <NON_REF>   .   .   END=10000757    GT:DP:GQ:MIN_DP:PL  0/0:48:99:45:0,120,1800
20  10000758    .   T   A,<NON_REF> 1663.77 .   DP=51;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.32;MQ0=0  GT:AD:DP:GQ:PL:SB   1/1:0,50,0:50:99:1697,149,0,1697,149,1697:0,0,0,0
20  10000759    .   A   <NON_REF>   .   .   END=10001018    GT:DP:GQ:MIN_DP:PL  0/0:40:99:28:0,65,1080
20  10001020    .   C   <NON_REF>   .   .   END=10001020    GT:DP:GQ:MIN_DP:PL  0/0:26:72:26:0,72,1080
20  10001021    .   T   <NON_REF>   .   .   END=10001021    GT:DP:GQ:MIN_DP:PL  0/0:25:37:25:0,37,909
20  10001022    .   C   <NON_REF>   .   .   END=10001297    GT:DP:GQ:MIN_DP:PL  0/0:30:87:25:0,72,831
20  10001298    .   T   A,<NON_REF> 1404.77 .   DP=41;MLEAC=2,0;MLEAF=1.00,0.00;MQ=171.56;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,41,0:41:99:1438,123,0,1438,123,1438:0,0,0,0
20  10001299    .   C   <NON_REF>   .   .   END=10001386    GT:DP:GQ:MIN_DP:PL  0/0:43:99:39:0,95,1226
20  10001387    .   C   <NON_REF>   .   .   END=10001418    GT:DP:GQ:MIN_DP:PL  0/0:41:42:39:0,21,315
20  10001419    .   T   <NON_REF>   .   .   END=10001425    GT:DP:GQ:MIN_DP:PL  0/0:45:12:42:0,9,135
20  10001426    .   A   <NON_REF>   .   .   END=10001427    GT:DP:GQ:MIN_DP:PL  0/0:49:0:48:0,0,1282
20  10001428    .   T   <NON_REF>   .   .   END=10001428    GT:DP:GQ:MIN_DP:PL  0/0:49:21:49:0,21,315
20  10001429    .   G   <NON_REF>   .   .   END=10001429    GT:DP:GQ:MIN_DP:PL  0/0:47:18:47:0,18,270
20  10001430    .   G   <NON_REF>   .   .   END=10001431    GT:DP:GQ:MIN_DP:PL  0/0:45:0:44:0,0,1121
20  10001432    .   A   <NON_REF>   .   .   END=10001432    GT:DP:GQ:MIN_DP:PL  0/0:43:18:43:0,18,270
20  10001433    .   T   <NON_REF>   .   .   END=10001433    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1201
20  10001434    .   G   <NON_REF>   .   .   END=10001434    GT:DP:GQ:MIN_DP:PL  0/0:44:18:44:0,18,270
20  10001435    .   A   <NON_REF>   .   .   END=10001435    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1130
20  10001436    .   A   AAGGCT,<NON_REF>    1845.73 .   DP=43;MLEAC=2,0;MLEAF=1.00,0.00;MQ=220.07;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,42,0:42:99:1886,125,0,1888,126,1890:0,0,0,0
20  10001437    .   A   <NON_REF>   .   .   END=10001437    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,0

Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

Created 2015-10-15 14:36:34 | Updated | Tags: best-practices presentations gvcf

Geraldine Van der Auwera presented this talk as part of the Broad Institute's Medical and Population Genetics (MPG) Primers series.

This talk provides a high-level overview of the workflow for performing variant discovery on high-throughput sequencing data, as described in the GATK Best Practices and implemented in the Broad's production pipelines.

The following points emphasized in this presentation are:

• Informational content of data file formats and flow of information throughout the pipeline
• Concepts involved in the data transformations (processing steps and analysis methods)
• Motivation and key mechanics of the GVCF workflow for scalable joint variant discovery
• Relation of the GATK Best Practices to the Broad's production pipeline implementation

The presentation slide deck is available at this link.

Created 2014-08-13 22:15:40 | Updated 2014-08-13 22:17:48 | Tags: conferences reference-model gvcf meetings

Here's my abstract for the upcoming Genome Science UK meeting in Oxford, where I'll be talking about our hot new workflow for variant discovery. The slide deck will be posted in the Presentations section as usual after the conference.

### Analyzing large cohorts without losing your mind: GATK's new reference model pipeline for variant discovery

Variant discovery is greatly empowered by the ability to analyse large cohorts of samples rather than single samples taken in isolation, but doing so presents considerable challenges. Variant callers that operate per-locus (such as Samtools and GATK’s UnifiedGenotyper) can handle fairly large cohorts (thousands of samples) and produce good results for SNPs, but they perform poorly on indels. More recently developed callers that operate using assembly graphs (such as Platypus and GATK’s HaplotypeCaller) perform much better on indels, but their runtime and computational requirements tend to increase exponentially with cohort size, limiting their application to cohorts of hundreds at most. In addition, traditional multisample calling workflows suffer from the so-called “N+1 problem”, where full cohort analysis must be repeated each time new samples are added.

To overcome these challenges, we developed an innovative workflow that decouples the two steps in the multisample variant discovery process: identifying evidence of variation in each sample, and interpreting that evidence in light of the evidence gathered for the entire cohort. Only the second step needs to be done jointly on all samples, while the first step can be done just as well (and much faster) on one sample at a time. This decoupling hinges on the use of a novel method for reference confidence estimation that produces a genomic VCF (gVCF) intermediate for each sample.

The new workflow enables fast, highly accurate and computationally cheap variant discovery in cohort sizes that were previously intractable: it has already been applied successful to a cohort of nearly one hundred thousand samples. This replaces previous brute-force approaches and lowers the threshold of accessibility of sophisticated cohort analysis methods for all, including researchers who do not have access to large amounts of computing power.

Created 2014-04-04 01:48:46 | Updated | Tags: appistry rnaseq webinar gvcf gatk3

Our partners at Appistry are putting on another webinar next week, and this one's going to be pretty special in our view -- because we're going to be doing pretty much all the talking!

Titled "Speed, Cohorts, and RNAseq: An Insider Look into GATK 3" (see that link for the full program), this webinar will be all about the GATK 3 features, of course. And lest you think this is just another marketing pitch (no offense, marketing people), rest assured that we will be diving into the gory technical details of what happens under the hood. This is a great opportunity to get the inside scoop on how the new features (RNAseq, GVCF pipeline etc) work -- all the stuff that's fit to print, but that we haven't had time to write down in the docs yet. So don't miss it if that's the sort of thing that floats your boat! Or if you miss it, be sure to check out the recording afterward.

As usual the webinar is completely free and open to everyone (not just Appistry customers or prospective for-profit users). All you need to do is register now and tune in on Thursday 4/10.

Talk to you then!

Created 2016-05-12 13:54:59 | Updated 2016-05-12 14:28:40 | Tags: gvcf

Hi,

I have noticed that most of my alternate alleles are followed by a "*" allele. In the following example, the GT is 0/6 which seems to correspond to REF/*.

For example: chr7 54646419 . GCCGGACCCTGTTGGAGGAGGCTGGGTGGTTGCATGGTGCCGTGA GCCCGGACCCTGTTGGAGGAGGCTGGGTGGTTGCATGGTGCCGTGA,GTGTTGGAGGAGGCTGGGTGGTTGCATGGTGCCGTGA,GGGCTGGGTGG TTGCATGGTGCCGTGA,GCGGACCCTGTTGGAGGAGGCTGGGTGGTTGCATGGTGCCGTGA,GACCGGACCCTGTTGGAGGAGGCTGGGTGGTTGCATGGTGCCGTGA,* 182245.29 GT:AD:DP:GQ:PL 0/6:840,0,0,0,0,0,73:917:99:170,2590,35627,2590,35627,35627,2590,35627,35627,35627,2590,35627,35627,35627,35627,2590,35627,35627,35627,35627,35627,0,33783,33783,33783,33783,33783,34747

How do I interpret these * alleles?

Thank you.

Created 2016-03-19 17:20:42 | Updated | Tags: haplotypecaller dp m gvcf mq genotypegvcf

Hello! I had a question about the difference between using HaplotypeCaller's --emitRefConfidence GVCF vs BP_RESOLUTION. Maybe the answer is obvious or in the forum somewhere already but I couldn't spot it...

First, some context: I'm working with GATK v. 3.5.0 in a haploid organism. I have 34 samples, from which 5 are very similar to the reference (they are backcrosses) while the rest are strains from a wild population. Originally I used --emitRefConfidence GVCF followed by GenotypeGVCF. While checking the output VCF file, I realized that the five backcrosses had a much lower DP in average than the other samples (but this doesn't make sense due to difference in reads numbers or anything like that, since they were run in the same lane, etc). I assume this happened because there are long tracks without any variant compare to the reference in those samples, and the GVCF blocks end up assigning a lower depth for a great amount of sites in those samples compare to the much more polymorphic ones. In any case, I figured I could just get all sites using BP_RESOLUTION so to obtain the "true" DP values per site. However, when I tried to do that, the resulting VCF file had very low MQ values! Can you explain why this happened?

This is the original file with --emitRefConfidence GVCF:

$bcftools view -H 34snps.vcf | head -n3 | cut -f1-8 chromosome_1 57 . A G 309.4 . AC=4;AF=0.235;AN=17;DP=582;FS=0;MLEAC=4;MLEAF=0.235;MQ=40;QD=34.24;SOR=2.303 chromosome_1 81 . G A 84.49 . AC=2;AF=0.065;AN=31;DP=603;FS=0;MLEAC=2;MLEAF=0.065;MQ=44.44;QD=30.63;SOR=2.833 chromosome_1 88 . T C 190.75 . AC=1;AF=0.091;AN=11;BaseQRankSum=-0.762;ClippingRankSum=0.762;DP=660;FS=7.782;MLEAC=1;MLEAF=0.091;MQ=29.53;MQRankSum=-1.179;QD=21.19;ReadPosRankSum=-1.666;SOR=1.414 And this is with --emitRefConfidence BP_RESOLUTION: $ bcftools view -H 34allgenome_snps.vcf | head -n3 | cut -f1-8
chromosome_1    57  .   A   G   307.28  .   AC=4;AF=0.211;AN=19;DP=602;FS=0;MLEAC=4;MLEAF=0.211;MQ=8.23;QD=34.24;SOR=2.204
chromosome_1    81  .   G   A   84.49   .   AC=2;AF=0.065;AN=31;DP=750;FS=0;MLEAC=2;MLEAF=0.065;MQ=5.53;QD=30.63;SOR=2.833
chromosome_1    88  .   T   C   190.75  .   AC=1;AF=0.091;AN=11;BaseQRankSum=-1.179;ClippingRankSum=0.762;DP=796;FS=7.782;MLEAC=1;MLEAF=0.091;MQ=4.8;MQRankSum=-1.179;QD=21.19;ReadPosRankSum=-1.666;SOR=1.414

I find it particularly strange since the mapping quality of the backcrosses should in fact be slightly better in average (around 59 for the original BAM file) than the other more polymorphic samples (around 58)...

Thank you very much!

Created 2016-03-17 14:42:53 | Updated | Tags: unifiedgenotyper haplotypecaller multi-sample gvcf

Hi, I'm using GATK ver 3.4 for SNP calling and I have some question about it. My data set has 500 samples, and I used genome data as reference for bowtie/GATK

1) I called SNP by sample (gvcf) with haplotype and then combined gvcf, however, the combination takes a long time, the GATK wants to recreate gvcf.idx files (4 of my gatk mission stuck at this step), one gatk combination finished after about 20 days calculation. I also try to use '-nct' to improve this, but it still stuck at preparing idx files.

2) For that finished gatk combination data set, I also used Unifiedgenotype with Gr.sorted.bam as input to call SNPs. The result is output with Gr.sorted.bam has 5 times more SNPs number than gvcf combination, and most missing SNPs could be found in individual gvcf files but missing in final result.

Could you help me with these? Thank you!

Created 2016-03-11 09:57:18 | Updated | Tags: haplotypecaller gvcf

Hello, It seems that running HaplotypeCaller with -ERC GVCF and then running GenotypeGVCFs -stand_call_conf 30 -stand_emit_conf 30 gives a different vcf than HaplotypeCaller -stand_call_conf 30 -stand_emit_conf 30 (on a single sample). Is that expected? I tried versions 3.2 and 3.5.

Tag along question: I've tried to be a good citizen and post questions on existing topics in this forum (instead of starting yet a duplicate thread) but these never get answered. Is it always better to post a new question?

Created 2016-02-22 09:29:54 | Updated 2016-02-22 09:31:14 | Tags: gvcf

Dear GATK team,

Do you have any recommendation or best practice guidelines on the GVCFGQBands setting? This used to be (gatk < 3.2) [0,5,20,60] but has been changed one block per GQ value in gatk > 3.3. We would like to decrease the file size of the g.vcf files without loosing to much resolution / sensitivity. In the gvcf docs the 'old' defaults are still mentioned: https://www.broadinstitute.org/gatk/guide/article?id=4017

Thanks, Robert

Created 2016-02-10 04:42:23 | Updated | Tags: best-practices genotypegvcfs gvcf

I am using the GATK pipeline to call variants by aligning reads to a draft quality reference genome that is ~367000 scaffolds. I split the scaffolds up into 50 intervals and successfully (and pretty quickly) generated GVCFs for 25 individuals using the -L option. However, I am having the worst of times with GenotypeGVCFs. After running for nearly 2 days on the first interval list, GenotypeGVCFs has not even output a file. Based on another post in the forum, I removed the scaffolds that are NOT in the interval from the GVCF header, and that sped up the process slightly - I have a combined VCF file with just the header generated after about 18 hours. Not sure how much longer the process has as the progress meter doesn't seem to be making any sense.

Is there any known way(s) to optimize this process?

Currently using the following command: java -Djava.io.tmpdir=/data/lwwvd/genoGVCF.tmp -XX:ParallelGCThreads=4 -Xmx15g -jar /usr/local/bin/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -nt 16 -T GenotypeGVCFs -R ../ref_genomes/bbu_ref_UMD_CASPUR_WB_2.0.fa -L interval_lists/bbub.refctgs.49.interval_list -V ./1095/1095.49.g.vcf.gz -V ./189/189.49.g.vcf.gz -V ./190/190.49.g.vcf.gz -V ./196/196.49.g.vcf.gz -V ./246/246.49.g.vcf.gz -V ./337/337.49.g.vcf.gz -V ./581/581.49.g.vcf.gz -V ./583/583.49.g.vcf.gz -V ./662/662.49.g.vcf.gz -V ./701/701.49.g.vcf.gz -V ./850/850.49.g.vcf.gz -V ./92764/92764.49.g.vcf.gz -V ./92765/92765.49.g.vcf.gz -V ./92766/92766.49.g.vcf.gz -V ./92767/92767.49.g.vcf.gz -V ./92768/92768.49.g.vcf.gz -V ./92769/92769.49.g.vcf.gz -V ./92770/92770.49.g.vcf.gz -V ./92771/92771.49.g.vcf.gz -V ./92774/92774.49.g.vcf.gz -V ./92775/92775.49.g.vcf.gz -V ./92776/92776.49.g.vcf.gz -V ./92777/92777.49.g.vcf.gz -V ./92778/92778.49.g.vcf.gz -V ./92795/92795.49.g.vcf.gz -o BBUB.combined.49.vcf

Created 2016-02-08 15:12:24 | Updated | Tags: haplotypecaller multisample gvcf runtime-error erc

I am trying to run the HaplotypeCaller on a bam file with multiple samples. It runs successfully without the ERC GVCF option, e.g.

java -jar /home/unix/csmillie/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I test.bam

But when I try running it with the ERC GVCF option, I get an error:

java -jar /home/unix/csmillie/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I test.bam --emitRefConfidence GVCF --sample_name TCGGCTGAGAAC

I am using Java 1.7. I have validated the bam file with Picard. The bam file has the appropriate header, with tab-separated read groups that look like this: @RG ID:3 SM:TCGGCTGAGAAC PL:Illumina

The stack trace is below. If anyone can help I would really appreciate it! I am running this on an interactive node on the Broad cluster, in case it helps with debugging. Thanks!

hw-uger-1001:~/data/csmillie/test $java -jar /home/unix/csmillie/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I test.bam --emitRefConfidence GVCF --sample_name TCGGCTGAGAAC INFO 09:56:10,853 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:56:10,855 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56 INFO 09:56:10,855 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 09:56:10,855 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 09:56:10,859 HelpFormatter - Program Args: -T HaplotypeCaller -R ref.fasta -I test.bam --emitRefConfidence GVCF --sample_name TCGGCTGAGAAC INFO 09:56:10,877 HelpFormatter - Executing as csmillie@hw-uger-1001.broadinstitute.org on Linux 2.6.32-573.12.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_71-b14. INFO 09:56:10,877 HelpFormatter - Date/Time: 2016/02/08 09:56:10 INFO 09:56:10,878 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:56:10,878 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:56:11,500 GenomeAnalysisEngine - Strictness is SILENT INFO 09:56:12,598 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500 INFO 09:56:12,606 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 09:56:12,760 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.15 INFO 09:56:12,972 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 09:56:13,128 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 09:56:13,732 GenomeAnalysisEngine - Done preparing for traversal INFO 09:56:13,733 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 09:56:13,733 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 09:56:13,734 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime INFO 09:56:13,734 HaplotypeCaller - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output INFO 09:56:13,735 HaplotypeCaller - All sites annotated with PLs forced to true for reference-model confidence output INFO 09:56:14,806 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.NullPointerException at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.isGVCF(HaplotypeCaller.java:1251) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.initializeReferenceConfidenceModel(HaplotypeCaller.java:728) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.initialize(HaplotypeCaller.java:659) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Code exception (see stack trace for error itself) ##### ERROR ------------------------------------------------------------------------------------------ Created 2016-02-05 15:25:48 | Updated 2016-02-05 15:28:04 | Tags: haplotypecaller gvcf no-calls Hi, I have generated a gVCF for an exome (with non-variant block records) from a BAM file belonging to the 1000Genomes data. I am using GATK tools version 3.5-0-g36282e4 and I have run the HaplotypeCaller as follows: time java -jar$gatk_dir/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R $reference \ -I$bamfile \ -ploidy 2 \ -stand_call_conf 20 \ -stand_emit_conf 10 \ -ERC GVCF \ -o output.g.vcf.gz

Within the purpose of the analysis I am performing, from this gVCF I need to be able to know whether the positions are no-called, homozygous reference, variant sites or if the positions were not targeted in the exome sequencing.

However, with the gVCF file I obtained I am not able to do it because there are only variant site records or non-variant block records where the GT tag is always "0/0".

So I have few questions regarding the non-variant block records:

1. Why the output file does not contain any no-call ("./.") record?

2. Shouldn't regions where there are no reads have the tag GT equal to "./." instead of "0/0"?

3. How can regions without reads (not targeted) be distinguished from regions with reads that were not called?

4. When looking at the bam file with IGV, non-variant blocks displayed in gVCF contain regions with reads. What is the explanation for such behaviour?

Sofia

Created 2016-02-05 10:00:42 | Updated | Tags: gvcf gatk3-5

I have noticed that some times g.vcf files will have calls where the DP field is completely missing from the FORMAT string, even though a variant is called with good GQ. When this happens, the INFO string does have a DP=0, as can be seen below.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  102637-001-093
gi|194447306|ref|NC_011083.1|   415537  .   A   <NON_REF>   .   .   END=415547  GT:DP:GQ:MIN_DP:PL  0:2:86:2:0,86
gi|194447306|ref|NC_011083.1|   415548  .   G   A,<NON_REF> 13.22   .   DP=0;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQ=0.00  GT:GQ:PL:SB 1:43:43,0,43:0,0,0,0
gi|194447306|ref|NC_011083.1|   415549  .   A   <NON_REF>   .   .   END=415549  GT:DP:GQ:MIN_DP:PL  0:2:86:2:0,86

gi|194447306|ref|NC_011083.1|   4683672 .   A   <NON_REF>   .   .   END=4683672 GT:DP:GQ:MIN_DP:PL  0:2:0:2:0,0
gi|194447306|ref|NC_011083.1|   4683673 .   AATC    A,<NON_REF> 6.95    .   DP=0;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQ=0.00  GT:GQ:PL:SB 1:45:45,0,45:0,0,0,0
gi|194447306|ref|NC_011083.1|   4683677 .   A   <NON_REF>   .   .   END=4683677 GT:DP:GQ:MIN_DP:PL  0:1:0:1:0,0

I'm using gatk version 3.5-0-g36282e4, and I've attached the files I used. The exact command I used is as folows:

bamfile=093_interval.bam
output=093_interval.g.vcf
bamout=093_interval_bamout.bam
reference=NC_011083.fasta

gatk -T HaplotypeCaller \
--sample_ploidy 1 \
-R $reference \ -I$bamfile \
-o $output \ -ERC GVCF \ -bamout$bamout 

How can I make sense of this, to my mind DP=0 (from the INFO field at least) should mean there are no reads, so therefore no call can be made, right?

Created 2016-02-03 11:28:05 | Updated | Tags: gvcf gatk3-5

Some background: I'm trying to directly convert haploid g.vcf files to fasta format, to utilise all sequencing information I have for downstream analysis, and not just the snips.

In one of my samples, I've encountered two records that overlap the same position, which makes parsing the g.vcf file a lot harder.  gi|194447306|ref|NC_011083.1| 346996 . CAACA C,<NON_REF> 3723.97 . BaseQRankSum=0.371;ClippingRankSum=1.279;DP=89;MLEAC=1,0;MLEAF=1.00,0.00;MQRankSum=0.536;RAW_MQ=266004.00;ReadPosRankSum=-0.701 GT:AD:DP:GQ:PL:SB 1:1,83,0:84:99:3763,0,3808:0,1,12,71 gi|194447306|ref|NC_011083.1| 347000 . A ATGTC,<NON_REF> 3723.97 . BaseQRankSum=0.825;ClippingRankSum=-1.732;DP=84;MLEAC=1,0;MLEAF=1.00,0.00;MQRankSum=-0.990;RAW_MQ=250875.00;ReadPosRankSum=-0.676 GT:AD:DP:GQ:PL:SB 1:1,83,0:84:99:3763,0,3808:0,1,12,71  The last 'A' of the ref at 346996 is the same position as the 'A' of the ref at position 347000. I was wondering if this construction is allowed in a vcf file?

I've looked at the vcf4.2 standard, but I couldn't find a definitive answer. One the one hand, multiple records with the same POS are explicitly allowed. On the other hand, the following suggests to me that records should not overlap, at least not for single-sample haploid g.vcf files: "ALT haplotypes are constructed from the REF haplotype by taking the REF allele bases at the POS in the reference genotype and replacing them with the ALT bases. In essence, the VCF record specifies a-REF-t and the alternative haplotypes are a-ALT-t for each alternative allele."

If I simply take the above records, and replace the REF with ALT, I won't get the true haplotype for my sample, since the two records overlap. So my haplotype reconstructed that way will be longer then the actual haplotype.

Created 2016-02-02 19:26:18 | Updated | Tags: gatk gvcf merging

Hi Team,

I need to know something and hopefully is simple to implement. I have to run GATK HaplotypeCaller on a large BAM file, thus I have to run this in batches of 4 hours top. I've done some test to determine the appropriate size for the subsampled BAM. So let's say I need to run 10 jobs. Each one of those will output a gVCG file, which all belong to the same individual in this experiment. I have read that there are several methods to parse and merge VCF files into a single one like CatVariants, CombineGVCFs and CombineVariants. The question is that I'm inclined to use CombineGVCFs since is the output I have, but I also have the understanding that this is for merging different individuals which is not what I have to do. So which approach should I use? Thanks for the inputs!!! Alejandro

Created 2016-01-21 12:08:25 | Updated | Tags: gvcf

I was looking at a g.vcf file, and I noticed that there are many alternating calls, with one having GQ of 99, and the next having a GQ of 0. Looking at the other fields, I can find no reason why the GQ of these calls should alternate like that.

For example, these two calls are adjacent, and cover only one position each. The only difference appears to be that the GQ=0 call has a slightly higher read depth  gi|194447306|ref|NC_011083.1| 25644 0 A <NON_REF> 0 0 END=25644 GT:DP:GQ:MIN_DP:PL 0:34:99:34:0,1391 gi|194447306|ref|NC_011083.1| 25645 0 G <NON_REF> 0 0 END=25645 GT:DP:GQ:MIN_DP:PL 0:35:0:35:0,0

Here, the GQ=0 call actually spans more positions, and has a higher depth then the GQ=99 call  gi|194447306|ref|NC_011083.1| 25646 0 T <NON_REF> 0 0 END=25646 GT:DP:GQ:MIN_DP:PL 0:35:99:35:0,1480 gi|194447306|ref|NC_011083.1| 25647 0 C <NON_REF> 0 0 END=25655 GT:DP:GQ:MIN_DP:PL 0:39:0:35:0,0

This pattern repeats throughout the file, can someone point me to an explanation as to why this is?

Created 2016-01-20 16:30:48 | Updated | Tags: haploid gvcf

I'm working on haploid bacteria, and I would like to create a fasta file of all positions for which I have data. I have looked at FastaAlternateReferenceMaker, but it seems to output reference alleles when the read depth at that position is 0. If that is the case, I would prefer N instead, since no coverage means we have no data at all for that position. So it might be ref, or it might be something else, and I'd rather not assume.

I think it should be possible to create a fasta file from the g.vcf file, since it is supposed to contain data on all positions, not just snips. Before I start working on a program, I would like to know if

1. There is a better/existing way to do what I described above
2. Am I correct in thinking that the g.vcf file is the best input filetype for this?

I know there are ways to convert bam to fasta, but for me that has several drawbacks:

1. I would have to deal with counting coverage myself
2. I would have to create the reverse-complement while reading the bam file, and detect things like PCR-duplicates and bad mappings
3. I won't have the benefit of gatk's fancy on the fly realignment step, or any of the other statistical computations that go into generating g.vcf files
4. bam files are big, g.vcf files are small, so using g.vcf files will be a lot faster

What do you think? I'd like to know my assumptions are correct before I start working on this.

Created 2016-01-06 13:01:18 | Updated | Tags: combinevariants haplotypecaller best-practices dbsnp gatk combinegvcfs gvcf

Hi guys, I have recently jointly called 27 full genome data using GenotypeGVCFs approach. While i was trying to extract some chromosomes from the final file, i got this error The provided VCF file is malformed at approximately line number 16076: Unparsable vcf record with allele *.

I look into the file and I found some of the multi-allellic sites having * as seen in the attached picture.

I feel the problem could be that the program realised that more than one allele is present at that position but could not ascertain which allele. I may be wrong but please what do you think I can do to solve this problem. LAWAL

Created 2016-01-04 15:36:35 | Updated | Tags: bug error combinegvcfs gvcf

Hi GATK team,

I am running the best practices pipeline on a large set of WES data. On some of the CombineGVCFs steps I am getting errors with the GATK v3.5.

Here is the command:

java -Xmx51200m -jar /home/ndaranalysis/GenomeAnalysisTK.jar -T CombineGVCFs -R /data/Build37/human_g1k_v37.fasta \
-o output.g.vcf.gz  -A AS_BaseQualityRankSumTest -A AS_FisherStrand -A AS_MappingQualityRankSumTest \
-A AS_QualByDepth -A AS_RMSMappingQuality -A AS_ReadPosRankSumTest -A AS_StrandOddsRatio -A AlleleBalance \
-A ClippingRankSumTest -A AS_InbreedingCoeff -A GCContent -A TandemRepeatAnnotator  -G StandardAnnotation \
-L /data/Exome/seqcap_ez_exome_v2.bed  -im ALL -ip 50 --logging_level ERROR -V input1.g.vcf -V input2.g.vcf ...

Here is the error:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.IllegalArgumentException: Features added out of order: previous (TabixFeature{referenceIndex=9, start=1354
40296, end=135440296, featureStartFilePosition=356127518800018, featureEndFilePosition=-1}) > next (TabixFeature{ref
erenceIndex=9, start=193026, end=193026, featureStartFilePosition=356127518807875, featureEndFilePosition=-1})
70)
a:200)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java :291) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java
:280)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR
##### ERROR MESSAGE: Features added out of order: previous (TabixFeature{referenceIndex=9, start=135440296, end=1354
40296, featureStartFilePosition=356127518800018, featureEndFilePosition=-1}) > next (TabixFeature{referenceIndex=9,
start=193026, end=193026, featureStartFilePosition=356127518807875, featureEndFilePosition=-1})
##### ERROR ------------------------------------------------------------------------------------------

Thanks!

Created 2015-12-29 05:30:09 | Updated | Tags: genotypegvcfs gvcf

ask the joint genotyping tool, GenotypeGVCFs

Created 2015-12-21 19:30:11 | Updated | Tags: haplotypecaller gvcf

Hi,

I have three general questions about using HaplotypeCaller (I know I could have tested by myself, but I figured it might be reliable to get some answer from people who are developing the tool):

1. For single sample analysis, is the vcf generated directly from HC the same as the vcf generated using GenotypeGVCFs on the gvcf generated from HC?
2. For multi-sample analysis, in terms of speed, how is the performance of running GenotypeGVCFs on each gvcf, compared with combining all gvcfs to run joint-calling, assuming we can get all gvcfs in parallel (say for 500 samples)?
3. It seems the gvcf can be generated in two modes, -ERC GVCF or -ERC BP_RESOLUTION. How different is the one generated using -ERC BP_RESOLUTION different from a vcf with all variant calls, reference calls and missing calls? And considering the size of the file, say for NA12878 whole genome, how different it is comparing the gvcf from -ERC GVCF and the one from -ERC BP_RESOLUTION?

Thank you very much for you attention and any information from you will be highly appreciated.

Created 2015-12-07 22:04:39 | Updated | Tags: genotypegvcfs gvcf

We would like to reduce the size of the output file when using --includeNonVariantSites. Can GenotypeGVCFs output in gVCF format when using --includeNonVariantSites?

Thanks, Carlos

Created 2015-10-26 12:17:38 | Updated | Tags: selectvariants genotypegvcfs gvcf

We are currently working on a project, and we have found that some of our samples were contaminated after the gVCF merging phase. Is it possible to remove samples from a merged gVCF (likely using SelectVariants), or would we need to re-merge only the good gVCFs into a new merged gVCF? (Note that we're actually working with a double-merged gVCF file containing ~5,000 samples, so re-merging would be potentially costly).

Thanks,

John Wallace

Created 2015-10-22 04:26:36 | Updated | Tags: gvcf

a position existed in taregt region file "./target.bed" , didn't exist in gVCF file, but after GenotypeGVCFs, a SNP turned up at this position

# java -Xmx15g -Djava.io.tmpdir=pwd/tmp \ -jar ./GATK/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R ./hg19/ucsc.hg19.fasta \ -I ./output.recal.cleaned.bam \ --dbsnp ./Data/dbsnp_138.hg19.excluding_sites_after_129.vcf \ --emitRefConfidence GVCF \ --variant_index_type LINEAR \ --variant_index_parameter 128000 \ -L ./target.bed \ -o ./SNP_Indel_HaplotypeCaller.g.vcf

and then I used "GenotypeGVCFs" to generate a vcf file which contains only variants. the commandline was as follows:

# ============================================== java -Xmx10g -Djava.io.tmpdir=pwd/tmp -jar ./GATK/GenomeAnalysisTK.jar \ -T GenotypeGVCFs \ -R ./hg19/ucsc.hg19.fasta \ --variant ./SNP_Indel_HaplotypeCaller.g.vcf \ -stand_call_conf 30 \ -stand_emit_conf 10 \ -o ./pedi_merged.vcf

In the file "pedi_merged.vcf", I found many variants which cannot be found in the corresponding gVCF file,such as

# chr10 126089432 . G . . END=126089433 GT:DP:GQ:MIN_DP:PL 0/0:4:12:4:0,12,139 chr10 126089435 . T . . END=126089437 GT:DP:GQ:MIN_DP:PL 0/0:5:15:5:0,15,171

we can see that not only the SNP, even the position "chr10 126089434" was not present in the gVCF file. while after "GenotypeGVCFs ", we can get a SNP which had no information in the corresponding gVCF file

when I used the "HaplotypeCaller" walker to generate a gVCF file, I used the "-L ./target.bed " argument. the file " ./target.bed " contained the position "chr10 126089434",

# ============================================== chr10 126089161 126089800

So we can see that a position existed in "./target.bed" , didn't exist in gVCF file, but after GenotypeGVCFs, a SNP turned up at this position ! can anyone tell me what's wrong with my commandline or there are some other problem about GATK "HaplotypeCaller "?

btw, my GATK version is "The Genome Analysis Toolkit (GATK) v3.3-0-g37228af"

Created 2015-10-21 13:46:12 | Updated 2015-10-21 13:56:18 | Tags: haplotypecaller genotypegvcfs gvcf gatk3-4

Hi all, I'm currently confused about the snips called as shown below. If I am not mistaken, the first row shows gatk called an 34 bp insertion in sample 001 at position 3229753. It didn't call anything for sample 001 on position 3229753, but then for position 3229756, it calls another 15bp insertion for sample 001, which overlaps completely with the first insertion.

I have three questions about this. 1) Is my interpretation of the data shown below correct 2) If this is correct, is this expected behaviour for gatk? What kind of circumstances are expected to generate these results? 3) How can I interpret these conflicting snips, should I just pick the call with the highest confidence and ignore the other? What about if a lower-confidence call is a substring of a previous call in another sample?

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001 002 003 004 gi|ref| 3229753 0 A AACTTGCCTGCCACGCTTTTCTTTATACTTAACCC 9635.2 0 AC=3;AF=1.00;AN=3;DP=304;FS=0.000;MLEAC=3;MLEAF=1.00;MQ=59.86;QD=29.65;SOR=0.779 GT:AD:DP:GQ:PL 1:0,48:48:99:2153,0 1:0,84:84:99:3696,0 .:0,0 1:0,85:85:99:3813,0 gi|ref| 3229754 0 A ACTTGCCTGCCACGCTTTTCTTTATACTTAACCCAGGCGCTAATTCATCTGCAACG 3012.2 0 AC=1;AF=1.00;AN=1;DP=291;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=59.91;QD=28.35;SOR=0.910 GT:AD:DP:GQ:PL .:0,0 .:0,0 1:0,69:69:99:3039,0 .:0,0 gi|ref| 3229756 0 G GCGCTAATTCATCTGC 3654.2 0 AC=3;AF=1.00;AN=3;DP=74;FS=0.000;MLEAC=3;MLEAF=1.00;MQ=60.00;QD=28.36;SOR=0.747 GT:AD:DP:GQ:PL 1:0,17:17:99:854,0 1:0,25:25:99:1213,0 .:0,0 1:0,32:32:99:1614,0

Created 2015-10-10 23:04:35 | Updated | Tags: gvcf

Dear all,

I have a naive query which might have discussed earlier. I tried to find in the forum but did not succeed.

Consider gVCF files produced for 3 different samples (single-sample variant calling) and genotyping gVCF to VCF generates the list of only variant sites. When it is required to find the shared variants between 3 samples, if one of the sample has no variant at that particular site in the VCF file, how could it be interpreted, Is it missing due to lack of reads or REF?

Created 2015-09-30 17:06:21 | Updated | Tags: haplotypecaller gvcf quality-score

After applying the standard RNA-Seq pipeline (with STAR, etc) I called varients with the command:

java -jar GenomeAnalysisTK.jar
-T HaplotypeCaller
-R chromosome.fa
-I ./final.bam
-dontUseSoftClippedBases
--variant_index_type LINEAR
--variant_index_parameter 128000
--emitRefConfidence GVCF -o ./final.gvcf

On the resultant gVCF file, I ran a little python script to see the distribution of calling quality across the different called genotypes:

• x-axis is quality score rounded to the nearest integer
• y-axis is the number of variants at that quality score 

As you can see, its mostly heterozygous variants, which is what I expect since this data comes from highly inbred mice. What i didn't expect however is the periodicity. Is that normal? Now I presumably I need to filter these variants on some number of quality score, and from this I really dont know where. 0? 50? 75?

Code to generate this data:

#!/usr/bin/env python2.7
import collections
with open('/home/john/overnight/outputs/ctrl_all_FVB.gvcf', 'rb') as f:
data = {}
for line in f:
if line[0] == '#': continue
line = line.split('\t')
if line[5] == '.': continue
gt = line[9][:3]
try: data[gt][int(float(line[5]))] += 1
except KeyError: data[gt] = collections.defaultdict(int)
for gt,qualities in data.items():
print '\n',gt
for qual,count in sorted(qualities.items()):
print qual,count

Created 2015-09-16 09:40:07 | Updated | Tags: gvcf

Dear the GATK team, There are a bug when running CombineGVCF, i have 330 sample GVCF files that all be outputted in "-ERC GVCF" mode by Haplotype Caller.

the error information is: ==========start at : Wed Sep 16 17:10:28 HKT 2015 ========== INFO 17:10:33,406 HelpFormatter - --------------------------------------------------------------------------------- INFO 17:10:33,412 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12 INFO 17:10:33,413 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:10:33,413 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:10:33,418 HelpFormatter - Program Args: -et NO_ET -K /ifshk5/PC_PA_EU/USER/zhangbaifeng/software/gatk/zhangbaifeng_genomics.cn.key -T CombineGVCFs -R /ifshk1/BC_CANCER/01bin/DNA/software/pipeline/CSAP_v5.2.4/Database/human_19/hg19_fasta_GATK/hg19.fasta --variant 1.sample.g.vcf --variant 2.sample.g.vcf ...--variant 330.sample.g.vcf -o /ifshk7/BC_RES/TECH/PMO/zhangbaifeng/330.snp.analysis/330.sample.GVCF/haplotypecaller/vcf/cohort.g.vcf INFO 17:10:33,450 HelpFormatter - Executing as zhangbaifeng@login-0-3.local on Linux 2.6.18-194.blc amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14. INFO 17:10:33,450 HelpFormatter - Date/Time: 2015/09/16 17:10:33 INFO 17:10:33,451 HelpFormatter - --------------------------------------------------------------------------------- INFO 17:10:33,451 HelpFormatter - --------------------------------------------------------------------------------- INFO 17:10:43,301 GenomeAnalysisEngine - Strictness is SILENT INFO 17:10:43,571 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 17:10:54,982 GenomeAnalysisEngine - Preparing for traversal INFO 17:10:54,994 GenomeAnalysisEngine - Done preparing for traversal INFO 17:10:54,995 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:10:54,995 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 17:10:54,996 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime

##### ERROR ------------------------------------------------------------------------------------------

Could you tell me how to solve it ? Thanks very much.

Created 2015-09-11 10:31:16 | Updated | Tags: haplotypecaller gvcf

I came across some unusual variants called by HaplotypeCaller running in gvcf mode while working on human WGS data (the example gvcf line can be seen below). The genotype in almost all samples is undefined i.e. "./.", despite the good coverage reported in DP field (only one sample is identified as 0/1). Moreover, in "./." genotyped samples all reads fall into reference allele group of AD field, therefore I would anticipate "0/0" genotype rather than "./.". I have also inspected several bam files visually and did not find any obvious mapping problems. I have attached two IGV snapshots of the variant region: first is from an example "./." genotyped patient and second one is from the only patient with variant. The region seems to have good 25-30x coverage with majority of mapping qualities equal to 60. However, apparently there is some other insertion nearby. The GATK version I am using is 2015.1-3.4.0-1-ga5ca3fc and reference genome is GRCh38.

Could you please explain why the inferred genotype is "./." instead of "0/0" ?

Best,

Ewa

chr1 100474610 rs568102277 T TG 358.91 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.54;ClippingRankSum=0.419;DB;DP=4026;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=1.36;QD=13.29;ReadPosRankSum=0.814;SOR=0.551 GT:AD:DP:GQ:PGT:PID:PL ./.:36,0:36 ./.:37,0:37 ./.:33,0:33 ./.:30,0:30 ./.:36,0:36 ./.:32,0:32 ./.:36,0:36 ./.:32,0:32 ./.:31,0:31 ./.:37,0:37 ./.:27,0:27 ./.:34,0:34 ./.:38,0:38 ./.:28,0:28 ./.:29,0:29 ./.:31,0:31 ./.:25,0:25 ./.:24,0:24 ./.:19,0:19 ./.:41,0:41 ./.:24,0:24 ./.:27,0:27 ./.:26,0:26 ./.:28,0:28 ./.:31,0:31 ./.:38,0:38 ./.:27,0:27 ./.:22,0:22 ./.:31,0:31 ./.:27,0:27 ./.:29,0:29 ./.:28,0:28 ./.:34,0:34 ./.:20,0:20 ./.:26,0:26 ./.:33,0:33 ./.:26,0:26 ./.:26,0:26 ./.:31,0:31 ./.:32,0:32 ./.:34,0:34 ./.:27,0:27 ./.:28,0:28 ./.:37,0:37 ./.:38,0:38 ./.:25,0:25 ./.:31,0:31 ./.:37,0:37 ./.:31,0:31 ./.:32,0:32 ./.:30,0:30 ./.:38,0:38 ./.:36,0:36 ./.:32,0:32 ./.:40,0:40 ./.:32,0:32 ./.:42,0:42 ./.:37,0:37 ./.:29,0:29 ./.:42,0:42 ./.:31,0:31 ./.:36,0:36 ./.:35,0:35 ./.:31,0:31 ./.:35,0:35 ./.:32,0:32 ./.:30,0:30 ./.:30,0:30 ./.:36,0:36 ./.:34,0:34 ./.:28,0:28 ./.:37,0:37 ./.:34,0:34 ./.:24,0:24 ./.:31,0:31 ./.:33,0:33 ./.:36,0:36 ./.:37,0:37 ./.:48,0:48 ./.:25,0:25 ./.:39,0:39 ./.:26,0:26 ./.:23,0:23 ./.:39,0:39 ./.:29,0:29 ./.:33,0:33 ./.:37,0:37 ./.:27,0:27 ./.:29,0:29 ./.:42,0:42 ./.:28,0:28 ./.:29,0:29 ./.:30,0:30 ./.:39,0:39 ./.:39,0:39 ./.:35,0:35 ./.:31,0:31 ./.:29,0:29 ./.:23,0:23 ./.:30,0:30 ./.:24,0:24 ./.:29,0:29 ./.:26,0:26 ./.:19,0:19 ./.:26,0:26 ./.:16,0:16 ./.:27,0:27 ./.:24,0:24 ./.:34,0:34 ./.:28,0:28 ./.:41,0:41 ./.:41,0:41 ./.:39,0:39 ./.:24,0:24 0/1:11,16:27:99:1|0:100474609_G_GT:381,0,245 ./.:36,0:36 ./.:26,0:26 ./.:27,0:27 ./.:29,0:29 ./.:29,0:29 ./.:28,0:28 ./.:24,0:24 ./.:19,0:19 ./.:31,0:31 ./.:33,0:33 ./.:23,0:23 ./.:25,0:25 ./.:31,0:31 ./.:34,0:34 ./.:26,0:26

Created 2015-08-18 20:05:07 | Updated | Tags: haplotypecaller gvcf wgs

Hi,

I am doing gVCF calls for whole genome samples and I would notice that the gvcf-calling jobs for some of the samples would fail at random genomic locations and if I resubmit those failed jobs, they would either finish successfully or fail again at a different genomic location ('genomic location' info from "ProgressMeter" line inside logs).

• I am doing one gVCF job per WGS sample. Right now there are more than 70% of jobs that are failing. Is there anything that should be changed on the parameters?
• Do you have something like a SOP for best practises on doing HaplotypeCaller calling for WGS samples? I understand the process is very similar to exome sequencing gVCF calling but somehow I see many more job failures with gVCF calling on WGS samples.

I am using the following parameters for gVCF call:

java -Xmx128g -XX:+UseConcMarkSweepGC -XX:-UseGCOverheadLimit -jar GenomeAnalysisTK.jar
-T HaplotypeCaller
-I file.bam
-nct 8
-R human_g1k_v37.fasta
-o /ttemp/file.g.vcf
-L b37_wgs.intervals
—emitRefConfidence GVCF
--variant_index_type LINEAR --variant_index_parameter 128000
-dcov 250
-minPruning 3
-stand_call_conf 30
-stand_emit_conf 30
-G Standard -A AlleleBalance -A Coverage
-A HomopolymerRun -A QualByDepth

Compute: One full node (“256GB RAM, 20 cores” per node) per single sample WGS gvcf job. GATK version being used is "3.1”

P.S. I am also testing out the latest version of GATK (3.4) without “-dcov” option to see if that resolves the issue.

Thanks,

Shalabh

Created 2015-08-11 18:52:55 | Updated | Tags: haplotypecaller gvcf

Dear GATK team, I wish to get gVCF files for each data set. But I am not sure if I should still use --output_mode EMIT_ALL_SITES argument in my command lines. In your previous thread, I found you mentioned that "HaplotypeCaller used to have that option, but it was removed when we introduced the reference model (gVCF) option. Have a look at the documentation that explains this here: http://www.broadinstitute.org/gatk/guide/article?id=2940". I clicked in the link. But the link was not accessible. So I wish to confirm if I am using the right arguments in my command. Here is my command. I have removed the --output_mode option. Will that be all right? java -Xmx12g -jar GATK_JARS/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R ucsc.hg19.fasta \ -I sample1.realigned.dedup.sorted.bam \ --genotyping_mode DISCOVERY \ -stand_emit_conf 10 \ -stand_call_conf 20 \ --emitRefConfidence GVCF \ --variant_index_type LINEAR \ --variant_index_parameter 128000 \ -o raw_var_sample1.g.vcf Created 2015-07-23 08:14:30 | Updated | Tags: haplotypecaller gvcf wgs Hello, Does anyone know a rough estimate of the file size of a gvcf produced at BP_RESOLUTION by the HaplotypeCallerfor a whole genome sequencing experiment. Perhaps a rather simple question, but i cannot find it elsewhere on the forum or other places like seqanswers. Thanks in advance, Created 2015-07-22 11:46:00 | Updated | Tags: combinevariants haplotypecaller best-practices vcf gatk genotypegvcfs combinegvcfs gvcf I was trying to do combine sets of vcf files for all my samples so that I have one single vcf output using this command option below java -d64 -Xmx48g -jar{GATK}/GenomeAnalysisTK.jar \ -R {REF} \ -T GenotypeGVCFs \ --variant A.g.vcf \ --variant B.g.vcf \ --variant C.g.vcf \ -stand_emit_conf 30 \ -stand_call_conf 30 \ -o genotype.vcf but I got this error message “The following invalid GT allele index was encountered in the file: END=21994810”. I have tried to locate where the problem could be coming from but I do not understand this. Could you please advise me. Created 2015-06-19 19:20:58 | Updated | Tags: haplotypecaller rnaseq genotypegvcfs gvcf Hello, I was wondering if there is a way to output all annotations for all sites when running HaplotypeCaller with BP_RESOLUTION. Currently it outputs all annotations for only called variants. Thanks in advance. Created 2015-06-05 07:41:51 | Updated | Tags: haplotypecaller gvcf Hi, I first use HaplotypeCaller to call variants with --emitRefConfidence GVCF, and then the tool GenotypeGVCFs on only the sample. In other words, I did not apply GenotypeGVCFs on cohort but on the sample itself. There was a record called in the first step but was not output in the second step. The particular record is showed below: chr1 78435701 rs202224025 TA T,TAA, 23.75 . BaseQRankSum=1.312;DB;DP=62;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=1.101;ReadPosRankSum=0.773 GT:AD:GQ:PL:SB 0/2:31,7,8,0:36:61,36,882,0,568,734,154,759,709,863:1,30,0,8 The command lines used are below: java -Xmx5g -Djava.io.tmpdir=pwd/tmp -jar /Software/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T HaplotypeCaller -R /Data/bundle_2.8_hg19/ucsc.hg19.fasta -I /input/chr1.bam -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 --dbsnp /Data/bundle_2.8_hg19/dbsnp_138.hg19.vcf -A StrandOddsRatio -A Coverage -A QualByDepth -A FisherStrand -A MappingQualityRankSumTest -A ReadPosRankSumTest -A RMSMappingQuality -o /output/chr1.gvcf.vcf -L chr1 java -Xmx5g -Djava.io.tmpdir=pwd/tmp -jar /Software/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /Data/bundle_2.8_hg19/ucsc.hg19.fasta --variant /output/chr1.gvcf.vcf -A StrandOddsRatio -A Coverage -A QualByDepth -A FisherStrand -A MappingQualityRankSumTest -A ReadPosRankSumTest -A RMSMappingQuality -o /output/chr1.gatkHC.vcf --dbsnp /Data/bundle_2.8_hg19/dbsnp_138.hg19.vcf -stand_call_conf 30.0 -stand_emit_conf 10.0 -L /BED/chr1.bed I thought that this variant should be emitted into the final vcf, since the qual is 23.75 which is greater than 10.0 (set by -stand_emit_conf). Do I misunderstand something here? Thank you! Created 2015-05-28 02:35:01 | Updated | Tags: gvcf Hello, I ran the Genotype Caller using the GATK version GenomeAnalysisTK-3.3-0 but now that we are in the filtering process we are having trouble and we found this error in the gvcf. This is an example of the error (in bold), where it says that the variant has a second alternate allele when it does not. It is affecting the filtering process and I am wondering how we could fix it. Thanks! -Paulina- Final GVCF 5 37036492 . C T 301.07 PASS AC=3;AF=0.065;AN=46;BaseQRankSum=-7.200e-01;ClippingRankSum=0.00;DP=201;FS=2.105;GQ_MEAN=27.70;GQ_STDDEV=19.31;InbreedingCoeff=-0.1151;MLEAC=3;MLEAF=0.065;MQ=60.00;MQ0=0;MQRankSum=-3.540e-01;NCC=7;QD=13.69;ReadPosRankSum=-7.200e-01;SOR=0.242;VQSLOD=5.78;culprit=MQ GT:AD:DP:GQ:PL 0/2:13,0,3:16:19:19,57,345,0,288,279 0/0:19,0,0:19:0:0,0,400,0,400,400 0/1:3,2:5:34:34,0,74 0/0:2,0:2:6:0,6,69 0/2:3,0,1:4:12:12,21,75,0,54,51 0/0:5,0:5:15:0,15,176 0/2:17,0,3:20:12:12,63,458,0,395,386 0/0:16,0,0:16:0:0,0,359,0,359,359 0/0:1,0:1:3:0,3,35 0/0:2,0:2:6:0,6,67 0/0:2,0:2:6:0,6,60 0/0:13,0,0:13:5:0,5,368,5,368,368 0/0:17,0,0:17:16:0,16,485,16,485,485 0/0:4,0:4:12:0,12,137 0/0:3,0:3:0:0,0,37 Pan001N GVCF 5 37036492 . C . . END=37036492 GT:DP:GQ:MIN_DP:PL 0/0:3:0:3:0,0,37 Created 2015-04-01 14:17:52 | Updated | Tags: vqsr haplotypecaller best-practices gvcf I am currently processing ~100 exomes and following the Best Practice recommendations for Pre-processing and Variant Discovery. However, there are a couple of gaps in the documentation, as far as I can tell, regarding exactly how to proceed with VQSR with exome data. I would be grateful for some feedback, particularly regarding VQSR. The issues are similar to those discussed on this thread: http://gatkforums.broadinstitute.org/discussion/4798/vqsr-using-capture-and-padding but my questions aren't fully-addressed there (or elsewhere on the Forum as far as I can see). Prior Steps: 1) All samples processed with same protocol (~60Mb capture kit) - coverage ~50X-100X 2) Alignment with BWA-MEM (to whole genome) 3) Remove duplicates, indel-realignment, bqsr 4) HC to produce gVCFs (-ERC) 5) Genotype gVCFs This week I have been investigating VQSR, which has generated some questions. Q1) Which regions should I use from my data for building the VQSR model? Here I have tried 3 different input datasets: a) All my variant positions (11Million positions) b) Variant positions that are in the capture kit (~326k positions) - i.e. used bedtools intersect to only extract variants from (1) c) Variant positions that are in the capture kit with padding of 100nt either side (~568k positions) - as above but bed has +/-100 on regions + uniq to remove duplicate variants that are now in more than one bed region For each of the above, I have produced "sensitive" and "specific" datasets: "Specific" --ts_filter_level 90.0 \ for both SNPs and INDELs "Sensitive" --ts_filter_level 99.5 \ for SNPs, and --ts_filter_level 99.0 \ for INDELs (as suggested in the definitive FAQ https://www.broadinstitute.org/gatk/guide/article?id=1259 ) I also wanted to see what effect, if any, the "-tranche" argument has - i.e. does it just allow for ease of filtering, or does it affect the mother generated, since it was not clear to me. I applied either 5 tranches or 6: 5-tranche: -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 90.0 \ for both SNPs and INDELs 6-tranche: -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 95.0 -tranche 90.0 \ for both SNPs and INDELs To compare the results I then used bed intersect to get back to the variants that are within the capture kit (~326k, as before). The output is shown in the spreadsheet image below. What the table appears to show me, is that at the "sensitive" settings (orange background), the results are largely the same - the difference between "PASS" in the set at the bottom where all variants were being used, and the others is mostly accounted for by variants being pushed into the 99.9-100 tranche. However, when trying to be specific (blue background), the difference between using all variants, or just the capture region/capture+100 is marked. Also surprising (at least for me) is the huge difference in "PASS" in cells E15 and E16, where the only difference was the number of tranches given to the model (note that there is very little difference in the analogous cells in Rows 5/6 andRows 10/11. Q2) Can somebody explain why there is such a difference in "PASS" rows between All-SPEC and the Capture(s)-Spec Q3) Can somebody explain why 6 tranches resulted in ~23k more PASSes than 5 tranches for the All-SPEC Q4) What does "PASS" mean in this context - a score =100? Is it an observation of a variant position in my data that has been observed in the "truth" set? It isn't actually described in the header of the VCF, though presumably the following corresponds: FILTER= Q5) Similarly, why do no variants fall below my lower tranche threshold of 90? Is it because they are all reliable at least to this level? Q6) Am I just really confused? :-( Thanks in advance for your help! :-) Created 2015-03-11 13:48:14 | Updated | Tags: haplotypecaller gvcf scaffolds Hi Team, 1 BAM = 1 individual my question is regarding the HaplotypeCaller and scaffolds in a BAM file. When I want to do the individual SNP-calling procedure (--emitRefConfidence GVCF) before the Joint Genotyping, I found that with my number of scaffolds the process is computationally quite costy. I now ran for every BAM the HaplotypeCaller just for a single scafflod (by using -L) Question is: Do you see any downside in this approach regarding the result quality? Or are the scaffolds treated independently anyways and my approach is fine? The next step would be to combine the gvcfs to a single one again (corresponding to the original BAM) and then do joint genotyping on a cohort of gvcfs (-> cohort of individuals) Thanks a lot! Alexander Created 2015-03-10 18:27:56 | Updated | Tags: haplotypecaller genotypegvcfs gvcf I run the following command for "GenotypeGVCFs" for 3 VCF files output of HaplotypeCaller as below: java data/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \ -R data/ucsc.hg19.fasta \ -T GenotypeGVCFs \ --variant data/47V_post.ERC.vcf \ --variant data/48V_post.ERC.vcf \ --variant data/49V_post.ERC.vcf \ --out data/Combined_geno_3files.vcf but in a final VCF output there is no rsID information and all rows are "." what is the problem? I am really confused. Could you please advise how to get SNP-ID in the output VCF Thanks Created 2015-03-09 18:47:53 | Updated | Tags: haplotypecaller combinegvcfs gvcf I used the following command to combine 3 VCF files which are outputs of HaplotypeCaller: java -jar data/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \ -R data/ucsc.hg19.fasta \ -T CombineGVCFs \ --variant data/47V_post.ERC.vcf \ --variant data/48V_post.ERC.vcf \ --variant data/49V_post.ERC.vcf \ --out data/Combined_3files.vcf However, after combined all 3 files, in the output final VCF, I can only see ./. genotypes. What is the problem? how I can to fix this? Thanks Created 2015-02-23 13:25:09 | Updated 2015-02-23 14:06:33 | Tags: haplotypecaller gvcf stand-emit-conf stand-call-conf I am using HC 3.3-0-g37228af to generate GVCFs, including the parameters (full command below): stand_emit_conf 10 stand_call_conf 30 The process completes fine, but when I look at the header of the gvcf produced, they are shown as follows: standard_min_confidence_threshold_for_calling=-0.0 standard_min_confidence_threshold_for_emitting=-0.0 After trying various tests, it appears that setting these values is incompatible with -ERC GVCF (which requires "-variant_index_type LINEAR" and "-variant_index_parameter 128000" ) 1) Can you confirm if this is expected behaviour, and why this should be so? 2) Is this another case where the GVCF is in intermediate file, and hence every possible variant is emitted initially? 3) Regardless of the answers above, is stand_call_conf equivalent to requiring a GQ of 30?  java -Xmx11200m -Djava.io.tmpdir=TMPDIR -jar /apps/GATK/3.3-0/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-I /E000007/target_indel_realignment/E000007.6.bqsr.bam \
-R /project/production/Indexes/samtools/hsapiens.hs37d5.fasta \
-et NO_ET \
-K /project/production/DAT/apps/GATK/2.4.9/ourkey \
-dt NONE \
-L 10 \
-A AlleleBalance \
-A BaseCounts \
-A BaseQualityRankSumTest \
-A ChromosomeCounts \
-A ClippingRankSumTest \
-A Coverage \
-A DepthPerAlleleBySample \
-A DepthPerSampleHC \
-A FisherStrand \
-A GCContent \
-A HaplotypeScore \
-A HardyWeinberg \
-A HomopolymerRun \
-A ClippingRankSumTest \
-A LikelihoodRankSumTest \
-A LowMQ \
-A MappingQualityRankSumTest \
-A MappingQualityZero \
-A MappingQualityZeroBySample \
-A NBaseCount \
-A QualByDepth \
-A RMSMappingQuality \
-A StrandBiasBySample \
-A StrandOddsRatio \
-A VariantType \
-ploidy 2 \
--min_base_quality_score 10 \
-ERC GVCF \
-variant_index_type LINEAR \
-variant_index_parameter 128000 \
--GVCFGQBands 20 \
--standard_min_confidence_threshold_for_calling 30 \
--standard_min_confidence_threshold_for_emitting 10

Created 2015-02-13 17:15:34 | Updated | Tags: haplotypecaller gvcf

I have 13 whole exome sequencing samples, and unfortunately, I'm having a hard time getting HaplotypeCaller to complete within the time frame the cluster I use allows (150 hours). I use 10 nodes at a time with 10gb ram with 8 cores per node. Is there any way to speed up this rate? I tried using HaplotypeCaller in GVCF mode with the following command:

java -d64 -Xmx8g -jar $GATKDIR/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R$REF --dbsnp DBSNP \ -I 7-27_realigned.bam \ -o 7-27_hg19.vcf \ -U ALLOW_UNSET_BAM_SORT_ORDER \ -gt_mode DISCOVERY \ -mbq 20 \ -stand_emit_conf 20 -G Standard -A AlleleBalance -nct 16 \ --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 Am I doing something incorrectly? Is there anything I can tweak to minimize the runtime? What is the expected runtime for WES on a standard setup (a few cores and some ram)? Created 2015-02-05 21:44:46 | Updated 2015-02-05 21:45:35 | Tags: haplotypecaller gvcf Hi, I have noticed that every time I repeat a gVCF call on the same sample (~same Bam file), the output gVCF files are not exactly same. They are almost similar, but there will be a few differences here and there & there will be a minute difference in unix-file-sizes as well. Is that something that is expected?? Shalabh Suman Created 2015-02-02 21:24:31 | Updated | Tags: vqsr dbsnp vqslod genotypegvcfs gvcf From my whole-genome (human) BAM files, I want to obtain: For each variant in dbSNP, the GQ and VQSLOD associated with seeing that variant in my data. Here's my situation using HaplotypeCaller -ERC GVCF followed by GenotypeGVCFs: CHROM POS ID REF ALT chr1 1 . A # my data chr1 1 . A T # dbSNP I would like to know the confidence (in terms of GQ and/or PL) of calling A/A, A/T. or T/T. The call of isn't useful to me for the reason explained below. How can I get something like this to work? Besides needing a GATK-style GVCF file for dbSNP, I'm not sure how GenotypeGVCFs behaves if "tricked" with a fake GVCF not from HaplotypeCaller. My detailed reason for needing this is below: For positions of known variation (those in dbSNP), the reference base is arbitrary. For these positions, I need to distinguish between three cases: 1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1. 2. We have sufficient evidence to call position n as homozygous reference (0/0) with confidence scores GQ=x2 and VQSLOD=y2. 3. We do not have sufficient evidence to make any call for position n. I was planning to use VQSR because the annotations it uses seem useful to distinguish between case 3 and either of 1 and 2. For example, excessive depth suggests a bad alignment, which decreases our confidence in making any call, homozygous reference or not. Following the best practices pipeline using HaplotypeCaller -ERC GVCF, I get ALTs with associated GQs and PLs, and GT=./.. However, GenotypeGVCF removes all of these, meaning that whenever the call by HaplotypeCaller was ./. (due to lack of evidence for variation), it isn't carried forward for use in VQSR. Consequently, this seems to distinguish only between these two cases: 1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1. 2. We do not have sufficient evidence to call position n as a variant (it's either 0/0 or unknown). This isn't sufficient for my application, because we care deeply about the difference between "definitely homozygous reference" and "we don't know". Thanks in advance! Douglas Created 2015-01-27 21:59:14 | Updated 2015-01-27 21:59:46 | Tags: best-practices snp gatk combinegvcfs gvcf Hi, I am combining gcvf files into single gvcf files by chromosome, using CombineGVCFs, in order to run GenotypeGVCFs. When I checked the first gvcf file generated by CombineGVCFs, I noticed that at each position, all the alleles were missing. For example, at position 16050036, this is what comes up in the final gvcf file: 22 16050036 . A C,<NON_REF> . . BaseQRankSum=-7.360e-01;ClippingRankSum=-7.360e-01;DP=4;MQ=27.00;MQ0=0;MQRankSum=-7.360e-01;ReadPosRankSum=0.736 GT:AD:DP:MIN_DP:PL:SB ./.:1,2,0:3:.:55,0,23,58,29,86:1,0,2,0 ./.:.:1:1:0,0,0,0,0,0 ./.:.:0:0:0,0,0,0,0,0 But if we just take one of the precursor gvcf files (one individual), we clearly see the genotype at that site: 22 16050036 . A C,<NON_REF> 26.80 . BaseQRankSum=-0.736;ClippingRankSum=-0.736;DP=3;MLEAC=1,0;MLEAF=0.500,0.00;MQ=27.00;MQ0=0;MQRankSum=-0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB 0/1:1,2,0:3:23:55,0,23,58,29,86:1,0,2,0 The command I'm using to generate these files is: java -Xmx1g -jar GenomeAnalysisTK.jar -T CombineGVCFs -R hs37d5.fa -V vcfs.chr{numchr}.new.list -o mergeGvcf_${numchr}.vcf -L${numchr} where numchr is a variable previously defined (indicating the chromosome number).

It seems that all the information is being taken into account except the actual genotypes. How do I solve this problem?

Thanks, Alva

Created 2014-12-09 21:47:52 | Updated | Tags: vcf genotypegvcfs gvcf variant-calling

Hi,

I used GenotypeGVCFs with 3 input gvcf files (3 individuals) to create a vcf file, and this seems to work, but when I examine the sites in the final vcf file, there are sites that are missing. I am in the process of calculating exactly how many sites are missing, but taking an initial section of the vcf file and initial sections of my 3 gvcf files, the initial set of variant positions in the 3 gvcf files combined (called "test file") is:

16050036 16050612 16050822 16050933 16051556 16051968 16051994 16052080 16052167 16052239 16052250 16052357 etc.

whereas the initial set of sites in my final vcf file is:

16050822 16050933 16051347 16051497 16051556 16051968 16051994 16052080 16052167 16052169 16052239 16052357 etc.

First of all, there are sites in the final vcf file (in bold) that are fixed for the 3 individuals, but that are still included (the individuals are all 0/1 at these positions). I removed these positions when I created my test file, so they don't show up there, as you can see. Second, there are sites in the test file that are not in the final vcf file (in bold), even though these are most definitely variant sites (I checked them - f.ex., 16050036 is a SNP). I'm not sure why these discrepancies are occurring.

I also got this warning 3 times when running GenotypeGVCFs: WARN 20:04:45,102 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 22:21483632 has 7 alternate alleles so only the top alleles will be used How do I relax the requirements of "at most 6 alternate alleles" to allow more in the case of indels? I am using the newest version of GATK (3.3).

FYI, this is the command I used for GenotypeGVCFs: java -Xmx2g -jar GenomeAnalysisTK.jar -R hs37d5.fa -T GenotypeGVCFs --variant file1.vcf --variant file2.vcf --variant file3.vcf -o final.vcf -L 22 (only running this for chr22)

Created 2014-12-05 15:27:21 | Updated 2014-12-05 18:08:01 | Tags: haplotypecaller best-practices vcf gatk gvcf

Hi,

I used HaplotypeCaller in GVCF mode to generate a single sample GVCF, but when I checked my vcf file I see that the reference allele is not showing up:

22  1   .   N   <NON_REF>   .   .   END=16050022    GT:DP:GQ:MIN_DP:PL  0/0:0:0:0:0,0,0
22 16050023    .   C   <NON_REF>   .   .   END=16050023    GT:DP:GQ:MIN_DP:PL  0/0:1:3:1:0,3,37
22  16050024    .   A   <NON_REF>   .   .   END=16050026    GT:DP:GQ:MIN_DP:PL  0/0:2:6:2:0,6,73
22  16050027    .   A   <NON_REF>   .   .   END=16050035    GT:DP:GQ:MIN_DP:PL  0/0:3:9:3:0,9,110
22  16050037    .   G   <NON_REF>   .   .   END=16050037    GT:DP:GQ:MIN_DP:PL  0/0:3:9:3:0,9,109
22  16050038    .   A   <NON_REF>   .   .   END=16050039    GT:DP:GQ:MIN_DP:PL  0/0:4:12:4:0,12,153

I am not sure where to start troubleshooting for this, since all the steps prior to using HaplotypeCaller did not generate any obvious errors.

The basic command that I used was: java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hs37d5.fa -I recal_1.bam -o raw_1.vcf -L 22 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

Have you encountered this problem before? Where should I start troubleshooting?

Thanks very much in advance, Alva

Created 2014-09-26 06:33:13 | Updated | Tags: haplotypecaller indel gvcf

I have WES data of 235 samples, aligned using bwa aln. I followed GATK's Best Pratice for marking duplicates, realignment around INDEL, and BQSR until I got the recalibrated bamfiles. All these are done with GATK 2.7-2. After that I generated gvcf file from the recalibrated bamfiles, using HaplotypeCaller from GATK 3.1-1, followed by GenotypeGVCFs and VQSR.

I came across one INDEL, which passes VQSR,

The variant is chr14 92537378 . G GC 11330.02 PASS AC=11;AF=0.023;AN=470;BaseQRankSum=0.991;ClippingRankSum=-6.820e-01;DP= 13932;FS=0.000;GQ_MEAN=137.20;GQ_STDDEV=214.47;InbreedingCoeff=-0.0311;MLEAC=11;MLEAF=0.023;MQ=43.67;MQ0=0;MQRankSum=-1.073e+00;NCC=0;QD=18.16; ReadPosRankSum=-1.760e-01;VQSLOD=3.70;culprit=FS

Indivudual genotypes seem very good. To name a few: 1800 0/0:65,0:65:99:0,108,1620 0/0:86,0:86:99:0,120,1800 0/1:23,25:48:99:1073,0,1971 0/1:51,39:90:99:1505,0,4106

But when I checked the variants in IGV, something strange popped out.

The pictures (and complete post) can be view here: http://alittlebitofsilence.blogspot.hk/2014/09/to-be-solved-gatk-haplotypecaller-indel.html

Judging from the alignment visualized by IGV, there are no insertions at that site, but an SNP in the next position, in all samples. But HaplotypeCaller calls G->GC insertion in some samples while not in other samples. My questions are: 1) In variant calling HaplotypeCaller would do a local de-novo assembly in some region. Is the inconsistency between bamfiles and gvcf because of the re-assembly? 2) Why is the insertion called in some samples but not others, although the position in all samples looks similar? 3) There are other variants called adjacent or near this site, but later filtered by VQSR. Does that indicate the variants from this region cannot be trusted? chr14 92537353 . C CG 303342.41 VQSRTrancheINDEL0.00to90.00+ chr14 92537354 . C CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,CTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG 142809.26 VQSRTrancheINDEL0.00to90.00+ chr14 92537378 . G GC 11330.02 PASS chr14 92537379 rs12896583 T TGCTGCTGCTGCTGC,C 13606.25 VQSRTrancheINDEL0.00to90.00+ chr14 92537385 rs141993435 CTTT C 314.64 VQSRTrancheINDEL0.00to90.00+ chr14 92537387 rs12896588 T G 4301.60 VQSRTrancheSNP99.90to100.00 chr14 92537388 rs12896589 T C 4300.82 VQSRTrancheSNP99.90to100.00 chr14 92537396 . G GC,GCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGC 4213.61 VQSRTrancheINDEL0.00to90.00+

chr14 92537397 . T TGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTGCTG,TGCTGCTGCTGCTG,TGCTGCTGCTG,TGCTGCTG 2690.79 VQSRTrancheINDEL0.00to90.00+ 3) I used GATK 2.7 for processing before calling gvcf files. Any possibility that if I change to 3.2, the bamfiles would look more similar to gvcf result? I have tried GATK3.2-2 for generating gvcf from GATK 2.7 bamfiles, the results are the same.

Created 2014-08-07 06:50:19 | Updated | Tags: haplotypecaller gvcf

Hello,

I ran HaplotypeCaller on few projects. In the output gvcf file, the "Filter" field is always "." From the gvcf header, I understand that Filter can also be "LowQual". from the documentation, I understand that it depends on the stand_emit_conf and stand_call_conf parameters. I used stand_emit_conf 10 and stand_call_conf 30. Do I understand this field correct? When this filter is active?

Thank you, Maya

Created 2014-07-01 14:41:58 | Updated | Tags: haplotypecaller genotypegvcfs gvcf

I'm running HaplotypeCaller in GVCF mode, followed by GenotypeGVCF. For some individuals, I have more than one aligned bam file. Do I need to combine the individual's bam files in HaplotypeCaller to produce one GVCF, or can I produce multiple GVCFs per individual and combine them in GenotypeGVCF?

Thanks.

Created 2014-06-30 18:53:29 | Updated | Tags: haplotypecaller gvcf

Hello,

It seems the banded gvcf produced by HaplotypeCaller is missing many positions in the genome. We are not able to find half the specific positions we want to look at. To be clear, these positions are not even contained in any band, they are simply not there. For example, we were looking for position 866438 on chromosome 1 and these are the relevant lines in the gvcf:

1 866433 . G C, 412.77 . BaseQRankSum=-1.085;DP=16;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-1.519;ReadPosRankSum=0.651 GT:AD:GQ:PL:SB 0/1:15,1,0:99:441,0,23572,546,23612,24157:0,0,0,0 1 866434 . G . . END=866435 GT:DP:GQ:MIN_DP:PL 0/0:22:45:21:0,45,675 1 866436 . ACGTG A, 403.73 . BaseQRankSum=-0.510;DP=17;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.714;ReadPosRankSum=0.434 GT:AD:GQ:PL:SB 0/1:16,1,0:99:441,0,23572,546,23612,24157:0,0,0,0 1 866441 . A . . END=866441 GT:DP:GQ:MIN_DP:PL 0/0:21:0:21:0,0,402

We looked at the corresponding bam files and there seems to be good coverage for these positions. Moreover, if HaplotypeCaller is run in BP_RESOLUTION mode, then these positions are present in the resulting vcf. Any idea what might be going wrong?

Thanks,

Juber

Created 2014-06-26 22:44:19 | Updated | Tags: haplotypecaller combinegvcfs gvcf

## Hi this is pretty much a feature request for something I think would be useful, I mentioned it briefly at the Brussels workshop and it seems like it might be possible.

In a couple of projects I'm involved in we have done low coverage (2-10x whole genome) exploratory sequencing for a large number of individuals (similar to 1K genomes, around 1,200 individuals between the two projects) and have recently processed these individuals using the new N+1 pipeline, generating gVCFs.

Now going forward we are adding additional sequence for a decent number of these individuals (from the same PCR free library) to improve genome coverage and the accuracy of genotypes (target 30x) in individuals and Trios of interest. We thus want to combine the new sequence (20x) with the older sequence (6-10x) to get as much coverage as possible. To do this I understand that currently I would need to rerun the GATK HaplotypeCaller on both the old and new BAMs at once, generating a new gVCF then track down the individuals in our previous combined gVCFs and remove them so I can Genotype the old gVCF minus the low coverage samples + the new gVCFs. Following that process I have to reprocess the old data multiple times and subset old combined gVCF files if new data comes in which is rather painful and computationally wasteful.

Ideally it would instead be possible to run GATK HaplotypeCaller just on the new sequence generating a second new gVCF that only has data for the new 20x coverage, then combine it somehow with the old gVCF merging the data from both the old and new gVCFs and resulting in a single final VCF record for this sample which has utilised the data from both the old and new gVCFs. I guess this could either be run as a separate tool to merge/combine old and new gVCFs or be done automatically by the GenotypeGVCFs tool.

This would also be useful from a work flow point of view, as we have limited computational resources and storage it's preferable that we process data as soon as it comes off the sequencer through to the gVCF stage to save space and allow us to archive the BAM files while keeping the gVCFs for when we run GenotypeGVCF on all the current data. At the moment I have to keep the BAMs for an individual in working space until I'm sure I've got all the sequence for that individual (and as mentioned above that can change in the future) then generate the gVCFs. Being able to flow sequence data through the cluster to gVCF stage as soon as it becomes available and then later merge the gVCFs when additional lanes are completed would make things a lot simpler from a resource management and pipeline design.

If this is possible it would be greatly appreciated if it could be implemented. Thanks!

Created 2014-06-19 13:50:30 | Updated | Tags: depthofcoverage gvcf

Dear all,

I have a large number of gVCF files, either single samples or combined in approximately 100 samples per combined gVCF file. I would like to compute something like the average depth for a set of regions of interest from the combined or single gVCF files. I can see that I could try to get a VCF output at every position, and use that to infer the percentage with a given read depth, but that seems mightily cumbersome.

So my question: is there an equivalent to the DepthOfCoverage GATK module that takes as input gVCF files? ideally combined gVCF and extract per sample average/median/minimum depth but otherwise I can work with single sample gVCF data.

Created 2014-06-04 17:34:15 | Updated | Tags: gvcf haplotype-caller

I used to run Haplotype Caller with stand_call_conf and --min_base_quality_score, but have shifted over to using Haplotype Caller to generate a gvcf (I use emitRefConfidence BP_RESOLUTION, as I need ALL sites), then GenotypeGVCFs to create the actual vcf.

The stand_call_conf and --min_base_quality_score, don't work in the GenotypeGVCFs tool - is it not possible to use these options if I do the two step process (HapCaller and GenotypeGVCF?)?

Created 2014-05-30 16:18:45 | Updated | Tags: haplotypecaller genotypegvcfs gvcf

Hi

For targeted re-sequenced data, can we do variant calling using HaplotypeCaller in gVCF mode followed by using GenotypeGVCFs for joint genotyping ?

Would the following best practices work for targeted re-sequencing as well , except for VQSR as for targted resequencing GATK advices another way of filtering. https://www.broadinstitute.org/gatk/guide/best-practices?bpm=DNAseq

Thanks, Tinu

Created 2014-05-27 14:55:25 | Updated | Tags: gvcf

I see something like this chr1 17697 . G C, 72.77 . BaseQRankSum=0.322;ClippingRankSum=-1.517;DP=8;MLEAC=1,0;MLEAF=0 .500,0.00;MQ=40.00;MQ0=0;MQRankSum=0.322;ReadPosRankSum=0.956 GT:AD:DP:GQ:PL:SB 0/1:5,3,0:8:99:101,0,178,116,187,303:0,5,0,3 in my gvcf file. apparently NON_REF was treated as an allele and was assigned a read depth (0 in this case) and genotype likelihood in combination with G, C alleles It is very confusing!

Created 2014-05-20 09:18:54 | Updated | Tags: depthofcoverage haplotypecaller gvcf

Dear all,

I am interested in creating genome-wide gVCFs from various sources, including exome samples. The rationale is that it is not completely clear what the target region actually is all the time, and it would be good to keep some flexibility about what I capture.

Nevertheless, these single sample gVCF files can become quite large, but most of the lines in the file are used to store very low depth data ( DP <= 2) which is probably not critical for downstream analysis. I would be interested to have, in the HaplotypeCaller gVCF data generation process, an additional parameter that sets as missing regions with, say, 2 reads or less. That would considerably reduce the storage at little cost in terms of discarded information.

Does such a parameter exist already? Would it make sense at all to add this? I am not sure but right now it seems to me it might be useful.

Vincent

Created 2014-05-05 18:20:14 | Updated 2014-05-05 18:20:37 | Tags: gvcf

I'm looking at a trio (sick child and parents) of whole-genomes processed with HaplotypeCaller for GVCF reference-model etc. Command line parameters are identical, but the child's files are about twice as large. I'm not sure there's a problem, but it is concerning and I'm investigating what's going on.

The total number of lines is almost double for each genomic region, but only the hom-ref GVCF spans. The mean length of a GVCF record in the child is 30 nucleotides, and 54 in the parents. Total sequencing depth is the same, so I guess this means the reference quality is more variable in the child. They do seem to have been processed differently on the wet-lab side, and we're investigating that too. I just wanted to ask the community if anyone has seen something like this, it was very alarming initially to see a trio with final VCF data size 4.5GB,5.5GB, and 8.5GB from the same sequencing run, but I'm happy to learn the number of heterozyous sites is similar among samples, this is just the 0/0 sites being inflated. What could that mean?

Created 2014-04-25 19:25:41 | Updated | Tags: haplotypecaller gvcf

Hello everyone!

I am using GATK as a new pipeline for our laboratory. I'm getting some results from HaplotypeCaller that I have some questions about.

In particular, I'm wondering why I have the number of comma-delimited values in the AD field.

Here is and example output: GT:AD:DP:GQ:PL:SB 0/1:19,3,0:22:26:26,0,625,83,634,717:10,9,2,1

I understand this means I have a heterozygous call at this position. (0/1). But I'm wondering why I have 3 values in the AD field (19, 3, 0). My reading of the samtools output suggests that there could be 4 outputs (forward ref, reverse, ref, forward alt, and reverse alt). Why do I only have 3 results?

Created 2014-03-18 22:26:05 | Updated | Tags: haplotypecaller bam multiple-inputs gvcf

i have been using HaplotypeCaller in gVCF mode on a cohort of 830 samples spread over 2450 bams. the number of bams per sample varies from 1-4. for samples with <=3 bams, the routine works perfectly. but for samples with 4 bams, the jobs always crash and I receive the error:

ERROR MESSAGE: Invalid command line: Argument emitRefConfidence has a bad value: Can only be used in single sample mode currently

is this a bug? are there any options i can use to avoid this error. i suppose it is possible that there is an issue with my bams, but it seems odd that the error occurs systematically with 4 bam samples and never for samples with 3 or fewer bams.

thanks for any help!

Created 2014-03-07 15:10:33 | Updated | Tags: gvcf incrementaljvd

I was wondering if you had any guidance on a practical number of gVCFs to use.

I work primarily in a service context, so the experiments I work with range in size from about 2 samples up to a few hundred. In the past, I've put the smaller ones together into ad hoc "cohorts" of 30-50 samples so that I could use joint calling. I could continue with that same model here, but it seems that I could also maintain a running gVCF of every sample I've ever processed (stratified by genome and capture method and vetted for basic QC first, of course). This master set would reach into the hundreds of samples pretty quickly - probably by the end of the month - and could be into the thousands by the end of the year.

Does this seem like a good idea? Have you encountered a practical upper limit on number of samples, or do you have a general feel of "You're wasting your time above around N samples"?