Tagged with #variant calling
0 documentation articles | 0 announcements | 34 forum discussions

No posts found with the requested search criteria.
No posts found with the requested search criteria.

Created 2015-09-17 01:02:25 | Updated 2015-09-17 01:03:59 | Tags: snp vcf
Comments (3)

I am dealing with my data using vcftools, in the generated vcf file, I found that when DP=0, why it stills give a Genotype, like 0/1 0/0 or something? I attached a pic, a screenshot from a vcf file. there are 2 lines showed DP=0,and still have a GT value! The question bothered me for a long time, HELP!!!

Created 2015-09-14 16:27:46 | Updated | Tags: haplotypecaller vcf bam rnaseq
Comments (2)


I'm using GATK to call variants in my RNA-Seq data. I'm noticing something strange, perhaps someone can help? For a number of sites the VCF is reporting things I cannot replicate from BAMs. How can I recover the reads that contribute to a variant call? Here is an example for 1 site in 1 sample, but I've observed this at many sites/samples:

$ grep 235068463 file.vcf chr1 235068463 . T C 1795.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-3.530;ClippingRankSum=-0.535;DP=60;FS=7.844;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=0.401;QD=29.93;ReadPosRankSum=3.557 GT:AD:DP:GQ:PL 0/1:5,55:60:44:1824,0,44 60 reads, 5 T, 55 C.
But loading the bam in IGV, I do not see any T reads. Similarly: $ samtools view -uh file.md.realn.bam chr1:235068463-235068463 |samtools mpileup - |grep 235068463 [mpileup] 1 samples in 1 input files <mpileup> Set max per-file depth to 8000 chr1 235068463 N 60 cCCccccCCCcccccCcccccccccCCCccCCCCCcCcccccCCCcCcCCccCCCCccCC >CA@B@>A>BA@BCABACCC:@@ACABBBCAACBBCABCB@CABBAB?>A?CBBAAAABA There are just 60 C's at that location. How do I decide what the genotype here is? C/C or C/T ?

For methodology I'm using gatk/3.2.0. I tried using HC from gatk/3.3.1 and got the same result. The bam and vcf files come from the final two lines: -2 pass STAR -Mark Dups -SplitNCigarReads -RealignerTargetCreator -IndelRealigner -BaseRecalibrator -PrintReads -MergeSamFiles.jar -Mark Dups -RealignerTargetCreator -IndelRealigner -HaplotyeCaller

Thanks, Kipp

Created 2015-05-29 16:00:04 | Updated | Tags: haplotypecaller
Comments (5)

Hello, I am trying to run HaplotypeCaller on my processed bam file but it keeps running out of memory. I have tried increasing the heap size to 90G but it still crash. This might have to do with the type of analysis I am doing... The sample are pool of pigs (6 individual so a ploidy of 12 for this particular sample) that have been sequenced on targeted regions, I use the bed file that has been given with the kit to narrow done the calling to the targeted regions. I have also reduce the number of alternative allele from 6 to 3. But it still crash after a while. Is there any other parameters I should try to modify to reduce the memory usage? I have attached the log file if you want to have a look at all the parameters. Cheers,


Created 2015-03-29 18:00:14 | Updated | Tags: best-practices snp vcf gatk error
Comments (12)


I have discovered some unusual calls in my VCF file after using HaplotypeCaller. I am using version 3.3 of GATK. I applied VQSR as well as the genotype refinement workflow (CalculateGenotypePosteriors, etc.) to refine the calls, but the unusual calls did not get removed. I also calculated the number of Mendelian error just in the biallelic SNPs in my final VCF file (using PLINK) and found unusually high percentage for each of the 3 families I am studying: 0.153%, 0.167%, and 0.25%. The percentage of triallelic SNPs is also very high: 0.111%. Why are the error rates so high?

I used the following commands to call the variants and generate the initial VCF file:

HaplotypeCaller (generate gvcf files for each individual for each chromosome

java -Xmx1g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hs37d5.fa -I recal_${ROOT}.bam -o ${outpath}raw_${ROOT}.vcf --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

GenotypeGVCFs (generate vcf files for each chr)

java -Xmx1g -jar GenomeAnalysisTK.jar -R hs37d5.fa -T GenotypeGVCFs -V vcfs.chr${numchr}.new.list -o final_chr${numchr}.vcf -L ${numchr}

CatVariants (generate 1 vcf file with all inds and all chrs)

java -Xmx1g -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants -R hs37d5.fa -V final.new.list -out final_allHutt.vcf -assumeSorted


java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input final_allHutt.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches

Used excludeFiltered here

java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input final_allHutt.vcf -mode SNP --ts_filter_level 99.9 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches -o recalibrated_snps_raw_indels_allHutt_filteredout.vcf

java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches

Used excludeFiltered here

java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -mode INDEL --ts_filter_level 99.0 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches -o recalibrated_variants_allHutt_filteredout.vcf

Genotype Refinement Workflow

java -Xmx3g -jar GenomeAnalysisTK.jar -T CalculateGenotypePosteriors -R hs37d5.fa --supporting ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf -ped Hutt.ped -V recalibrated_variants_allHutt_filteredout.vcf -o recalibrated_variants_allHutt.postCGP.f.vcf

java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantFiltration -R hs37d5.fa -V recalibrated_variants_allHutt.postCGP.f.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibrated_variants_allHutt.postCGP.Gfiltered.f.vcf

Again, the first genotype in this example (indel) passed VariantFiltration even though its coverage was zero (2/2:0,0,0:0:PASS)

The entire example is below:

1 20199272 . T TCTTC,C 3520.08 PASS AC=8,22;AF=0.160,0.440;AN=50;BaseQRankSum=-1.356e+00;ClippingRankSum=-1.267e+00;DP=487;FS=4.843;GQ_MEAN=27.84;GQ_STDDEV=40.31;InbreedingCoeff=0.1002;MLEAC=1,12;MLEAF=0.020,0.240;MQ=51.74;MQ0=0;MQRankSum=0.421;NCC=2;PG=0,0,0,0,0,0;QD=32.53;ReadPosRankSum=1.27;SOR=0.699;VQSLOD=0.687;culprit=FS GT:AD:DP:FT:GQ:PGT:PID:PL:PP 2/2:0,0,0:0:PASS:22:.:.:410,207,355,32,22,0:410,207,355,32,22,0 2/2:0,0,1:1:lowGQ:5:.:.:240,51,36,18,5,0:240,51,36,18,5,0 0/2:4,0,4:8:PASS:90:.:.:140,153,256,0,103,90:140,153,256,0,103,90 0/0:22,0,0:22:lowGQ:0:.:.:0,0,390,0,390,390:0,0,390,0,390,390 0/0:2,0,0:2:lowGQ:3:.:.:0,3,45,3,45,45:0,3,45,3,45,45 2/2:0,0,3:3:lowGQ:11:.:.:287,135,124,21,11,0:287,135,124,21,11,0 ./.:7,0,0:7:PASS 2/2:0,0,3:4:lowGQ:11:.:.:282,126,115,22,11,0:282,126,115,22,11,0 0/2:10,0,0:10:lowGQ:5:.:.:27,5,494,0,411,405:27,5,494,0,411,405 0/2:7,0,0:7:lowGQ:13:.:.:13,15,502,0,303,288:13,15,502,0,303,288 0/1:8,6,0:14:PASS:99:.:.:194,0,255,218,273,491:194,0,255,218,273,491 0/0:18,0,0:18:PASS:52:.:.:0,52,755,52,755,755:0,52,755,52,755,755 2/2:0,0,0:0:PASS:23:.:.:305,168,416,23,30,0:305,168,416,23,30,0 0/2:0,0,4:4:lowGQ:14:.:.:40,14,634,0,185,699:40,14,634,0,185,699 0/0:19,0,0:19:PASS:58:.:.:0,58,824,58,824,824:0,58,824,58,824,824 0/0:1,0,0:1:lowGQ:6:0|1:20199257_CT_C:0,6,91,6,91,91:0,6,91,6,91,91 1/1:0,0,0:0:lowGQ:2:.:.:177,11,0,12,2,44:177,11,0,12,2,44 0/1:0,0,3:3:PASS:34:.:.:94,0,388,34,38,304:94,0,388,34,38,304 0/2:15,0,2:17:lowGQ:18:0|1:20199249_CT_C:18,64,695,0,632,624:18,64,695,0,632,624 1/1:0,0,0:0:lowGQ:8:.:.:133,8,0,101,17,265:133,8,0,101,17,265 0/2:3,0,0:3:PASS:25:.:.:129,25,484,0,121,94:129,25,484,0,121,94 0/2:2,0,0:2:PASS:38:.:.:185,38,644,0,88,42:185,38,644,0,88,42 0/2:2,0,0:2:lowGQ:14:.:.:256,14,293,0,57,41:256,14,293,0,57,41./.:11,0,0:11:PASS 1/2:0,0,1:1:lowGQ:14:.:.:115,24,14,36,0,359:115,24,14,36,0,359 1/2:0,0,1:1:PASS:28:.:.:188,39,28,35,0,206:188,39,28,35,0,206 2/2:0,0,3:3:lowGQ:8:1|1:20199249_CT_C:88,88,89,8,9,0:88,88,89,8,9,0

Why are some genotypes being passed when there is no support for their genotype? Why are the Mendelian error rates so high?

Thank you very much in advance, Alva

Created 2015-03-20 05:19:22 | Updated | Tags: gatk
Comments (3)


If I get it right, anomalous read pairs refer to those where only one read is mapped. I wonder whether GATK would ever use these reads to do the calling.

Thank you!

Created 2015-03-11 20:38:44 | Updated | Tags: vqsr best-practices variantfiltration
Comments (3)

Hi all - I'm stumped and need your help. I'm following the GATK best practices for calling variants with HaplotypeCaller in GVCF mode. One of my samples is NA12878, among 119 others samples in my cohort. For some reason GATK is missing a bunch of variants in this sample that I can clearly see in IGV but are not listed in the VCF. I discovered that the variant is being filtered out..reason being VQSRTranchesSNP99.00to99.90. The genotype is homozygous variant, DP is 243, Qual is 524742.54 and its known in dbSNP. I suspect this is happening to other variants.

How do I adjust VQSR or how tranches are used and variants get placed in? I supposed I need to fine tune my parameters...but I would think something as obvious as this variant would pass Filtering.

Created 2015-01-23 16:55:57 | Updated | Tags: vqsr haplotypecaller bam gatk genotypegvcfs
Comments (2)


I have recal.bam files for all the individuals in my study (these constitute 4 families), and each bam file contains information for one chromosome for one individual. I was wondering if it is best for me to pass all the files for a single individual together when running HaplotypeCaller, if it will increase the accuracy of the calling, or if I can just run HaplotypeCaller on each individual bam file separately.

Also, I was wondering at which step I should be using CalculateGenotypePosteriors, and if it will clean up the calls substantially. VQSR already filters the calls, but I was reading that CalculateGenotypePosteriors actually takes pedigree files, which would be useful in my case. Should I try to use CalculateGenotypePosteriors after VQSR? Are there other relevant filtering or clean-up tools that I should be aware of?

Thanks very much in advance,


Created 2014-12-09 21:47:52 | Updated | Tags: vcf genotypegvcfs gvcf
Comments (17)


I used GenotypeGVCFs with 3 input gvcf files (3 individuals) to create a vcf file, and this seems to work, but when I examine the sites in the final vcf file, there are sites that are missing. I am in the process of calculating exactly how many sites are missing, but taking an initial section of the vcf file and initial sections of my 3 gvcf files, the initial set of variant positions in the 3 gvcf files combined (called "test file") is:

16050036 16050612 16050822 16050933 16051556 16051968 16051994 16052080 16052167 16052239 16052250 16052357 etc.

whereas the initial set of sites in my final vcf file is:

16050822 16050933 16051347 16051497 16051556 16051968 16051994 16052080 16052167 16052169 16052239 16052357 etc.

First of all, there are sites in the final vcf file (in bold) that are fixed for the 3 individuals, but that are still included (the individuals are all 0/1 at these positions). I removed these positions when I created my test file, so they don't show up there, as you can see. Second, there are sites in the test file that are not in the final vcf file (in bold), even though these are most definitely variant sites (I checked them - f.ex., 16050036 is a SNP). I'm not sure why these discrepancies are occurring.

I also got this warning 3 times when running GenotypeGVCFs: WARN 20:04:45,102 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 22:21483632 has 7 alternate alleles so only the top alleles will be used How do I relax the requirements of "at most 6 alternate alleles" to allow more in the case of indels? I am using the newest version of GATK (3.3).

FYI, this is the command I used for GenotypeGVCFs: java -Xmx2g -jar GenomeAnalysisTK.jar -R hs37d5.fa -T GenotypeGVCFs --variant file1.vcf --variant file2.vcf --variant file3.vcf -o final.vcf -L 22 (only running this for chr22)

Thank you in advance, Alva

Created 2014-12-03 16:57:35 | Updated | Tags: haplotypecaller genotypegvcfs
Comments (2)

If I want the variants to be called only if they fit the following criteria:

1) Min. total coverage for consideration of heterozygous is 10.

2) Min. coverage of each of the two observed major basecalls to be called heterozygous is 5.

3) Min. percentage of each of the two observed major basecalls in order to be called heterozygous is 20.

4) Min. coverage in order for a position to be called homozygous is 6.

which command-line arguments in which tools (HaplotpyeCaller or GenotypeGVCFs) can I use to accomplish these? I cannot seem to find the proper arguments in the documentation. I apologize if I overlook.

Thank you

Created 2014-11-19 04:35:45 | Updated | Tags:
Comments (4)

I used GATK 3.3 for calling SNPs for my exome data. I also restricted my SNPs to be called specifying a BED file. When I annotated the vcf file i get lot of SNPs as intronic. What can be its possible reason?

Created 2014-09-15 21:28:59 | Updated | Tags: indelrealigner unifiedgenotyper indels gatk
Comments (3)

I used GATK 3.2-2 to do indel realignment and multisample calling with UnifiedGenotyper (I cannot use HaplotypeCaller because it is incompatible with the type of data I am analyzing).

Among the ~96 samples sequenced, I have one sample with two nearby variant calls, an indel and a missense, that we checked by Sanger. The missense is real but we found no trace of the indel in the Sanger traces. When I look at indel call in the multisample vcf, it has good allele depth and GQ, but suspiciously has the same AD as the missense call. Additionally, when I look at the bam in IGV, I see no evidence for the indel whatsoever and the variant is not called in any other samples in this project.

indel: chr13:101755523 CA>C GT:AD:DP:GQ:PL 0/1:55,56:111:99:1388,0,1538

missense: chr13:101755530 A>G GT:AD:DP:GQ:PL 0/1:55,56:111:99:2170,0,2125

I went back and recalled just this one sample (single sample calling)… which resulted in the correct variants, i.e. the indel was not called at all, but the SNP, which does validate, is called.

I understand that this is not an easy region to call because of the 7xA repeat, but it’s not obvious to me why this happens only in multisample mode and I'd like to have a better understanding of what is going on.

Created 2014-08-14 19:37:20 | Updated | Tags: best-practices rnaseq
Comments (2)


Thank you for providing guidelines on RNA-Seq variant discovery. For our data, we are currently playing with multiple mapping methods and have noticed that 2-step alignments work "better" than 1-step alignments. By 2-step alignments, I mean using STAR as step1 and then take the unmapped from this and use another aligner (like Bowtie2) for alignment. If I use such a methodology, will there be an issue in variant calling when during splitting cigar strings I ask it convert the 255 MAPQ to another value (like 60 in the best practices example), since bowtie2 gives different MAPQ scores. Sorry if this seems like a stupid question, but I am just a little curious how such a thing might affect the variant calls. Any insights/comment on this will be greatly appreciated.


Created 2014-08-11 09:50:56 | Updated | Tags: unifiedgenotyper vcf qual
Comments (1)

I used the EMIT_ALL_SITES option with Unified Genotyper. For polymorphic sites, the quality score (QUAL field) corresponds to the Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. But for monomorphic sites, (at this site, we have an homozygote for the reference allele), what is the definition of the quality score ? and how is it computed ? Many thanks for your explanation.

Created 2014-05-05 16:45:37 | Updated | Tags: haplotypecaller error
Comments (4)

I'm having trouble running the haplotypecaller on a few of my samples. Before I go full scale on all of my samples I am attempting the pipeline on a small number of my samples. When I try running haplotypecaller on my samples individually with the command below I get a stack trace error (also below). Usually I can figure out the stack trace errors and correct them, but I am having trouble with this one. With this error, the program runs for a time then errors out (sometime it runs longer then other times). It doesn't occur for every sample and when it does occur, if I re-run that same individual and same command again it sometimes goes to completion without errors. Other times I might have to run that individual/command multiple times before I can get it to complete without errors. Any insight would be greatly appreciated.

Command: java -Xmx16g -jar ~/programs/GenomeAnalysisTK.jar -T HaplotypeCaller -R ../miliaris_ref.fa -I realigned/27861_realigned.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --max_alternate_alleles 2 -mbq 30 -recoverDanglingHeads -dontUseSoftClippedBases -stand_call_conf 4.0 -stand_emit_conf 3.0 -o rawSNPS_Q4/27861_rawSNPS_Q4.vcf -nct 8 -A HaplotypeScore -A FisherStrand -A BaseQualityRankSumTest -A MappingQualityRankSumTest -A ReadPosRankSumTest -A QualByDepth -A VariantType -A LowMQ

Error: . . . INFO 12:36:15,578 ProgressMeter - comp45464_c0_seq1:533 2.82e+06 2.5 m 53.0 s 7.1% 35.2 m 32.7 m INFO 12:36:45,579 ProgressMeter - comp45609_c0_seq1:269 2.84e+06 3.0 m 63.0 s 7.2% 41.9 m 38.9 m INFO 12:37:15,580 ProgressMeter - comp48164_c0_seq1:493 3.09e+06 3.5 m 67.0 s 7.8% 44.9 m 41.4 m WARN 12:37:24,038 ExactAFCalc - this tool is currently set to genotype at most 2 alternate alleles in a given context, but the context at comp48440_c0_seq1:357 has 6 alter$ INFO 12:37:45,581 ProgressMeter - comp53167_c0_seq2:450 3.67e+06 4.0 m 65.0 s 9.2% 43.3 m 39.3 m

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.NullPointerException at java.lang.String.checkBounds(String.java:374) at java.lang.String.(String.java:314) at net.sf.samtools.util.StringUtil.bytesToString(StringUtil.java:301) at net.sf.samtools.BAMRecord.decodeReadName(BAMRecord.java:331) at net.sf.samtools.BAMRecord.getReadName(BAMRecord.java:220) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.readthreading.ReadThreadingGraph.addRead(ReadThreadingGraph.java:543) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.readthreading.ReadThreadingAssembler.createGraph(ReadThreadingAssembler.java:163) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.readthreading.ReadThreadingAssembler.assemble(ReadThreadingAssembler.java:112) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.LocalAssemblyEngine.runLocalAssembly(LocalAssemblyEngine.java:168) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.assembleReads(HaplotypeCaller.java:961) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:825) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:141) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk

Created 2014-04-16 14:11:37 | Updated | Tags: unifiedgenotyper haplotypecaller bug gatk
Comments (12)

HI, I'd like to report a weird result from HaplotypeCaller.

We have a patient sequenced by targeted sequencing on HNF1B. This patient has been confirmed to have a whole gene deletion of HNF1B so we used this patient as a positive control. We expected to see no heterozygous variants called in HNF1B.

However, HaplotypeCaller called two heterozygous variants: one deletion (it didn't pass the FS strand bias filter and the ReadPosRankSumTest filter) and one substitution (this one passed all the quality filters). Both these two variants were not called by UnifiedGenotyper (and the variants called by UnifiedGenotyper in HNF1B region were all homozygous as what we expected)

Please see the VCF table:

There are three things I want to highlight: 1, The deletion is only 10 bases upstream of the substitution, but the FS score is enormous for the deletion whereas very low for the substitution. If there's a strand bias, it must affect both variants if they are so close to each other. 2, The score of ReadPosRankSumTest of the deletion didn't pass the filter because it's always called near the end of the reads. The ReadPosRankSumTest score for the substitution is missing. 3, The genotype was called 0/1 for the substitution, but if we look at the AD, there are 206 reads supporting the ref allele and 0 read supporting the alt allele. Going by the AD, it's clearly a homozygous ref/ref genotype.

Then I looked into the bam files. It turns out the all the alternative alleles of the substitution were from the ends of bad reads, and there are not too many of them after all. So the reads in the bam file also support a homozygous ref/ref genotype.

Therefore I'm really confused why the substitution has 0/1 genotype called by the HaplotypeCaller and why it passed the filter.

Many Thanks


Created 2014-03-12 08:28:47 | Updated | Tags: unifiedgenotyper gatk
Comments (7)

Hi, we observe an unexpected deletion call (GATK UnifiedGenotyper 3.0 and 2.5) in a setup with three samples called together. In the file call.txt you find the call. From the pileup (pileup.txt, position 19:57325555) we would expect, that the indel would only be called for the first sample. Is there another source of information, that makes GATK believe, that the deletion also occurs in the second and third sample?

Thanks in advance, Johannes

Created 2014-02-13 14:30:47 | Updated | Tags:
Comments (13)

Dear GATK Team,

I have encountered a problem while doing a variant calling. I received raw reads from sequencing company and wanted to do some analysis myself. I did everything according to GATK best practices and everything was fine until I looked at VCF file for called variants and I did not find the variant which was reported to me from sequencing company as a disease causing mutation and therefore was the only one I especially searched for. It is a 1bp heterozygous deletion, which can be seen in IGV picture attached to this post. I have tried to change parameters for both HaplotypeCaller and UnifiedGenotyper, but nothing has made them to call this indel.

The last command I used was:

java -jar ~/NGS_programs/GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ~/NGS_data/hg19/ucsc.hg19.fasta -I KO002_P.merged.dedup.realn.bam -stand_call_conf 20 -stand_emit_conf 1 -L chr6:157097064-157533913 -o ARID1B/ARID1B_UG_emit1_call20_INDEL.vcf -glm INDEL

I also have tried lowering the -minIndelCnt, -minIndelFrac arguments, and maximized the --max_deletion_fraction while using UG. But nothing helped.

Do you have any ideas what could be the problem?

Thank you in advance!


Created 2014-02-12 20:31:19 | Updated | Tags: variantrecalibrator workflow multi-sample
Comments (1)

Hi there,

We are sequencing a set of regions that covers about 1.5 megabases in total. We're running into problems with VQSR -- VariantRecalibrator says there are too few variants to do recalibration. To give a sense of numbers, in one sample we have about 3000 SNVs and 600 indels.

We seem to have too few indels to do VQSR on them and have a couple of questions:

  1. Can we combine multiple samples to increase the number of variants, or does VariantRecalibrator need to work on each sample individually?

  2. If we do not use VQSR for indels, should we also avoid VQSR for the SNPs?

  3. The other question is whether joint or batch variant calling across several samples would help us in this case?

Thanks in advance!

Created 2014-02-10 19:57:41 | Updated 2014-02-10 20:24:46 | Tags: variantannotator variantstovcf variantfiltration vcf
Comments (5)


I have annotated my vcf file of 20 samples from Unified genotyper using the following steps.

Unified genotyper->Variantrecalibration->Applyrecalibration->VariantAnnotator

My question is how should I proceed if I have to select rare variants (MAF<1%) for the candidate genes that I have,for each of these 20 samples?

Created 2014-01-31 13:44:10 | Updated | Tags: unifiedgenotyper
Comments (2)

Hi, I encounter a strange error when using UnifiedGenotyper, The command that I use is: /usr/bin/java -Xmx4g -jar /home/shengyu_ni/bin/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /mnt/genotyping/sendru/human_g1k_v37.fasta -I bqsr.bam -o haplogroup.vcf -out_mode EMIT_ALL_SITES -gt_mode GENOTYPE_GIVEN_ALLELES -alleles /mnt/genotyping/sendru/isogg.vcf -nt 4 -ploidy 1 -L /mnt/genotyping/sendru/comprey.interval_list

and the error mesage is:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: -1 at org.broadinstitute.sting.gatk.walkers.genotyper.GeneralPloidyGenotypeLikelihoods.subsetToAlleles(GeneralPloidyGenotypeLikelihoods.java:380) at org.broadinstitute.sting.gatk.walkers.genotyper.GeneralPloidyGenotypeLikelihoodsCalculationModel.getLikelihoods(GeneralPloidyGenotypeLikelihoodsCalculationModel.java:294) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoods(UnifiedGenotyperEngine.java:331) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:232) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.8-1-g932cd3a):
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR ------------------------------------------------------------------------------------------

I already notice that it can be some compatible problem with my file: isogg.vcf, because when I replace that file with another vcf file, it runs successfully, but I can not spot what makes the difference. so I list the header of the vcf file as the following,

additionally, this vcf file works fine with realignment







INFO=<ID=ISOGG,Number=0,Type=Flag,Description="Used as y chromosome haplogroup in ISOGG. ">

FILTER=<ID=PASS,Description="Only pass snp is kept, the other possibles are private and investigation. ">






















































































Y 2649696 M236 G T . PASS ISOGG Y 2655180 M176 C G . PASS ISOGG Y 2656127 Z12426 G A . PASS ISOGG Y 2656959 L1233 G C . PASS ISOGG

Thank you.

Created 2014-01-13 11:55:08 | Updated | Tags: vqsr haplotypecaller exome
Comments (6)

We are running GATK HaplotypeCaller on ~50 whole exome samples. We are interested in rare variants - so we ran GATK in single sample mode instead of multi sample as you recommend, however we would like to take advantage of VQSR. What would you recommend? Can we run VQSR on the output from GATK single sample?

Additionally, we are likely to run extra batches of new exome samples. Should we wait until we have them all before running them through the GATK pipeline?

Many thanks in advance.

Created 2014-01-10 15:39:54 | Updated 2014-01-10 16:16:06 | Tags: ad dp
Comments (1)


I've used the Unified Genotyper for variant calling with GATK version 2.5.2. This was the info for a private variant.

GT:AD:DP:GQ:PL 0/1:52,37:88:99:1078,0,1486

However, after select variants to exclude non variant and variants not passing Filter, the AD changed and eliminated the alternative reads though the DP remained unchanged.

GT:AD:DP:GQ:PL 0/1:51,0:88:99:0,99,1283

I think I recall another post having a similar issue due to multithreaded use of select variants


APologies for not commenting on this post instead as I had already posted this prior to seeing the other post!



Created 2013-09-04 14:27:42 | Updated 2013-09-04 14:46:07 | Tags: haplotypecaller
Comments (7)

Looking closely at our vcf file produced by HaplotypeCaller we noticed disagreement between PL values and GT in some variants. For example.

1   762589  rs71507461  G   C   21714.69    PASS    [clipped]   GT:AD:DP:GQ:PL  1/1:0,33:33:99:1464,99,0    **0/1**:32,38:70:99:**0,9,2274**    **0/1**:78,42:120:99:**323,30,0**   **1/1**:11,5:96:99:**123,0,324**    1/1:2,84:86:99:3923,271,0   1/1:2,104:106:99:4945,348,0
1   762592  rs71507462  C   G   21714.69    PASS    [clipped]   GT:AD:DP:GQ:PL  1/1:0,60:32:99:1464,99,0    0/1:586,0:67:99:1471,0,2274 0/1:77,42:119:99:1667,0,6152    1/1:1,95:96:99:4462,310,0   1/1:2,85:87:99:3923,271,0   1/1:2,101:103:99:4945,348,0
1   762601  rs71507463  T   C   21714.69    PASS    [clipped]   GT:AD:DP:GQ:PL  1/1:1675,144:32:99:1464,99,0    0/1:30,38:68:99:1471,0,2274 0/1:79,42:121:99:1667,0,495 1/1:20,24:90:99:4462,0,476  1/1:2,83:85:99:3923,271,0   1/1:3,107:110:99:4945,348,0

What is the relationship between GT and PL - I initially thought PL determined GT? What other factors go into GATK deciding on a particular GT? For example, does GATK take into account the GT of nearby variants to determine the GT of an individual variant?

Thank you in advance for your help.


Created 2013-08-20 15:43:47 | Updated | Tags: annotation interpretation
Comments (1)

I have been trying to find documentation for understanding the annotations and numbers in the VCF file. Something like below, can someone guide me how to understand/interpret the numbers? How is the quality of the variant calling for this particular indel?

AC=2; AF=0.100; AN=20; BaseQRankSum=6.161; ClippingRankSum=-2.034; DP=313; FS=5.381; InbreedingCoeff=-0.1180; MLEAC=2; MLEAF=0.100; MQ=58.49; MQ0=0; MQRankSum=-0.456; QD=1.46; ReadPosRankSum=-4.442; VQSLOD=0.348; topculprit=ReadPosRankSum

Created 2013-08-12 08:53:33 | Updated | Tags:
Comments (3)

i have sequenced two exomes from two affected individuals from the same family. i have called their variants with the recommended UnifiedGenotyper protocol for each sample individually (was done a year and half ago..), and then did theintersection to find teh shared ones. i was wondering, if you recommend calling them at one step, togther? is there any added value for doing so (in term of reducing false positives, accuracy etc..).

Created 2013-07-19 06:44:49 | Updated | Tags: haplotypecaller hg19
Comments (3)

Hi, I’m trying to use GATK for variant calling. I have had some problems preparing the hg19 reference genome. I haven’t done the alignment myself, but gotten the .bam files already done. I how ever know that this should be the same reference used for the alignment. I have downloaded the hg19 from ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz and downloaded the new version of mitochondrial genome (NC_012920.1). and then tried: cat chr*.fa > hg19.fa

After this I have followed the instructions on (how to) Prepare a reference for use with BWA and GATK.

When I try to call the variants using HaplotypeCaller (as instructed on: (howto) Call variants on a diploid genome with the HaplotypeCaller):

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa -I 8526RU.rmdup.bam -L 20 --genotyping_mode DISCOVERY --output_mode EMIT_VARIANTS_ONLY -stand_emit_conf 10 -stand_call_conf 30 -o raw_variants8526RU.vcf

I get the error message: “Badly formed genome loc: Contig '20' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?” Can you tell me what the problem is? And how to fix this? I know there have been some similar questions considering the contigs, but I haven’t been able to solve the problem based on them.

Thank you, Sini

Created 2013-07-17 08:53:06 | Updated 2013-07-17 09:49:26 | Tags: haplotypecaller
Comments (2)

Hi there,

I'm starting to use use the HaplotypeCaller to identify variants in my exome projects, but I was wondering how it initially determines if a region has the potential to be variable. I couldn't find any useful documentation, online...so could anyone of you help me to basically understand this first step?

Thank you in advance.

Created 2013-07-02 06:38:26 | Updated | Tags: gatk2 vcf pipeline bug
Comments (9)

Hello all,

We've just started using GATK in order to perform variant calling in a non-model teleost fish. The fish genome is highly repetitive (>65%), and also suffers from the whole genome duplication event common in teleosts (e.g. zebrafish). Additionally, the fish strain we use is highly inbred, which should result in a highly homogenous genome. We have generated a genome assembly and a de novo repeat library based on NGS data (manuscript submitted) before mapping the reads from four individuals (male and female) to the genome via bowtie2. Variants were called using UnifiedGenotyper.

We generally get a very good list of variants, but it seems that we're getting a number of false positives and negatives when calling variants. Some of these appear to be due to paralogues, but some seem to be errors in the actual genotype call. For example:

scaffold00001 1199020 . T G 44.35 . AC=1;AF=0.167;AN=6;BaseQRankSum=-7.420;DP=110;Dels=0.00;FS=152.859;HaplotypeScore=3.6965;MLEAC=1;MLEAF=0.167;MQ=42.00;MQ0=0;MQRankSum=-1.972;QD=1.53;ReadPosRankSum=-2.777;SB=-4.096e+00 GT:AD:DP:GQ:PL 0/1:20,9:29:79:79,0,588 0/0:16,7:23:12:0,12,447 0/0:39,18:57:65:0,65,1426 ./.

In this case, individual 3 has a homozygous reference genotype, despite having a 31% minor allele frequency. Individual 1 also has a 31% minor allele frequency, but is called heterozygous.Some of the bases used to call the G allele are of low quality (when looking more closely using IGV), but I would still expect the genotype to be heterozygous.

A reverse example:

scaffold00458 298207 . A G 64.81 . AC=2;AF=0.333;AN=6;BaseQRankSum=3.027;DP=64;Dels=0.00;FS=5.080;HaplotypeScore=0.0000;MLEAC=2;MLEAF=0.333;MQ=16.26;MQ0=0;MQRankSum=3.177;QD=1.16;ReadPosRankSum=-3.252;SB=0.439 GT:AD:DP:GQ:PL 0/0:8,0:8:21:0,21,207 0/1:20,1:21:13:13,0,152 0/1:31,4:35:90:90,0,102 ./.

Here, individual 2 is called heterozygous, but there is only a single read which supports the minor allele. Additionally, when looking at IGV, you can see that the read in question has a number of mismatches, suggesting it originates from another area of the genome.

I've also uploaded screenshots of IGV if that I hope will help clarify the problems we're having. We have used default parameters of GATK in almost all cases, and we did not used VQSR, as we did not have a list of high confidence SNPs at the time.

Created 2013-06-26 16:28:06 | Updated | Tags: unifiedgenotyper vcf bam multi-sample
Comments (5)

we are running tests trying to get UG to produce 1 vcf per sample when inputting bams from multiple subjects. our situation is complicated slightly by the fact that each sample has 3 bams. when we input all 6 bams into UG, hoping to output 2 vcfs (1 per sample) we instead get a single vcf. we found some relevant advice in this post: http://gatkforums.broadinstitute.org/discussion/2262/why-unifiedgenotyper-treat-multiple-bam-input-as-one-sample but still haven't solved the issue.

details include: 1) we are inputting 6 bams for our test, 3 per sample for 2 samples. 2) bams were generated using Bioscope from targeted capture reads sequenced on a Solid 4. 3) as recommended in the post above we checked out the @RG statements in the bam headers using Samtools -- lines for the 6 bams are as follows:

sample 1:

@RG ID:20130610202026358 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-10T16:20:26-0400 SM:S1

@RG ID:20130611214013844 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-11T17:40:13-0400 SM:S1

@RG ID:20130613002511879 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:147 DT:2013-06-12T20:25:11-0400 SM:S1

sample 2:

@RG ID:20130611021848236 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-10T22:18:48-0400 SM:S1

@RG ID:20130612014345277 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-11T21:43:45-0400 SM:S1

@RG ID:20130613085411753 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:150 DT:2013-06-13T04:54:11-0400 SM:S1

Based on the former post, I would have expected each of these bams to generate a separate vcf as it appears the ids are all different (which would not have been desirable either, as we are hoping to generate 2 vcfs in this test). Thus, it is not clear if/how we should use Picard tool AddOrReplaceReadGroups to modify the @RG headers?

Does that make sense? Any advice?

Created 2013-06-03 22:42:48 | Updated 2013-06-03 22:44:22 | Tags: gatkpapergenotyper bowtie
Comments (6)


I'm trying to call variants from bowtie-aligned reads, I used PrintReads with ReassignMappingQuality filter to give all reads a mapping score of 60 to replace default value of 255. However, I'm wondering if this assignment would introduce any bias in variant calling.

Thanks a lot!

Created 2013-05-09 13:49:37 | Updated | Tags: unifiedgenotyper haplotypecaller
Comments (5)


I am adapting a two-year old pipeline, which includes UnifiedGenotyper to call variants. I would like to update this to the HaplotypeCaller. In the command for the UnifiedGenotyper there is an option --metrics_file. I cannot find reference to this in the GATK documentation - do you know what it is and whether I can use it with HaplotypeCaller as well?

Also, do I understand correctly that using HaplotypeCaller to call variants means you don't have to carry out the local realignment around indels step prior to this?

Thanks very much,


Created 2013-04-26 08:38:39 | Updated | Tags: unifiedgenotyper
Comments (3)

Hi there,

I just had a look at the code of the UnifiedGenotyper and how its Variant Calling algorithm is implemented (very well documented by the way :) . But now I wonder how GATK reads in the SAM file and finds out where the differences to the reference (SNPs, Indels) are that are then examined in the Variant Calling. I only see that the UnifiedGenotyper gets a set of alleles, but not where the alleles are actually determined. I have also found out that GATK is using Samtools for parsing SAM files, but have not found the point where the actual reads are parsed and processed (e.g., by using the CIGAR string). Are you maybe doing a local realignment before the actual Variant Calling from which you get the alleles?

I would appreciate if you could guide me to the right direction where this happens in the code.

Best regards,


Created 2013-04-22 14:51:02 | Updated | Tags: unifiedgenotyper trio exome
Comments (2)


I have exome data for a few sets of parent-offspring trios, in which offspring have phenotypically related but probably genetically different diseases. Their parents are unaffected so I am particularly interested in identifying de novo mutations. My plan was to analyse each individual separately up to the variant calling phase and then to input three (mother, father, child) analysis-ready BAMs into the UnifiedGenotyper along with a ped file. My questions are:

1) Can you tell me whether the UnifiedGenotyper uses pedigree information in the ped file to call genotypes more accurately? In other words, is this better than calling variants jointly without supplying the ped file? If not, does GATK recommend any external tools for doing this step?

2) It is better to call variants jointly using all the trios (even though they are not related and probably don't share the same disease-causing mutations)?

Best wishes,


Created 2013-03-04 15:05:36 | Updated | Tags: unifiedgenotyper snp
Comments (1)

Hi to all, I ran UnifiedGenotyper of three exome seq samples and phased with familial pedigree. During the manual filtration I saw several inconsistency. For example I get this output from UnifiedGenotyper and phasing:

chr2 38525498 rs76204302 T G 66.79 . AC=2;AF=0.333;AN=6;BaseQRankSum=-7.191;DB;DP=180;Dels=0.00;FS=208.951;HaplotypeScore=13.8978;MLEAC=2;MLEAF=0.333;MQ=60.00;MQ0=0;MQRankSum=2.030;QD=0.52;ReadPosRankSum=1.325;SB=-2.263e+00 GT:AD:DP:GQ:PL 0/0:30,22:52:39:0,39,729 0/1:9,7:16:5:5,0,280 0/1:55,57:112:98:98,0,1169

The order of samples are father, mother and son.How is possible that the father with, respectively 30 bases REF and 22 bases VAR is called 0/0? Thanks