Tagged with #snp 2 documentation articles | 0 announcements | 32 forum discussions

The answer depends on what tool we're talking about, and whether we're considering variant discovery or variant manipulation.

GATK variant manipulation tools are able to recognize the following types of alleles:

• SNP (single nucleotide polymorphism)
• INDEL (insertion/deletion)
• MIXED (combination of SNPs and indels at a single position)
• MNP (multi-nucleotide polymorphism, e.g. a dinucleotide substitution)
• SYMBOLIC (generally, a very large allele or one that's fuzzy and not fully modeled; i.e. there's some event going on here but we don't know what exactly)

Of our two variant callers, UnifiedGenotyper is the more limited, as it only calls SNPs and indels, and does so separately (even if you run in calling mode BOTH, the program performs separate calling operations internally). The HaplotypeCaller is more sophisticated and calls different types of variants at the same time. So in addition to SNPs and indels, it is capable of emitting mixed records by default. It is also capable of emitting MNPs and symbolic alleles, but the modes to do so are not enabled by default and they are not part of our recommended best practices for the tool.

The GATK currently does not handle SVs (structural variations) or CNVs (copy number variations), but there are some third-party software packages built on top of GATK that provide this functionality. See GenomeSTRiP for SVs and XHMM for CNVs.

This can happen when you expect a call to be made based on the output of other variant calling tools, or based on examination of the data in a genome browser like IGV.

There are several possibilities, and among them, it is possible that GATK may be missing a real variant. But we are generally very confident in the calculations made by our tools, and in our experience, most of the time, the problem lies elsewhere. So, before you post this issue in our support forum, please follow these troubleshooting guidelines, which hopefully will help you figure out what's going on.

In all cases, to diagnose what is happening, you will need to look directly at the sequencing data at the position in question.

1. Generate the bamout and compare it to the input bam

If you are using HaplotypeCaller to call your variants (as you nearly always should) you'll need to run an extra step first to produce a file called the "bamout file". See this tutorial for step-by-step instructions on how to do this.

What often happens is that when you look at the reads in the original bam file, it looks like a variant should be called. However, once HaplotypeCaller has performed the realignment, the reads may no longer support the expected variant. Generating the bamout file and comparing it to the original bam will allow you to elucidate such cases.

In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!

2. Check the base qualities of the non-reference bases

The variant callers apply a minimum base quality threshold, under which bases will not be counted as supporting evidence for a variant. This is because low base qualities mean that the sequencing machine was not confident that it called the right bases. If your expected variant is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the DP field of the VCF is the unfiltered depth. You may believe you have good coverage at your site of interest, but since the variant callers ignore bases that fail the quality filters, the actual coverage seen by the variant callers may be lower than you think.

3. Check the mapping qualities of the reads that support the non-reference allele(s)

The quality of a base is capped by the mapping quality of the read that it is on. This is because low mapping qualities mean that the aligner had little confidence that the read was mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- in fact, you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

4. Check how many alternate alleles are present

By default the variant callers will only consider a certain number of alternate alleles. This parameter can be relaxed using the --max_alternate_alleles argument (see the HaplotypeCaller documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles increases the computational cost of the processing, scaling exponentially with the number of alternate alleles, which means it will use more resources and take longer. Unless you have a really good reason to change the default value, we highly recommend that you not modify this parameter.

5. When using UnifiedGenotyper, check for overlapping deletions

The UnifiedGenotyper ignores sites if there are too many overlapping deletions. This parameter can be relaxed using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument) but be aware that increasing its value could adversely affect the reliability of your results.

6. Check for systematic biases introduced by your sequencing technology

Some sequencing technologies introduce particular sources of bias. For example, in data produced by the SOLiD platform, alignments tend to have reference bias and it can be severe in some cases. If the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site, you are probably seeing false positives.

No posts found with the requested search criteria.

Hi,

I have discovered some unusual calls in my VCF file after using HaplotypeCaller. I am using version 3.3 of GATK. I applied VQSR as well as the genotype refinement workflow (CalculateGenotypePosteriors, etc.) to refine the calls, but the unusual calls did not get removed. I also calculated the number of Mendelian error just in the biallelic SNPs in my final VCF file (using PLINK) and found unusually high percentage for each of the 3 families I am studying: 0.153%, 0.167%, and 0.25%. The percentage of triallelic SNPs is also very high: 0.111%. Why are the error rates so high?

I used the following commands to call the variants and generate the initial VCF file:

CatVariants (generate 1 vcf file with all inds and all chrs)

java -Xmx1g -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants -R hs37d5.fa -V final.new.list -out final_allHutt.vcf -assumeSorted

VQSR

java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input final_allHutt.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches

Used excludeFiltered here

java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input final_allHutt.vcf -mode SNP --ts_filter_level 99.9 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches -o recalibrated_snps_raw_indels_allHutt_filteredout.vcf

java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches

Used excludeFiltered here

java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -mode INDEL --ts_filter_level 99.0 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches -o recalibrated_variants_allHutt_filteredout.vcf

Genotype Refinement Workflow

java -Xmx3g -jar GenomeAnalysisTK.jar -T CalculateGenotypePosteriors -R hs37d5.fa --supporting ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf -ped Hutt.ped -V recalibrated_variants_allHutt_filteredout.vcf -o recalibrated_variants_allHutt.postCGP.f.vcf

java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantFiltration -R hs37d5.fa -V recalibrated_variants_allHutt.postCGP.f.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibrated_variants_allHutt.postCGP.Gfiltered.f.vcf

Again, the first genotype in this example (indel) passed VariantFiltration even though its coverage was zero (2/2:0,0,0:0:PASS)

The entire example is below:

1 20199272 . T TCTTC,C 3520.08 PASS AC=8,22;AF=0.160,0.440;AN=50;BaseQRankSum=-1.356e+00;ClippingRankSum=-1.267e+00;DP=487;FS=4.843;GQ_MEAN=27.84;GQ_STDDEV=40.31;InbreedingCoeff=0.1002;MLEAC=1,12;MLEAF=0.020,0.240;MQ=51.74;MQ0=0;MQRankSum=0.421;NCC=2;PG=0,0,0,0,0,0;QD=32.53;ReadPosRankSum=1.27;SOR=0.699;VQSLOD=0.687;culprit=FS GT:AD:DP:FT:GQ:PGT:PID:PL:PP 2/2:0,0,0:0:PASS:22:.:.:410,207,355,32,22,0:410,207,355,32,22,0 2/2:0,0,1:1:lowGQ:5:.:.:240,51,36,18,5,0:240,51,36,18,5,0 0/2:4,0,4:8:PASS:90:.:.:140,153,256,0,103,90:140,153,256,0,103,90 0/0:22,0,0:22:lowGQ:0:.:.:0,0,390,0,390,390:0,0,390,0,390,390 0/0:2,0,0:2:lowGQ:3:.:.:0,3,45,3,45,45:0,3,45,3,45,45 2/2:0,0,3:3:lowGQ:11:.:.:287,135,124,21,11,0:287,135,124,21,11,0 ./.:7,0,0:7:PASS 2/2:0,0,3:4:lowGQ:11:.:.:282,126,115,22,11,0:282,126,115,22,11,0 0/2:10,0,0:10:lowGQ:5:.:.:27,5,494,0,411,405:27,5,494,0,411,405 0/2:7,0,0:7:lowGQ:13:.:.:13,15,502,0,303,288:13,15,502,0,303,288 0/1:8,6,0:14:PASS:99:.:.:194,0,255,218,273,491:194,0,255,218,273,491 0/0:18,0,0:18:PASS:52:.:.:0,52,755,52,755,755:0,52,755,52,755,755 2/2:0,0,0:0:PASS:23:.:.:305,168,416,23,30,0:305,168,416,23,30,0 0/2:0,0,4:4:lowGQ:14:.:.:40,14,634,0,185,699:40,14,634,0,185,699 0/0:19,0,0:19:PASS:58:.:.:0,58,824,58,824,824:0,58,824,58,824,824 0/0:1,0,0:1:lowGQ:6:0|1:20199257_CT_C:0,6,91,6,91,91:0,6,91,6,91,91 1/1:0,0,0:0:lowGQ:2:.:.:177,11,0,12,2,44:177,11,0,12,2,44 0/1:0,0,3:3:PASS:34:.:.:94,0,388,34,38,304:94,0,388,34,38,304 0/2:15,0,2:17:lowGQ:18:0|1:20199249_CT_C:18,64,695,0,632,624:18,64,695,0,632,624 1/1:0,0,0:0:lowGQ:8:.:.:133,8,0,101,17,265:133,8,0,101,17,265 0/2:3,0,0:3:PASS:25:.:.:129,25,484,0,121,94:129,25,484,0,121,94 0/2:2,0,0:2:PASS:38:.:.:185,38,644,0,88,42:185,38,644,0,88,42 0/2:2,0,0:2:lowGQ:14:.:.:256,14,293,0,57,41:256,14,293,0,57,41./.:11,0,0:11:PASS 1/2:0,0,1:1:lowGQ:14:.:.:115,24,14,36,0,359:115,24,14,36,0,359 1/2:0,0,1:1:PASS:28:.:.:188,39,28,35,0,206:188,39,28,35,0,206 2/2:0,0,3:3:lowGQ:8:1|1:20199249_CT_C:88,88,89,8,9,0:88,88,89,8,9,0

Why are some genotypes being passed when there is no support for their genotype? Why are the Mendelian error rates so high?

Thank you very much in advance, Alva

Hello,

when I run the UnifiedGenotyper to call INDELs, I get the following error (detailed command line output below) HAPLOTYPE_MAX_LENGTH must be > 0 but got 0

When I call only SNPs instead, it doe not occur. I have searched to find an answer to why this is happening, but cannot figure out the reason. Could this be a bug? I get the error no matter if I run the Realigner before or not.

Did you observe this problem before?

Command line output: java -Xmx10g -jar ~/work/tools/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /media/rebecca/UUI/Work/BputSem/BputSem_gapfilled.final.fa -I realigned_A.bam -gt_mode DISCOVERY -glm INDEL -stand_call_conf 30 -stand_emit_conf 10 -o rawINDELS_q30_A.vcf -ploidy 10

INFO 15:50:47,987 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:50:47,990 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 INFO 15:50:47,990 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 15:50:47,990 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 15:50:47,996 HelpFormatter - Program Args: -T UnifiedGenotyper -R /media/rebecca/UUI/Work/BputSem/BputSem_gapfilled.final.fa -I realigned_A.bam -gt_mode DISCOVERY -glm INDEL -stand_call_conf 30 -stand_emit_conf 10 -o rawINDELS_q30_A.vcf -ploidy 10 INFO 15:50:48,002 HelpFormatter - Executing as rebecca@rebecca-ThinkPad-T440s on Linux 3.13.0-44-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_65-b32. INFO 15:50:48,002 HelpFormatter - Date/Time: 2015/03/12 15:50:47 INFO 15:50:48,002 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:50:48,002 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:50:48,132 GenomeAnalysisEngine - Strictness is SILENT INFO 15:50:48,402 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 15:50:48,409 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 15:50:48,447 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04 INFO 15:50:48,594 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 15:50:48,622 GenomeAnalysisEngine - Done preparing for traversal INFO 15:50:48,622 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 15:50:48,622 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 15:50:48,623 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 15:50:48,658 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values. INFO 15:50:48,658 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values. INFO 15:51:06,585 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR stack trace

java.lang.IllegalArgumentException: HAPLOTYPE_MAX_LENGTH must be > 0 but got 0 at org.broadinstitute.gatk.utils.pairhmm.PairHMM.initialize(PairHMM.java:97) at org.broadinstitute.gatk.utils.pairhmm.N2MemoryPairHMM.initialize(N2MemoryPairHMM.java:60) at org.broadinstitute.gatk.utils.pairhmm.LoglessPairHMM.initialize(LoglessPairHMM.java:66) at org.broadinstitute.gatk.utils.pairhmm.PairHMM.computeLikelihoods(PairHMM.java:194) at org.broadinstitute.gatk.tools.walkers.indels.PairHMMIndelErrorModel.computeGeneralReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:461) at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyIndelGenotypeLikelihoods.add(GeneralPloidyIndelGenotypeLikelihoods.java:201) at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyIndelGenotypeLikelihoods.add(GeneralPloidyIndelGenotypeLikelihoods.java:124) at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyGenotypeLikelihoodsCalculationModel.getLikelihoods(GeneralPloidyGenotypeLikelihoodsCalculationModel.java:270) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateLikelihoods(UnifiedGenotypingEngine.java:317) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotypingEngine.java:201) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:379) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:151) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------

Thanks for all your help thus far! I'm not sure if I'd be able to figure out SNP calling without everyone here! I have one questions, however!

Does anyone know of a tool that will determine if a SNP/INDEL is inherited or de novo? Is there a GATK walker that will aid in this?

Hi! I am currently evaluating different methods to select tagSNPs for a gene. Is it possible to identify tagSNPs for a gene with GATK by scanning a given list of SNPs? Thank you very much.

I am working with plants, and so cannot follow most of your human/cancer specific workflows etc. Also my question is probably not GATK specifc, but I would be very grateful for any kind of answer. I have used BWA to align a set of reads and the contigs assembled from these reads (de novo). I have used samtools mpileup and GATK unifiedGenotyper on the aligned reads. This results for both in a list of ONLY SNPs, the same ones (aside of sensitivity). I have also used the samtools steps on the aligned contigs, resulting in a list of ONLY indels. I can clearly see both SNPs and indels in IGV, both looking at the reads or the contigs. generally I see that reads are being used for variant calling, so is it "wrong" to use contigs? and why do I not get ANY indels in either tool's output when using reads? any ideas?

Hey,

we are working with GATK 3.2.2 and focus on an optimization of the SNP and indel calling in the context of our data. Thereby, we also determine the frequency of a certain variant. However, we came across some strange results in this case. In the vcf files there is of course the DP in the info field, representing the unfiltered depth, and the DP in the format field, representing the filtered depth. Yet, we also calculated the depth by ourselves in the bam file and looked it up in the IGV. One especially strange case is the following: For a sample GATK calls a deletion. The unfiltered DP (info field) is DP=644. The filtered DP (format field) is DP=505. It consists of 438 reads with the original genotype and 67 reads containing the variant. However, in our raw bam file and in the IGV things are different. There we see 499 reads in total. 425 containing a C (reference), one with an N and 73 with a deletion. We checked and the results are reproduceable. Furthermore, we see similar differences as well. There are many cases in which the number of reads containing the variant or the reference allele as we count it and as it is observable in the IGV is smaller than the reported number of filtered reads. How can this happen? I'm really looking forward to your answers.

Sarah

Hi,

I am combining gcvf files into single gvcf files by chromosome, using CombineGVCFs, in order to run GenotypeGVCFs. When I checked the first gvcf file generated by CombineGVCFs, I noticed that at each position, all the alleles were missing.

For example, at position 16050036, this is what comes up in the final gvcf file: 22 16050036 . A C,<NON_REF> . . BaseQRankSum=-7.360e-01;ClippingRankSum=-7.360e-01;DP=4;MQ=27.00;MQ0=0;MQRankSum=-7.360e-01;ReadPosRankSum=0.736 GT:AD:DP:MIN_DP:PL:SB ./.:1,2,0:3:.:55,0,23,58,29,86:1,0,2,0 ./.:.:1:1:0,0,0,0,0,0 ./.:.:0:0:0,0,0,0,0,0

But if we just take one of the precursor gvcf files (one individual), we clearly see the genotype at that site: 22 16050036 . A C,<NON_REF> 26.80 . BaseQRankSum=-0.736;ClippingRankSum=-0.736;DP=3;MLEAC=1,0;MLEAF=0.500,0.00;MQ=27.00;MQ0=0;MQRankSum=-0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB 0/1:1,2,0:3:23:55,0,23,58,29,86:1,0,2,0

The command I'm using to generate these files is:

java -Xmx1g -jar GenomeAnalysisTK.jar -T CombineGVCFs -R hs37d5.fa -V vcfs.chr${numchr}.new.list -o mergeGvcf_${numchr}.vcf -L {numchr} where numchr is a variable previously defined (indicating the chromosome number). It seems that all the information is being taken into account except the actual genotypes. How do I solve this problem? Thanks, Alva We are analysing monozygotic twins while using the HaplotypeCaller but we see some weird things. Not all SNPs are seen by the HaplotypeCaller while the bam file as input has clearly the SNP in there, this while in the twin it is called. I did let the HaplotypeCaller output a bamfile and it seems that the haplotypecaller does not seem to use any reads on this SNP. In the igv screenshot there are the gvcf files of both samples and the pre-haplotypecaller bam and the post-haplotypecaller bam. The pre-bams have used the indelrealignment and baserecalibration like in the best practice. There multiple of this locations so this screenshot is just an example. Someone know why this is caused and maybe how to solve this? Hi, GATK Team I've run into a strange case where a SNP called by HaplotypeCaller has been represented as if it were an indel: 6 16327909 . ATGCTGATGCTGC CTGCTGATGCTGC 1390.70 PASS AC=1;AC_Orig=2;AF=0.500;AF_Orig=0.040;AN=2;AN_Orig=50;BaseQRankSum=0.788;DP=10;FS=6.154;InbreedingCoeff=0.1807;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358;VQSLOD=2.78;culprit=FS GT:DP:GQ:PL 0/1:10:70:284,0,214 This VCF entry (for a single individual) comes from a multi-sample VCF that has multiple alternate "alleles" at that position: 6 16327909 . ATGCTGATGCTGC ATGC,CTGCTGATGCTGC,A 1390.70 . AC=12,3,1;AF=0.024,6.048e-03,2.016e-03;AN=496;BaseQRankSum=0.788;DP=2791;FS=6.154;InbreedingCoeff=0.1807;MLEAC=13,3,1;MLEAF=0.026,6.048e-03,2.016e-03;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358 GT:AD:DP:GQ:PL  However, this mode of representing a SNP is causing processing and analysis problems further downstream after I've split the multi-sample VCF into individual files. Is there a way to fix this problem such that variants are listed in the most parsimonious (and hopefully standard) way? Thanks, Grace Hi, I start working with IGV, but I have some doubts in how to identify a good SPN in this program. First I download the new Soybean Genome on Phytozome (Gmax_275_v2.0.fa and Gmax_275_Wm82.a2.v1.gene.gff3 files), and then I upload my files (sample.vcf, sample.bam and sample.bam.bai) into the program. I indexed which files that program needed, so that's OK! But my doubt is which parameters should I consider for a good SNP? For example, what I need to see on Alleles, Genotypes and Variant Attributes? See the example below. Chr: Chr06 Position: 35170948 ID: . Reference: C* Alternate: T Qual: 160 Type: SNP Is Filtered Out: No Alleles: No Call: 0 Allele Num: 2 Allele Count: 4 Allele Frequency: 1 Minor Allele Fraction: 1 Genotypes: Non Variant: 0 - No Call: 0 - Hom Ref: 0 Variant: 1 - Het: 0 - Hom Var: 1 Variant Attributes AF1: 1 RPB: 5.557190e-01 VDB: 1.587578e-01 Depth: 18 FQ: -54 DP4: [1, 1, 6, 8] AC1: 2 Mapping Quality: 25 PV4: [1, 0.22, 1, 0.24] Hi, I am running Variant Recalibration on Indels,prior to this I completed Variant Recalibration and ApplyRecalibration on SNPs. So,the input file is the recalibrated VCF file from Apply Recalibration step of SNP's. Below is the error I am getting. ERROR ------------------------------------------------------------------------------------------ ERROR A USER ERROR has occurred (version 2.7-4-g6f46d11): ERROR ERROR This means that one or more arguments or inputs in your command are incorrect. ERROR The error message below tells you what is the problem. ERROR ERROR If the problem is an invalid argument, please check the online documentation guide ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ERROR ERROR Visit our website and forum for extensive documentation and answers to ERROR commonly asked questions http://www.broadinstitute.org/gatk ERROR ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ERROR ERROR MESSAGE: Bad input: Values for DP annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://gatkforums.broadinsti tute.org/discussion/49/using-variant-annotator The command I am using for this is : jre1.7.0_40/bin/java -Djava.io.tmpdir=./rb_2905_VCF/tmp -Xmx2g -jar GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T VariantRecalibrator -R dbdata/human_g1k_v37.fasta -input{input_file} --maxGaussians 4 -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.vcf - resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_137.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.indels.b37.vcf -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum -mode INDEL -recalFile $destdir/${input_file%recal.snps.vcf}recal.indel.recal -tranchesFile $destdir/${input_file%recal.snps.vcf}recal.indel.tranches -rscriptFile $destdir/${input_file%recal.snps.vcf}recal.indel.plots.R

If I remove the options -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum,then I get this error:

ERROR ------------------------------------------------------------------------------------------

Hi! I have worked some time on a mRNAseq set, single-end. Its a high quality set and lots of biological replicates (200+).

My question is, how could I best contribute to the methodology used for SNPs call in mRNAseq? What do we need tested to improve this method?

So I have used the latest GATK best practices pipeline for variant detection on non-human organisms, but now I am trying to do it for human data. I downloaded the Broad bundle and I was able to run all of the steps up to and including ApplyRecalibration. However, now I am not exactly sure what to do. The VCF file that is generated contains these FILTER values:

.

PASS

VQSRTrancheSNP99.90to100.00

I am not sure what these mean. Does the "VQSRTrancheSNP99.90to100.00" filter mean that that SNP falls below the specified truth sensitivity level? Does "PASS" mean that it is above that level? Or is it vice versa? And what does "." mean? Which ones should I keep as "good" SNPs?

I'm also having some difficulty fully understanding how the VQSLOD is used.... and what does the "culprit" mean when the filter is "PASS"?

A final question.... I've been using this command to actually create a file with only SNPs that PASSed the filter:

java -Xmx2g -jar /share/apps/GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -T SelectVariants -R ~/broad_bundle/ucsc.hg19.fasta --variant Pt1.40300.output.recal_and_filtered.snps.chr1.vcf -o Pt1.40300.output.recal_and_filtered.passed.snps.chr1.vcf -select 'vc.isNotFiltered()'

Is this the correct way to get PASSed SNPs? Is there a better way? Any help you can give me would be highly appreciated. Thanks!

• Nikhil Joshi

I have a question about SNP calling from GATK.

We want to call the SNP from many individuals. We hope that we could have the right confident Genotype SNP. But when I check the SNP results, I find some problem, such as the following example:

scaffold_1 12816 . C T 5075.21 TruthSensitivityTranche99.90to100.00 AC=20;AF=0.417;AN=48;BaseQRankSum=14.475;DB;DP=2186;Dels=0.00;FS=1.453;HRun=1;HaplotypeScore=1.2508;InbreedingCoeff=0.8258;MQ=23.69;MQ0=948;MQRankSum=-11.287;QD=10.78;ReadPosRankSum=13.705;SB=-276.66;VQSLOD=-17.8239;culprit=ReadPosRankSum
0/0:204,0:204:99:0,366,3726 **1/1:19,20:39:27.03:249,27,0 ** 0/0:110,0:110:99:0,219,2229 0/1:54,20:74:27.15:27,0,871 0/0:98,0:98:99:0,156,1701 0/0:208,0:211:99:0,490,5141 0/0:147,0:148:99:0,279,2915 ./. 1/1:6,22:28:27.04:282,27,0 1/1:0,31:31:39.04:360,39,0 0/0:220,0:220:99:0,424,4397 ./. 1/1:0,30:30:75.07:692,75,0 0/1:25,12:37:99:101,0,637 1/1:0,36:36:69.07:636,69,0 0/0:39,0:39:87.10:0,87,852 1/1:23,61:84:99:1107,120,0 1/1:0,38:38:78.08:736,78,0 ./. ./. 0/0:131,0:131:99:0,213,2230 1/1:6,37:43:57.08:593,57,0 1/1:12,19:31:30.04:293,30,0 0/0:27,8:35:54.08:0,54,537 ./. 0/0:64,0:65:96.13:0,96,993 0/0:116,1:117:99:0,207,2124 ./. 0/0:5,0:5:15.03:0,15,162 0/0:95,0:95:99:0,192,2012

I have 30 individuals that need to call the SNPs. I find that GT are different based on the reads depth. for one individual: 1/1:19,20:39:27.03:249,27,0. The two alleles are covered by 19 and 20 reades, the Genotype should be 0/1 as a heterozygosity. but GATK call it as a homozygosity. What is the problem?

There are lots of this situation in my data. How can I make sure the GATK's SNP are right or wrong? Hope you can give me some help?

Best Zhiqiang

I am running the Quality score recalibration and got an version error like this:

java -Xmx4g -jar ~/GenomeAnalysisTK-2.1-3/GenomeAnalysisTK.jar -l INFO -R hg19.fa --DBSNP snp137.txt -I sdD1a_bam.marked.realigned.fixed.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile sdD1a.recal_data.csv & [11] 9980 [c@scigc-vm-10 shones]\$ ##### ERROR ------------------------------------------------------------------------------------------

ERROR ------------------------------------------------------------------------------------------

Is there a solution for that? Thanks.

I have the genomes of several isolates of a parasite, and I would like to investigate synonymous/non-synonymous substitution for identifying potential antigens, as well as SNPs genome-wide and I am wondering how well BWA/GATK are suited for this purpose. I've been told that BWA is only very good with sequences <2% divergent, and some of the antigens in this specie are known to be >20% divergent. However, I also know that GATK does local realignments of indels. So I would specifically like to know - is BWA/GATK good for looking at substitutions/SNPs in highly variable genes, and if not which other alignment tools are compatible and appropriate for this purpose?

I've had the same DNA sample sequenced using two different library prep methods, NEB-Next and Illumina-Nextera, the latter of which generates twice the number of reads than the former. When genotypes for the two samples are called individually using the UnifiedGenotyper, the read depth, DP for the Illumina-Nextera is roughly twice that of the NEB-Next but the quality score, QUAL, remains similar. I was expecting the QUAL to reflect the read depth but obviously this is not the case. Could you enlighten me?

Hi team,

This is two separate questions:

1. Starting with a vcf file, plotting the depth (DP) distribution gives a nice, slightly asymmetrical bell-shaped curve. Given that SNPs with very high and very low coverages should be excluded, how does one decide what is very high and low. e.g. 5% either side ?

2. I'm only interested in chromosomes 2L, 2R, 3L, 3R and X of my Drosophila sequences. Filtering for these is easy with a Perl script but I'm trying to do this earlier on in the GATK processes. I've tried ...-L 2L -L 2R -L 3L ...etc, -L 2L 2R 3L ....etc and, -L 2L, 2R, 3R...etc but the result is either input error message or chromosome 2L only.

Many thanks and apologies if I've missed anything in the instructions.

Cheers,

Blue

Hi :), I'am new to NGS and have a questions reagarding the filtering. I have 13 individuals, 3 of them with a coverage of 11x-15x, the rest with a coverage of 5x-7x. I did multi-sample SNPcalling and hard filtering, the latter as there are no known SNPs so far. Now I am not sure how to set the minimal SNP quality (QUAL). On the best practise it is suggested to juse at least 30 for individuals with a coverage of at least 30, but 4 if the coverage is below 10. So what is the best way to set the QUAL filter?

many thanks in advance

In working with a mapping that has a very large coverage in certain places, I have noticed that the Unified Genotyper often calls SNPs even if the alternate allele is only present in a minuscule <1% fraction of the reads at a position. It is difficult to filter these SNPs out because QUAL is high for these SNPs the and QD, which is low in these situations, is also low for many other (good) SNPs.

There already is a fractional threshold option for indel calling, -minIndelFrac. Is there any similar option for SNPs – and if not, what is your reasoning for omitting this option and what steps do you recommend for filtering out these SNPs?

Is Strand Bias-Fisher's p-value the only p-value we can obtain for a SNP/INDEL call? Do we have an option to calculate the actual HWE p-value instead of the HWE proportions?

~Thanks, Rini

I have Bisulfite- treated sequence mapped using Bismark and Bowtie2 and I'd like to call SNPs and INDELs from it. I have used Bis-SNP to call SNPs but it doesn't call indels , can I use GATK to call indels from the mapped data? Do u have any support to Bisulfite data? Another question please, the data is a mix from 6 different people do u have any support fro pooled data? Thanks for your help.

I was wondering, in the vcf output from UnifiedGenotyper what metrics go into the variant quality score, QUAL ?

I'm assuming that depth, DP is one of them but I can't find further information. Apologies if I've missed it.

Attached are some plots of the depth and quality distribution for variant calls, and also of the relationship between depth and quality. I'm slightly worried that the bimodal quality distribution indicates an error, otherwise I've just attached the graphs for general interest.

Basically I have an odd-looking distribution of my variant quality scores (see attached png), and was wondering how concerned should I be and how can I rectify it.

The input data from the graph is from UnifiedGenotyper vcf output file, QUAL values. The four samples in the vcf file are one Drosophila reference line, and three more which are outcrosses of the reference line and thus are heterozygous for the reference allele.

My fastq read-mapping pipeline includes adapter and low-quality base removal, and local re-alignment. I've also attached a pdf showing read quality distribution from one of the samples which also looks a bit odd.

The Unified Genotyper calls SNPs relative to the specified, publicly-available reference assembly.

How can I call SNPs (in many samples, with UG) relative to an in-house individual, which I have sequenced at high-coverage?

My current solution is to perform a de novo assembly on the in-house reference individual using e.g. Velvet, and then simply use the fasta as the reference for UG.

Can the publicly-available reference assembly still be useful here for speeding up the mapping and filling-in missing parts ?

My organism is Drosophila melanogaster.

Hi to all, I ran UnifiedGenotyper of three exome seq samples and phased with familial pedigree. During the manual filtration I saw several inconsistency. For example I get this output from UnifiedGenotyper and phasing:

chr2 38525498 rs76204302 T G 66.79 . AC=2;AF=0.333;AN=6;BaseQRankSum=-7.191;DB;DP=180;Dels=0.00;FS=208.951;HaplotypeScore=13.8978;MLEAC=2;MLEAF=0.333;MQ=60.00;MQ0=0;MQRankSum=2.030;QD=0.52;ReadPosRankSum=1.325;SB=-2.263e+00 GT:AD:DP:GQ:PL 0/0:30,22:52:39:0,39,729 0/1:9,7:16:5:5,0,280 0/1:55,57:112:98:98,0,1169

The order of samples are father, mother and son.How is possible that the father with, respectively 30 bases REF and 22 bases VAR is called 0/0? Thanks

Giuliano

I have used UnifiedGenotyper to call SNPS. I found some SNPs that has been reported from low quality reads in chromosome X and chromosome Y. Is it possible not to take low quality reads into account while calling SNPs using UnifiedGenotyper? Or, do I need to do quality filtering of BAM files before hand ?

I ran in to the situation now a couple of times that I need to extract a set of private SNPs from a multisample VCF file. For example in a forward genetics knockout screen of a large set of samples.

It is possible with vcf-contrast from vcf-tools:

vcf-contrast +sample1 -sample2 -sample3 -n input.vcf > private sample1.vcf

vcf-contrast -sample1 +sample2 -sample3 -n input.vcf > private sample2.vcf

vcf-contrast -sample1 -sample2 +sample3 -n input.vcf > private sample3.vcf

After this I still would have to filter out the private 0/0 calls and doing this for a large multisample VCF means entering this command for all the combinations which is not really nice.

Surely this must be possible with GATK. Does anyone know how to do this with GATK.

Maybe it is somewhere in the SelectVariants? The --discordance option looked promissing but there is something about that the samples should be the same? Or is it possible to write another variant walker or a JEXL expression?

P.S. By accident I also posted this question in the XHMM, an admin could remove it there.

Hi All, I have used GATK UnifiedGenotyper to generate a raw.vcf file. Now I want to use GATK VQSR to get a more accurate result ,and I follow this protocol:

snp.model <- BuildErrorModelWithVQSR(raw.vcf, SNP);
indel.model <- BuildErrorModelWithVQSR(raw.vcf, INDEL);
recalibratedSNPs.rawIndels.vcf <- ApplyRecalibration(raw.vcf, snp.model, SNP);
analysisReady.vcf <- ApplyRecalibration(recalibratedSNPs.rawIndels.vcf, indel.model, INDEL);


I wanna know will the it be better if I seperate the SNP and INDEL to run VQSR, like this:

SNP.raw.vcf , INDEL.raw.vcf <- SeperateSNPINDEL(raw.vcf);
snp.model <- BuildErrorModelWithVQSR(SNP.raw.vcf, SNP);
indel.model <- BuildErrorModelWithVQSR( INDEL.raw.vcf, INDEL);
SNP_analysisReady.vcf <- ApplyRecalibration(SNP.raw.vcf, snp.model, SNP);
INDEL_analysisReady.vcf <- ApplyRecalibration(INDEL.raw.vcf, INDEL.model, SNP);


Thanks a lot !

We are attempting to see if using ReducedReads will help with the overwhelming file sizes for the SNP calling we are doing on whole genome BAM files. We have been using a protocol similar to the one described in best practices document: Best: multi-sample realignment with known sites and recalibration. My question is what is the best point in the pipeline to use ReducedReads?

I am implementing a tool that uses reads to identify potential sites for cancer, but I need a reliable genotype call from non-cancerous tissue. Currently, I'm trying to use GATK to make those calls, filter out the bad, and then use the remaining calls when I look over the raw reads in the cancer tissue.

The problem I'm finding is that GATK doesn't provide a GQ for variants that are homozygous for the reference, and I don't know how to correlate the QUAL with the GQs from those homozygous non-ref. Is there a way that I'm overlooking to have GATK provide the GQ for homozygous reference non-variant sites?

Leishmania has 36 chromosomes but their copy number is unpredictable for each strain and chromosome copy number can change very quickly. So what is an optimal ploidy setting for organisms with extensive aneuploidy? So far we use just diploid setting. Some samples have consistently more heterozygous SNPs in higher copy chromosomes but this relationship does not hold in many other samples: there is no strong correlation between chromosome copy number and abundance of heterozygous SNPs.