# Tagged with #snp 2 documentation articles | 0 announcements | 43 forum discussions

Created 2014-01-17 15:36:06 | Updated 2016-03-05 11:20:37 | Tags: snp variants symbolicallele indel cnv mnp sv

The answer depends on what tool we're talking about, and whether we're considering variant discovery or variant manipulation.

#### Variant manipulation

GATK variant manipulation tools are able to recognize the following types of alleles:

• SNP (single nucleotide polymorphism)
• INDEL (insertion/deletion)
• MIXED (combination of SNPs and indels at a single position)
• MNP (multi-nucleotide polymorphism, e.g. a dinucleotide substitution)
• SYMBOLIC (such as the <NON-REF> allele used in GVCFs produced by HaplotypeCaller, the * allele used to signify the presence of a spanning deletion, or undefined events like a very large allele or one that's fuzzy and not fully modeled; i.e. there's some event going on here but we don't know what exactly)

Note that SelectVariants, the GATK tool most used for VCF subsetting operations, discriminates strictly between these categories. This means that if you use for example -selectType INDEL to pull out indels, it will only select pure INDEL records, excluding any MIXED records that might include a SNP allele in addition to the insertion or deletion alleles of interest. To include those you would have to also specify selectType MIXED in the same command.

#### Variant discovery

The HaplotypeCaller is a sophisticated variant caller that can call different types of variants at the same time. So in addition to SNPs and indels, it is capable of emitting mixed records by default, as well as symbolic representations for e.g. spanning deletions. It does emit physical phasing information, but in its current version, HC is not able to emit MNPs. If you would like to combine contiguous SNPs into MNPs, you will need to use the ReadBackedPhasing tool with the MNP merging function activated. See the tool documentation for details. Our older (and now deprecated) variant caller, UnifiedGenotyper, was even more limited. It only called SNPs and indels, and did so separately (even if you ran in calling mode BOTH, the program performed separate calling operations internally) so it was not able to recognize that SNPs and Indels should be emitted together as a joint record when they occur at the same site.

The general release version of GATK is currently not able to detect SVs (structural variations) or CNVs (copy number variations). However, the alpha version of GATK 4 (the next generation of GATK tools) includes tools for performing CNV (copy number variation) analysis in exome data. Let us know if you're interested in trying them out by commenting on this article in the forum.

There is also a third-party software package called GenomeSTRiP built on top of GATK that provides SV (structural variation) analysis capabilities.

Created 2012-07-30 17:37:12 | Updated 2015-10-30 19:34:24 | Tags: unifiedgenotyper haplotypecaller snp bamout

This can happen when you expect a call to be made based on the output of other variant calling tools, or based on examination of the data in a genome browser like IGV.

There are several possibilities, and among them, it is possible that GATK may be missing a real variant. But we are generally very confident in the calculations made by our tools, and in our experience, most of the time, the problem lies elsewhere. So, before you post this issue in our support forum, please follow these troubleshooting guidelines, which hopefully will help you figure out what's going on.

In all cases, to diagnose what is happening, you will need to look directly at the sequencing data at the position in question.

### 1. Generate the bamout and compare it to the input bam

If you are using HaplotypeCaller to call your variants (as you nearly always should) you'll need to run an extra step first to produce a file called the "bamout file". See this tutorial for step-by-step instructions on how to do this.

What often happens is that when you look at the reads in the original bam file, it looks like a variant should be called. However, once HaplotypeCaller has performed the realignment, the reads may no longer support the expected variant. Generating the bamout file and comparing it to the original bam will allow you to elucidate such cases.

In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!

### 2. Check the base qualities of the non-reference bases

The variant callers apply a minimum base quality threshold, under which bases will not be counted as supporting evidence for a variant. This is because low base qualities mean that the sequencing machine was not confident that it called the right bases. If your expected variant is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the DP field of the VCF is the unfiltered depth. You may believe you have good coverage at your site of interest, but since the variant callers ignore bases that fail the quality filters, the actual coverage seen by the variant callers may be lower than you think.

### 3. Check the mapping qualities of the reads that support the non-reference allele(s)

The quality of a base is capped by the mapping quality of the read that it is on. This is because low mapping qualities mean that the aligner had little confidence that the read was mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- in fact, you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

### 4. Check how many alternate alleles are present

By default the variant callers will only consider a certain number of alternate alleles. This parameter can be relaxed using the --max_alternate_alleles argument (see the HaplotypeCaller documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles increases the computational cost of the processing, scaling exponentially with the number of alternate alleles, which means it will use more resources and take longer. Unless you have a really good reason to change the default value, we highly recommend that you not modify this parameter.

### 5. When using UnifiedGenotyper, check for overlapping deletions

The UnifiedGenotyper ignores sites if there are too many overlapping deletions. This parameter can be relaxed using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument) but be aware that increasing its value could adversely affect the reliability of your results.

### 6. Check for systematic biases introduced by your sequencing technology

Some sequencing technologies introduce particular sources of bias. For example, in data produced by the SOLiD platform, alignments tend to have reference bias and it can be severe in some cases. If the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site, you are probably seeing false positives.

No articles to display.

Created 2016-03-16 00:20:27 | Updated | Tags: haplotypecaller snp haploid chloroplast pooled-sequencing-haplotypecaller

I'm trying to call whole-chloroplast genome haplotypes for a pooled chloroplast-captured DNA sample from a non-model organism (no well-established variants). The reads are Illumina 100 bp PE reads, and have already undergone some clipping (adapter-trimming and quality control) and have been aligned to a reference genome. The pool represents 20 individuals. I want to know if there is a way in GATK to call the frequency of whole-genome haplotypes (or else, is there a way currently in existence elsewhere? ) If necessary, I can generate a panel of known haplotypes.

Currently, I have been using HaplotypeCaller to call SNPs and then filtering those by hand in Excel. I have already tried increasing the maximum active region size to larger than the whole reference genome (~150,000 bp), with a corresponding increase in the max reads per sample value, but this doesn't seem to have come up with whole-genome haplotypes.

Created 2016-01-23 13:25:11 | Updated | Tags: snp dbsnp138 dbsnp144 rsids

Hi GATK Team,

I have a vcf file with ID column (column no. 3) has rsIDs of dbSNP138 build (hg19). Now, I want to update the same vcf file with dbSNP144 build with the same genomic coordinates (hg19). How to that that? In my opinion, it is not the case of liftover as I want to keep the genomic coordinates same but need to update the column 3.

The vcf file is made from the best practices GATK pipeline. Resource bundles available at ftp GATK are used during the pipeline.

Thanks,

Waqas.

Created 2016-01-23 05:03:33 | Updated | Tags: variantfiltration snp gatk

Hi,

I am doing an analysis on chimp variants, and do not have enough variants to train the model in order to perform VQSR. I am using hard filters at this point, as per the recommendation. I keep reading that the recommended cutoff values are only recommendations and that I should expect to tweak the parameters further. However, how do I do this in an effective manner? I'm just not sure how to approach exploring the parameter space.

Should I try out different values around the recommended cutoffs and, down the line, analyze how many mutations I get/calculate certain values for quality control and check that they agree with what is expected (CpG transition rate, and so on)? Is there a way to inspect the data to see whether it is behaving well, or do I have to wait until down the line to see how things are going? Any insight into this would help tremendously.

Thanks! Alva

Created 2016-01-14 04:05:41 | Updated | Tags: readbackedphasing snp phasing mnp

Can the tool also work on MNP other than SNP? If not, is there any plan to expand?

Created 2015-12-30 23:50:57 | Updated 2015-12-30 23:51:22 | Tags: haplotypecaller snp multi-sample

I am trying to use GATK HP ( v3.5 ) to call SNPs in amplicon seq data of a small genome of around 600bp. As shown in the attachment, the variants are not called between location 50 and 60, despite high coverage across many samples. ( There are total 96 samples ).

https://www.dropbox.com/s/a4311lbti2sn9ze/seq.png?dl=0

The base qualities are above 30 and mapping quality is also 60 ( bwa mem ). I also did not remove duplicates ( not marked as well ) as its amplicon seq data.

The command I used was ( multisample SNP calling ) :

java -Xmx50g -jar GenomeAnalysisTK.jar -nct 2 -R -T HaplotypeCaller -I merged_samples.bam -o gatk_out_raw_snps_indels.vcf --min_base_quality_score 30

Its the same case even with base quality 20.

Created 2015-10-28 16:46:20 | Updated | Tags: snp fastaalternatereferencemaker

Hi everyone,

I have a question about using the GATK AlternateReferenceMaker and how it may affect down stream analysis. I apologize for the long post but I tried to include only relevant info. Any thoughts on analysis are helpful!

Here is basically what I did:

• Illumina sequencing of 300 libraries with methods similar to genotype-by-sequencing (like RAD-tag libraries without shearing)
• Created contigs of sequences within individuals (cap3) and compared across 300 individuals
• Selected contigs found in at least 5 individuals and mapped to a draft reference genome of sister species (bwa-mem)
• Used GATK AlternateReferenceMaker to insert differences and create a "pseudo"-reference for my species

• Then I have did SNP calling from raw sequences using GATK best practices (alignment of each individual to pseudo-reference, realignment, g.vcf, Haplotype Caller)
• I did hard filtering for quality, coverage, and missing data across SNPs
• I have sub-selected SNPs with minor allele frequencies > 0.25 to use in designing a SNP array for parentage (~800 SNPs)
• When I try to select the flanking regions around these SNPs using bedtools, it pulls the flanking regions from the pseudo-reference and many of them have a lot of N's... so my questions are:
1. How can I tell if these were produced from the process I used to create a "pseudo"-reference for my species? Could they be misalignments from sequence contigs to reference or between species? How would this affect the SNP calling proceess?

2. What is the best way to ground truth the sequencing region? Should I try to pull the alignments from the region around each SNP for each individual and look at them manually? How can I do this?

3. Finally, the sister species and my pseudo-reference are not indexed in the same way. I'm guessing I should've sorted them before indexing or specified the indexing to use when using the GATK AlternateReferenceMaker. Can I convert the indexing?

Created 2015-10-11 07:13:38 | Updated | Tags: snp

# GenotypeGVCFs (generate vcf files for each chr)

java -Xmx1g -jar GenomeAnalysisTK.jar -R hs37d5.fa -T GenotypeGVCFs -V vcfs.chr${numchr}.new.list -o final_chr${numchr}.vcf -L {numchr} # CatVariants (generate 1 vcf file with all inds and all chrs) java -Xmx1g -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants -R hs37d5.fa -V final.new.list -out final_allHutt.vcf -assumeSorted # VQSR java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input final_allHutt.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches # Used excludeFiltered here java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input final_allHutt.vcf -mode SNP --ts_filter_level 99.9 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches -o recalibrated_snps_raw_indels_allHutt_filteredout.vcf java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches # Used excludeFiltered here java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -mode INDEL --ts_filter_level 99.0 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches -o recalibrated_variants_allHutt_filteredout.vcf # Genotype Refinement Workflow java -Xmx3g -jar GenomeAnalysisTK.jar -T CalculateGenotypePosteriors -R hs37d5.fa --supporting ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf -ped Hutt.ped -V recalibrated_variants_allHutt_filteredout.vcf -o recalibrated_variants_allHutt.postCGP.f.vcf java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantFiltration -R hs37d5.fa -V recalibrated_variants_allHutt.postCGP.f.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibrated_variants_allHutt.postCGP.Gfiltered.f.vcf Again, the first genotype in this example (indel) passed VariantFiltration even though its coverage was zero (2/2:0,0,0:0:PASS) The entire example is below: 1 20199272 . T TCTTC,C 3520.08 PASS AC=8,22;AF=0.160,0.440;AN=50;BaseQRankSum=-1.356e+00;ClippingRankSum=-1.267e+00;DP=487;FS=4.843;GQ_MEAN=27.84;GQ_STDDEV=40.31;InbreedingCoeff=0.1002;MLEAC=1,12;MLEAF=0.020,0.240;MQ=51.74;MQ0=0;MQRankSum=0.421;NCC=2;PG=0,0,0,0,0,0;QD=32.53;ReadPosRankSum=1.27;SOR=0.699;VQSLOD=0.687;culprit=FS GT:AD:DP:FT:GQ:PGT:PID:PL:PP 2/2:0,0,0:0:PASS:22:.:.:410,207,355,32,22,0:410,207,355,32,22,0 2/2:0,0,1:1:lowGQ:5:.:.:240,51,36,18,5,0:240,51,36,18,5,0 0/2:4,0,4:8:PASS:90:.:.:140,153,256,0,103,90:140,153,256,0,103,90 0/0:22,0,0:22:lowGQ:0:.:.:0,0,390,0,390,390:0,0,390,0,390,390 0/0:2,0,0:2:lowGQ:3:.:.:0,3,45,3,45,45:0,3,45,3,45,45 2/2:0,0,3:3:lowGQ:11:.:.:287,135,124,21,11,0:287,135,124,21,11,0 ./.:7,0,0:7:PASS 2/2:0,0,3:4:lowGQ:11:.:.:282,126,115,22,11,0:282,126,115,22,11,0 0/2:10,0,0:10:lowGQ:5:.:.:27,5,494,0,411,405:27,5,494,0,411,405 0/2:7,0,0:7:lowGQ:13:.:.:13,15,502,0,303,288:13,15,502,0,303,288 0/1:8,6,0:14:PASS:99:.:.:194,0,255,218,273,491:194,0,255,218,273,491 0/0:18,0,0:18:PASS:52:.:.:0,52,755,52,755,755:0,52,755,52,755,755 2/2:0,0,0:0:PASS:23:.:.:305,168,416,23,30,0:305,168,416,23,30,0 0/2:0,0,4:4:lowGQ:14:.:.:40,14,634,0,185,699:40,14,634,0,185,699 0/0:19,0,0:19:PASS:58:.:.:0,58,824,58,824,824:0,58,824,58,824,824 0/0:1,0,0:1:lowGQ:6:0|1:20199257_CT_C:0,6,91,6,91,91:0,6,91,6,91,91 1/1:0,0,0:0:lowGQ:2:.:.:177,11,0,12,2,44:177,11,0,12,2,44 0/1:0,0,3:3:PASS:34:.:.:94,0,388,34,38,304:94,0,388,34,38,304 0/2:15,0,2:17:lowGQ:18:0|1:20199249_CT_C:18,64,695,0,632,624:18,64,695,0,632,624 1/1:0,0,0:0:lowGQ:8:.:.:133,8,0,101,17,265:133,8,0,101,17,265 0/2:3,0,0:3:PASS:25:.:.:129,25,484,0,121,94:129,25,484,0,121,94 0/2:2,0,0:2:PASS:38:.:.:185,38,644,0,88,42:185,38,644,0,88,42 0/2:2,0,0:2:lowGQ:14:.:.:256,14,293,0,57,41:256,14,293,0,57,41./.:11,0,0:11:PASS 1/2:0,0,1:1:lowGQ:14:.:.:115,24,14,36,0,359:115,24,14,36,0,359 1/2:0,0,1:1:PASS:28:.:.:188,39,28,35,0,206:188,39,28,35,0,206 2/2:0,0,3:3:lowGQ:8:1|1:20199249_CT_C:88,88,89,8,9,0:88,88,89,8,9,0 Why are some genotypes being passed when there is no support for their genotype? Why are the Mendelian error rates so high? Thank you very much in advance, Alva Created 2015-03-12 16:15:59 | Updated | Tags: unifiedgenotyper snp error indel Hello, when I run the UnifiedGenotyper to call INDELs, I get the following error (detailed command line output below) HAPLOTYPE_MAX_LENGTH must be > 0 but got 0 When I call only SNPs instead, it doe not occur. I have searched to find an answer to why this is happening, but cannot figure out the reason. Could this be a bug? I get the error no matter if I run the Realigner before or not. Did you observe this problem before? Command line output: java -Xmx10g -jar ~/work/tools/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /media/rebecca/UUI/Work/BputSem/BputSem_gapfilled.final.fa -I realigned_A.bam -gt_mode DISCOVERY -glm INDEL -stand_call_conf 30 -stand_emit_conf 10 -o rawINDELS_q30_A.vcf -ploidy 10 INFO 15:50:47,987 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:50:47,990 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 INFO 15:50:47,990 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 15:50:47,990 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 15:50:47,996 HelpFormatter - Program Args: -T UnifiedGenotyper -R /media/rebecca/UUI/Work/BputSem/BputSem_gapfilled.final.fa -I realigned_A.bam -gt_mode DISCOVERY -glm INDEL -stand_call_conf 30 -stand_emit_conf 10 -o rawINDELS_q30_A.vcf -ploidy 10 INFO 15:50:48,002 HelpFormatter - Executing as rebecca@rebecca-ThinkPad-T440s on Linux 3.13.0-44-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_65-b32. INFO 15:50:48,002 HelpFormatter - Date/Time: 2015/03/12 15:50:47 INFO 15:50:48,002 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:50:48,002 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:50:48,132 GenomeAnalysisEngine - Strictness is SILENT INFO 15:50:48,402 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 15:50:48,409 SAMDataSourceSAMReaders - Initializing SAMRecords in serial INFO 15:50:48,447 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04 INFO 15:50:48,594 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 15:50:48,622 GenomeAnalysisEngine - Done preparing for traversal INFO 15:50:48,622 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 15:50:48,622 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 15:50:48,623 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 15:50:48,658 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values. INFO 15:50:48,658 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values. INFO 15:51:06,585 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.IllegalArgumentException: HAPLOTYPE_MAX_LENGTH must be > 0 but got 0 at org.broadinstitute.gatk.utils.pairhmm.PairHMM.initialize(PairHMM.java:97) at org.broadinstitute.gatk.utils.pairhmm.N2MemoryPairHMM.initialize(N2MemoryPairHMM.java:60) at org.broadinstitute.gatk.utils.pairhmm.LoglessPairHMM.initialize(LoglessPairHMM.java:66) at org.broadinstitute.gatk.utils.pairhmm.PairHMM.computeLikelihoods(PairHMM.java:194) at org.broadinstitute.gatk.tools.walkers.indels.PairHMMIndelErrorModel.computeGeneralReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:461) at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyIndelGenotypeLikelihoods.add(GeneralPloidyIndelGenotypeLikelihoods.java:201) at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyIndelGenotypeLikelihoods.add(GeneralPloidyIndelGenotypeLikelihoods.java:124) at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyGenotypeLikelihoodsCalculationModel.getLikelihoods(GeneralPloidyGenotypeLikelihoodsCalculationModel.java:270) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateLikelihoods(UnifiedGenotypingEngine.java:317) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotypingEngine.java:201) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:379) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:151) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNanoTraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: HAPLOTYPE_MAX_LENGTH must be > 0 but got 0 ##### ERROR ------------------------------------------------------------------------------------------ Created 2015-03-09 21:03:42 | Updated | Tags: snp Thanks for all your help thus far! I'm not sure if I'd be able to figure out SNP calling without everyone here! I have one questions, however! Does anyone know of a tool that will determine if a SNP/INDEL is inherited or de novo? Is there a GATK walker that will aid in this? Created 2015-03-07 16:14:34 | Updated | Tags: selectvariants snp Hi! I am currently evaluating different methods to select tagSNPs for a gene. Is it possible to identify tagSNPs for a gene with GATK by scanning a given list of SNPs? Thank you very much. Created 2015-02-24 19:52:54 | Updated | Tags: unifiedgenotyper snp indels I am working with plants, and so cannot follow most of your human/cancer specific workflows etc. Also my question is probably not GATK specifc, but I would be very grateful for any kind of answer. I have used BWA to align a set of reads and the contigs assembled from these reads (de novo). I have used samtools mpileup and GATK unifiedGenotyper on the aligned reads. This results for both in a list of ONLY SNPs, the same ones (aside of sensitivity). I have also used the samtools steps on the aligned contigs, resulting in a list of ONLY indels. I can clearly see both SNPs and indels in IGV, both looking at the reads or the contigs. generally I see that reads are being used for variant calling, so is it "wrong" to use contigs? and why do I not get ANY indels in either tool's output when using reads? any ideas? Created 2015-02-12 13:25:05 | Updated | Tags: snp dp indel Hey, we are working with GATK 3.2.2 and focus on an optimization of the SNP and indel calling in the context of our data. Thereby, we also determine the frequency of a certain variant. However, we came across some strange results in this case. In the vcf files there is of course the DP in the info field, representing the unfiltered depth, and the DP in the format field, representing the filtered depth. Yet, we also calculated the depth by ourselves in the bam file and looked it up in the IGV. One especially strange case is the following: For a sample GATK calls a deletion. The unfiltered DP (info field) is DP=644. The filtered DP (format field) is DP=505. It consists of 438 reads with the original genotype and 67 reads containing the variant. However, in our raw bam file and in the IGV things are different. There we see 499 reads in total. 425 containing a C (reference), one with an N and 73 with a deletion. We checked and the results are reproduceable. Furthermore, we see similar differences as well. There are many cases in which the number of reads containing the variant or the reference allele as we count it and as it is observable in the IGV is smaller than the reported number of filtered reads. How can this happen? I'm really looking forward to your answers. Sarah Created 2015-01-27 21:59:14 | Updated 2015-01-27 21:59:46 | Tags: best-practices snp gatk combinegvcfs gvcf Hi, I am combining gcvf files into single gvcf files by chromosome, using CombineGVCFs, in order to run GenotypeGVCFs. When I checked the first gvcf file generated by CombineGVCFs, I noticed that at each position, all the alleles were missing. For example, at position 16050036, this is what comes up in the final gvcf file: 22 16050036 . A C,<NON_REF> . . BaseQRankSum=-7.360e-01;ClippingRankSum=-7.360e-01;DP=4;MQ=27.00;MQ0=0;MQRankSum=-7.360e-01;ReadPosRankSum=0.736 GT:AD:DP:MIN_DP:PL:SB ./.:1,2,0:3:.:55,0,23,58,29,86:1,0,2,0 ./.:.:1:1:0,0,0,0,0,0 ./.:.:0:0:0,0,0,0,0,0 But if we just take one of the precursor gvcf files (one individual), we clearly see the genotype at that site: 22 16050036 . A C,<NON_REF> 26.80 . BaseQRankSum=-0.736;ClippingRankSum=-0.736;DP=3;MLEAC=1,0;MLEAF=0.500,0.00;MQ=27.00;MQ0=0;MQRankSum=-0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB 0/1:1,2,0:3:23:55,0,23,58,29,86:1,0,2,0 The command I'm using to generate these files is: java -Xmx1g -jar GenomeAnalysisTK.jar -T CombineGVCFs -R hs37d5.fa -V vcfs.chr{numchr}.new.list -o mergeGvcf_${numchr}.vcf -L${numchr} where numchr is a variable previously defined (indicating the chromosome number).

It seems that all the information is being taken into account except the actual genotypes. How do I solve this problem?

Thanks, Alva

Created 2014-08-21 16:53:33 | Updated | Tags: haplotypecaller snp

We are analysing monozygotic twins while using the HaplotypeCaller but we see some weird things. Not all SNPs are seen by the HaplotypeCaller while the bam file as input has clearly the SNP in there, this while in the twin it is called. I did let the HaplotypeCaller output a bamfile and it seems that the haplotypecaller does not seem to use any reads on this SNP. In the igv screenshot there are the gvcf files of both samples and the pre-haplotypecaller bam and the post-haplotypecaller bam. The pre-bams have used the indelrealignment and baserecalibration like in the best practice.

There multiple of this locations so this screenshot is just an example.

Someone know why this is caused and maybe how to solve this?

Created 2014-05-15 20:29:50 | Updated | Tags: haplotypecaller snp indels vcf-format

Hi, GATK Team

I've run into a strange case where a SNP called by HaplotypeCaller has been represented as if it were an indel:

6 16327909 . ATGCTGATGCTGC CTGCTGATGCTGC 1390.70 PASS AC=1;AC_Orig=2;AF=0.500;AF_Orig=0.040;AN=2;AN_Orig=50;BaseQRankSum=0.788;DP=10;FS=6.154;InbreedingCoeff=0.1807;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358;VQSLOD=2.78;culprit=FS GT:DP:GQ:PL 0/1:10:70:284,0,214

This VCF entry (for a single individual) comes from a multi-sample VCF that has multiple alternate "alleles" at that position:

6   16327909    .   ATGCTGATGCTGC   ATGC,CTGCTGATGCTGC,A    1390.70 .   AC=12,3,1;AF=0.024,6.048e-03,2.016e-03;AN=496;BaseQRankSum=0.788;DP=2791;FS=6.154;InbreedingCoeff=0.1807;MLEAC=13,3,1;MLEAF=0.026,6.048e-03,2.016e-03;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358   GT:AD:DP:GQ:PL

However, this mode of representing a SNP is causing processing and analysis problems further downstream after I've split the multi-sample VCF into individual files. Is there a way to fix this problem such that variants are listed in the most parsimonious (and hopefully standard) way?

Thanks,

Grace

Created 2014-04-22 17:53:55 | Updated | Tags: snp vcf igv

Hi, I start working with IGV, but I have some doubts in how to identify a good SPN in this program. First I download the new Soybean Genome on Phytozome (Gmax_275_v2.0.fa and Gmax_275_Wm82.a2.v1.gene.gff3 files), and then I upload my files (sample.vcf, sample.bam and sample.bam.bai) into the program. I indexed which files that program needed, so that's OK! But my doubt is which parameters should I consider for a good SNP? For example, what I need to see on Alleles, Genotypes and Variant Attributes? See the example below.

Chr: Chr06 Position: 35170948 ID: . Reference: C* Alternate: T Qual: 160 Type: SNP Is Filtered Out: No

Alleles: No Call: 0 Allele Num: 2 Allele Count: 4 Allele Frequency: 1

Minor Allele Fraction: 1

Genotypes: Non Variant: 0

• No Call: 0
• Hom Ref: 0 Variant: 1
• Het: 0
• Hom Var: 1

Variant Attributes AF1: 1 RPB: 5.557190e-01 VDB: 1.587578e-01 Depth: 18 FQ: -54 DP4: [1, 1, 6, 8] AC1: 2 Mapping Quality: 25 PV4: [1, 0.22, 1, 0.24]

Created 2014-03-10 22:03:01 | Updated | Tags: unifiedgenotyper variantrecalibrator applyrecalibration snp indels

Hi,

I am running Variant Recalibration on Indels,prior to this I completed Variant Recalibration and ApplyRecalibration on SNPs. So,the input file is the recalibrated VCF file from Apply Recalibration step of SNP's.

Below is the error I am getting.

##### ERROR MESSAGE: Bad input: Values for DP annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://gatkforums.broadinsti

tute.org/discussion/49/using-variant-annotator

The command I am using for this is :

##### ERROR ------------------------------------------------------------------------------------------

Is there a solution for that? Thanks.

Created 2013-07-31 19:20:30 | Updated 2013-07-31 19:20:55 | Tags: snpeff snp bwa substitution

I have the genomes of several isolates of a parasite, and I would like to investigate synonymous/non-synonymous substitution for identifying potential antigens, as well as SNPs genome-wide and I am wondering how well BWA/GATK are suited for this purpose. I've been told that BWA is only very good with sequences <2% divergent, and some of the antigens in this specie are known to be >20% divergent. However, I also know that GATK does local realignments of indels. So I would specifically like to know - is BWA/GATK good for looking at substitutions/SNPs in highly variable genes, and if not which other alignment tools are compatible and appropriate for this purpose?

Created 2013-07-24 10:25:20 | Updated | Tags: unifiedgenotyper snp qual

I've had the same DNA sample sequenced using two different library prep methods, NEB-Next and Illumina-Nextera, the latter of which generates twice the number of reads than the former. When genotypes for the two samples are called individually using the UnifiedGenotyper, the read depth, DP for the Illumina-Nextera is roughly twice that of the NEB-Next but the quality score, QUAL, remains similar. I was expecting the QUAL to reflect the read depth but obviously this is not the case. Could you enlighten me?

Created 2013-07-22 11:09:51 | Updated | Tags: depthofcoverage snp vcf intervals non-human filter

Hi team,

This is two separate questions:

1. Starting with a vcf file, plotting the depth (DP) distribution gives a nice, slightly asymmetrical bell-shaped curve. Given that SNPs with very high and very low coverages should be excluded, how does one decide what is very high and low. e.g. 5% either side ?

2. I'm only interested in chromosomes 2L, 2R, 3L, 3R and X of my Drosophila sequences. Filtering for these is easy with a Perl script but I'm trying to do this earlier on in the GATK processes. I've tried ...-L 2L -L 2R -L 3L ...etc, -L 2L 2R 3L ....etc and, -L 2L, 2R, 3R...etc but the result is either input error message or chromosome 2L only.

Many thanks and apologies if I've missed anything in the instructions.

Cheers,

Blue

Created 2013-07-04 14:49:10 | Updated | Tags: snp multi-sample hardfilters

Hi :), I'am new to NGS and have a questions reagarding the filtering. I have 13 individuals, 3 of them with a coverage of 11x-15x, the rest with a coverage of 5x-7x. I did multi-sample SNPcalling and hard filtering, the latter as there are no known SNPs so far. Now I am not sure how to set the minimal SNP quality (QUAL). On the best practise it is suggested to juse at least 30 for individuals with a coverage of at least 30, but 4 if the coverage is below 10. So what is the best way to set the QUAL filter?

Created 2013-06-20 11:57:47 | Updated | Tags: unifiedgenotyper snp filter gatk

In working with a mapping that has a very large coverage in certain places, I have noticed that the Unified Genotyper often calls SNPs even if the alternate allele is only present in a minuscule <1% fraction of the reads at a position. It is difficult to filter these SNPs out because QUAL is high for these SNPs the and QD, which is low in these situations, is also low for many other (good) SNPs.

There already is a fractional threshold option for indel calling, -minIndelFrac. Is there any similar option for SNPs – and if not, what is your reasoning for omitting this option and what steps do you recommend for filtering out these SNPs?

Created 2013-04-24 21:27:07 | Updated | Tags: snp p-value

Is Strand Bias-Fisher's p-value the only p-value we can obtain for a SNP/INDEL call? Do we have an option to calculate the actual HWE p-value instead of the HWE proportions?

~Thanks, Rini

Created 2013-04-19 15:53:12 | Updated | Tags: indelrealigner snp bisulfite

I have Bisulfite- treated sequence mapped using Bismark and Bowtie2 and I'd like to call SNPs and INDELs from it. I have used Bis-SNP to call SNPs but it doesn't call indels , can I use GATK to call indels from the mapped data? Do u have any support to Bisulfite data? Another question please, the data is a mix from 6 different people do u have any support fro pooled data? Thanks for your help.

Created 2013-03-26 11:34:41 | Updated | Tags: unifiedgenotyper snp snps

I was wondering, in the vcf output from UnifiedGenotyper what metrics go into the variant quality score, QUAL ?

I'm assuming that depth, DP is one of them but I can't find further information. Apologies if I've missed it.

Attached are some plots of the depth and quality distribution for variant calls, and also of the relationship between depth and quality. I'm slightly worried that the bimodal quality distribution indicates an error, otherwise I've just attached the graphs for general interest.

Created 2013-03-20 11:49:18 | Updated | Tags: unifiedgenotyper snp

Basically I have an odd-looking distribution of my variant quality scores (see attached png), and was wondering how concerned should I be and how can I rectify it.

The input data from the graph is from UnifiedGenotyper vcf output file, QUAL values. The four samples in the vcf file are one Drosophila reference line, and three more which are outcrosses of the reference line and thus are heterozygous for the reference allele.

My fastq read-mapping pipeline includes adapter and low-quality base removal, and local re-alignment. I've also attached a pdf showing read quality distribution from one of the samples which also looks a bit odd.

Created 2013-03-13 10:18:39 | Updated 2013-03-13 10:19:56 | Tags: unifiedgenotyper snp reference assembly

The Unified Genotyper calls SNPs relative to the specified, publicly-available reference assembly.

How can I call SNPs (in many samples, with UG) relative to an in-house individual, which I have sequenced at high-coverage?

My current solution is to perform a de novo assembly on the in-house reference individual using e.g. Velvet, and then simply use the fasta as the reference for UG.

Can the publicly-available reference assembly still be useful here for speeding up the mapping and filling-in missing parts ?

My organism is Drosophila melanogaster.

Created 2013-03-04 15:05:36 | Updated | Tags: unifiedgenotyper snp variant-calling

Hi to all, I ran UnifiedGenotyper of three exome seq samples and phased with familial pedigree. During the manual filtration I saw several inconsistency. For example I get this output from UnifiedGenotyper and phasing:

chr2 38525498 rs76204302 T G 66.79 . AC=2;AF=0.333;AN=6;BaseQRankSum=-7.191;DB;DP=180;Dels=0.00;FS=208.951;HaplotypeScore=13.8978;MLEAC=2;MLEAF=0.333;MQ=60.00;MQ0=0;MQRankSum=2.030;QD=0.52;ReadPosRankSum=1.325;SB=-2.263e+00 GT:AD:DP:GQ:PL 0/0:30,22:52:39:0,39,729 0/1:9,7:16:5:5,0,280 0/1:55,57:112:98:98,0,1169

The order of samples are father, mother and son.How is possible that the father with, respectively 30 bases REF and 22 bases VAR is called 0/0? Thanks

Giuliano

Created 2013-02-13 07:39:15 | Updated | Tags: unifiedgenotyper snp snps

I have used UnifiedGenotyper to call SNPS. I found some SNPs that has been reported from low quality reads in chromosome X and chromosome Y. Is it possible not to take low quality reads into account while calling SNPs using UnifiedGenotyper? Or, do I need to do quality filtering of BAM files before hand ?

Created 2013-01-21 13:08:41 | Updated | Tags: selectvariants snp

I ran in to the situation now a couple of times that I need to extract a set of private SNPs from a multisample VCF file. For example in a forward genetics knockout screen of a large set of samples.

It is possible with vcf-contrast from vcf-tools:

vcf-contrast +sample1 -sample2 -sample3 -n input.vcf > private sample1.vcf

vcf-contrast -sample1 +sample2 -sample3 -n input.vcf > private sample2.vcf

vcf-contrast -sample1 -sample2 +sample3 -n input.vcf > private sample3.vcf

After this I still would have to filter out the private 0/0 calls and doing this for a large multisample VCF means entering this command for all the combinations which is not really nice.

Surely this must be possible with GATK. Does anyone know how to do this with GATK.

Maybe it is somewhere in the SelectVariants? The --discordance option looked promissing but there is something about that the samples should be the same? Or is it possible to write another variant walker or a JEXL expression?

P.S. By accident I also posted this question in the XHMM, an admin could remove it there.

Created 2012-11-16 07:11:18 | Updated 2013-01-07 20:02:39 | Tags: snp indels variantrecalibration

Hi All, I have used GATK UnifiedGenotyper to generate a raw.vcf file. Now I want to use GATK VQSR to get a more accurate result ,and I follow this protocol:

snp.model <- BuildErrorModelWithVQSR(raw.vcf, SNP);
indel.model <- BuildErrorModelWithVQSR(raw.vcf, INDEL);
recalibratedSNPs.rawIndels.vcf <- ApplyRecalibration(raw.vcf, snp.model, SNP);
analysisReady.vcf <- ApplyRecalibration(recalibratedSNPs.rawIndels.vcf, indel.model, INDEL);

I wanna know will the it be better if I seperate the SNP and INDEL to run VQSR, like this:

SNP.raw.vcf , INDEL.raw.vcf <- SeperateSNPINDEL(raw.vcf);
snp.model <- BuildErrorModelWithVQSR(SNP.raw.vcf, SNP);
indel.model <- BuildErrorModelWithVQSR( INDEL.raw.vcf, INDEL);
INDEL_analysisReady.vcf <- ApplyRecalibration(INDEL.raw.vcf, INDEL.model, SNP);

Thanks a lot !

Created 2012-08-16 15:32:49 | Updated 2012-10-18 01:00:37 | Tags: best-practices reducereads snp bam

We are attempting to see if using ReducedReads will help with the overwhelming file sizes for the SNP calling we are doing on whole genome BAM files. We have been using a protocol similar to the one described in best practices document: Best: multi-sample realignment with known sites and recalibration. My question is what is the best point in the pipeline to use ReducedReads?

Created 2012-08-09 16:38:44 | Updated 2013-01-07 20:00:43 | Tags: unifiedgenotyper snp heterozygousreference

I am implementing a tool that uses reads to identify potential sites for cancer, but I need a reliable genotype call from non-cancerous tissue. Currently, I'm trying to use GATK to make those calls, filter out the bad, and then use the remaining calls when I look over the raw reads in the cancer tissue.

The problem I'm finding is that GATK doesn't provide a GQ for variants that are homozygous for the reference, and I don't know how to correlate the QUAL with the GQs from those homozygous non-ref. Is there a way that I'm overlooking to have GATK provide the GQ for homozygous reference non-variant sites?

Created 2012-07-24 06:49:35 | Updated 2012-07-24 13:04:31 | Tags: unifiedgenotyper ploidy snp

Leishmania has 36 chromosomes but their copy number is unpredictable for each strain and chromosome copy number can change very quickly. So what is an optimal ploidy setting for organisms with extensive aneuploidy? So far we use just diploid setting. Some samples have consistently more heterozygous SNPs in higher copy chromosomes but this relationship does not hold in many other samples: there is no strong correlation between chromosome copy number and abundance of heterozygous SNPs.