# Tagged with #indels 0 documentation articles | 0 announcements | 51 forum discussions

No posts found with the requested search criteria.
No posts found with the requested search criteria.

Created 2016-01-06 10:44:22 | Updated 2016-01-06 11:14:58 | Tags: indels variant-calling deletion repeat-region

Hi,

I just want to know how the GATK calls a deletion of a base in a long repeat region? We came across a situation where GATK called a deletion in a sequence of CTAAAAAAAAAAAG , where TA > TA/A. We viewed it in IGV and came to see that the adjacent A to T have been called as the deletion. We are confused that the deletion can be on any of the A in the repeat, and how the caller predicts it is exactly the adjacent A to T. I am attaching the IGV view also. Please comment on this question.

Thank you, Athul

Created 2015-10-06 20:13:14 | Updated | Tags: vcf indels gatk

Have a long standing stable pipeline, which I need to tweak as can be explained and justified. What is the difference between the old file of known indels (from Mills paper(s)) - Homo_sapiens_assembly19.kown_indels.vcf And the current version of known indels with 1000G added - Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz

I know the newer reference is smaller - just sites. It also has significantly fewer records/line counts. What else was removed or added from the old version until now?

Created 2015-09-14 10:00:46 | Updated | Tags: haplotypecaller indels

Hi all,

I have the below INDEL call from GATK-3.3 Haplotype caller.

chr17   39190954    .   G   GCAGCAGCTTGGCTGGCAGCAGCTGGTCTCA 770.52  PASS    AC=1;AF=0.500;AN=2;DP=138;
FS=0.000;MLEAC=1;MLEAF=0.500;MQ=58.33;MQ0=0;QD=5.58;SOR=0.693   GT:AD:DP:GQ:PL  0/1:0,0:0:7:807,0,7

The command used:

java -Xmx10G -jar GenomeAnalysisTK.jar -R %s -T HaplotypeCaller -I %s -L %s -stand_emit_conf 10 -stand_call_conf 30
--genotyping_mode DISCOVERY -o %s

DP in the INFO field is 138 and AD from the FORMAT field is 0,0. I understand that DP and AD are unfiltered and filtered depths. However, having 0 reads is something alarming. Could someone help me to understand the differing read depths.

Created 2015-09-09 18:33:54 | Updated | Tags: indels asereadcounter

Hi, I'm trying to use ASEReadCounter to count reads per allele of some known indels, but got zero counts for altCount for all ~200,000 sites. But in haplotypecaller output vcf files, there're alternative allele reads. The ASEReadCounter output for SNPs (separate known snp site file) seem fine so far. Could you please let me know if ASEReadCounter can deal with indels?

Thank you very much! Bin

Created 2015-07-01 12:08:39 | Updated | Tags: ugcallvariants indels cnv

Hi,

I notice that although UnifiedGenotyper and HaplotypeCaller identify indels, none or reported greater than a length of around 50bp. Obviously integrating this ability into the already complicated algorithms is not easy, nor are such variants especially common - but they can easily be biologically-important, as well as interfered SNP calls, e.g. if a SNP is in a heterozygous deletion.

Do you know of any callers for larger indels/CNVs, specifically for multi-sample NGS projects ? I've looked at pindel and break-dancer (and probably a few others) but they seem mostly for single samples.

Created 2015-06-03 10:26:13 | Updated | Tags: haplotypecaller indels

Hi,

I am wondering whether GATK HC uses read pairs information (such as reads that are mapped with unexpected separation distances or orientation) to detect large structural variations ?

Thank you, Fabrice

Created 2015-05-29 19:52:11 | Updated | Tags: fastaalternatereference indels

Dear GATK team.

I would like to use the FastaAlternateReferenceMaker to generate draft genome assemblies of samples from plant species that are closely related to my reference. I know there are quite some large-scale insertions and deletions that I need to take into account.

My question is whether FastaAlternateReferenceMaker is able to process such large-scale (up to several kb in length) insertions and deletions?

The documentation states that it "works only for SNPs and for simple indels (but not for things like complex substitutions)". When are indels "simple" enough for the tool to be able to process them?

Thanks in advance for the information.

Robin

Created 2015-04-14 05:23:45 | Updated | Tags: variantrecalibrator indels

Hi,

I try to run VQSR on my VCF file produced by HaplotypeCaller. It worked pretty well when dealing with SNPs, but errors prompted up when recalibrating INDELs. I am running GATK version3.2. Commend lines are below:

/ifs2/BC_MD/DEV/WorkFlow/Sofware/jre1.7.0_45/bin/java -Xmx5g -Djava.io.tmpdir=pwd/tmp -jar /ifs2/BC_MD/DEV/WorkFlow/Sofware/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T VariantRecalibrator -R /ifs2/BC_MD/DEV/WorkFlow/Data/bundle_2.8_hg19/ucsc.hg19.fasta -input ./mem_with_a/HC_raw.vcf -resource:mills,known=true,training=true,truth=true,prior=12.0 /ifs2/BC_MD/DEV/WorkFlow/Data/bundle_2.8_hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /ifs2/BC_MD/DEV/WorkFlow/Data/bundle_2.8_hg19/dbsnp_138.hg19.vcf -an QD -an FS -an ReadPosRankSum -an MQRankSum -mode INDEL --maxGaussians 4 -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile ./mem_with_a/HC_indel.VQSR.recal -tranchesFile ./mem_with_a/HC_indel.VQSR.tranches

/ifs2/BC_MD/DEV/WorkFlow/Sofware/jre1.7.0_45/bin/java -Xmx5g -Djava.io.tmpdir=pwd/tmp -jar /ifs2/BC_MD/DEV/WorkFlow/Sofware/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T ApplyRecalibration -R /ifs2/BC_MD/DEV/WorkFlow/Data/bundle_2.8_hg19/ucsc.hg19.fasta -input ./mem_with_a/HC_raw.vcf -mode INDEL --ts_filter_level 99.0 -recalFile ./mem_with_a/HC_indel.VQSR.recal -tranchesFile ./mem_with_a/HC_indel.VQSR.tranches -o ./mem_with_a/HC_indel.vcf

Error messages are there:

java.lang.IllegalArgumentException: No data found. at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:392) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:138) at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: No data found. ##### ERROR ------------------------------------------------------------------------------------------ Thank you! Emma Created 2015-03-04 09:33:57 | Updated | Tags: indels Hi, I am trying to use the IndelGenotyper with the following command: java -Xmx4g -jar$GATK/IndelGenotyper.36.3336-GenomeAnalysisTK.jar -T IndelGenotyperV2 --somatic \ -I $DIR/Normal.bam$DIR/Tumor.bam \ --verboseOutput \ -o $DIR/Tumor.Indels \ -R$REF_GENOMEdecoy \ -L BED_FILE \ -nt 4 \ --minCoverage 30 \ --minNormalCoverage 30 However, I got this error message: Invalid argument value 'Tumor.bam' at position 5. Many thanks for your feedback. Created 2015-02-24 19:52:54 | Updated | Tags: unifiedgenotyper snp indels I am working with plants, and so cannot follow most of your human/cancer specific workflows etc. Also my question is probably not GATK specifc, but I would be very grateful for any kind of answer. I have used BWA to align a set of reads and the contigs assembled from these reads (de novo). I have used samtools mpileup and GATK unifiedGenotyper on the aligned reads. This results for both in a list of ONLY SNPs, the same ones (aside of sensitivity). I have also used the samtools steps on the aligned contigs, resulting in a list of ONLY indels. I can clearly see both SNPs and indels in IGV, both looking at the reads or the contigs. generally I see that reads are being used for variant calling, so is it "wrong" to use contigs? and why do I not get ANY indels in either tool's output when using reads? any ideas? Created 2015-01-13 01:21:15 | Updated | Tags: unifiedgenotyper indels snps genotype-given-alleles I have a small UG question to ask and a minor bug to report, but I will give the whole context. I have cohorts at low and high coverage. I will probably call my low coverage data with UG and my high coverage data with HC. I will probably run VQSR per cohort. I will then create a union set of variants and recall all samples with GENOTYPE_GIVEN_ALLELES. I will probably use UG and HC for the low and high coverage recalling, respectively. Prior to generating the union set I will need to merge the UG outputted SNPs and InDels at the same position. I can merge SNPs and InDels with bcftools: bcftools norm -m +any I remember from another thread that UG treats SNPs and InDels separately: http://gatkforums.broadinstitute.org/discussion/3375/combine-snp-and-indel-vcf But I just checked that UG can do SNP and InDels calls at the same position, when an alleles file with merged SNPs and InDels is provided, when I use GENOTYPE_GIVEN_ALLELES. I expected it to fail, but it seems to work fine. SNPs and InDels are emitted at the same site/record/line. For one record and 3 samples (HOMREF, HET, HOMALT) the genotypes are correct. Can I trust the PLs? Can I also trust QUAL and other calculated values? Or is UG not meant to be used for calling SNPs and InDels at the same site with GGA? I only had a minor issue despite tabix indexing my bgzipped known alleles file: ##### ERROR MESSAGE: An index is required, but none found., for input source: alleles.vcf.gz Feel free to delete the original post of this question. Created 2014-11-07 08:45:55 | Updated | Tags: indels iontorrent hardfiltration Hey I'm having the following problem: I am working with different sequencing datasets (Ion Torrent) resulting from targeted sequencing. As it is not the full exome, I have to use hard filtration to filter those SNPs and indels (detected by the Haplotype Caller) that seem to be reliable. Regarding the SNPs the results are quite satisfying. However, regarding the indels, I have a relatively long list of possible candidates. Unfortunately I don't know the biological truth, so it is very hard to adjust certain parameters to finally exclude those indels, which are false positives. Therefore, I was thinking of a dataset, resulting from sequencing with Ion Torrent, and a list with validated indels in this dataset. With this dataset it would be possible to optimize my pipeline for detecting the true indels, and not all those false positives. As you propose certain thresholds for the hard filtration of indels, I thought you might have some datasets with validated indels, on which your proposals rely?! If so, is there a place where I can find those datasets? Many thanks in advance for your help! Created 2014-09-15 21:28:59 | Updated | Tags: indelrealigner unifiedgenotyper indels gatk variant-calling I used GATK 3.2-2 to do indel realignment and multisample calling with UnifiedGenotyper (I cannot use HaplotypeCaller because it is incompatible with the type of data I am analyzing). Among the ~96 samples sequenced, I have one sample with two nearby variant calls, an indel and a missense, that we checked by Sanger. The missense is real but we found no trace of the indel in the Sanger traces. When I look at indel call in the multisample vcf, it has good allele depth and GQ, but suspiciously has the same AD as the missense call. Additionally, when I look at the bam in IGV, I see no evidence for the indel whatsoever and the variant is not called in any other samples in this project. indel: chr13:101755523 CA>C GT:AD:DP:GQ:PL 0/1:55,56:111:99:1388,0,1538 missense: chr13:101755530 A>G GT:AD:DP:GQ:PL 0/1:55,56:111:99:2170,0,2125 I went back and recalled just this one sample (single sample calling)… which resulted in the correct variants, i.e. the indel was not called at all, but the SNP, which does validate, is called. I understand that this is not an easy region to call because of the 7xA repeat, but it’s not obvious to me why this happens only in multisample mode and I'd like to have a better understanding of what is going on. Created 2014-07-15 19:50:25 | Updated | Tags: indels I’m wondering why a small indel called at 99%+ in 1000 Genomes and observed in homozygous state in many exomes sequenced by others could come up as not detected in EVS for a region with >6000 EVS exomes sequenced at >25 average read depth and nearby (+/-20bp) filter PASS status? The EVS FAQ notes that the GATK pipeline was used to call indels and “In general, the INDEL calls are less robust than the SNP calls and have a higher false positive rate. When applying the ESP data to research studies, users are advised to keep this difference in mind.” However, what is the estimated false negative rate for small indels--is there any assessment of that (such as what fraction of 1000G small indel variants with >95% allele frequency were also detected by GATK on EVS samples)? Specifically the missing variant is chr9: 133710819 insC (dbSNP: http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=11433463) -- no such variant exists in EVS, whereas 1000 genomes does have this insC variant and reports it at 99%+ allele frequency (minor allele is in GRCh37): ## INFO= ## INFO= 9 133710819 . T TC 385 PASS AN=2184;RSQ=0.3011;ERATE=0.0009;VT=INDEL;AVGPOST=0.9871;THETA=0.0006;LDAF=0.9926;AC=2174;AF=1.00;ASN_AF=1.00;AMR_AF=0.99;AFR_AF=0.98;EUR_AF=1. The EVS UW team has confirmed this insC variant exists in the ESP trace data. How often were small indels either missed or called by the EVS GATK indel pipeline but subsequently filtered out of EVS final called for some reason? Is there any means to characterize the false positive or false negative rates of the GATK indel caller used for EVS? Specifically, is there some (average) read depth threshold below which GATK would not call an indel? Is there above some (average) read depth threshold is the estimated false positive and false negative rates say <5%? Dan Created 2014-07-10 20:18:01 | Updated | Tags: unifiedgenotyper indels I performed Variant Calling using UnifiedGenotyper and HaplotypeCaller for Whole Exome Sequenced samples. GATK version used GATK 3.0 I had only 4 African samples, since did not have any controls I took 70 African BAMs from 1000Genomes. Used Agilent Sure Select WES Version 5 target bed file as interval file to limit the calls to exomes. I couldot find a single INDEL in UnifiedGenotyper called data and hence couldnot run VQSR for indels on it. But HaplotypeCaller called 20,456 INDELs. UnifiedGenotyper Called Data # Total Variants-278319 # SNPs-278319 # INDELs-0 HaplotypeCaller Called Data # Total Variants-291269 # SNPs-270813 # INDELs-20456 Is it possible that for variants called from 74 BAMs, UnifiedGenotyper could not find a single INDEL ?? Thanks, Tinu Created 2014-07-02 15:05:28 | Updated | Tags: indels Hi, I was just wondering if an updated version of indelocator is available for public use. The version on the CGA website is pretty old. I've seen other questions regarding whether or not it is available, but no recent answers. Created 2014-06-11 15:10:28 | Updated | Tags: indelrealigner realignertargetcreator indels resources Hello there, Would you please let us know how "IndelRealigner" makes use of "known" resources? I assume it already has the intervals of interest for realignment from the "RealignerTargetCreator". So it's not clear why it needs the resources again. Would it use them for making any sort of decision to reject or accept realigned indels? Also, please let us know what happens (algorithmically) if we don't provide the same resources used in "RealignerTargetCreator" to the program. Thank you Amin Zia Created 2014-05-23 15:26:18 | Updated 2014-05-23 15:32:12 | Tags: unifiedgenotyper haplotypecaller indels Hello, I used UnifiedGenotyper and HaplotypeCaller on my sample to get out INDELS and SNPS. There is a 147bp deletion in one of the genes(working with yeast genome)(see attached file). But this deletion is not reported by UnifiedGenotyper and HaplotypeCaller. Since I am working with yeast genome(which is comparatively very small compared to human genome), I am only interested in INDELS ranging from 150bp-200bp maximum. Is there a way by changing parameters that this can happen. I have read some of the previous questions about how haplotypecaller is unable to call large INDELS(in kb's), but I guess in my case INDELS max I am looking for is 200bp. Can this be achieved?? Also UnifiedGenotyper has an option to provide ploidy level. I could not locate that option in HaplotypeCaller. Any way to have that on HaplotypeCaller ? I ask you this because the yeast genome samples I work with are eseentially haploid. NOTE: My sample is paired end data with 100bp reads Hope to hear from you soon Regards Varun Created 2014-05-15 20:29:50 | Updated | Tags: haplotypecaller snp indels vcf-format Hi, GATK Team I've run into a strange case where a SNP called by HaplotypeCaller has been represented as if it were an indel: 6 16327909 . ATGCTGATGCTGC CTGCTGATGCTGC 1390.70 PASS AC=1;AC_Orig=2;AF=0.500;AF_Orig=0.040;AN=2;AN_Orig=50;BaseQRankSum=0.788;DP=10;FS=6.154;InbreedingCoeff=0.1807;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358;VQSLOD=2.78;culprit=FS GT:DP:GQ:PL 0/1:10:70:284,0,214 This VCF entry (for a single individual) comes from a multi-sample VCF that has multiple alternate "alleles" at that position: 6 16327909 . ATGCTGATGCTGC ATGC,CTGCTGATGCTGC,A 1390.70 . AC=12,3,1;AF=0.024,6.048e-03,2.016e-03;AN=496;BaseQRankSum=0.788;DP=2791;FS=6.154;InbreedingCoeff=0.1807;MLEAC=13,3,1;MLEAF=0.026,6.048e-03,2.016e-03;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358 GT:AD:DP:GQ:PL However, this mode of representing a SNP is causing processing and analysis problems further downstream after I've split the multi-sample VCF into individual files. Is there a way to fix this problem such that variants are listed in the most parsimonious (and hopefully standard) way? Thanks, Grace Created 2014-04-30 17:38:38 | Updated | Tags: variantrecalibrator bug indels This crops up when running with "-mode INDEL". Not sure why there is no data. (See attached log file with stack trace.) All input files are non-empty (except the .R file). A similar execution using "-mode SNP" completes with no problems. Since I'm simply looking to get the scripting and flags correct, I've used a public data set. Could it be that I'm unlucky and chose something that has no indels from the reference, which is causing the error? Could there be a more graceful method of termination? Created 2014-03-26 19:55:01 | Updated | Tags: vqsr indels Hi all -- This should be a simple problem -- I cannot find a valid version of the Mills indel reference in the resource bundle, or anywhere else online! All versions of the reference VCF are stripped of genotypes and do not contain a FORMAT column or any additional annotations. I am accessing the Broad's public FTP, and none of the Mills VCF files in bundle folders 2.5 or 2.8 contain a full VCF. I understand that there are "sites only" VCF, but I can't seem to find anything else. Can anyone link me to a version that contains the recommended annotations for indel VQSR, or that can be annotated? Created 2014-03-10 22:03:01 | Updated | Tags: unifiedgenotyper variantrecalibrator applyrecalibration snp indels Hi, I am running Variant Recalibration on Indels,prior to this I completed Variant Recalibration and ApplyRecalibration on SNPs. So,the input file is the recalibrated VCF file from Apply Recalibration step of SNP's. Below is the error I am getting. ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 2.7-4-g6f46d11): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: Bad input: Values for DP annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://gatkforums.broadinsti tute.org/discussion/49/using-variant-annotator The command I am using for this is : jre1.7.0_40/bin/java -Djava.io.tmpdir=./rb_2905_VCF/tmp -Xmx2g -jar GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T VariantRecalibrator -R dbdata/human_g1k_v37.fasta -input{input_file} --maxGaussians 4 -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.vcf - resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_137.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.indels.b37.vcf -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum -mode INDEL -recalFile $destdir/${input_file%recal.snps.vcf}recal.indel.recal -tranchesFile $destdir/${input_file%recal.snps.vcf}recal.indel.tranches -rscriptFile $destdir/${input_file%recal.snps.vcf}recal.indel.plots.R

If I remove the options -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum,then I get this error:

##### ERROR ------------------------------------------------------------------------------------------

Created 2014-02-24 16:12:14 | Updated | Tags: variantrecalibrator vqsr applyrecalibration indels

Hi,

Given that there's no tranche plot generated for indels using VariantRecalibrator, how do we assess which tranche to pick for the next step, ApplyRecalibration? On SNP mode, I'm using tranche plots to evaluate the tradeoff between true and false positive rates at various tranche levels, but that's not possible with indels.

Thanks!

Grace

Created 2014-02-18 19:05:14 | Updated 2014-02-18 19:06:52 | Tags: unifiedgenotyper insertsizedistribution indels

Hi, I have run gatk UnifiedGenotyper on ~45 human whole exomes. I have noticed that the maximum size for called indels in these exomes is 60bp. My question is, is 60 a default for the maximum indel size? I can't seem to find information regarding the threshold for indel size called by UnifiedGenotyper. Is the maximum indel size a parameter that can be adjusted?

In case this is of any use; This is the command I am using:

java -jar /usr/local/gatk/2.6-4/GenomeAnalysisTK.jar -R Homo_sapiens_assembly19.fasta -T UnifiedGenotyper -glm INDEL -I bams.list -L Targets.bed -stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 1000

Thank you very much for your support! Dalia

Created 2014-02-14 08:56:11 | Updated 2014-02-14 09:06:51 | Tags: unifiedgenotyper indels

Hi there,

I'm systematically comparing different read mappers right now. That is, I align reads and run MarkDup/IndelRealignment/Recalibration followed by the UnifiedGenotyper.

When using stampy, I only get SNP calls but no indels. I've extracted one example where the alignments clearly show a homozygous 14bp deletion that does not show up in the VCF. For comparison purposes, I've looked at GSNAP alignments at the same locus. There the deletion looks a bit less clear in the IGV but is nonetheless present in the VCF produced by UG (screenshot attached).

My call to UG (2.8-1-g932cd3a) is the following:

gatk -T UnifiedGenotyper -glm BOTH -R hg18/hg18.fasta -rf BadCigar -L chr1:4742500-4744500 -I example.stampy.bam -o example.stampy.vcf

gatk -T UnifiedGenotyper -glm BOTH -R hg18/hg18.fasta -rf BadCigar -L chr1:4742500-4744500 -I example.gsnap.bam -o example.gsnap.vcf

So my question boils down to: Why is the deletion called for GSNAP but not for STAMPY, although it looks much cleaner for STAMPY. I'm attaching example BAM files and UG output.

Your help is very much appreciated. Please let me know if there is any additional information I can provide.

Cheers, Tobi

Created 2014-02-04 21:03:21 | Updated | Tags: unifiedgenotyper indels

Hi all,

I am running UnifiedGenotyper GLM=Indel Ploidy=1, which produces my raw indels, with no apparent problems. I then run BaseRecalibrator and PrintReads to generate a new BAM, which i am then recalling the indels. This time, I am getting no indels in the VCF (only headers). Curiously, this problem is only present in all 13/52 isolates that are most distant from the reference (expecting ~50K indels), while the rest have all worked (~6K indels). Any clues what might be causing this problem? Thanks, Rhys

Created 2014-02-03 15:43:45 | Updated | Tags: haplotypecaller indels solid

We ran a recent version of Haplotyper Caller on our SOLiD targeted resequencing data and got a ridiculous number of indels. We took a closer look at some and there was absolutely no evidence for an indel at a called position, and wondered whether the internal realignment was doing something weird? Is this a known problem for SOLiD data? Our Illumina data works much better. It makes us now wary of using GATK for SOLiD data...is it just a filtering thing?

Created 2014-01-14 08:21:56 | Updated | Tags: indels

Hi I have been trying to get my hands on help docs or command line examples for the indelocator ... It does not seem to have any support yet still part of the tool for cancer sample analysis. Do you have any recommendations? If it is not recommended then why make it public?

Many of us would like to understand how to run it, we can probably do our own filtering on verbose output.

Thanks, Barak

Created 2013-11-23 21:44:31 | Updated | Tags: unifiedgenotyper indels

Dear all,

I find the behaviour of UG with reduced reads and indels very odd. I am using the 2.7.4 GATK and I am dealing with an indel that looks completely legit in the normal BAM: 11 47354522 C 16 .,..+3CAC,+3cac.,,+3cac,+3cac..+3CAC,.,,, :=>!!=>!!>!=?;>>

The reduced BAM mpileup looks like: 11 47354522 C 9 ,,+3cac..+3CAC,.,,, C!>!=?;>> which I don't really have an issue with. I presume the forward and reverse reads containing the indels were collapsed into two composite reads.

Now this indel absolutely refuses to be called, until I set the option -minIndelCnt 2 and then the indel is found and called. It is as if the UG could not see that the reads supporting the indel are collapsed. Similarly in the VCF file the call looks like: 0/1:8,2:10:85:85,0,407 So depth 10, which I cannot make sense of. In any case with 2 reads supporting the indel, it is not consistent with the unreduced BAM file.

Is there something I do not understand here, or is there a bug with the handling of indels and the GATK UG?

Thank you in advance for suggestions,

Created 2013-09-12 14:38:40 | Updated 2013-09-12 14:39:06 | Tags: indelrealigner indels

I have ~2,000 whole-genome-sequencing bams that I need to move through an alignment/deduping/recalibration pipeline. I'll be using bwa-mem for the initial alignment, and I'm wondering if you would still recommend carrying out the local realignment step (downstream) for these bams, given bwa-mem does some local/gapped alignment on its own. This is an important question for our group because the local realignment (using RealignerTargetCreator & IndelRealigner) for these bams will take a significant amount of time / computational resources. If I want to call short indels using UnifiedGenotyper (rather than HC, due to number of samples), does bwa-mem for primary alignment reduce the need for local realignment downstream? Any feedback/thoughts would be much appreciated - thanks!

Created 2013-09-08 01:15:58 | Updated | Tags: indels filters

Hi, I am trying to filter the indels using GenomeAnalysisTK.jar -l INFO -T VariantFiltration -R reference.fa -V raw_indels.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 ||ReadPosRankSum < -20.0" --filterName "my_indels_filter" -o filtered_indels.vcf

But I donot get the filtered file, instead i get the same indels as before, with no error message, please guide me if I doing in right way.

Here is the output that I get:

chr1 2379 . A AGAG 1359.54 my_indels_filter AC=4;AF=0.500;AN=8;BaseQRankSum=5.734;DP=67;FS=0.000;HaplotypeScore=230.3405;MLEAC=4;MLEAF=0.500;MQ=22.94;MQ0=4;MQRankSum=2.033;QD=20.29;ReadPosRankSum=3.432;SB=-1.476e-03 GT:AD:DP:GQ:PL 0/1:15,11:17:99:567,0,305 0/1:10,4:12:99:207,0,411 0/1:9,4:9:99:205,0,252 0/1:5,8:10:35:432,0,35 chr1 2414 . C CCCA 91.56 my_indels_filter AC=5;AF=0.625;AN=8;BaseQRankSum=4.234;DP=43;FS=0.000;HaplotypeScore=151.6242;MLEAC=5;MLEAF=0.625;MQ=23.55;MQ0=5;MQRankSum=-1.458;QD=18.87;ReadPosRankSum=-1.129;SB=-1.476e-03 GT:AD:DP:GQ:PL 0/1:9,3:10:99:276,0,262 0/1:3,1:4:56:56,0,122 0/1:12,2:7:99:139,0,241 1/1:0,2:9:24:395,24,0

Created 2013-09-06 12:49:58 | Updated 2013-09-06 12:50:13 | Tags: unifiedgenotyper indels

Hello, I noticed that you can use emit_all_sites in UnifiedGenotyper to gather information on alternate alleles at every base provided in the interval file. Is there a way to do this same process but with indels? I want to pass in an interval list of indel locations and a bam file and for every site in the interval list get number of reads supporting the indel and number of reference. Thanks you.

Created 2013-08-13 04:31:16 | Updated | Tags: variantrecalibrator indels snps mode

For exome-sequencing, I was wondering how I should decide the "-mode" parameter? should I use "SNP" or "Indel" or both? I want to get both information in the final results, should I use "BOTH" -mode?

Created 2013-07-25 04:08:18 | Updated 2013-07-25 04:10:10 | Tags: haplotypecaller output indels

Dear Team, i am using Haplotype caller to identify the indels in my sequence.. i am running it in command line..

java -jar /illumina/data/galaxy/apps/GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -T HaplotypeCaller -R genome.fa -I S21_full.picard.bam -o indels_S21.vcf

but i couldnt find any entries in the output file, also the file size shows 0..

INFO 09:14:25,280 ProgressMeter - Location processed.active regions runtime per.1M.active regions completed total.runtime remaining

INFO 09:14:55,283 ProgressMeter - chr1:4710472 4.72e+06 30.0 s 6.0 s 0.2% 5.5 h 5.4 h

INFO 09:15:25,284 ProgressMeter - chr1:8280981 8.29e+06 60.0 s 7.0 s 0.3% 6.2 h 6.2 h

INFO 09:15:55,285 ProgressMeter - chr1:11311033 1.13e+07 90.0 s 7.0 s 0.4% 6.8 h 6.8 h

INFO 09:16:25,286 ProgressMeter - chr1:15877953 1.59e+07 120.0 s 7.0 s 0.5% 6.5 h 6.5 h

INFO 09:16:55,287 ProgressMeter - chr1:19156923 1.92e+07 2.5 m 7.0 s 0.6% 6.7 h 6.7 h

INFO 09:17:25,288 ProgressMeter - chr1:22167191 2.22e+07 3.0 m 8.0 s 0.7% 7.0 h 6.9 h

INFO 09:17:55,289 ProgressMeter - chr1:24574489 2.46e+07 3.5 m 8.0 s 0.8% 7.3 h 7.3 h

Thanks Sridhar

Created 2013-07-17 15:35:58 | Updated | Tags: unifiedgenotyper indels

If you are using UnifiedGenotyper to call Indels, and you are looking at an Indel which is hard to sequence (say 9 Ts versus 10 Ts) the sequencing may generate more different Indel alleles than there really are. What is the best way to deal with this situation if you actually only want one alt allele called? If you limit max_alt_alleles to 2, what does the software do?

Created 2013-07-16 17:56:17 | Updated | Tags: variantrecalibrator indels

I am trying to recalibrate my VCF files for Indels calling using the below command lines:

java -Xmx2G -jar ../GenomeAnalysisTK.jar -T VariantRecalibrator \ -R ../GATK_ref/hg19.fasta \ -input ./Variants/gcat_set_053_2.raw.snps.indels.vcf \ -nt 4 \ -resource:mills,known=false,training=true,truth=true,prior=12.0 ../GATK_ref/Mills_and_1000G_gold_standard.indels.hg19.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ../GATK_ref/dbsnp_137.hg19.vcf \ -an DP -an FS -an ReadPosRankSum -an MQRankSum \ --maxGaussians 4 \ -percentBad 0.05 \ -minNumBad 1000 \ -mode INDEL \ -recalFile ./Variants/VQSR/gcat_set_053_2.indels.vcf.recal \ -tranchesFile ./Variants/VQSR/gcat_set_053_2.indels.tranches \ -rscriptFile ./Variants/VQSR/gcat_set_053_2.indels.recal.plots.R > ./Variants/VQSR/IndelRecal2-noAnnot.log

I got this error message, even after taking the recommendation (e.g. maxGaussians 4, --percentBad 0.05). What does this error message mean? my files have too few variants? It's exome-seq.

##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)

Created 2013-07-07 12:46:50 | Updated 2013-07-07 12:51:16 | Tags: bundle vcf indels

Hi all,

I have been looking for a documentation for the INFO column in the VCF of the Mills indels in the GATK resource bundle (Mills_and_1000G_gold_standard.indels.b37.sites.vcf.gz), but to no avail. I did a unique list of all the possible entries for the INFO, but I have no idea what they mean. Does anybody know what they mean? Thank you!!

set=Intersect1000GAll

set=Intersect1000GAll-MillsDoubleCenter

set=Intersect1000GAll-MillsDoubleCenter-MillsTracesUnknown

set=Intersect1000GAll-MillsTracesUnknown

set=Intersect1000GMinusBI

set=Intersect1000GMinusDI

set=Intersect1000GMinusDI-MillsDoubleCenter

set=Intersect1000GMinusOX

set=Intersect1000GMinusOX-MillsCenterUnknown-MillsTracesUnknown

set=Intersect1000GMinusOX-MillsDoubleCenter

set=Intersect1000GMinusOX-MillsDoubleCenter-MillsTracesUnknown

set=Intersect1000GMinusOX-MillsTracesUnknown

set=Intersect1000GMinusSI

set=Intersect1000GMinusSI-MillsDoubleCenter

set=Intersect1000GMinusSI-MillsDoubleCenter-MillsTracesUnknown

set=Intersect1000GMinusSI-MillsTracesUnknown

set=MillsAlleleMatch1000G

set=MillsAlleleMatch1000G-MillsDoubleCenter

set=MillsAlleleMatch1000G-MillsDoubleCenter-MillsTracesUnknown

set=MillsAlleleMatch1000G-MillsTracesUnknown

set=MillsCenterUnknown

set=MillsCenterUnknown-Intersect1000GAll-MillsTracesUnknown

set=MillsCenterUnknown-Intersect1000GMinusBI-MillsTracesUnknown

set=MillsCenterUnknown-Intersect1000GMinusDI-MillsTracesUnknown

set=MillsCenterUnknown-Intersect1000GMinusSI

set=MillsCenterUnknown-Intersect1000GMinusSI-MillsTracesUnknown

set=MillsCenterUnknown-MillsAlleleMatch1000G

set=MillsCenterUnknown-MillsAlleleMatch1000G-MillsTracesUnknown

set=MillsCenterUnknown-MillsTraces3Plus

set=MillsCenterUnknown-MillsTraces3Plus-Intersect1000GMinusSI

set=MillsCenterUnknown-MillsTraces3Plus-MillsAlleleMatch1000G

set=MillsCenterUnknown-MillsTracesUnknown

set=MillsDoubleCenter

set=MillsDoubleCenter-MillsTracesUnknown

set=MillsTraces3Plus

set=MillsTraces3Plus-Intersect1000GMinusSI

set=MillsTraces3Plus-MillsAlleleMatch1000G

set=MillsTracesUnknown

Created 2013-06-30 14:38:27 | Updated | Tags: selectvariants jexl indels

How can I select indels with lenght smaller than 10 bp from a vcf file?

I tried

java -jar GenomeAnalysisTK.jar -T SelectVariants -R ref.fa --variant INDEL.vcf -o INDEL_maxLenght10.vcf -select 'vc.getIndelLengths().0 < 10'

but the output still contains all the Indels, also the ones larger than 10 bp.

Created 2013-05-13 14:10:51 | Updated | Tags: vqsr indels vqslod

Hi Mark, Eric -

First, I wanted to thank you guys for providing advice with respect to running VQSR. I am already sold and a huge fan of the method :-).

I was wondering if either of you could comment on VQSLOD and sensitivity filter tranche? To be more specific, if I set a filter threshold of 99% for sensitivity and VQSLOD < 0 I imagine that probably is not a good idea! However, a VQSLOD of 3 or 5 may be appropriate in the statistical sense, i.e. pretty confident that this is a real variant. Finally, I am thinking we should include VQSLOD in our statistical genetic association mapping methods. I wanted to get a sense from either of you what VQSLOD you would want to completely remove from analysis?

Best Wishes,

Manny.

Created 2013-04-16 11:30:01 | Updated | Tags: unifiedgenotyper indels gatk

Hi,

I was using GATK UnifiedGenotyper for SNPs/indels. My command is as follows:

java -Xmx16g -jar $gatkPath/GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R$humanRefSequence \ -I "$sampleID".recal.bam \ --dbsnp$humanDbsnpFile \ -stand_call_conf 30.0 \ -stand_emit_conf 10.0 \ -dcov 250 \ -l INFO -log "$sampleID".GATKvariants.log \ -o "$sampleID".GATKvariants.raw.vcf \ --output_mode EMIT_ALL_SITES

When I am selecting Indels from the vcf file, it's producing an empty vcf file (which is working fine with SNPs). My command for selecting Indels are as follows:

java -Xmx10g -jar $gatkPath/GenomeAnalysisTK.jar \ -T SelectVariants \ -R$humanRefSequence \ --variant "$sampleID".GATKvariants.raw.vcf \ -o "$sampleID".GATKindels.raw.vcf \ -selectType INDEL \ -log "\$sampleID".SelectIndels.log

Do I need to use "-glm BOTH" in UnifiedGenotyper to output both SNPs and Indels? As from the documentation, I understand that the default value of -glm is set to SNP.

Created 2013-03-04 15:38:03 | Updated | Tags: depthofcoverage indels

Hello,

For matched tumor and normal pairs, we easily get insertion and deletion counts from the output of Somatic Indel Detector in GATK. However, when we run multiple samples from the same patient, sometimes calls are made in one sample but not another, so we might not have the numbers for all samples for all indel events. We can get the deletion counts from Depth of Coverage in GATK, but retrieving insertions is trickier.

Does you have a suggestion for how to solve this problem in an automated (ie non-IGV fashion)?

Additionally, as DepthofCoverage is being retired, what do you recommend that we use for getting SNP and deletion counts?

Thank you

Created 2013-02-08 03:58:15 | Updated | Tags: unifiedgenotyper indels

Hi, I'm using The Genome Analysis Toolkit (GATK) v2.3-9-ge5ebf34

I was running the UnifiedGenotyper tools on a 100bp paired end read with the the command below:

Program Args: -T UnifiedGenotyper -R ref.fasta -I realignedBwa.bam -glm BOTH -log variantsCalling.log -o bwa_variants.vcf

I noticed that only SNPs were called and not a single indel was called. however, when i used a 150bp paired end read with the similar command above it worked fine, the SNPs and Indels were called.

To make sure that the realignedBWA.bam file that i used was not corrupt/malformed, i used two different other pileup program on this bam file and the SNPs and Indels were called nicely. Hope you can take a look at this. I attached together the log file just in case

Created 2013-01-14 11:07:56 | Updated 2013-01-14 14:20:02 | Tags: combinevariants merge indels

Hi. I want to merge two VCF files. Initially I was selected only indels(by select variant option). Now I want to merge these two VCF file which contains only INDELS. But When I run the command, I am getting the same error:

ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.NumberFormatException: For input string: "."

I run this command:

java -jar -Xmx2g GenomeAnalysisTK.jar -R hg19_5.fasta -T CombineVariants -V indelsample1.vcf -V indelsample3.vcf -o indels1s3.vcf -genotypeMergeOptions UNIQUIFY

Could you please tell me what is the reason behind this? and how to merge two VCF file having INDELS?

Created 2013-01-11 09:17:53 | Updated 2013-01-11 14:24:02 | Tags: unifiedgenotyper indels

Hello,

Could anyone give me an example generic command for single-sample and multi-sample INDEL calling using the Unified Genotyper? I only found documentation for SNP calling. And when I tried something like that :

java -jar GenomeAnalysisTK.jar \
-R resources/Homo_sapiens_assembly19.fasta \
-T UnifiedGenotyper \
-I input.bam \
--dbsnp dbsnp_137.b37.vcf \
--genotype_likelihoods_model INDEL \
-o output.indels.raw.vcf \
-stand_call_conf 50.0 \
-stand_emit_conf 10.0 \
-dcov 50

The number of indels detected and reported in my output vcf file is too low (only 400 indels for an exome).

Created 2013-01-04 16:58:20 | Updated 2013-01-07 19:13:17 | Tags: fisherstrand indels

I am filtering looking for rare variants and found some frameshift variants in an interesting gene. Some of them are noted as PASS in the QC column of the VCF and some are noted as Indel_FS . What exactly does that second notation mean? I am almost positive that these will validate given how they segregate in my subjects.

Created 2012-12-24 03:25:30 | Updated | Tags: unifiedgenotyper indels

When I use EMIT_ALL_CONFIDENT_SITES for SNPs, I get an expected very large list of genotypes regardless if the genotypes vary from the reference. When I use the same command line but I switch the model to Indels, I only get a VCF of variant sites. Is the EMIT_ALL_CONFIDENT_SITES option not compatible with Indel discovery?

I'm grateful for any clarification.

Created 2012-11-27 09:55:19 | Updated | Tags: unifiedgenotyper indels

Hi,

I have noticed, sometimes GATK UnifiedGenotyper calls overlapping INDELs even though there should be 3 non-overlapping (independent) calls.

I have attached 200bp region from my BAM, reference and vcf file, so you can reproduce this issue.

Cheers, Leszek

Created 2012-11-16 07:11:18 | Updated 2013-01-07 20:02:39 | Tags: snp indels variantrecalibration

Hi All, I have used GATK UnifiedGenotyper to generate a raw.vcf file. Now I want to use GATK VQSR to get a more accurate result ,and I follow this protocol:

snp.model <- BuildErrorModelWithVQSR(raw.vcf, SNP);
indel.model <- BuildErrorModelWithVQSR(raw.vcf, INDEL);
recalibratedSNPs.rawIndels.vcf <- ApplyRecalibration(raw.vcf, snp.model, SNP);
analysisReady.vcf <- ApplyRecalibration(recalibratedSNPs.rawIndels.vcf, indel.model, INDEL);

I wanna know will the it be better if I seperate the SNP and INDEL to run VQSR, like this:

SNP.raw.vcf , INDEL.raw.vcf <- SeperateSNPINDEL(raw.vcf);
snp.model <- BuildErrorModelWithVQSR(SNP.raw.vcf, SNP);
indel.model <- BuildErrorModelWithVQSR( INDEL.raw.vcf, INDEL);
INDEL_analysisReady.vcf <- ApplyRecalibration(INDEL.raw.vcf, INDEL.model, SNP);

Thanks a lot !

Created 2012-10-10 15:49:51 | Updated 2012-10-18 01:17:18 | Tags: indels fixmate realignment

I recently used GATK, the latest version to carry out Indel realignment for my bam files. I have read at few places that one needs to fix the mate position information as the reads mapping position may change during realignment. My question is that does this step is taken care of y GATK automatically or one needs to run Picard fixmates before going for subsequent analysis.

Thanks

Created 2012-09-27 08:13:35 | Updated 2012-10-18 00:51:39 | Tags: unifiedgenotyper indels clipping

Hi,

I'm running the UnifiedGenotyper (gatk version 2.0-35) with -glm INDEL on a dataset. When I use -glm SNP everything works fine and I get my SNV calls, but if I use INDEL I get the following Error message:

It looks like something is up with the read coordinates. I ran this particular script with -L 1 to limit to chromosome 1 but when I check the first and last reads of this chromosome everything seems to be in order (at least they map within the chromosome coordinates).

Cheers, Paul

Created 2012-08-10 18:19:46 | Updated 2013-01-07 19:45:12 | Tags: unifiedgenotyper vcf indels

Hello, I'm maybe missing something obvious but it seems a GATK vcf file does not tell a given variant is SNP, Insertion or deletion. Did I missed some command when I called variations? I can easily classify variations by eye or a script from a given vcf entry but the entry does not explicitly say variation type.

Here are deletions:

Here are insertions:

Ld04 23671 . G GAAAT 6952 . AC=2;AF=1.00;AN=2;DP=100;FS=0.000;HaplotypeScore=24.8695;MLEAC=2;MLEAF=1.00;MQ=59.54;MQ0=0;QD=69.52;SB=-3.355e+03 GT:AD:DP:GQ:PL 1/1:65,34:100:99:6952,301,0

Here are SNPs

Ld04 19890 . T C 3276.01 . AC=2;AF=1.00;AN=2;DP=85;Dels=0.00;FS=0.000;HaplotypeScore=0.7887;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=38.54;SB=-1.614e+03 GT:AD:DP:GQ:PL 1/1:0,85:85:99:3276,253,0