# Tagged with #indels 1 documentation article | 0 announcements | 40 forum discussions

Note that the Somatic Indel Detector was previously called Indel Genotyper V2.0

For a complete, detailed argument reference, refer to the GATK document page here.

### Calling strategy

The Somatic Indel Detector can be run in two modes: single sample and paired sample. In the former mode, exactly one input bam file should be given, and indels in that sample are called. In the paired mode, the calls are made in the tumor sample, but in addition to that the differential signal is sought between the two samples (e.g. somatic indels present in tumor cell DNA but not in the normal tissue DNA). In the paired mode, the genotyper makes an initial call in the tumor sample in the same way as it would in the single sample mode; the call, however, is then compared to the normal sample. If any evidence (even very weak, so that it would not trigger a call in single sample mode) for the event is found in the normal, the indel is annotated as germline. Only when the minimum required coverage in the normal sample is achieved and there is no evidence in the normal sample for the event called in the tumor is the indel annotated as somatic.

The calls in both modes (recall that in paired mode the calls are made in tumor sample only and are simply annotated according to the evidence in the matching normal) are performed based on a set of simple thresholds. Namely, all distinct events (indels) at the given site are collected, along with the respective counts of alignments (reads) supporting them. The putative call is the majority vote consensus (i.e. the indel that has the largest count of reads supporting it). This call is accepted if 1) there is enough coverage (as well as enough coverage in matching normal sample in paired mode); 2) reads supporting the consensus indel event constitute a sufficiently large fraction of the total coverage at the site; 3) reads supporting the consensus indel event constitute a sufficiently large fraction of all the reads supporting any indel at the site. See details in the Arguments section of the tool documentation.

Theoretically, the Somatic Indel Detector can be run directly on the aligned short read sequencing data. However, it does not perform any deep algorithmic tasks such as searching for misplaced indels close to a given one, or correcting read misalignments given the presence of an indel in another read, etc. Instead, it assumes that all the evidence for indels (all the reads that support it), for the presence of the matching event in normal etc is already in the input and performs simple counting. It is thus highly, HIGHLY recommended to run the Somatic Indel Detector on "cleaned" bam files, after performing Local realignment around indels.

### Output

Brief output file (specified with -bed option) will look as follows:

chr1    556817  556817  +G:3/7
chr1    3535035 3535054 -TTCTGGGAGCTCCTCCCCC:9/21
chr1    3778838 3778838 +A:15/48
...


This is a .bed track that can be loaded into UCSC browser or IGV browser, the event itself and the <count of supporting reads>/<total coverage> are reported in the 'name' field of the file. The event locations on the chromosomes are 1-based, and the convention is that all events (both insertions and deletions) are assigned to the base on the reference immediately preceding the event (second column). The third column is the stop position of the event on the reference, or strictly speaking the base immediately preceding the first base on the reference after the event: the last deleted base for deletions, or the same base as the start position for insertions. For instance, the first line in the above example specifies an insertion (+G) supported by 3 reads out of 7 (i.e. total coverage at the site is 7x) that occurs immediately after genomic position chr1:556817. The next line specifies a 19 bp deletion -TTCTGGGAGCTCCTCCCCC supported by 9 reads (total coverage 21x) occuring at (after) chr1:3535035 (the first and last deleted bases are 3535035+1=3535036 and 3535054, respectively).

Note that in the paired mode all calls made in tumor (both germline and somatic) will be printed into the brief output without further annotations.

The detailed (verbose) output option is kept for backward compatibility with post-processing tools that might have been developed to work with older versions of the IndelGenotyperV2. All the information described below is now also recorded into the vcf output file, so the verbose text output is completely redundant, except for genomic annotations (if --refseq is used). Generated vcf file can be annotated separately using VCF post-processing tools.

The detailed output (-verbose) will contain additional statistics characterizing the alignments around each called event, SOMATIC/GERMLINE annotations (in paired mode), as well as genomic annotations (when --refseq is used). The verbose output lines matching the three lines from the example above could look like this (note that the long lines are wrapped here, the actual output file contains one line per event):

chr1    556817  556817  +G      N_OBS_COUNTS[C/A/T]:0/0/52      N_AV_MM[C/R]:0.00/5.27  N_AV_MAPQ[C/R]:0.00/35.17    \
N_NQS_MM_RATE[C/R]:0.00/0.08    N_NQS_AV_QUAL[C/R]:0.00/23.74   N_STRAND_COUNTS[C/C/R/R]:0/0/32/20  \
T_OBS_COUNTS[C/A/T]:3/3/7       T_AV_MM[C/R]:2.33/5.50  T_AV_MAPQ[C/R]:66.00/24.75   \
T_NQS_MM_RATE[C/R]:0.05/0.08    T_NQS_AV_QUAL[C/R]:20.26/11.61  T_STRAND_COUNTS[C/C/R/R]:3/0/2/2 \
SOMATIC GENOMIC
chr1    3535035 3535054 -TTCTGGGAGCTCCTCCCCC  N_OBS_COUNTS[C/A/T]:3/3/6 N_AV_MM[C/R]:3.33/2.67  N_AV_MAPQ[C/R]:73.33/99.00 \
N_NQS_MM_RATE[C/R]:0.00/0.00    N_NQS_AV_QUAL[C/R]:29.27/31.83  N_STRAND_COUNTS[C/C/R/R]:0/3/0/3  \
T_OBS_COUNTS[C/A/T]:9/9/21      T_AV_MM[C/R]:1.56/0.17  T_AV_MAPQ[C/R]:88.00/99.00    \
T_NQS_MM_RATE[C/R]:0.02/0.00   T_NQS_AV_QUAL[C/R]:30.86/25.25  T_STRAND_COUNTS[C/C/R/R]:2/7/2/10  \
GERMLINE  UTR TPRG1L
chr1    3778838 3778838 +A      N_OBS_COUNTS[C/A/T]:5/7/22      N_AV_MM[C/R]:5.00/5.20  N_AV_MAPQ[C/R]:54.20/81.20  \
N_NQS_MM_RATE[C/R]:0.00/0.01    N_NQS_AV_QUAL[C/R]:24.94/26.05  N_STRAND_COUNTS[C/C/R/R]:4/1/15/0  \
T_OBS_COUNTS[C/A/T]:15/15/48    T_AV_MM[C/R]:9.73/4.21  T_AV_MAPQ[C/R]:91.53/86.09  \
T_NQS_MM_RATE[C/R]:0.17/0.02    T_NQS_AV_QUAL[C/R]:30.57/25.19  T_STRAND_COUNTS[C/C/R/R]:15/0/32/1 \
GERMLINE   INTRON  DFFB


The fields are tab-separated. The first four fields confer the same event and location information as in the brief format (chromosome, last reference base before the event, last reference base of the event, event itself). Event information is followed by tagged fields reporting various collected statistics. In the paired mode (as in the example shown above), there will be two sets of the same statistics, one for normal (prefixed with 'N_') and one for tumor (prefixed with 'T_') samples. In the single sample mode, there will be only one set of statistics (for the only sample analyzed) and no 'N_'/'T_' prefixes. Statistics are stratified into (two or more of) the following classes: (C)onsensus-supporting reads (i.e. the reads that contain the called event, for which the line is printed); (A)ll reads that contain an indel at the site (not necessarily the called consensus); (R)eference allele-supporting reads, (T)otal=all reads.

For instance, the field T_OBS_COUNTS[C/A/T]:3/3/7 in the first line of the example above should be interpreted as follows: a) this is the OBS_COUNTS statistics for the (T)umor sample (this particular one is simply the read counts, all statistics are listed below); b) The statistics is broken down into three classes: [C/A/T]=(C)onsensus/(A)ll-indel/(T)otal coverage; c) the respective values in each class are 3, 3, 7. In other words, the insertion +G is observed in 3 distinct reads, there was a total of 3 reads with an indel at the site (i.e. only consensus was observed in this case with no observations for any other indel event), and the total coverage at the site is 7. Examining the N_OBS_COUNTS field in the same record, we can conclude that the total coverage in normal at the same site was 52, and among those reads there was not a single one carrying any indel (C/A/T=0/0/52). Hence the 'SOMATIC' annotation added towards the end of the line.

In paired mode the tagged statistics fields are always followed by GERMLINE/SOMATIC annotation (in single sample mode this field is skipped). If --refseq option is used, the next field will contain the coding status annotation (one of GENOMIC/INTRON/UTR/CODING), optionally followed by the gene name (present if the indel is within the boundaries of an annotated gene, i.e. the status is not GENOMIC).

#### List of annotations produced in verbose mode

NOTE: in older versions the OBS_COUNTS statistics was erroneously annotated as [C/A/R] (last class R, not T). This was a typo, and the last number reported in the triplet was still total coverage.

Duplicated reads, reads with mapping quality 0, or reads coming from blacklisted lanes are not counted and do not contribute to any of the statistics.

When no reads are available in a class (e.g. the count of consensus indel-supporting reads in normal sample is 0), all the other statistics for that class (e.g. average mismatches per read, average base qualities in NQS window etc) will be set to 0. For some statistics (average number of mismatches) this artificial value can be "very good", for some others (average base quality) it's "very bad". Needless to say, all those zeroes reported for the classes with no reads should be ignored when attempting call filtering.

• OBS_COUNTS[C/A/T] Observed counts of reads supporting the consensus (called) indel, all indels (consensus + any others), and the total coverage at the site, respectively.
• AV_MM[C/R] Average numbers of mismatches across consensus indel- and reference allele-supporting reads.
• AV_MAPQ[C/R] Average mapping qualities (as reported in the input bam file) of consensus indel- and reference allele-supporting reads.
• NQS_AV_QUAL[C/R] Average base quality computed across all bases falling into the 5bp window on each side of the indel and coming form all consensus- or reference-supporting reads, respectively.
• STRAND_COUNTS[C/C/R/R] Counts of consensus-supporting forward aligned, consensus-supporting rc-aligned, reference-supporting forward-aligned and reference-supporting rc-aligned reads, respectively.

### Creating a indel mask file

The output of the Somatic Indel Detector can be used to mask out SNPs near indels. To do this, we have a script that creates a bed file representing the masking intervals based on the output of this tool. Note that this script requires a full SVN checkout of the GATK, although the strategy is simple: for each indel, create an interval which extends N bases to either side of it.

python python/makeIndelMask.py <raw_indels> <mask_window> <output>
e.g.

No posts found with the requested search criteria.

I am working with plants, and so cannot follow most of your human/cancer specific workflows etc. Also my question is probably not GATK specifc, but I would be very grateful for any kind of answer. I have used BWA to align a set of reads and the contigs assembled from these reads (de novo). I have used samtools mpileup and GATK unifiedGenotyper on the aligned reads. This results for both in a list of ONLY SNPs, the same ones (aside of sensitivity). I have also used the samtools steps on the aligned contigs, resulting in a list of ONLY indels. I can clearly see both SNPs and indels in IGV, both looking at the reads or the contigs. generally I see that reads are being used for variant calling, so is it "wrong" to use contigs? and why do I not get ANY indels in either tool's output when using reads? any ideas?

I have a small UG question to ask and a minor bug to report, but I will give the whole context. I have cohorts at low and high coverage. I will probably call my low coverage data with UG and my high coverage data with HC. I will probably run VQSR per cohort. I will then create a union set of variants and recall all samples with GENOTYPE_GIVEN_ALLELES. I will probably use UG and HC for the low and high coverage recalling, respectively. Prior to generating the union set I will need to merge the UG outputted SNPs and InDels at the same position. I can merge SNPs and InDels with bcftools:

bcftools norm -m +any


I remember from another thread that UG treats SNPs and InDels separately:

http://gatkforums.broadinstitute.org/discussion/3375/combine-snp-and-indel-vcf


But I just checked that UG can do SNP and InDels calls at the same position, when an alleles file with merged SNPs and InDels is provided, when I use GENOTYPE_GIVEN_ALLELES. I expected it to fail, but it seems to work fine. SNPs and InDels are emitted at the same site/record/line. For one record and 3 samples (HOMREF, HET, HOMALT) the genotypes are correct. Can I trust the PLs? Can I also trust QUAL and other calculated values? Or is UG not meant to be used for calling SNPs and InDels at the same site with GGA?

I only had a minor issue despite tabix indexing my bgzipped known alleles file:

##### ERROR MESSAGE: An index is required, but none found., for input source: alleles.vcf.gz


Feel free to delete the original post of this question.

Hey

I'm having the following problem: I am working with different sequencing datasets (Ion Torrent) resulting from targeted sequencing. As it is not the full exome, I have to use hard filtration to filter those SNPs and indels (detected by the Haplotype Caller) that seem to be reliable. Regarding the SNPs the results are quite satisfying. However, regarding the indels, I have a relatively long list of possible candidates. Unfortunately I don't know the biological truth, so it is very hard to adjust certain parameters to finally exclude those indels, which are false positives.

Therefore, I was thinking of a dataset, resulting from sequencing with Ion Torrent, and a list with validated indels in this dataset. With this dataset it would be possible to optimize my pipeline for detecting the true indels, and not all those false positives.

As you propose certain thresholds for the hard filtration of indels, I thought you might have some datasets with validated indels, on which your proposals rely?! If so, is there a place where I can find those datasets?

I used GATK 3.2-2 to do indel realignment and multisample calling with UnifiedGenotyper (I cannot use HaplotypeCaller because it is incompatible with the type of data I am analyzing).

Among the ~96 samples sequenced, I have one sample with two nearby variant calls, an indel and a missense, that we checked by Sanger. The missense is real but we found no trace of the indel in the Sanger traces. When I look at indel call in the multisample vcf, it has good allele depth and GQ, but suspiciously has the same AD as the missense call. Additionally, when I look at the bam in IGV, I see no evidence for the indel whatsoever and the variant is not called in any other samples in this project.

indel: chr13:101755523 CA>C GT:AD:DP:GQ:PL 0/1:55,56:111:99:1388,0,1538

missense: chr13:101755530 A>G GT:AD:DP:GQ:PL 0/1:55,56:111:99:2170,0,2125

I went back and recalled just this one sample (single sample calling)… which resulted in the correct variants, i.e. the indel was not called at all, but the SNP, which does validate, is called.

I understand that this is not an easy region to call because of the 7xA repeat, but it’s not obvious to me why this happens only in multisample mode and I'd like to have a better understanding of what is going on.

I’m wondering why a small indel called at 99%+ in 1000 Genomes and observed in homozygous state in many exomes sequenced by others could come up as not detected in EVS for a region with >6000 EVS exomes sequenced at >25 average read depth and nearby (+/-20bp) filter PASS status?

The EVS FAQ notes that the GATK pipeline was used to call indels and “In general, the INDEL calls are less robust than the SNP calls and have a higher false positive rate. When applying the ESP data to research studies, users are advised to keep this difference in mind.” However, what is the estimated false negative rate for small indels--is there any assessment of that (such as what fraction of 1000G small indel variants with >95% allele frequency were also detected by GATK on EVS samples)?

Specifically the missing variant is chr9: 133710819 insC (dbSNP: http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=11433463) -- no such variant exists in EVS, whereas 1000 genomes does have this insC variant and reports it at 99%+ allele frequency (minor allele is in GRCh37):

## INFO=<ID=AN,Number=1,Type=Integer,Description="Total Allele Count">

9 133710819 . T TC 385 PASS AN=2184;RSQ=0.3011;ERATE=0.0009;VT=INDEL;AVGPOST=0.9871;THETA=0.0006;LDAF=0.9926;AC=2174;AF=1.00;ASN_AF=1.00;AMR_AF=0.99;AFR_AF=0.98;EUR_AF=1.

The EVS UW team has confirmed this insC variant exists in the ESP trace data. How often were small indels either missed or called by the EVS GATK indel pipeline but subsequently filtered out of EVS final called for some reason?

Is there any means to characterize the false positive or false negative rates of the GATK indel caller used for EVS?

Specifically, is there some (average) read depth threshold below which GATK would not call an indel?

Is there above some (average) read depth threshold is the estimated false positive and false negative rates say <5%?

Dan

I performed Variant Calling using UnifiedGenotyper and HaplotypeCaller for Whole Exome Sequenced samples.

GATK version used GATK 3.0

I had only 4 African samples, since did not have any controls I took 70 African BAMs from 1000Genomes. Used Agilent Sure Select WES Version 5 target bed file as interval file to limit the calls to exomes.

I couldot find a single INDEL in UnifiedGenotyper called data and hence couldnot run VQSR for indels on it. But HaplotypeCaller called 20,456 INDELs.

UnifiedGenotyper Called Data

# INDELs-0

HaplotypeCaller Called Data

# INDELs-20456

Is it possible that for variants called from 74 BAMs, UnifiedGenotyper could not find a single INDEL ??

Thanks, Tinu

Hello there,

Would you please let us know how "IndelRealigner" makes use of "known" resources? I assume it already has the intervals of interest for realignment from the "RealignerTargetCreator". So it's not clear why it needs the resources again. Would it use them for making any sort of decision to reject or accept realigned indels?

Also, please let us know what happens (algorithmically) if we don't provide the same resources used in "RealignerTargetCreator" to the program.

Thank you Amin Zia

Hello,

I used UnifiedGenotyper and HaplotypeCaller on my sample to get out INDELS and SNPS. There is a 147bp deletion in one of the genes(working with yeast genome)(see attached file). But this deletion is not reported by UnifiedGenotyper and HaplotypeCaller. Since I am working with yeast genome(which is comparatively very small compared to human genome), I am only interested in INDELS ranging from 150bp-200bp maximum. Is there a way by changing parameters that this can happen. I have read some of the previous questions about how haplotypecaller is unable to call large INDELS(in kb's), but I guess in my case INDELS max I am looking for is 200bp. Can this be achieved??

Also UnifiedGenotyper has an option to provide ploidy level. I could not locate that option in HaplotypeCaller. Any way to have that on HaplotypeCaller ? I ask you this because the yeast genome samples I work with are eseentially haploid.

NOTE: My sample is paired end data with 100bp reads

Hope to hear from you soon

Regards Varun

Hi, GATK Team

I've run into a strange case where a SNP called by HaplotypeCaller has been represented as if it were an indel:

6 16327909 . ATGCTGATGCTGC CTGCTGATGCTGC 1390.70 PASS AC=1;AC_Orig=2;AF=0.500;AF_Orig=0.040;AN=2;AN_Orig=50;BaseQRankSum=0.788;DP=10;FS=6.154;InbreedingCoeff=0.1807;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358;VQSLOD=2.78;culprit=FS GT:DP:GQ:PL 0/1:10:70:284,0,214

This VCF entry (for a single individual) comes from a multi-sample VCF that has multiple alternate "alleles" at that position:

6   16327909    .   ATGCTGATGCTGC   ATGC,CTGCTGATGCTGC,A    1390.70 .   AC=12,3,1;AF=0.024,6.048e-03,2.016e-03;AN=496;BaseQRankSum=0.788;DP=2791;FS=6.154;InbreedingCoeff=0.1807;MLEAC=13,3,1;MLEAF=0.026,6.048e-03,2.016e-03;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358   GT:AD:DP:GQ:PL


However, this mode of representing a SNP is causing processing and analysis problems further downstream after I've split the multi-sample VCF into individual files. Is there a way to fix this problem such that variants are listed in the most parsimonious (and hopefully standard) way?

Thanks,

Grace

This crops up when running with "-mode INDEL". Not sure why there is no data. (See attached log file with stack trace.)

All input files are non-empty (except the .R file). A similar execution using "-mode SNP" completes with no problems. Since I'm simply looking to get the scripting and flags correct, I've used a public data set. Could it be that I'm unlucky and chose something that has no indels from the reference, which is causing the error? Could there be a more graceful method of termination?

Hi all --

This should be a simple problem -- I cannot find a valid version of the Mills indel reference in the resource bundle, or anywhere else online!

All versions of the reference VCF are stripped of genotypes and do not contain a FORMAT column or any additional annotations.

I am accessing the Broad's public FTP, and none of the Mills VCF files in bundle folders 2.5 or 2.8 contain a full VCF. I understand that there are "sites only" VCF, but I can't seem to find anything else.

Can anyone link me to a version that contains the recommended annotations for indel VQSR, or that can be annotated?

Hi,

I am running Variant Recalibration on Indels,prior to this I completed Variant Recalibration and ApplyRecalibration on SNPs. So,the input file is the recalibrated VCF file from Apply Recalibration step of SNP's.

Below is the error I am getting.

##### ERROR MESSAGE: Bad input: Values for DP annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://gatkforums.broadinsti

tute.org/discussion/49/using-variant-annotator

The command I am using for this is :

jre1.7.0_40/bin/java -Djava.io.tmpdir=./rb_2905_VCF/tmp -Xmx2g -jar GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T VariantRecalibrator -R dbdata/human_g1k_v37.fasta -input ${input_file} --maxGaussians 4 -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.vcf - resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_137.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.indels.b37.vcf -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum -mode INDEL -recalFile$destdir/${input_file%recal.snps.vcf}recal.indel.recal -tranchesFile$destdir/${input_file%recal.snps.vcf}recal.indel.tranches -rscriptFile$destdir/{input_file%recal.snps.vcf}recal.indel.plots.R If I remove the options -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum,then I get this error: ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 2.7-4-g6f46d11): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: Argument with name '--use_annotation' (-an) is missing. ##### ERROR ------------------------------------------------------------------------------------------ Hi, Given that there's no tranche plot generated for indels using VariantRecalibrator, how do we assess which tranche to pick for the next step, ApplyRecalibration? On SNP mode, I'm using tranche plots to evaluate the tradeoff between true and false positive rates at various tranche levels, but that's not possible with indels. Thanks! Grace Hi, I have run gatk UnifiedGenotyper on ~45 human whole exomes. I have noticed that the maximum size for called indels in these exomes is 60bp. My question is, is 60 a default for the maximum indel size? I can't seem to find information regarding the threshold for indel size called by UnifiedGenotyper. Is the maximum indel size a parameter that can be adjusted? In case this is of any use; This is the command I am using: java -jar /usr/local/gatk/2.6-4/GenomeAnalysisTK.jar -R Homo_sapiens_assembly19.fasta -T UnifiedGenotyper -glm INDEL -I bams.list -L Targets.bed -stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 1000 Thank you very much for your support! Dalia Hi there, I'm systematically comparing different read mappers right now. That is, I align reads and run MarkDup/IndelRealignment/Recalibration followed by the UnifiedGenotyper. When using stampy, I only get SNP calls but no indels. I've extracted one example where the alignments clearly show a homozygous 14bp deletion that does not show up in the VCF. For comparison purposes, I've looked at GSNAP alignments at the same locus. There the deletion looks a bit less clear in the IGV but is nonetheless present in the VCF produced by UG (screenshot attached). My call to UG (2.8-1-g932cd3a) is the following: gatk -T UnifiedGenotyper -glm BOTH -R hg18/hg18.fasta -rf BadCigar -L chr1:4742500-4744500 -I example.stampy.bam -o example.stampy.vcf gatk -T UnifiedGenotyper -glm BOTH -R hg18/hg18.fasta -rf BadCigar -L chr1:4742500-4744500 -I example.gsnap.bam -o example.gsnap.vcf  So my question boils down to: Why is the deletion called for GSNAP but not for STAMPY, although it looks much cleaner for STAMPY. I'm attaching example BAM files and UG output. Your help is very much appreciated. Please let me know if there is any additional information I can provide. Cheers, Tobi Hi all, I am running UnifiedGenotyper GLM=Indel Ploidy=1, which produces my raw indels, with no apparent problems. I then run BaseRecalibrator and PrintReads to generate a new BAM, which i am then recalling the indels. This time, I am getting no indels in the VCF (only headers). Curiously, this problem is only present in all 13/52 isolates that are most distant from the reference (expecting ~50K indels), while the rest have all worked (~6K indels). Any clues what might be causing this problem? Thanks, Rhys We ran a recent version of Haplotyper Caller on our SOLiD targeted resequencing data and got a ridiculous number of indels. We took a closer look at some and there was absolutely no evidence for an indel at a called position, and wondered whether the internal realignment was doing something weird? Is this a known problem for SOLiD data? Our Illumina data works much better. It makes us now wary of using GATK for SOLiD data...is it just a filtering thing? Dear all, I find the behaviour of UG with reduced reads and indels very odd. I am using the 2.7.4 GATK and I am dealing with an indel that looks completely legit in the normal BAM: 11 47354522 C 16 .,..+3CAC,+3cac.,,+3cac,+3cac..+3CAC,.,,, :=>!!=>!!>!=?;>> The reduced BAM mpileup looks like: 11 47354522 C 9 ,,+3cac..+3CAC,.,,, C!>!=?;>> which I don't really have an issue with. I presume the forward and reverse reads containing the indels were collapsed into two composite reads. Now this indel absolutely refuses to be called, until I set the option -minIndelCnt 2 and then the indel is found and called. It is as if the UG could not see that the reads supporting the indel are collapsed. Similarly in the VCF file the call looks like: 0/1:8,2:10:85:85,0,407 So depth 10, which I cannot make sense of. In any case with 2 reads supporting the indel, it is not consistent with the unreduced BAM file. Is there something I do not understand here, or is there a bug with the handling of indels and the GATK UG? Thank you in advance for suggestions, I have ~2,000 whole-genome-sequencing bams that I need to move through an alignment/deduping/recalibration pipeline. I'll be using bwa-mem for the initial alignment, and I'm wondering if you would still recommend carrying out the local realignment step (downstream) for these bams, given bwa-mem does some local/gapped alignment on its own. This is an important question for our group because the local realignment (using RealignerTargetCreator & IndelRealigner) for these bams will take a significant amount of time / computational resources. If I want to call short indels using UnifiedGenotyper (rather than HC, due to number of samples), does bwa-mem for primary alignment reduce the need for local realignment downstream? Any feedback/thoughts would be much appreciated - thanks! Hi, I am trying to filter the indels using GenomeAnalysisTK.jar -l INFO -T VariantFiltration -R reference.fa -V raw_indels.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 ||ReadPosRankSum < -20.0" --filterName "my_indels_filter" -o filtered_indels.vcf But I donot get the filtered file, instead i get the same indels as before, with no error message, please guide me if I doing in right way. Here is the output that I get: chr1 2379 . A AGAG 1359.54 my_indels_filter AC=4;AF=0.500;AN=8;BaseQRankSum=5.734;DP=67;FS=0.000;HaplotypeScore=230.3405;MLEAC=4;MLEAF=0.500;MQ=22.94;MQ0=4;MQRankSum=2.033;QD=20.29;ReadPosRankSum=3.432;SB=-1.476e-03 GT:AD:DP:GQ:PL 0/1:15,11:17:99:567,0,305 0/1:10,4:12:99:207,0,411 0/1:9,4:9:99:205,0,252 0/1:5,8:10:35:432,0,35 chr1 2414 . C CCCA 91.56 my_indels_filter AC=5;AF=0.625;AN=8;BaseQRankSum=4.234;DP=43;FS=0.000;HaplotypeScore=151.6242;MLEAC=5;MLEAF=0.625;MQ=23.55;MQ0=5;MQRankSum=-1.458;QD=18.87;ReadPosRankSum=-1.129;SB=-1.476e-03 GT:AD:DP:GQ:PL 0/1:9,3:10:99:276,0,262 0/1:3,1:4:56:56,0,122 0/1:12,2:7:99:139,0,241 1/1:0,2:9:24:395,24,0 Hello, I noticed that you can use emit_all_sites in UnifiedGenotyper to gather information on alternate alleles at every base provided in the interval file. Is there a way to do this same process but with indels? I want to pass in an interval list of indel locations and a bam file and for every site in the interval list get number of reads supporting the indel and number of reference. Thanks you. For exome-sequencing, I was wondering how I should decide the "-mode" parameter? should I use "SNP" or "Indel" or both? I want to get both information in the final results, should I use "BOTH" -mode? Dear Team, i am using Haplotype caller to identify the indels in my sequence.. i am running it in command line.. java -jar /illumina/data/galaxy/apps/GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -T HaplotypeCaller -R genome.fa -I S21_full.picard.bam -o indels_S21.vcf but i couldnt find any entries in the output file, also the file size shows 0.. INFO 09:14:25,280 ProgressMeter - Location processed.active regions runtime per.1M.active regions completed total.runtime remaining INFO 09:14:55,283 ProgressMeter - chr1:4710472 4.72e+06 30.0 s 6.0 s 0.2% 5.5 h 5.4 h INFO 09:15:25,284 ProgressMeter - chr1:8280981 8.29e+06 60.0 s 7.0 s 0.3% 6.2 h 6.2 h INFO 09:15:55,285 ProgressMeter - chr1:11311033 1.13e+07 90.0 s 7.0 s 0.4% 6.8 h 6.8 h INFO 09:16:25,286 ProgressMeter - chr1:15877953 1.59e+07 120.0 s 7.0 s 0.5% 6.5 h 6.5 h INFO 09:16:55,287 ProgressMeter - chr1:19156923 1.92e+07 2.5 m 7.0 s 0.6% 6.7 h 6.7 h INFO 09:17:25,288 ProgressMeter - chr1:22167191 2.22e+07 3.0 m 8.0 s 0.7% 7.0 h 6.9 h INFO 09:17:55,289 ProgressMeter - chr1:24574489 2.46e+07 3.5 m 8.0 s 0.8% 7.3 h 7.3 h Please suggest on the same.. Thanks Sridhar If you are using UnifiedGenotyper to call Indels, and you are looking at an Indel which is hard to sequence (say 9 Ts versus 10 Ts) the sequencing may generate more different Indel alleles than there really are. What is the best way to deal with this situation if you actually only want one alt allele called? If you limit max_alt_alleles to 2, what does the software do? I am trying to recalibrate my VCF files for Indels calling using the below command lines: java -Xmx2G -jar ../GenomeAnalysisTK.jar -T VariantRecalibrator \ -R ../GATK_ref/hg19.fasta \ -input ./Variants/gcat_set_053_2.raw.snps.indels.vcf \ -nt 4 \ -resource:mills,known=false,training=true,truth=true,prior=12.0 ../GATK_ref/Mills_and_1000G_gold_standard.indels.hg19.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ../GATK_ref/dbsnp_137.hg19.vcf \ -an DP -an FS -an ReadPosRankSum -an MQRankSum \ --maxGaussians 4 \ -percentBad 0.05 \ -minNumBad 1000 \ -mode INDEL \ -recalFile ./Variants/VQSR/gcat_set_053_2.indels.vcf.recal \ -tranchesFile ./Variants/VQSR/gcat_set_053_2.indels.tranches \ -rscriptFile ./Variants/VQSR/gcat_set_053_2.indels.recal.plots.R > ./Variants/VQSR/IndelRecal2-noAnnot.log I got this error message, even after taking the recommendation (e.g. maxGaussians 4, --percentBad 0.05). What does this error message mean? my files have too few variants? It's exome-seq. ##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example) Hi all, I have been looking for a documentation for the INFO column in the VCF of the Mills indels in the GATK resource bundle (Mills_and_1000G_gold_standard.indels.b37.sites.vcf.gz), but to no avail. I did a unique list of all the possible entries for the INFO, but I have no idea what they mean. Does anybody know what they mean? Thank you!! set=Intersect1000GAll set=Intersect1000GAll-MillsDoubleCenter set=Intersect1000GAll-MillsDoubleCenter-MillsTracesUnknown set=Intersect1000GAll-MillsTracesUnknown set=Intersect1000GMinusBI set=Intersect1000GMinusDI set=Intersect1000GMinusDI-MillsDoubleCenter set=Intersect1000GMinusOX set=Intersect1000GMinusOX-MillsCenterUnknown-MillsTracesUnknown set=Intersect1000GMinusOX-MillsDoubleCenter set=Intersect1000GMinusOX-MillsDoubleCenter-MillsTracesUnknown set=Intersect1000GMinusOX-MillsTracesUnknown set=Intersect1000GMinusSI set=Intersect1000GMinusSI-MillsDoubleCenter set=Intersect1000GMinusSI-MillsDoubleCenter-MillsTracesUnknown set=Intersect1000GMinusSI-MillsTracesUnknown set=MillsAlleleMatch1000G set=MillsAlleleMatch1000G-MillsDoubleCenter set=MillsAlleleMatch1000G-MillsDoubleCenter-MillsTracesUnknown set=MillsAlleleMatch1000G-MillsTracesUnknown set=MillsCenterUnknown set=MillsCenterUnknown-Intersect1000GAll-MillsTracesUnknown set=MillsCenterUnknown-Intersect1000GMinusBI-MillsTracesUnknown set=MillsCenterUnknown-Intersect1000GMinusDI-MillsTracesUnknown set=MillsCenterUnknown-Intersect1000GMinusSI set=MillsCenterUnknown-Intersect1000GMinusSI-MillsTracesUnknown set=MillsCenterUnknown-MillsAlleleMatch1000G set=MillsCenterUnknown-MillsAlleleMatch1000G-MillsTracesUnknown set=MillsCenterUnknown-MillsTraces3Plus set=MillsCenterUnknown-MillsTraces3Plus-Intersect1000GMinusSI set=MillsCenterUnknown-MillsTraces3Plus-MillsAlleleMatch1000G set=MillsCenterUnknown-MillsTracesUnknown set=MillsDoubleCenter set=MillsDoubleCenter-MillsTracesUnknown set=MillsTraces3Plus set=MillsTraces3Plus-Intersect1000GMinusSI set=MillsTraces3Plus-MillsAlleleMatch1000G set=MillsTracesUnknown How can I select indels with lenght smaller than 10 bp from a vcf file? I tried java -jar GenomeAnalysisTK.jar -T SelectVariants -R ref.fa --variant INDEL.vcf -o INDEL_maxLenght10.vcf -select 'vc.getIndelLengths().0 < 10' but the output still contains all the Indels, also the ones larger than 10 bp. Hi Mark, Eric - First, I wanted to thank you guys for providing advice with respect to running VQSR. I am already sold and a huge fan of the method :-). I was wondering if either of you could comment on VQSLOD and sensitivity filter tranche? To be more specific, if I set a filter threshold of 99% for sensitivity and VQSLOD < 0 I imagine that probably is not a good idea! However, a VQSLOD of 3 or 5 may be appropriate in the statistical sense, i.e. pretty confident that this is a real variant. Finally, I am thinking we should include VQSLOD in our statistical genetic association mapping methods. I wanted to get a sense from either of you what VQSLOD you would want to completely remove from analysis? Best Wishes, Manny. Hi, I was using GATK UnifiedGenotyper for SNPs/indels. My command is as follows: java -Xmx16g -jargatkPath/GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R $humanRefSequence \ -I "$sampleID".recal.bam \ --dbsnp $humanDbsnpFile \ -stand_call_conf 30.0 \ -stand_emit_conf 10.0 \ -dcov 250 \ -l INFO -log "$sampleID".GATKvariants.log \ -o "$sampleID".GATKvariants.raw.vcf \ --output_mode EMIT_ALL_SITES When I am selecting Indels from the vcf file, it's producing an empty vcf file (which is working fine with SNPs). My command for selecting Indels are as follows: java -Xmx10g -jar$gatkPath/GenomeAnalysisTK.jar \ -T SelectVariants \ -R $humanRefSequence \ --variant "$sampleID".GATKvariants.raw.vcf \ -o "$sampleID".GATKindels.raw.vcf \ -selectType INDEL \ -log "$sampleID".SelectIndels.log

Do I need to use "-glm BOTH" in UnifiedGenotyper to output both SNPs and Indels? As from the documentation, I understand that the default value of -glm is set to SNP.

Hello,

For matched tumor and normal pairs, we easily get insertion and deletion counts from the output of Somatic Indel Detector in GATK. However, when we run multiple samples from the same patient, sometimes calls are made in one sample but not another, so we might not have the numbers for all samples for all indel events. We can get the deletion counts from Depth of Coverage in GATK, but retrieving insertions is trickier.

Does you have a suggestion for how to solve this problem in an automated (ie non-IGV fashion)?

Additionally, as DepthofCoverage is being retired, what do you recommend that we use for getting SNP and deletion counts?

Thank you

Hi, I'm using The Genome Analysis Toolkit (GATK) v2.3-9-ge5ebf34

I was running the UnifiedGenotyper tools on a 100bp paired end read with the the command below:

Program Args: -T UnifiedGenotyper -R ref.fasta -I realignedBwa.bam -glm BOTH -log variantsCalling.log -o bwa_variants.vcf

I noticed that only SNPs were called and not a single indel was called. however, when i used a 150bp paired end read with the similar command above it worked fine, the SNPs and Indels were called.

To make sure that the realignedBWA.bam file that i used was not corrupt/malformed, i used two different other pileup program on this bam file and the SNPs and Indels were called nicely. Hope you can take a look at this. I attached together the log file just in case

Hi. I want to merge two VCF files. Initially I was selected only indels(by select variant option). Now I want to merge these two VCF file which contains only INDELS. But When I run the command, I am getting the same error:

ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.NumberFormatException: For input string: "."


I run this command:

java -jar -Xmx2g GenomeAnalysisTK.jar -R hg19_5.fasta -T CombineVariants -V indelsample1.vcf -V indelsample3.vcf -o indels1s3.vcf -genotypeMergeOptions UNIQUIFY


Could you please tell me what is the reason behind this? and how to merge two VCF file having INDELS?

Hello,

Could anyone give me an example generic command for single-sample and multi-sample INDEL calling using the Unified Genotyper? I only found documentation for SNP calling. And when I tried something like that :

java -jar GenomeAnalysisTK.jar \
-R resources/Homo_sapiens_assembly19.fasta \
-T UnifiedGenotyper \
-I input.bam \
--dbsnp dbsnp_137.b37.vcf \
--genotype_likelihoods_model INDEL \
-o output.indels.raw.vcf \
-stand_call_conf 50.0 \
-stand_emit_conf 10.0 \
-dcov 50


The number of indels detected and reported in my output vcf file is too low (only 400 indels for an exome).

I am filtering looking for rare variants and found some frameshift variants in an interesting gene. Some of them are noted as PASS in the QC column of the VCF and some are noted as Indel_FS . What exactly does that second notation mean? I am almost positive that these will validate given how they segregate in my subjects.

When I use EMIT_ALL_CONFIDENT_SITES for SNPs, I get an expected very large list of genotypes regardless if the genotypes vary from the reference. When I use the same command line but I switch the model to Indels, I only get a VCF of variant sites. Is the EMIT_ALL_CONFIDENT_SITES option not compatible with Indel discovery?

I'm grateful for any clarification.

Hi,

I have noticed, sometimes GATK UnifiedGenotyper calls overlapping INDELs even though there should be 3 non-overlapping (independent) calls.

I have attached 200bp region from my BAM, reference and vcf file, so you can reproduce this issue.

Cheers, Leszek

Hi All, I have used GATK UnifiedGenotyper to generate a raw.vcf file. Now I want to use GATK VQSR to get a more accurate result ,and I follow this protocol:

snp.model <- BuildErrorModelWithVQSR(raw.vcf, SNP);
indel.model <- BuildErrorModelWithVQSR(raw.vcf, INDEL);
recalibratedSNPs.rawIndels.vcf <- ApplyRecalibration(raw.vcf, snp.model, SNP);


I wanna know will the it be better if I seperate the SNP and INDEL to run VQSR, like this:

SNP.raw.vcf , INDEL.raw.vcf <- SeperateSNPINDEL(raw.vcf);
snp.model <- BuildErrorModelWithVQSR(SNP.raw.vcf, SNP);
indel.model <- BuildErrorModelWithVQSR( INDEL.raw.vcf, INDEL);


Thanks a lot !

I recently used GATK, the latest version to carry out Indel realignment for my bam files. I have read at few places that one needs to fix the mate position information as the reads mapping position may change during realignment. My question is that does this step is taken care of y GATK automatically or one needs to run Picard fixmates before going for subsequent analysis.

Thanks

Hi,

I'm running the UnifiedGenotyper (gatk version 2.0-35) with -glm INDEL on a dataset. When I use -glm SNP everything works fine and I get my SNV calls, but if I use INDEL I get the following Error message:

It looks like something is up with the read coordinates. I ran this particular script with -L 1 to limit to chromosome 1 but when I check the first and last reads of this chromosome everything seems to be in order (at least they map within the chromosome coordinates).

Cheers, Paul

Hello, I'm maybe missing something obvious but it seems a GATK vcf file does not tell a given variant is SNP, Insertion or deletion. Did I missed some command when I called variations? I can easily classify variations by eye or a script from a given vcf entry but the entry does not explicitly say variation type.

Here are deletions:

Here are insertions:

Ld04 23671 . G GAAAT 6952 . AC=2;AF=1.00;AN=2;DP=100;FS=0.000;HaplotypeScore=24.8695;MLEAC=2;MLEAF=1.00;MQ=59.54;MQ0=0;QD=69.52;SB=-3.355e+03 GT:AD:DP:GQ:PL 1/1:65,34:100:99:6952,301,0

Here are SNPs

Ld04 19890 . T C 3276.01 . AC=2;AF=1.00;AN=2;DP=85;Dels=0.00;FS=0.000;HaplotypeScore=0.7887;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=38.54;SB=-1.614e+03 GT:AD:DP:GQ:PL 1/1:0,85:85:99:3276,253,0