Note that the Somatic Indel Detector was previously called Indel Genotyper V2.0
For a complete, detailed argument reference, refer to the GATK document page here.
The Somatic Indel Detector can be run in two modes: single sample and paired sample. In the former mode, exactly one input bam file should be given, and indels in that sample are called. In the paired mode, the calls are made in the tumor sample, but in addition to that the differential signal is sought between the two samples (e.g. somatic indels present in tumor cell DNA but not in the normal tissue DNA). In the paired mode, the genotyper makes an initial call in the tumor sample in the same way as it would in the single sample mode; the call, however, is then compared to the normal sample. If any evidence (even very weak, so that it would not trigger a call in single sample mode) for the event is found in the normal, the indel is annotated as germline. Only when the minimum required coverage in the normal sample is achieved and there is no evidence in the normal sample for the event called in the tumor is the indel annotated as somatic.
The calls in both modes (recall that in paired mode the calls are made in tumor sample only and are simply annotated according to the evidence in the matching normal) are performed based on a set of simple thresholds. Namely, all distinct events (indels) at the given site are collected, along with the respective counts of alignments (reads) supporting them. The putative call is the majority vote consensus (i.e. the indel that has the largest count of reads supporting it). This call is accepted if 1) there is enough coverage (as well as enough coverage in matching normal sample in paired mode); 2) reads supporting the consensus indel event constitute a sufficiently large fraction of the total coverage at the site; 3) reads supporting the consensus indel event constitute a sufficiently large fraction of all the reads supporting any indel at the site. See details in the Arguments section of the tool documentation.
Theoretically, the Somatic Indel Detector can be run directly on the aligned short read sequencing data. However, it does not perform any deep algorithmic tasks such as searching for misplaced indels close to a given one, or correcting read misalignments given the presence of an indel in another read, etc. Instead, it assumes that all the evidence for indels (all the reads that support it), for the presence of the matching event in normal etc is already in the input and performs simple counting. It is thus highly, HIGHLY recommended to run the Somatic Indel Detector on "cleaned" bam files, after performing Local realignment around indels.
Brief output file (specified with -bed option) will look as follows:
chr1 556817 556817 +G:3/7 chr1 3535035 3535054 -TTCTGGGAGCTCCTCCCCC:9/21 chr1 3778838 3778838 +A:15/48 ...
This is a .bed track that can be loaded into UCSC browser or IGV browser, the event itself and the <count of supporting reads>/<total coverage> are reported in the 'name' field of the file. The event locations on the chromosomes are 1-based, and the convention is that all events (both insertions and deletions) are assigned to the base on the reference immediately preceding the event (second column). The third column is the stop position of the event on the reference, or strictly speaking the base immediately preceding the first base on the reference after the event: the last deleted base for deletions, or the same base as the start position for insertions. For instance, the first line in the above example specifies an insertion (+G) supported by 3 reads out of 7 (i.e. total coverage at the site is 7x) that occurs immediately after genomic position chr1:556817. The next line specifies a 19 bp deletion -TTCTGGGAGCTCCTCCCCC supported by 9 reads (total coverage 21x) occuring at (after) chr1:3535035 (the first and last deleted bases are 3535035+1=3535036 and 3535054, respectively).
Note that in the paired mode all calls made in tumor (both germline and somatic) will be printed into the brief output without further annotations.
The detailed (verbose) output option is kept for backward compatibility with post-processing tools that might have been developed to work with older versions of the IndelGenotyperV2. All the information described below is now also recorded into the vcf output file, so the verbose text output is completely redundant, except for genomic annotations (if --refseq is used). Generated vcf file can be annotated separately using VCF post-processing tools.
The detailed output (-verbose) will contain additional statistics characterizing the alignments around each called event, SOMATIC/GERMLINE annotations (in paired mode), as well as genomic annotations (when --refseq is used). The verbose output lines matching the three lines from the example above could look like this (note that the long lines are wrapped here, the actual output file contains one line per event):
chr1 556817 556817 +G N_OBS_COUNTS[C/A/T]:0/0/52 N_AV_MM[C/R]:0.00/5.27 N_AV_MAPQ[C/R]:0.00/35.17 \ N_NQS_MM_RATE[C/R]:0.00/0.08 N_NQS_AV_QUAL[C/R]:0.00/23.74 N_STRAND_COUNTS[C/C/R/R]:0/0/32/20 \ T_OBS_COUNTS[C/A/T]:3/3/7 T_AV_MM[C/R]:2.33/5.50 T_AV_MAPQ[C/R]:66.00/24.75 \ T_NQS_MM_RATE[C/R]:0.05/0.08 T_NQS_AV_QUAL[C/R]:20.26/11.61 T_STRAND_COUNTS[C/C/R/R]:3/0/2/2 \ SOMATIC GENOMIC chr1 3535035 3535054 -TTCTGGGAGCTCCTCCCCC N_OBS_COUNTS[C/A/T]:3/3/6 N_AV_MM[C/R]:3.33/2.67 N_AV_MAPQ[C/R]:73.33/99.00 \ N_NQS_MM_RATE[C/R]:0.00/0.00 N_NQS_AV_QUAL[C/R]:29.27/31.83 N_STRAND_COUNTS[C/C/R/R]:0/3/0/3 \ T_OBS_COUNTS[C/A/T]:9/9/21 T_AV_MM[C/R]:1.56/0.17 T_AV_MAPQ[C/R]:88.00/99.00 \ T_NQS_MM_RATE[C/R]:0.02/0.00 T_NQS_AV_QUAL[C/R]:30.86/25.25 T_STRAND_COUNTS[C/C/R/R]:2/7/2/10 \ GERMLINE UTR TPRG1L chr1 3778838 3778838 +A N_OBS_COUNTS[C/A/T]:5/7/22 N_AV_MM[C/R]:5.00/5.20 N_AV_MAPQ[C/R]:54.20/81.20 \ N_NQS_MM_RATE[C/R]:0.00/0.01 N_NQS_AV_QUAL[C/R]:24.94/26.05 N_STRAND_COUNTS[C/C/R/R]:4/1/15/0 \ T_OBS_COUNTS[C/A/T]:15/15/48 T_AV_MM[C/R]:9.73/4.21 T_AV_MAPQ[C/R]:91.53/86.09 \ T_NQS_MM_RATE[C/R]:0.17/0.02 T_NQS_AV_QUAL[C/R]:30.57/25.19 T_STRAND_COUNTS[C/C/R/R]:15/0/32/1 \ GERMLINE INTRON DFFB
The fields are tab-separated. The first four fields confer the same event and location information as in the brief format (chromosome, last reference base before the event, last reference base of the event, event itself). Event information is followed by tagged fields reporting various collected statistics. In the paired mode (as in the example shown above), there will be two sets of the same statistics, one for normal (prefixed with 'N_') and one for tumor (prefixed with 'T_') samples. In the single sample mode, there will be only one set of statistics (for the only sample analyzed) and no 'N_'/'T_' prefixes. Statistics are stratified into (two or more of) the following classes: (C)onsensus-supporting reads (i.e. the reads that contain the called event, for which the line is printed); (A)ll reads that contain an indel at the site (not necessarily the called consensus); (R)eference allele-supporting reads, (T)otal=all reads.
For instance, the field T_OBS_COUNTS[C/A/T]:3/3/7 in the first line of the example above should be interpreted as follows: a) this is the OBS_COUNTS statistics for the (T)umor sample (this particular one is simply the read counts, all statistics are listed below); b) The statistics is broken down into three classes: [C/A/T]=(C)onsensus/(A)ll-indel/(T)otal coverage; c) the respective values in each class are 3, 3, 7. In other words, the insertion +G is observed in 3 distinct reads, there was a total of 3 reads with an indel at the site (i.e. only consensus was observed in this case with no observations for any other indel event), and the total coverage at the site is 7. Examining the N_OBS_COUNTS field in the same record, we can conclude that the total coverage in normal at the same site was 52, and among those reads there was not a single one carrying any indel (C/A/T=0/0/52). Hence the 'SOMATIC' annotation added towards the end of the line.
In paired mode the tagged statistics fields are always followed by GERMLINE/SOMATIC annotation (in single sample mode this field is skipped). If --refseq option is used, the next field will contain the coding status annotation (one of GENOMIC/INTRON/UTR/CODING), optionally followed by the gene name (present if the indel is within the boundaries of an annotated gene, i.e. the status is not GENOMIC).
NOTE: in older versions the OBS_COUNTS statistics was erroneously annotated as [C/A/R] (last class R, not T). This was a typo, and the last number reported in the triplet was still total coverage.
Duplicated reads, reads with mapping quality 0, or reads coming from blacklisted lanes are not counted and do not contribute to any of the statistics.
When no reads are available in a class (e.g. the count of consensus indel-supporting reads in normal sample is 0), all the other statistics for that class (e.g. average mismatches per read, average base qualities in NQS window etc) will be set to 0. For some statistics (average number of mismatches) this artificial value can be "very good", for some others (average base quality) it's "very bad". Needless to say, all those zeroes reported for the classes with no reads should be ignored when attempting call filtering.
The output of the Somatic Indel Detector can be used to mask out SNPs near indels. To do this, we have a script that creates a bed file representing the masking intervals based on the output of this tool. Note that this script requires a full SVN checkout of the GATK, although the strategy is simple: for each indel, create an interval which extends N bases to either side of it.
python python/makeIndelMask.py <raw_indels> <mask_window> <output> e.g. python python/makeIndelMask.py indels.raw.bed 10 indels.mask.bed
I was using GATK UnifiedGenotyper for SNPs/indels. My command is as follows:
java -Xmx16g -jar $gatkPath/GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R $humanRefSequence \ -I "$sampleID".recal.bam \ --dbsnp $humanDbsnpFile \ -stand_call_conf 30.0 \ -stand_emit_conf 10.0 \ -dcov 250 \ -l INFO -log "$sampleID".GATKvariants.log \ -o "$sampleID".GATKvariants.raw.vcf \ --output_mode EMIT_ALL_SITES
When I am selecting Indels from the vcf file, it's producing an empty vcf file (which is working fine with SNPs). My command for selecting Indels are as follows:
java -Xmx10g -jar $gatkPath/GenomeAnalysisTK.jar \ -T SelectVariants \ -R $humanRefSequence \ --variant "$sampleID".GATKvariants.raw.vcf \ -o "$sampleID".GATKindels.raw.vcf \ -selectType INDEL \ -log "$sampleID".SelectIndels.log
Do I need to use "-glm BOTH" in UnifiedGenotyper to output both SNPs and Indels? As from the documentation, I understand that the default value of -glm is set to SNP.
Many thanks in advance.
Hi. I want to merge two VCF files. Initially I was selected only indels(by select variant option). Now I want to merge these two VCF file which contains only INDELS. But When I run the command, I am getting the same error:
ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.NumberFormatException: For input string: "."
I run this command:
java -jar -Xmx2g GenomeAnalysisTK.jar -R hg19_5.fasta -T CombineVariants -V indelsample1.vcf -V indelsample3.vcf -o indels1s3.vcf -genotypeMergeOptions UNIQUIFY
Could you please tell me what is the reason behind this? and how to merge two VCF file having INDELS?
Thanks in advance.
When I use EMIT_ALL_CONFIDENT_SITES for SNPs, I get an expected very large list of genotypes regardless if the genotypes vary from the reference. When I use the same command line but I switch the model to Indels, I only get a VCF of variant sites. Is the EMIT_ALL_CONFIDENT_SITES option not compatible with Indel discovery?
I'm grateful for any clarification.
For matched tumor and normal pairs, we easily get insertion and deletion counts from the output of Somatic Indel Detector in GATK. However, when we run multiple samples from the same patient, sometimes calls are made in one sample but not another, so we might not have the numbers for all samples for all indel events. We can get the deletion counts from Depth of Coverage in GATK, but retrieving insertions is trickier.
Does you have a suggestion for how to solve this problem in an automated (ie non-IGV fashion)?
Additionally, as DepthofCoverage is being retired, what do you recommend that we use for getting SNP and deletion counts?
I am filtering looking for rare variants and found some frameshift variants in an interesting gene. Some of them are noted as PASS in the QC column of the VCF and some are noted as Indel_FS . What exactly does that second notation mean? I am almost positive that these will validate given how they segregate in my subjects.
I recently used GATK, the latest version to carry out Indel realignment for my bam files. I have read at few places that one needs to fix the mate position information as the reads mapping position may change during realignment. My question is that does this step is taken care of y GATK automatically or one needs to run Picard fixmates before going for subsequent analysis.
Hello, I'm maybe missing something obvious but it seems a GATK vcf file does not tell a given variant is SNP, Insertion or deletion. Did I missed some command when I called variations? I can easily classify variations by eye or a script from a given vcf entry but the entry does not explicitly say variation type.
Here are deletions:
d02 264482 . TT T 389.93 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.224;DP=25;FS=0.000;HaplotypeScore=58.0966;MLEAC=2;MLEAF=1.00;MQ=47.51;MQ0=0;MQRankSum=1.714;QD=15.60;ReadPosRankSum=0.075;SB=-1.112e+02 GT:AD:DP:GQ:PL 1/1:15,9:20:9:431,9,0
Ld04 26597 . CCC C 2923.96 . AC=2;AF=1.00;AN=2;BaseQRankSum=-2.217;DP=98;FS=2.954;HaplotypeScore=145.2957;MLEAC=2;MLEAF=1.00;MQ=59.82;MQ0=0;MQRankSum=0.283;QD=29.84;ReadPosRankSum=1.150;SB=-1.306e+03 GT:AD:DP:GQ:PL 1/1:58,31:98:99:2966,195,0
Here are insertions:
Ld04 23671 . G GAAAT 6952 . AC=2;AF=1.00;AN=2;DP=100;FS=0.000;HaplotypeScore=24.8695;MLEAC=2;MLEAF=1.00;MQ=59.54;MQ0=0;QD=69.52;SB=-3.355e+03 GT:AD:DP:GQ:PL 1/1:65,34:100:99:6952,301,0
Ld04 26880 . T TTATT 4973 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.552;DP=99;FS=2.935;HaplotypeScore=116.0082;MLEAC=2;MLEAF=1.00;MQ=59.46;MQ0=0;MQRankSum=0.192;QD=50.23;ReadPosRankSum=1.834;SB=-1.950e+03 GT:AD:DP:GQ:PL 1/1:49,38:98:99:4973,247,0
Here are SNPs
Ld04 19890 . T C 3276.01 . AC=2;AF=1.00;AN=2;DP=85;Dels=0.00;FS=0.000;HaplotypeScore=0.7887;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=38.54;SB=-1.614e+03 GT:AD:DP:GQ:PL 1/1:0,85:85:99:3276,253,0
Ld04 19963 . A G 3571.01 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.895;DP=96;Dels=0.00;FS=0.000;HaplotypeScore=2.4678;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=1.769;QD=37.20;ReadPosRankSum=-0.316;SB=-1.523e+03 GT:AD:DP:GQ:PL 1/1:3,93:96:99:3571,205,0
I do not see any markers that tell me variation type. Nor the length of variations. Such information is given in a vcf of Samtools.
PS I used this command line of V 2.0-38:
java -jar ~/GenomeAnalysisTK-2.0-38/GenomeAnalysisTK.jar -T UnifiedGenotyper --genotype_likelihoods_model BOTH -R ref1 -I IN.bam -o OUT.GK2.vcf
I have noticed, sometimes GATK UnifiedGenotyper calls overlapping INDELs even though there should be 3 non-overlapping (independent) calls.
I have attached 200bp region from my BAM, reference and vcf file, so you can reproduce this issue.
I'm running the UnifiedGenotyper (gatk version 2.0-35) with -glm INDEL on a dataset. When I use -glm SNP everything works fine and I get my SNV calls, but if I use INDEL I get the following Error message:
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: START (149) > (100) STOP -- this should never happen -- call Mauricio! at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinates(ReadClipper.java:512) [...]
It looks like something is up with the read coordinates. I ran this particular script with -L 1 to limit to chromosome 1 but when I check the first and last reads of this chromosome everything seems to be in order (at least they map within the chromosome coordinates).
Hi, I'm using The Genome Analysis Toolkit (GATK) v2.3-9-ge5ebf34
I was running the UnifiedGenotyper tools on a 100bp paired end read with the the command below:
Program Args: -T UnifiedGenotyper -R ref.fasta -I realignedBwa.bam -glm BOTH -log variantsCalling.log -o bwa_variants.vcf
I noticed that only SNPs were called and not a single indel was called. however, when i used a 150bp paired end read with the similar command above it worked fine, the SNPs and Indels were called.
To make sure that the realignedBWA.bam file that i used was not corrupt/malformed, i used two different other pileup program on this bam file and the SNPs and Indels were called nicely. Hope you can take a look at this. I attached together the log file just in case
thanks in advance
Hi Mark, Eric -
First, I wanted to thank you guys for providing advice with respect to running VQSR. I am already sold and a huge fan of the method :-).
I was wondering if either of you could comment on VQSLOD and sensitivity filter tranche? To be more specific, if I set a filter threshold of 99% for sensitivity and VQSLOD < 0 I imagine that probably is not a good idea! However, a VQSLOD of 3 or 5 may be appropriate in the statistical sense, i.e. pretty confident that this is a real variant. Finally, I am thinking we should include VQSLOD in our statistical genetic association mapping methods. I wanted to get a sense from either of you what VQSLOD you would want to completely remove from analysis?
Hi All, I have used GATK UnifiedGenotyper to generate a raw.vcf file. Now I want to use GATK VQSR to get a more accurate result ,and I follow this protocol:
snp.model <- BuildErrorModelWithVQSR(raw.vcf, SNP); indel.model <- BuildErrorModelWithVQSR(raw.vcf, INDEL); recalibratedSNPs.rawIndels.vcf <- ApplyRecalibration(raw.vcf, snp.model, SNP); analysisReady.vcf <- ApplyRecalibration(recalibratedSNPs.rawIndels.vcf, indel.model, INDEL);
I wanna know will the it be better if I seperate the SNP and INDEL to run VQSR, like this:
SNP.raw.vcf , INDEL.raw.vcf <- SeperateSNPINDEL(raw.vcf); snp.model <- BuildErrorModelWithVQSR(SNP.raw.vcf, SNP); indel.model <- BuildErrorModelWithVQSR( INDEL.raw.vcf, INDEL); SNP_analysisReady.vcf <- ApplyRecalibration(SNP.raw.vcf, snp.model, SNP); INDEL_analysisReady.vcf <- ApplyRecalibration(INDEL.raw.vcf, INDEL.model, SNP);
Thanks a lot !
Could anyone give me an example generic command for single-sample and multi-sample INDEL calling using the Unified Genotyper? I only found documentation for SNP calling. And when I tried something like that :
java -jar GenomeAnalysisTK.jar \ -R resources/Homo_sapiens_assembly19.fasta \ -T UnifiedGenotyper \ -I input.bam \ --dbsnp dbsnp_137.b37.vcf \ --genotype_likelihoods_model INDEL \ -o output.indels.raw.vcf \ -stand_call_conf 50.0 \ -stand_emit_conf 10.0 \ -dcov 50
The number of indels detected and reported in my output vcf file is too low (only 400 indels for an exome).
Thank you for any help you can provide me, Geoffroy