Using the Somatic Indel Detector
Posted in Methods and Workflows | Last updated on 2012-09-28 18:06:11


Comments (2)

Note that the Somatic Indel Detector was previously called Indel Genotyper V2.0

For a complete, detailed argument reference, refer to the GATK document page here.


Calling strategy

The Somatic Indel Detector can be run in two modes: single sample and paired sample. In the former mode, exactly one input bam file should be given, and indels in that sample are called. In the paired mode, the calls are made in the tumor sample, but in addition to that the differential signal is sought between the two samples (e.g. somatic indels present in tumor cell DNA but not in the normal tissue DNA). In the paired mode, the genotyper makes an initial call in the tumor sample in the same way as it would in the single sample mode; the call, however, is then compared to the normal sample. If any evidence (even very weak, so that it would not trigger a call in single sample mode) for the event is found in the normal, the indel is annotated as germline. Only when the minimum required coverage in the normal sample is achieved and there is no evidence in the normal sample for the event called in the tumor is the indel annotated as somatic.

The calls in both modes (recall that in paired mode the calls are made in tumor sample only and are simply annotated according to the evidence in the matching normal) are performed based on a set of simple thresholds. Namely, all distinct events (indels) at the given site are collected, along with the respective counts of alignments (reads) supporting them. The putative call is the majority vote consensus (i.e. the indel that has the largest count of reads supporting it). This call is accepted if 1) there is enough coverage (as well as enough coverage in matching normal sample in paired mode); 2) reads supporting the consensus indel event constitute a sufficiently large fraction of the total coverage at the site; 3) reads supporting the consensus indel event constitute a sufficiently large fraction of all the reads supporting any indel at the site. See details in the Arguments section of the tool documentation.

Theoretically, the Somatic Indel Detector can be run directly on the aligned short read sequencing data. However, it does not perform any deep algorithmic tasks such as searching for misplaced indels close to a given one, or correcting read misalignments given the presence of an indel in another read, etc. Instead, it assumes that all the evidence for indels (all the reads that support it), for the presence of the matching event in normal etc is already in the input and performs simple counting. It is thus highly, HIGHLY recommended to run the Somatic Indel Detector on "cleaned" bam files, after performing Local realignment around indels.


Output

Brief output file (specified with -bed option) will look as follows:

chr1    556817  556817  +G:3/7
chr1    3535035 3535054 -TTCTGGGAGCTCCTCCCCC:9/21
chr1    3778838 3778838 +A:15/48
...

This is a .bed track that can be loaded into UCSC browser or IGV browser, the event itself and the <count of supporting reads>/<total coverage> are reported in the 'name' field of the file. The event locations on the chromosomes are 1-based, and the convention is that all events (both insertions and deletions) are assigned to the base on the reference immediately preceding the event (second column). The third column is the stop position of the event on the reference, or strictly speaking the base immediately preceding the first base on the reference after the event: the last deleted base for deletions, or the same base as the start position for insertions. For instance, the first line in the above example specifies an insertion (+G) supported by 3 reads out of 7 (i.e. total coverage at the site is 7x) that occurs immediately after genomic position chr1:556817. The next line specifies a 19 bp deletion -TTCTGGGAGCTCCTCCCCC supported by 9 reads (total coverage 21x) occuring at (after) chr1:3535035 (the first and last deleted bases are 3535035+1=3535036 and 3535054, respectively).

Note that in the paired mode all calls made in tumor (both germline and somatic) will be printed into the brief output without further annotations.

The detailed (verbose) output option is kept for backward compatibility with post-processing tools that might have been developed to work with older versions of the IndelGenotyperV2. All the information described below is now also recorded into the vcf output file, so the verbose text output is completely redundant, except for genomic annotations (if --refseq is used). Generated vcf file can be annotated separately using VCF post-processing tools.

The detailed output (-verbose) will contain additional statistics characterizing the alignments around each called event, SOMATIC/GERMLINE annotations (in paired mode), as well as genomic annotations (when --refseq is used). The verbose output lines matching the three lines from the example above could look like this (note that the long lines are wrapped here, the actual output file contains one line per event):

chr1    556817  556817  +G      N_OBS_COUNTS[C/A/T]:0/0/52      N_AV_MM[C/R]:0.00/5.27  N_AV_MAPQ[C/R]:0.00/35.17    \
                 N_NQS_MM_RATE[C/R]:0.00/0.08    N_NQS_AV_QUAL[C/R]:0.00/23.74   N_STRAND_COUNTS[C/C/R/R]:0/0/32/20  \    
                 T_OBS_COUNTS[C/A/T]:3/3/7       T_AV_MM[C/R]:2.33/5.50  T_AV_MAPQ[C/R]:66.00/24.75   \
                 T_NQS_MM_RATE[C/R]:0.05/0.08    T_NQS_AV_QUAL[C/R]:20.26/11.61  T_STRAND_COUNTS[C/C/R/R]:3/0/2/2 \
                 SOMATIC GENOMIC
chr1    3535035 3535054 -TTCTGGGAGCTCCTCCCCC  N_OBS_COUNTS[C/A/T]:3/3/6 N_AV_MM[C/R]:3.33/2.67  N_AV_MAPQ[C/R]:73.33/99.00 \
                 N_NQS_MM_RATE[C/R]:0.00/0.00    N_NQS_AV_QUAL[C/R]:29.27/31.83  N_STRAND_COUNTS[C/C/R/R]:0/3/0/3  \
                 T_OBS_COUNTS[C/A/T]:9/9/21      T_AV_MM[C/R]:1.56/0.17  T_AV_MAPQ[C/R]:88.00/99.00    \
                 T_NQS_MM_RATE[C/R]:0.02/0.00   T_NQS_AV_QUAL[C/R]:30.86/25.25  T_STRAND_COUNTS[C/C/R/R]:2/7/2/10  \
                 GERMLINE  UTR TPRG1L
chr1    3778838 3778838 +A      N_OBS_COUNTS[C/A/T]:5/7/22      N_AV_MM[C/R]:5.00/5.20  N_AV_MAPQ[C/R]:54.20/81.20  \
                 N_NQS_MM_RATE[C/R]:0.00/0.01    N_NQS_AV_QUAL[C/R]:24.94/26.05  N_STRAND_COUNTS[C/C/R/R]:4/1/15/0  \
                 T_OBS_COUNTS[C/A/T]:15/15/48    T_AV_MM[C/R]:9.73/4.21  T_AV_MAPQ[C/R]:91.53/86.09  \
                 T_NQS_MM_RATE[C/R]:0.17/0.02    T_NQS_AV_QUAL[C/R]:30.57/25.19  T_STRAND_COUNTS[C/C/R/R]:15/0/32/1 \ 
                 GERMLINE   INTRON  DFFB

The fields are tab-separated. The first four fields confer the same event and location information as in the brief format (chromosome, last reference base before the event, last reference base of the event, event itself). Event information is followed by tagged fields reporting various collected statistics. In the paired mode (as in the example shown above), there will be two sets of the same statistics, one for normal (prefixed with 'N_') and one for tumor (prefixed with 'T_') samples. In the single sample mode, there will be only one set of statistics (for the only sample analyzed) and no 'N_'/'T_' prefixes. Statistics are stratified into (two or more of) the following classes: (C)onsensus-supporting reads (i.e. the reads that contain the called event, for which the line is printed); (A)ll reads that contain an indel at the site (not necessarily the called consensus); (R)eference allele-supporting reads, (T)otal=all reads.

For instance, the field T_OBS_COUNTS[C/A/T]:3/3/7 in the first line of the example above should be interpreted as follows: a) this is the OBS_COUNTS statistics for the (T)umor sample (this particular one is simply the read counts, all statistics are listed below); b) The statistics is broken down into three classes: [C/A/T]=(C)onsensus/(A)ll-indel/(T)otal coverage; c) the respective values in each class are 3, 3, 7. In other words, the insertion +G is observed in 3 distinct reads, there was a total of 3 reads with an indel at the site (i.e. only consensus was observed in this case with no observations for any other indel event), and the total coverage at the site is 7. Examining the N_OBS_COUNTS field in the same record, we can conclude that the total coverage in normal at the same site was 52, and among those reads there was not a single one carrying any indel (C/A/T=0/0/52). Hence the 'SOMATIC' annotation added towards the end of the line.

In paired mode the tagged statistics fields are always followed by GERMLINE/SOMATIC annotation (in single sample mode this field is skipped). If --refseq option is used, the next field will contain the coding status annotation (one of GENOMIC/INTRON/UTR/CODING), optionally followed by the gene name (present if the indel is within the boundaries of an annotated gene, i.e. the status is not GENOMIC).

List of annotations produced in verbose mode

NOTE: in older versions the OBS_COUNTS statistics was erroneously annotated as [C/A/R] (last class R, not T). This was a typo, and the last number reported in the triplet was still total coverage.

Duplicated reads, reads with mapping quality 0, or reads coming from blacklisted lanes are not counted and do not contribute to any of the statistics.

When no reads are available in a class (e.g. the count of consensus indel-supporting reads in normal sample is 0), all the other statistics for that class (e.g. average mismatches per read, average base qualities in NQS window etc) will be set to 0. For some statistics (average number of mismatches) this artificial value can be "very good", for some others (average base quality) it's "very bad". Needless to say, all those zeroes reported for the classes with no reads should be ignored when attempting call filtering.

  • OBS_COUNTS[C/A/T] Observed counts of reads supporting the consensus (called) indel, all indels (consensus + any others), and the total coverage at the site, respectively.
  • AV_MM[C/R] Average numbers of mismatches across consensus indel- and reference allele-supporting reads.
  • AV_MAPQ[C/R] Average mapping qualities (as reported in the input bam file) of consensus indel- and reference allele-supporting reads.
  • NQS_MM_RATE[C/R] Mismatch rate in small (currently 5bp on each side) window around the indel in consensus indel- and reference allele-supporting reads. The rate is obtained as average across all bases falling into the window, in all reads. Namely, if the sum of coverages from all the consensus-supporting reads, at every individual reference base in [indel start-5,indel start],[indel stop, indel_stop +5] intervals is, e.g. 100, and 5 of those covering bases are mismatches (regardless of what particular read they come from or whether they occur at the same or different positions), the NQS_MM_RATE[C] is 0.05. Note that this statistics was observed to behave very differently from AV_MM. The latter captures potential global problems with read-placement and/or overall read quality issues: when reads have too many mismatches, the alignments are problematic. Even if the vicinity of the indel is "clean" (low NQS_MM_RATE), high AV_MM indicates a potential problem (e.g. the reads could have come from a highly othologous pseudogene/gene copy that is not in the reference). On the other hand, even when AV_MM is low (especially for long reads), so that the overall placement of the reads seem to be reliable, NQS_MM_RATE may still be relatively high, indicating a potential local problem (few low quality/mismatching bases near the tip of the read, incorrect indel event etc).
  • NQS_AV_QUAL[C/R] Average base quality computed across all bases falling into the 5bp window on each side of the indel and coming form all consensus- or reference-supporting reads, respectively.
  • STRAND_COUNTS[C/C/R/R] Counts of consensus-supporting forward aligned, consensus-supporting rc-aligned, reference-supporting forward-aligned and reference-supporting rc-aligned reads, respectively.

Creating a indel mask file

The output of the Somatic Indel Detector can be used to mask out SNPs near indels. To do this, we have a script that creates a bed file representing the masking intervals based on the output of this tool. Note that this script requires a full SVN checkout of the GATK, although the strategy is simple: for each indel, create an interval which extends N bases to either side of it.

python python/makeIndelMask.py <raw_indels> <mask_window> <output>
e.g.
python python/makeIndelMask.py indels.raw.bed 10 indels.mask.bed

Return to top Comment on this article in the forum