Tool for calling indels in Tumor-Normal paired sample mode; this tool supports single-sample mode as well, but this latter functionality is now superceded by UnifiedGenotyper.
This is a simple, counts-and-cutoffs based tool for calling indels from aligned (preferrably MSA cleaned) sequencing data. Supported output formats are: BED format, extended verbose output (tab separated), and VCF. The latter two outputs include additional statistics such as mismatches and base qualitites around the calls, read strandness (how many forward/reverse reads support ref and indel alleles) etc. It is highly recommended to use these additional statistics to perform post-filtering of the calls as the tool is tuned for sensitivity (in other words it will attempt to "call" anything remotely reasonable based only on read counts and will generate all the additional metrics for the post-processing tools to make the final decision). The calls are performed by default from a matched tumor-normal pair of samples. In this case, two (sets of) input bam files must be specified using tagged -I command line arguments: normal and tumor bam(s) must be passed with -I:normal and -I:tumor arguments, respectively. Indels are called from the tumor sample and annotated as germline if even a weak evidence for the same indel, not necessarily a confident call, exists in the normal sample, or as somatic if normal sample has coverage at the site but no indication for an indel. Note that strictly speaking the calling is not even attempted in normal sample: if there is an indel in normal that is not detected/does not pass a threshold in tumor sample, it will not be reported. To make indel calls and associated metrics for a single sample, this tool can be run with --unpaired flag (input bam tagging is not required in this case, and tags are completely ignored if still used: all input bams will be merged on the fly and assumed to represent a single sample - this tool does not check for sample id in the read groups). Which (putative) calls will make it into the output file(s) is controlled by an expression/list of expressions passed with -filter flag: if any of the expressions evaluate to TRUE, the site will be discarded. Otherwise the putative call and all the associated statistics will be printed into the output. Expressions recognize the following variables(in paired-sample somatic mode variables are prefixed with T_ and N_ for Tumor and Normal, e.g. N_COV and T_COV are defined instead of COV): COV for coverage at the site, INDEL_F for fraction of reads supporting consensus indel at the site (wrt total coverage), INDEL_CF for fraction of reads with consensus indel wrt all reads with an indel at the site, CONS_CNT for the count of reads supporting the consensus indel at the site. Conventional arithmetic and logical operations are supported. For instance, N_COV<4||T_COV<6||T_INDEL_F<0.3||T_INDEL_CF<0.7 instructs the tool to only output indel calls with at least 30% observed allelic fraction and with consensus indel making at least 70% of all indel observations at the site, and only at the sites where tumor coverage and normal coverage are at least 6 and 4, respectively.
Tumor and normal bam files (or single sample bam file(s) in --unpaired mode).
Indel calls with associated metrics.
java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SomaticIndelDetector \ -o indels.vcf \ -verbose indels.txt -I:normal normal.bam \ -I:tumor tumor.bam
| Name | Type | Default value | Summary |
|---|---|---|---|
| Required | |||
| --out | VariantContextWriter | stdout | File to write variants (indels) in VCF format |
| Optional | |||
| --bedOutput | File | NA | Lightweight bed output file (only positions and events, no stats/annotations) |
| --filter_expressions | ArrayList[String] | [] | One or more logical expressions. If any of the expressions is TRUE, putative indel will be discarded and nothing will be printed into the output (unless genotyping at the specific position is explicitly requested, see -genotype). Default: T_COV<6||N_COV<4||T_INDEL_F<0.3||T_INDEL_CF<0.7 |
| --maxNumberOfReads | int | 10000 | Maximum number of reads to cache in the window; if number of reads exceeds this number, the window will be skipped and no calls will be made from it |
| --metrics_file | PrintStream | NA | File to print callability metrics output |
| --refseq | String | NA | Name of RefSeq transcript annotation file. If specified, indels will be annotated with GENOMIC/UTR/INTRON/CODING and with the gene name |
| --verboseOutput | File | NA | Verbose output file in text format |
| --window_size | int | 200 | Size (bp) of the sliding window used for accumulating the coverage. May need to be increased to accomodate longer reads or longer deletions. A read can be fit into the window if its length on the reference (i.e. read length + length of deletion gap(s) if any) is smaller than the window size. Reads that do not fit will be ignored, so long deletions can not be called if window is too small |
Lightweight bed output file (only positions and events, no stats/annotations).
One or more logical expressions. If any of the expressions is TRUE, putative indel will be discarded and nothing will be printed into the output (unless genotyping at the specific position is explicitly requested, see -genotype). Default: T_COV<6||N_COV<4||T_INDEL_F<0.3||T_INDEL_CF<0.7.
Maximum number of reads to cache in the window; if number of reads exceeds this number, the window will be skipped and no calls will be made from it.
File to print callability metrics output.
File to write variants (indels) in VCF format.
Name of RefSeq transcript annotation file. If specified, indels will be annotated with GENOMIC/UTR/INTRON/CODING and with the gene name.
Verbose output file in text format.
Size (bp) of the sliding window used for accumulating the coverage. May need to be increased to accomodate longer reads or longer deletions. A read can be fit into the window if its length on the reference (i.e. read length + length of deletion gap(s) if any) is smaller than the window size. Reads that do not fit will be ignored, so long deletions can not be called if window is too small.
See also Documentation index | GATK Site | GATK support forum
GATK version 2.3-9-ge5ebf34 built at 2013/01/11 22:47:55.