No official posts found with tag SomaticIndelDetector

SomaticIndelDetector

Tool for calling indels in Tumor-Normal paired sample mode; this tool supports single-sample mode as well, but this latter functionality is now superceded by UnifiedGenotyper.

Category Cancer-specific Variant Discovery Tools


Introduction

This is a simple, counts-and-cutoffs based tool for calling indels from aligned (preferrably MSA cleaned) sequencing data. Supported output formats are: BED format, extended verbose output (tab separated), and VCF. The latter two outputs include additional statistics such as mismatches and base qualitites around the calls, read strandness (how many forward/reverse reads support ref and indel alleles) etc. It is highly recommended to use these additional statistics to perform post-filtering of the calls as the tool is tuned for sensitivity (in other words it will attempt to "call" anything remotely reasonable based only on read counts and will generate all the additional metrics for the post-processing tools to make the final decision). The calls are performed by default from a matched tumor-normal pair of samples. In this case, two (sets of) input bam files must be specified using tagged -I command line arguments: normal and tumor bam(s) must be passed with -I:normal and -I:tumor arguments, respectively. Indels are called from the tumor sample and annotated as germline if even a weak evidence for the same indel, not necessarily a confident call, exists in the normal sample, or as somatic if normal sample has coverage at the site but no indication for an indel. Note that strictly speaking the calling is not even attempted in normal sample: if there is an indel in normal that is not detected/does not pass a threshold in tumor sample, it will not be reported. To make indel calls and associated metrics for a single sample, this tool can be run with --unpaired flag (input bam tagging is not required in this case, and tags are completely ignored if still used: all input bams will be merged on the fly and assumed to represent a single sample - this tool does not check for sample id in the read groups). Which (putative) calls will make it into the output file(s) is controlled by an expression/list of expressions passed with -filter flag: if any of the expressions evaluate to TRUE, the site will be discarded. Otherwise the putative call and all the associated statistics will be printed into the output. Expressions recognize the following variables(in paired-sample somatic mode variables are prefixed with T_ and N_ for Tumor and Normal, e.g. N_COV and T_COV are defined instead of COV): COV for coverage at the site, INDEL_F for fraction of reads supporting consensus indel at the site (wrt total coverage), INDEL_CF for fraction of reads with consensus indel wrt all reads with an indel at the site, CONS_CNT for the count of reads supporting the consensus indel at the site. Conventional arithmetic and logical operations are supported. For instance, N_COV<4||T_COV<6||T_INDEL_F<0.3||T_INDEL_CF<0.7 instructs the tool to only output indel calls with at least 30% observed allelic fraction and with consensus indel making at least 70% of all indel observations at the site, and only at the sites where tumor coverage and normal coverage are at least 6 and 4, respectively.

Input

Tumor and normal bam files (or single sample bam file(s) in --unpaired mode).

Output

Indel calls with associated metrics.

Examples

 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SomaticIndelDetector \
   -o indels.vcf \
   -verbose indels.txt
   -I:normal normal.bam \
   -I:tumor tumor.bam
 

SomaticIndelDetector specific arguments

Name Type Default value Summary
Required
--out VariantContextWriter stdout File to write variants (indels) in VCF format
Optional
--bedOutput File NA Lightweight bed output file (only positions and events, no stats/annotations)
--filter_expressions ArrayList[String] [] One or more logical expressions. If any of the expressions is TRUE, putative indel will be discarded and nothing will be printed into the output (unless genotyping at the specific position is explicitly requested, see -genotype). Default: T_COV<6||N_COV<4||T_INDEL_F<0.3||T_INDEL_CF<0.7
--maxNumberOfReads int 10000 Maximum number of reads to cache in the window; if number of reads exceeds this number, the window will be skipped and no calls will be made from it
--metrics_file PrintStream NA File to print callability metrics output
--refseq String NA Name of RefSeq transcript annotation file. If specified, indels will be annotated with GENOMIC/UTR/INTRON/CODING and with the gene name
--verboseOutput File NA Verbose output file in text format
--window_size int 200 Size (bp) of the sliding window used for accumulating the coverage. May need to be increased to accomodate longer reads or longer deletions. A read can be fit into the window if its length on the reference (i.e. read length + length of deletion gap(s) if any) is smaller than the window size. Reads that do not fit will be ignored, so long deletions can not be called if window is too small

Additional capabilities

The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals. This capability is available to all GATK walkers.

Argument details

--bedOutput / -bed ( File )

Lightweight bed output file (only positions and events, no stats/annotations).

--filter_expressions / -filter ( ArrayList[String] with default value [] )

One or more logical expressions. If any of the expressions is TRUE, putative indel will be discarded and nothing will be printed into the output (unless genotyping at the specific position is explicitly requested, see -genotype). Default: T_COV<6||N_COV<4||T_INDEL_F<0.3||T_INDEL_CF<0.7.

--maxNumberOfReads / -mnr ( int with default value 10000 )

Maximum number of reads to cache in the window; if number of reads exceeds this number, the window will be skipped and no calls will be made from it.

--metrics_file / -metrics ( PrintStream )

File to print callability metrics output.

--out / -o ( VariantContextWriter with default value stdout )

File to write variants (indels) in VCF format.

--refseq / -refseq ( String )

Name of RefSeq transcript annotation file. If specified, indels will be annotated with GENOMIC/UTR/INTRON/CODING and with the gene name.

--verboseOutput / -verbose ( File )

Verbose output file in text format.

--window_size / -ws ( int with default value 200 )

Size (bp) of the sliding window used for accumulating the coverage. May need to be increased to accomodate longer reads or longer deletions. A read can be fit into the window if its length on the reference (i.e. read length + length of deletion gap(s) if any) is smaller than the window size. Reads that do not fit will be ignored, so long deletions can not be called if window is too small.


See also Documentation index | GATK Site | GATK support forum

GATK version 2.3-9-ge5ebf34 built at 2013/01/11 22:47:55.