SomaticCall Manual

SomaticCall is a program that finds single‐base differences (substitutions) between sequence data from tumor and matched normal samples. It is designed to be highly stringent, so as to achieve a low false positive rate. It takes as input a BAM file for each sample, and produces as output a list of differences (somatic mutations).

1. Installation

Download the software
Make sure you have a recent version of GCC (4.3.3 or later) and addr2line
You need GMP with C++ support (this is automatic if GCC is 4.3.3 or later)
You also need ‘samtools’ in your path. (Get samtools here.)
Type ‘./configure’ then ‘make’.
Make sure that the several executables created by make are all in your path.
If you have trouble, please see general help instructions here.

2. Generating BAM files as input

Currently we do not provide instructions for creating BAM files. However, please note the following:

It is absolutely critical that duplicates be flagged correctly. If you find that SomaticCall produces false positives, incorrect flagging of duplicates is the most likely cause.
The BAM file should include unplaced reads.
The quality scores in the BAM file should have been recalibrated.
We anticipate that the SomaticCall algorithm will not work well on data having many indel errors.
For the Illumina platform, we recommend using all reads, not just purity‐filtered (PF) reads.

3. Generating binary reference files files

To generate the REFHEAD.fastb and REFHEAD.lookup files, run

MakeLookupTable SOURCE=g OUT_HEAD=REFHEAD

where g is the fasta file for the genome reference, and REFHEAD is the output ‘head'.

This will create REFHEAD.fastb and REFHEAD.lookup.

4. Call mutations

Index the BAM files using ‘samtools index’.
Invoke SomaticCall with the following options

TUMOR_BAM= the name of your tumor BAM file (ends in .bam)
NORMAL_BAM= the name of your normal BAM file (ends in .bam)
REFHEAD= the files REFHEAD.fasta, REFHEAD.fastb, and REFHEAD.lookup all must exist and will be referred to as REFHEAD (i.e., the common prefix of these three files). See previous section.
DBSNP= name of file containing one line for each known SNP, where each line has two fields: the chromosome name, and the one‐based position of the SNP on the chromosome
MAX_THREADS= maximum number of threads to run in parallel
OUT= the name of the directory where the results should be placed (this directory should not be shared with any other process)

5. Find the mutations

They are in the file `OUT/mutation_reports'. Column one provides zero‐based coordinates.

6. Under the hood: description of the algorithm and intermediate files

Handling of alignments. They are discarded if they are marked as duplicates or assigned mapping quality score 0. The mapping quality score is not otherwise used.
Core statistical computation. We describe a test that is repeated several times in the process. Given aligned reads for the tumor and normal, the code SomaticMutation in SomaticCallTools.cc tests for a likely somatic mutation according to the following criteria:
1. either an adjusted quality score sum in the tumor for the mutant base must be at least 100 or the LOD score for mutant:ref + mutant:mutant vs ref:ref must be at least 6.3;
2. the quality score sum for the mutant base in the normal must be < 50 and the LOD score for ref:ref vs mutant:ref + mutant:mutant must be at least 2.3.
- This test is applied several times as the alignments are modified in the course of the program. Once a putative mutation fails the test, it is not considered again.
- There is also a pretest which checks that the adjusted quality score sum in the tumor is at least 60, but this is intended only as an optimization.
- Note that the adjusted quality score sum test of (i) does not actually affect the final output, but is included for experimental purposes.
Overall steps in the process.
1. First, dbSNP position are excluded, and a list of putative somatic mutations is obtained by applying SomaticMutation (yielding mutation_reports0).
2. Then we require that a graph assembly of the locus has no cycles and that all paths across it have the same length (yielding mutation_reports1).
3. All reads supporting the mutant allele in the tumor are then realigned at high stringency. We seed on 12-mers of frequency up to 1000, allow up to 4 errors (including indels of size up to 2), and require that the next best placement has more than 6 more errors. Then we call SomaticMutation again (yielding mutation_reports2).
4. Next we eliminate mutations for which the alternative base is supported by only one read start position (yielding mutation_reports3).
5. Then we search the entire set of reads from the normal sample with the goal of finding more reads at each putative mutation.
  1. To do this we first examine the given tumor alignments to find 21-mers from the tumor data that are centered at the mutation. Find the most frequent 3 such 21-mers. If there are no such 21-mers, reject the locus. (At this point we create mutation_reports4.)
  2. For each of the two 20-mers in the above 21-mers, find every instance in the normal reads.
  3. For each such instance, align the read to the reference at the locus of the somatic mutation, allowing up to 6 mismatches and indels.
  4. Align the read to the whole genome using 12-mers, not allowing indels. If an alignment is found that is nearly as good or better than the local placement, discard the read.
  5. Return the local placements of the surviving reads.
  6. Add in the additional normal reads and recall mutations.
  This process yields mutation_reports5.
6. Finally we filter for tumor score >= 6.3, normal_score 2.3, yielding mutation_reports5f. We also create a link mutation_reports that points to this. Each entry in these files has a reference to report_dir, a directory that contains files t.visual and n.visual, that can be viewed with less -r to show the reads that support a given mutation.