Single Nucleotide Polymorphism
A Single Nucleotide Polymorphism (or SNP, often pronounced snip; also called a point mutation) is a location within a genome in which a single nucleotide may take the form of either of two or more bases. SNPs are a simple example of polymorphism. Here is an example of two aligned sequences, containing two SNPs:
Approximately 0.1% of all bases in the human genome are SNPs; the remaining 99.9% are identical in virtually all humans, barring the occasional mutation. Locating SNPs is an important ongoing task in genomics. If all of the SNPs on the human genome can be located, it will no longer be necessary to sequence an individual's complete genome; it will only be necessary to examine the SNPs and identify their values, and the rest of the genome can be assumed. One of the caveats to locating SNPs through the assembly process is that it may be difficult to distinguish between a genuine SNP and a sequencing error in which a single base is mis-called.
Conservative SNP picking
For some purposes (e.g. genome typing), one would like to be as conservative as possible when picking SNPs, i.e. the false positive rate should be as low as possible, with the understanding that also some real events will be lost. We accomodate this by a three-step approach (implemented in MapSnps and SnpAnalyzer):
Identifying SNPs via read-to-read alignments
First, we look for SNPs between overlapping reads (see MapSnps), using the neighborhood quality score (NQS) criterion (base qualities near thye event have to be above a certain threshold). In addition, we require to be at least two reads voting for one base and two voting for another.
Filtering by repeats
Next, we discard all events that are in or close to repeats (based on repetitive 48-mers, requiring SNPs to be at least 500 bp away on both ends).
Filtering by disagreement
Since we still cannot completely rule out overcollapsing that went undetected by the repeat finder, we also now require that a certain minimum number of reads agrees with contig consensus in a SNP region (extending 48 bases to each side) as well as a minimum and maximum number of disagreeing reads (which have to match exactly 48 bases with contig consensus on each side). By doing this, we discard regions in which the read coverage is unusually high or low. All surviving SNPs are then considered "trustworthy", meaning that we assume that the false positive rate is now very low.