A sequencing error or mis-call occurs when a sequencing method calls one or more bases incorrectly, leading to an inaccurate read. Due to the vagaries of molecular biology, no laboratory-based DNA sequencing methods are perfectly precise; they are all known to mis-call bases occasionally in the machines.
The chance of a sequencing error is generally known and quantifiable, thanks to extensive testing and calibration of the sequencing machines. Each base in a read is assigned a quality score, indicating confidence that the base has been called correctly. Some sequencing methods are more reliable than others and so give higher quality scores. Sequencing errors are also more likely to appear at the end of a read, far from where the insert has begun, so quality scores there are typically lower.
Types of sequencing errors
A mismatch is a substitution of one base for another, e.g., an A for a C. Mismatches are different from SNPs, which are actual differences in the genome (due to polymorphism). It is not easy to distinguish mismatches from SNPs, especially at low coverage. Mismatches are often fixed during error correction.
An indel, short for "insertion/deletion", occurs when a read contains a different number of bases from its reference at some points in the alignment. An insertion occurs when the read contains extra bases, while a deletion occurs when the read is missing bases. Indels, like mismatches, may actually be true indicators of polymorphism rather than the result of sequencing errors.
A homopolymer-length error is a type of indel specific to the 454 sequencing method. In 454 sequencing, each homopolymer sequence is called in a single flow, indicated by a light signal. The brightness of the light indicates the length of the homopolymer. When the same base appears several times in a row, it may be hard to distinguish the exact brightness of the light, resulting in (for example) the sequence AAAAA being called as AAAAAA, or vice versa.