The quality score or Q-score of a base is a measure of confidence in that base's identity. All laboratory-based sequencing methods cause occasional sequencing errors; thus, while a read appears to be a definite sequence of bases, in reality we must treat each base with a degree of uncertainty. The quality score is an expectation value of this uncertainty.
Quality scores are expressed on a logarithmic scale, like a decibel system. The quality score, Q, is related to the probability of a sequencing error, Perror, by the following equation:
- Perror = 1 / 10Q/10
For example, a quality score of 40 (commonly expressed as "Q40") indicates a 1 / 104 = 0.01% chance of an error, and thus 99.99% confidence.
Determining quality scores
There is no a priori way to know a base's quality score. A simple heuristic method to estimate quality scores is to align a high-coverage set of reads together, perform error correction, and then count the number of errors found. Allowances should be made for a base's location on a read, as the reliability of reading machines is low in the middle of an insert. Sophisticated new methods, involving examining the parametric output of the sequencing machines, are currently in development.
Arachne takes quality scores as an input. Read files in
fasta format typically have companion files, in
qual format, that give the corresponding quality scores for each base in the read. Note that the quality score is a property of each unique base in a read, rather than the read itself.
The aggregate quality of a sequence (such as a read or a contig) is defined as the average quality score of all bases in that sequence.