Quality score

From ArachneWiki

Jump to: navigation, search

The quality score or Q-score of a base is a measure of confidence in that base's identity. All laboratory-based sequencing methods cause occasional sequencing errors; thus, while a read appears to be a definite sequence of bases, in reality we must treat each base with a degree of uncertainty. The quality score is an expectation value of this uncertainty.

Quality scores are expressed on a logarithmic scale, like a decibel system. The quality score, Q, is related to the probability of a sequencing error, Perror, by the following equation:

Perror = 1 / 10Q/10

For example, a quality score of 40 (commonly expressed as "Q40") indicates a 1 / 104 = 0.01% chance of an error, and thus 99.99% confidence.

Determining quality scores

There is no a priori way to know a base's quality score. A simple heuristic method to estimate quality scores is to align a high-coverage set of reads together, perform error correction, and then count the number of errors found. Allowances should be made for a base's location on a read, as the reliability of reading machines is low in the middle of an insert. Sophisticated new methods, involving examining the parametric output of the sequencing machines, are currently in development.

For Sanger reads, the program Phred assigns quality scores to each read, computed from the traces. At the end of a read, the trace signal tends to get weaker, creating a need for quality trimming.

Since quality scores assigned to 454 contigs are on a different scale than Phred scores, Arachne provides a mapping tool RemapQuals to translate these scores onto a comparable scale.

In Arachne

Arachne takes quality scores as an input. Read files in fasta format typically have companion files, in qual format, that give the corresponding quality scores for each base in the read. Note that the quality score is a property of each unique base in a read, rather than the read itself.

In the source code, the qualvector and vecqualvector objects handle quality scores. They store quality scores as char objects; they must be re-cast as ints for human readability.

Related terms

The aggregate quality of a sequence (such as a read or a contig) is defined as the average quality score of all bases in that sequence.

The consensus quality at a location is the quality score of the consensus base; it represents the confidence of the consensus algorithm.

Personal tools