In genomics, a homopolymer is a sequence of identical bases, like AAAA or TTTTTTTT. Homopolymers appear as subsequences in larger sequences; in this case the size of the homopolymer is referred to as the homopolymer length.
Very long homopolymers form repeats and are difficult to sequence. They are fortunately very rare, though they do appear in genomes more often than statistical randomness would suggest, especially in junk DNA.)
Homopolymers in 454 sequencing
The 454 sequencing method does not call bases directly. Instead it calls flows, which are indicated by a light signal. Each flow represents a homopolymer, and the brightness of the light indicates the length of the homopolymer. Hence the sequence TAAAAA would appear as a small light to mark the T, followed by a much brighter light to mark the 5 A's. The danger in this process is that the brightness of the light is easy to mis-calibrate, especially for long homopolymers. As a result, 454 reads often contain homopolymer-length sequencing errors, such as calling AAAAA as AAAAAA or vice versa.
Weighted homopolymer rate
The weighted homopolymer rate (WHR) of a sequence is a measure of the frequency of homopolymers in the sequence. It is calculated as
- WHR = (ΣNi=1 ni2) / N
where N is the number of homopolymers in the sequence, and the ni's are the homopolymers' lengths, so that the summation goes from 1 to N. Note that N is NOT the total length of the sequence. For example:
- TGATTCAAGCATTCGATC: This homopolymer-poor sequence has a WHR of (1 + 1 + 1 + 4 + 1 + 4 + 1 + 1 + 1 + 4 + 1 + 1 + 1 + 1 + 1) / 15 = 1.6.
- GGGTGCCCCCAAAATATT: This homopolymer-rich sequence has a WHR of ( 9 + 1 + 1 + 25 + 16 + 1 + 1 + 4 ) / 8 = 7.25.
The lowest possible WHR of a sequence is 1; the highest possible is the square of the sequence length (if N = 1). A randomly-generated sequence has an expected WHR of 20/9 ≈ 2.222. Most genomes have WHRs higher than the random value, due to imbalances in GC-content and the presence of junk DNA.