A repeat is a sequence that appears multiple times, identically or near-identically, within a larger sequence. An assembly object (such as a genome) that contains repeats is called repetitive or redundant. The sequence that appears repeatedly may also be termed a repeat element; the number of times a repeat element appears is called its copy number.
Repeats in WGA
Repeats pose a serious problem for Whole Genome Shotgun Assembly, especially during the overlap phase. It is impossible to distinguish between shotgun reads sequenced from different but identical-looking regions. Such reads may align to create a false overlap despite being far away from each other in the genome; this is called a repeat overlap and can easily cause misassemblies. Some genomes are so repetitive as to prohibit finding true overlaps in them, making them virtually impossible to assemble using WGA.
If a genome contains a repeat that is longer than any input reads, that repeat forms an insurmountable barrier, because it is impossible to bridge the repeat. (This is why it is so hard to make an assembly out of Solexa reads!) In an assembly with good connectivity, it may be possible to use read pairing information to bridge repetitive regions and sort out the different versions of each repeat.
- TagRepeatReads flags repetitive reads, storing the flags in the file reads.is_repetitive. Note that "repetitiveness", in the context of a read, means containing kmer sequences that appear often in other reads, rather than containing the same sequence many times within the same read.
- Assemblator treads carefully around repetitive reads, keeping them out of contig consensus.
- TagRepeats finds repetitive kmers and merges them.
- RunMarkup provides some useful information about the repetitiveness of an assembly.
- DisplaySupercontig indicates repeats visually. Reads tagged by TagRepeatReads are drawn in dark gray. Furthermore, any short region with unexpectedly high coverage, as indicated by the height of the reads drawn in, is likely to be a repeat.
- FindRepeatFamilies identifies the common ancestor of repeat families by analyzing reads.
Consensus Repeat Tagging
Even if repeats are not assembled correctly, there is still a lot of information that can be found in the contig consensus sequence:
- the approximate k-mer repetitivity
- the degree of interspersement (i.e. local repeats vs ubiquitous repeats)
- the approximate structure
- approximate identity of repeats
- sequence, that is correct enough to identify and classify repeats
Ideally, what one would like to look at is a picture that annotates the genome where sequence is repetitive, but also captures where the other copies are located in the genome. For assembly purposes, we also want to only consider perfect or at least highly identical repeats that exceed a certain minimum length.
Note: we are interested in all repeats, so we specifically do not want to mask high-copy number k-mers!
- Module: TagRepeats
To find all relevant repeats, we first extract all overlapping k-mers (48-mers, by default) in contig consensus sequence and store them in a table. Sorting this table alphabetically now allows for quick binary searches to find the multiplicity of each k-mer (forward and reverse complement).
Note: k-mer tables can get large for large genomes. By removing all unique k-mers in place before tagging, even genomes of 3.6 Gbp can be processed.
Since we also store the origin of the k-mers in the table, we can now retrieve this information and store it in an array as we walk along a contig. The actual data structure that we record for each non-unique k-mer match now contains
- the start position on the current contig
- matching contig (the contig the k-mer in the table came from)
- starting position on the matching contig
- forward/reverse complement match
By sorting this list using the proper criteria, all perfect matches will be located in contiguous positions in the list, so that we can merge these to reconstruct the true length of the repeats.
Note: we do allow for single base mismatches between two perfect k-mer matches and merge these repeats into single matches. Since keeping track of all repeats can be costly, doing this helps reduce memory footprint.
Now that we have all the information about all repeats, we also want to be able to quickly look at it. DisplaySupercontig can display repeats as bars under the contig consensus spanning the repeats, where the individual bar is coler coded by matching scaffold - which makes it easier to spot longer, not completely contiguous repeats between scaffolds.
Repeat Read Tagging
- Module: TagRepeatReads
Another way which does not rely on any assembly being correct is based on examining k-mers in the read set. If more than half of the k-mers in a read have a multiplicity of 2.5 over the expected coverage, the read is declared "repetitive". In DisplaySupercontig, these reads appear in dark grey.
- Module: FindRepeatFamilies
It is sometimes of interest to classify repeats in a genome into families and create a common ancestor copy. This is again done by examining the read set rather than an assembly: first, we construct a sorted k-mer table from all k-mers found in repetitive reads (the set of reads used can be specified via command line option). Then, each k-mer that appears at a multiplicity higher than a threshold, is extended to the left and right with the most popular k-mer on each side, until the popularity of extensions drops below a threshold. In order not to reconstruct the same family over and over again, we tag all used k-mers as to be off-limits for other families. NOTE: this approach will also find the rDNA and - in most ccases - the mitochondrial DNA. Blasting the families against all known organisms can quickly identify homologues in other genomes.
Why do repeats exist?
Certain nucleotide structures tend to recur in genomes because they serve biological functions. One class of biologically important repeats is Sequence motifs, such as the TATA box motif (so called because it always begins with the bases TATA) which is the initiator motif for virtually every intron in every eukaryotic genome. Telomeres are another kind of sequence motif.
One common cause of repeats, which can be found in a large variety of genomes, are transposable elements. They range from a few hundred bases (e.g. SINE) up to several thousand (e.g. transposons). Typically, isolated repeats of up to 15 kbp or more do not cause trouble for WGA as long as the flanking sequence is unique; however, clusters of mobile elements, which in some cases span several hundred thousand base pairs (which is longer than the longest insert size), can lead to situations in which the correct sequence cannot be resolved. Also, due to the high copy count of mobile elements, it is not guaranteed that Arachne was able to even detect all overlaps between reads from these regions (Arachne imposes a cutoff to mask extremely popular k-mers in reads when looking for overlaps between reads). In addition, it can be ambiguous as to what unique sequence flanks the repeat cluster on both sides.
Another common form of repeats are long duplications, typically where the copies are placed relatively close to each other. Unlike with clusters of mobile elements, we have the advantage that, most likely, all correct (and incorrect) overlaps between the reads involved are known. This makes the assembly task easier, and even perfect repeats of ~30 kbp can be resolved.
Tandem repeats (i.e. there are many adjacent copies of a single motif) are another typical repeat structure. The size of the motif can vary dramatically (from 6 bp up to several thousand) as well as the identity between the copies. Tandem repeats, if spanning long stretches, pose a major problem to Whole Genome Assembly.