From an assembler's perspective, each new genome is different from all others and holds its own surprises. So, when writing WGA software, one tries to anticipate as many future situations as possible - and this is a big part of the reason why Arachne is extremely modular, each module being its own executable with typically many knobs and levers: each module performs a (sometimes very narrowly defined) task, which is only a fraction of an algorithm, the sum (and sequence) of which makes up the art of Whole Genome Assembly.
In this section, we will group algorithms into certain tasks and discuss strategies for solving problems.
One of the most important parts in setting up the data for assembly is to get the data as clean as possible. Reads, as they come from the sequencing machines, may still include bases that stem from the vector as well as a tail containing only very low quality bases. Since these are two different problems, we distinguish between:
- Vector trimming: removing vector bases from the beginning of the read.
- Quality trimming: removing low quality bases from the end of the read.
Overlap finding is the problem of identifying which reads might have originated from the same part of the genome and hence align to each other. This is non-trivial, since a.) the number of reads can be very high, b.) repeats will add false overlaps, and c.) polymorphism as well as sequencing errors might hide true overlaps.
Since reads contain sequencing errors (whether labeled as being of low base quality or not), hence it is desirable to do error correction in order to identify and fix these errors, which in turn will make subsequent modules run more efficiently.
Contig layout is generally referred to as the phase in which reads get put together according their overlaps to form contigs. There are very different strategies to do this, as well as different stages in which it is required to take either the full set or a subset of reads and turn them into contigs.
The layout creates read locations, estimating the positions of reads in a contig relative to each other. Building consensus now means to take this information, examine all read bases of all reads and create consensus sequence, which is the final hypothesis for what the true DNA sequence of the orgnaism looks like.
Read libraries come with estimates for mean insert sizes and the standard deviation from the mean. In order to improve the accuracy, these numbers can be re-estimated based on the read locations in an assembly.
Scaffolding and breaking
Scaffolding is the process that ties contigs together based on insert linking information. The tricky parts are to ignore deficient inserts (chimeras), not to be fooled by reads that were placed into the wrong copy of a repeat, and to not aggrevate problems caused by wrong joins in contigs.
Contig gap stuffing
Contig gap stuffing is a technique to improve the connnectivity by a.) joining consecutive contigs if they overlap, and b.) building bridges between contigs based on reads that link off into a contig gap.
Read placement re-arrangement
Repeats as well as polymorphism can cause reads to either end up in the wrong place in the assembly or not be placed at all. Since accounting for as many reads in the data set is important (for example to estimate how much of the genome is really captured), read placemment re-arrangement (removing, adding, and moving of reads) is a tool to increase the number of reads assembled consistently with their partners.
Optical Map Alignments
Optical maps (or restriction maps) can help assigning scaffolds to chromosomes or linkage maps. Optical map alignments compare the position of restriction sites in a scaffold with the distance estimates provided by the map, thus finding full or partial overlaps that can be scored and ranked.
Looking at assemblies
Any assembly is a large pile of numbers: read ids, locations, link stretches, contig and scaffold sizes etc. To get any assessment, whether on a very local or a global level, nothing is as intuitive as visualizisation. Arachne provides a number of tools to inspect read locations, contigs, links, repeats - and all this from different distances away.
Identifying assembly problems
Automatically identifying assembly problems, even if those cannot be fixed, is an integral part of Arachne, since we need to caution anyone looking at a draft genome that certain regions might contain more or less severe assembly errors. Arachne has a whole suite of tools that look for pathologies and subsequently assign probabilities for the occurance of errors throughout the genome.
Finding repeats, especially in a large, repetitive genome, poses somewhat of a computational challenge. Arachne provides two different strategies to identify repeats, one by examining the consesus, another by analyzing k-mers in reads.
Estimating the polymorphism rate
Assuming that the consensus sequence reflects one haplotype (or frankentype) and that reads from both haplotypes were used and placed in the assembly, estimating the polymorphism rate can be done by identifying reads that - consistently with each other - disagree with consensus.
Aligning an assembly to a reference
If there is a reference (e.g. finished BACs, or a closely related strain), it might be helpful to align scaffolds to the reference. Arachne provides a technique to assign one unique placement (if there exists any alignment at all) to each scaffold in the assembly, thus resolving multiple possible alignments caused by repeat sequence.
Assuming that the telomeric motif is a short sequence that gets repeated in tandem (which is the case for most fungi), searching for this motif can be done by aligning reads to themselves. Subsequently, the problem of identifying telomeres is then reduced to examining reads that link off of the ends of scaffolds.
Mapping supercontigs to chromosomes
Mapping supercontigs to chromosomes makes the draft assembly usable for many purposes: convenient genome browsing, synteny etc. Arachne provides mechanisms to assign scaffolds via genetic markers, FISH, or an optical map.