Whole Genome Shotgun Assembly
Whole Genome Shotgun Assembly (often called simply Whole Genome Assembly or WGA) is the task of sequencing the entire genome of an organism by ordering and orienting reads. Reads are small snippets of DNA that come from random locations in the genome; they are typically found through paired-end sequencing of inserts. The inserts range in size from a few thousand base pairs up to 200,000 bp and are created by any of several sequencing methods.
Arachne approaches the task of WGA by organizing reads into successively larger assembly objects. Reads are combined into contigs, so called because they consist of many reads aligned together to create a contiguous sequence of base pairs. These are then used to make supercontigs, which consist of many contigs separated by gaps of estimated sizes. In practice, genomes are always sequenced to high levels of coverage, allowing for overlaps and redundancy checks. It is not at all uncommon to amass tens of millions of reads before attempting whole genome assembly.
Challenges to WGA
Whole genome assembly is akin to solving an enormous one-dimensional jigsaw puzzle. The biological and stochastic nature of the problem leads to some special difficulties, listed below. In the jigsaw puzzle analogy, random chance represents missing pieces. Cloning bias represents a tendency of some regions of the puzzle to resist being cut. Sequencing errors represent miscut pieces. Redundancy represents a repetitive pattern in the puzzle, so that a piece can fit in multiple places and distinct pieces may appear identical. Polymorphism represents two slightly different puzzles superimposed on each other.
Unless the coverage is effectively infinite, some regions will simply not be represented. This leads to gaps in the assembly.
Cloning bias is a tendency of certain regions of the genome to be cloned less often than others during sequencing, and thus less likely to be sequenced. This results in a lower expected read coverage in these regions, which magnifies the gap-producing effects of random chance. To get enough data to sequence these regions, a higher overall coverage is needed. The most prominent cause of cloning bias is AT-richness, which affects cloning rates in the Sanger sequencing process. In 454 sequencing process, another kind of cloning bias causes homopolymers to be counted incorrectly.
There are many different sequencing methods, but none of them are perfect; they all call bases incorrectly from time to time. These mis-calls are called sequencing errors. Fortunately, the chance of a sequencing error is usually known and quantifiable. The quality score is a measure of confidence that a base has been called correctly.
A related problem is chimerism: the occurrence of paired production reads with an inaccurate link between them. Chimerism is a form of poor data quality; Arachne discards chimeric reads when it detects them.
- Main article: Repeat
Every genome has repeats: sequences that appear in more than one place. Repeats can look dramatically different from genome to genome, in terms of length, identity, and extent. While tame genomes contain almost no repeat sequences, others can be made up of repeats on more than half of the genome - and these are copies at high-identity. One common form of redundancy is segmental duplication. Repeats are also likely to appear in highly AT-rich or GC-rich genomes, for the simple statistical reason that these genomes contain less variety.
Repeats (or clusters thereof) begin to pose a problem for the assembly process when they are at least as long as the smallest insert size and identical at high rates (99% and up). It is not clear if some repetitive genomes can be assembled correctly at all; in any case, there are ways to at least ameliorate the problem.
- Main article: Polymorphism
Most organisms are diploid, meaning they have two complete pairs of chromosomes, one from the mother and one from the father. The two chromosomes are similar but, unless there is no variation in the gene pool, not identical. (Some species are effectively haploid; see mouse.) The reads in a library can come from either chromosome in a pair, and therefore, they might not match up even if they are taken from the exact same spot in the genome. Such genomes are said to be polymorphic.
Polymorphic genomes can pose a major problem to genome assembly. Do two reads come from polymorphic versions of the same region, or do they come from different copies of a repeat? What to do about indels -- sequences that exist in one chromosome but not in the other? What if some stretches are inverted relative to each other? When is it better to assemble both haplotypes together, and when should they be split apart into two separate assemblies?
Special Types of WGA
- Main article: Assisted assembly
Assisted assembly is the process of using an already-sequenced genome (a reference genome) as a reference for improving an assembly. Assisted assembly is a useful alternative to de novo assembly in the case of low-coverage assemblies. The main module that performs assisted assembly is AssistMain.
- Main article: Hybrid assembly
Sanger sequencing is just one method of obtaining snippets of DNA sequence. Other promising technologies are emerging, and you may find yourself with a mix of data from Sanger and other sequencing methods. Arachne allows you to build hybrid assemblies.