Arachne

From ArachneWiki

Jump to: navigation, search

Arachne is a toolkit developed for Whole Genome Shotgun Assembly. Arachne consists of a comprehensive set of modules, including a central pipeline (Assemblez) that can be run on almost any genome to produce a draft assembly. Arachne's mandate explicitly includes accommodating difficult genomes with complications such as extreme size, repeats, and high polymorphism rates. In order to construct a reasonably well-connected assembly from such tricky genomes, Arachne provides further tools that can be used after the main module pipeline.

The Arachne code package has been under continuous development since 2000. It began with the classic "overlap-layout-consensus" paradigm and has since developed into a vast collection of tools, implemented in numerous modules, to analyze, visualize and manipulate assemblies. New and improved algorithms are becoming available on a regular basis. [1]

Contents

Arachne's Paradigm

Arachne's approach to genome assembly is to maximize "consistency". Consistency is defined as

  • minimizing the base disagreements between overlapping reads, and
  • maximizing the insert happiness -- reads should be placed roughly as far from each other as the library suggests, and pointing toward each other

The number of reads (and overlaps between them) does not allow for an exhaustive search; the computational requirements would be insane. Instead, Arachne uses "greedy" algorithms: combine reads selecting overlaps very carefully; use insert linking to disambiguate local branches when building contigs: only paths that are confirmed are considered; build scaffolds requiring at least two links; add reads that have not been used; break contigs and scaffolds where there is evidence that there might be errors (some ambiguities can only be resolved looking at the bigger picture); remove and move reads and inserts.

Arachne's algorithms are implemented in assembly modules. Each module performs one or more defined tasks: finding overlaps, building contigs, extending scaffolds, repairing consensus, closing contig gaps, and so forth. Each module tries to increase the overall consistency, but it is not guaranteed that it will, in fact, find the global optimum. To get closer and closer, Arachne can run all these steps in a certain order and in iterations, since each step might open possibilities for other steps to find a better solution (e.g. scaffolding after breaking typically yields better connectivity). Script modules such as Assemblez perform the assembly modules in a logical order.

Arachne uses its own internal binary formats for data files. You cannot examine an assembly directly by looking at the files; you must instead use Arachne's text and visual output modules.

Download Arachne

Detailed instructions for download, installation, and compilation are available.

See also

References

  1. Some of Arachne's algorithms are described in "ARACHNE: A Whole-Genome Shotgun Assembler", Genome Research, January 2002, and "Whole-Genome Sequence Assembly for Mammalian Genomes: ARACHNE 2", Genome Research, January 2003, though it should be noted that current algorithms have quite dramatically diverged from what was published.
Personal tools