Improving Assemblies

From ArachneWiki

Jump to: navigation, search

After running Assemblez, you will have a decent draft final assembly, but you are probably far from done. Now you can really get into the thick of using Arachne to improve your assembly.

Each genome is different and requires individual attention when being assembled. Rather than providing a "one size fits all" solution, Arachne lets you specifically address certain issues by running quite specialized modules -- and there are lots of modules. Chances are you will find something that comes very close to doing exactly what you want.

NOTE: After Assemblez has been run, the identity of RUN changes to include work. If given RUN=run, Assemblez will put the assembly files in run/work. Any further modules you call must then specify RUN=run/work.


Improvement algorithms

This section lists only a few examples of sequences of operations that can be performed. For more details, please check the Algorithms section.

Repair false joins

Some genomes require more aggressive breaking than is done per default. Modules to do that are PositiveBreaker or NegativeBreaker. You can also run:

BreakIt COV_OFFSET=6000

Scaffolds do not extend

Hotspots of polymorphism and/or indels can prevent the scaffolder (Rebuilder) from making joins, even though there is enough evidence based on insert linking. To be more aggressive, run

Rebuilder use_sw_gap=True max_overlap_score=200000 max_errors_in_align=10000 end_stretch_in_align=24 
              max_gap_in_swgap=6000 max_indel_in_swgap=6000 (optional: keep_existing_supers=True)

Even more aggressive (especially with the option MIN_LINKS lowered) and only recommended for highly polymorphic genomes is


Specifically isolate contigs or supers

RecycleGarbage lets you manipulate assemblies in many ways, including removing contigs, reads or scaffolds, e.g.


will remove all but scaffold 6, whereas


removes scaffold 6 and keeps the rest.

Reads are still misplaced

Reads can still be in the wrong places and need to be re-arranged; this can be due to repeats or polymorphism. To first aggressively remove reads, run

IncompatiblePurge USE_REPEATS=False

which will remove both end reads if one of them has a high-quality difference with at least one other read.

CompatibleBinge or CompatibleBingeFast

will subsequently add reads if they can be placed consistent as pairs. The option NO_SNPS controls whether you want high quality disagreements between reads (False) or not (True; for a haploid, one might want to avoid that).

In addition


will also place single reads (i.e. the partner cannot be placed wiithin the same scaffold) either at the ends of scaffolds or near contig gaps.

Consensus needs repair

Some modules introduce disagreements between the contig consensus, e.g. by moving reads between copies of a repeat. To update the consensus to reflect the correct read placements, run


followed by


Important note: make sure the input assembly contains the file mergedcontigs.superb. You can generate it by running FindGapDeviations.

Alternatively, you can run


which is usually faster but still somewhat experimental.

Alignment tools

The Arachne code package includes several alignment tools, with varying goals and levels of refinement. Executable tools (available at the command line) include CmpSeq, QueryLookupTable, and PerfectLookup. Programmatic tools (available to C++ programs) include LocalAlign.

Generating Ace files

Ace files are the main input files for Consed, a tool for viewing assemblies by graphically showing the aligned reads on a contig-by-contig basis. The module CreateAceFile generates ace files from a finished assembly; you can also create them automatically by setting ACE=True and setting ACEDIR appropriately. See CreateAceFile.

Personal tools