Hybrid assembly

From ArachneWiki

Jump to: navigation, search

A hybrid assembly is an assembly in which the input reads come from different sequencing methods. The most common kind, ABI or Sanger reads, often form a baseline, to which other types of reads are added.

Arachne was raised as a Whole Genome Shotgun Assembler for a specific kind of data: paired end-reads, obtained from Sanger sequencing. With the arrival of new sequencing technologies, however, a number of built-in assumptions are not always true; vector trimming reads, for example, makes no sense for reads that came from some pyrosequencing machine, since there is no cloning involved. Another factor is read pairing: Arachne heavily relies on linking information from paired end-reads. If this is not available, Arachne cannot be used to its full strength.


Sanger and 454

Early 454 reads were unpaired, shorter (100 or 200 bases), and less accurate (aggregate quality ~Q17) than Sanger reads. However, in large quantity (~20x coverage) they could be assembled into reliable (~Q35) assemblies.

The Newbler module performs assembly of 454 reads. The short unpaired reads are intrinsically unable to disambiguate repeats, so the Newbler assembler attempts to detect repeat regions and output them as separate contigs. Thus assemblies of 454 data have poor connectivity for most genomes. It is natural to try to mix in ABI reads to improve connectivity.

We currently make hybrid Sanger/454 assemblies as follows:

  1. Assemble 454 reads with Newbler.
  2. Take the Newbler output and provide it to Arachne as additional "reads". It is necessary to modify quality scores for this step: Newbler gives unrealistic quality scores to its assemblies. Arachne must be set to accept reads longer than 1023 bases for this to work.
  3. Use Arachne to assemble the hybrid data set as normal.

Assembly strategy

"Foreign" contigs

How does one go about combining data sets? Pretending that non-Sanger reads are in fact Sanger reads, and stuffing them into the assembler, is not a viable option for a number of reasons: reads might be to short (Arachne likes long reads), base error rates might be too high (Arachne likes high quality) etc.

What's more, we can often assume that there is already some other assembler that knows better how to deal with these reads and their idiosyncrasies, and is in fact able to build better contigs. Instead of running contig layout and consensus from reads, we will start with the sequence produced by this other assembler and populate it with paired reads based on sequence alignments.

Populating contigs

We take our set of Sanger reads and try to assign a unique placement for each of them based on the foreign contigs. Taking into account repeats in the genome and the assumption that the contigs might not always be correct, we apply a number of different algorithms to place reads, interleaved with some scaffolding steps:

  1. Place both end reads in pairs within a contig based on sequence alignments between reads and contigs
  2. Place reads if they have a unique (and good) alignment
  3. Place reads at the ends of contigs
  4. Scaffolding, optionally break contigs.
  5. Recompute consensus.
  6. Patch gaps between contigs using Sanger reads.
  7. Repeat steps 1-4.

One issue we have not addressed yet: the foreign contigs might have overcollapsed repeats (if the foreign assembler followed the "unitigging" principle, i.e. combine overlapping reads up to the point where there is a branch point). A simple way around this is to cut up the foreign contigs into "fake readoids" and add them to the Sanger data set pretending that they are unpartnered reads. This allows the patching steps to use these reads when trying to bridge gaps between contigs (doing this iteratively will allow for mutliple placements of these reads thus stuffing multiple gaps cause by different copies of the same repeat).

Details: how to build hybrid assemblies


HybridAssemble is a script, which performs the following steps:

  1. Takes foreign contig bases (and quality scores) and adds fake readoids to the Sanger data set. The module RemapQuals might be useful.
  2. Generates all pre-processed input by running Assemblez with big_reads=True correct1_passes=0 STOP=TagRepeatReads.
  3. Generates an assembly from the foreign contigs by running FormMarkers.
  4. Performs a run of placing reads and scaffolding by calling PlopDownReads, Rebuilder, CompatibleBinge, PlaceBacEnds and PlaceUnplaced.
  5. Runs regular assembly processing by calling Assemblez with big_reads=True SANITIZE_CONTIGS=False START=FindGapDeviations BREAK_AND_REBUILD=False POSITIVE_BREAK=False.

The parameter FULL_EXTERNAL_COVERAGE=False assumes that the foreign contigs only partially covers the genome. In that case, HybridAssemble will run layout all reads not placed on foreign contigs and merge these contigs to the assembly.

Life cycle of a hybrid contig

To illustrate the process, we took snapshots (using DisplaySupercontig) of a contig that undergoes read placements at different stages:

Life cycle of a hybrid contig.
Life cycle of a hybrid contig.
Personal tools