HybridAssemble

From ArachneWiki

Jump to: navigation, search
HybridAssemble
Function Script
Phase Assembly
Standard CLAs PRE, DATA, RUN, GDB, NO_HEADER
Special CLAs DATA_OUT, EXT_FASTB, EXT_FASTA, EXT_QUAL, EXT_QUALB, EXCLUDE_FILE, QUALB_SCALE, MIN_CONTIG, end_floor, explicit_trim, fakeReadLen, fakeReadShift, fakeReadInsert, fakeReadHead, ADD_FAKE_READS, PAIR_FAKE_READS, PRE_ASSEMBLE, MID_ASSEMBLE, POST_ASSEMBLE, FULL_EXTERNAL_COVERAGE, NON_OVERLAPPING, recycle_bad_contigs, ACE, VERBOSE_LOAD
Source location ARACHNE_DIR/assemble

HybridAssemble is a module which runs a pipeline for hybrid assembly. It assembles data sets from different sources, namely a combination of paired Sanger reads and an assembly built using alternate sequencing technologies. HybridAssemble calls Assemblez to perform all necessary pre-processing steps (read trimming, read overlaps etc.), but bypasses the contig layout stage entirely. Instead, the alternate assembly is used as a template for contigs, which are then populated by the paired Sanger reads.

Contents

Input

In order to compute correct consensus quality scores, HybridAssemble generates non-overlapping, "fake" reads (readoids) from the alternate assembly and adds data files to the traceinfo, fasta and qual directories. These reads will be used in the assembly process to cover regions that are not represented in the Sanger data set, so that the consensus quality scores are based on the original alternate assembly quality scores.

NOTE: Given the alternate assembly and Sanger reads, HybridAssemble will run a full assembly process from the beginning to the end. There is no need to manually cut the alternate assembly into fake reads etc.


Command-Line Arguments

Argument name Argument type Default value Meaning
DATA_OUT String [None] Optional output directory for readoids.
EXT_FASTB Filename [None] Alternate assembly (fastb).
EXT_FASTA Filename [None] Alternate assembly (fasta). Either EXT_FASTA or EXT_FASTB needs to be specified.
EXT_QUAL Filename [None] Alternate assembly quality scores.
EXT_QUALB Filename [None] Alternate assembly quality scores as qualb.
EXCLUDE_FILE Filename [None] Reads to exclude. This argument is passed directly to Assemblez when set, otherwise Assemblez will use its default: "reads.to_exclude".
QUALB_SCALE Double 1 Scale down quality scores.
MIN_CONTIG Unsigned Integer 1000 Minimum contig length to use from alternate assembly.
end_floor String [None] Reads are trimmed until no quality values below this number exist within 12 bases of either end. This argument is passed directly to Assemblez when set, otherwise Assemblez will use its default of 10.
explicit_trim Bool False Use the clip_vector_left and clip_vector_right fields from the XML_ancillary_files for read trimming.
fakeReadLen Unsigned Integer 800 If ADD_FAKE_READS=True, this determines the readoid length.
fakeReadShift Unsigned Integer 100 The distance between the start of each readoid along the alternate assembly. If NON_OVERLAPPING=True, this argument is set to the same value as fakeReadLen.
fakeReadInsert Unsigned Integer 2000 If PAIR_FAKE_READS=True, this determines the insert size of the readoid pairs.
fakeReadHead String 454 The readoid name and file headers.
ADD_FAKE_READS Bool True Add artificial readoids.
PAIR_FAKE_READS Bool False This custom argument tells HybridAssemble to create readoid pairs if set to True. The paired readoid library is named "<fakeReadHead>_lib.p", and the unpaired is named "<fakeReadHead>_lib.u". When set to True an unpaired readoid library is still generated due to alternate assembly sequence < fakeReadInsert, and as artifacts when reaching contig ends. Using this tends to vastly improve the hybrid assembly, and it is recommended that it be set to True. An important effect of this is that it changes the meaning of NON_OVERLAPPING. If no other readoid options are modified, defaults should give ~2X coverage.
PRE_ASSEMBLE Bool True Run all pre-processing.
MID_ASSEMBLE Bool True Populate alternate assembly with Sanger reads.
POST_ASSEMBLE Bool True Post-processing: Patch gaps, rebuild consensus, etc.
FULL_EXTERNAL_COVERAGE Bool True If False, it will run contig layout to recover regions not present in the alternate assembly.
NON_OVERLAPPING Bool True Sets fakeReadShift to fakeReadLen when True. WhenPAIR_FAKE_READS=False, the effect is the generation of adjacent, non-overlapping readoids.
recycle_bad_contigs Bool True Keep unpopulated stand-alone contigs.
ACE Bool False Generate ACE files.
VERBOSE_LOAD Bool True Stub, currently does nothing.


How to run a hybrid assembly

  1. Generate alternate assembly.
  2. Prepare standard Assemblez input files. Any desired Assemblez command-line arguments not settable within HybridAssemble may be set in defaults.Assemblez.
  3. Assemble. Make the usual project from the Sanger data and run HybridAssemble by providing the alternate assembly. The rest will just run like a regular assembly.

How to generate readoids only

  1. Generate alternate assembly.
  2. Set PRE_ASSEMBLE, MID_ASSEMBLE, and POST_ASSEMBLE to False.
  3. Set DATA_OUT to prevent the readoids from ending up in the standard assembly directories.
  4. (Optional) Customize fakeReadHead.
  5. Set other options as necessary to obtain desired coverage and sizes (an easy rule of thumb with paired readoids is to approximate coverage as fakeReadLen × 2 ÷ fakeReadShift, though this breaks if the average sequence size of the alternate assembly is ≤ 2X fakeReadInsert).
  6. Run.

References

  1. [PMID 16056220]
  2. [PMID 18755037]
  3. [PMID 19251739]
Personal tools