HybridAssemble is a module which runs a pipeline for hybrid assembly. It assembles data sets from different sources, namely a combination of paired Sanger reads and an assembly built using alternate sequencing technologies. HybridAssemble calls Assemblez to perform all necessary pre-processing steps (read trimming, read overlaps etc.), but bypasses the contig layout stage entirely. Instead, the alternate assembly is used as a template for contigs, which are then populated by the paired Sanger reads.
- Standard Arachne Input, including Sanger read data (fasta, qual, xml files, etc.).
- Alternate sequencing technology assembly (e.g. 454, Illumina, etc.) as generated by an appropriate assembler (e.g. Newbler, ABySS, etc.) - fasta and qual files.
In order to compute correct consensus quality scores, HybridAssemble generates non-overlapping, "fake" reads (readoids) from the alternate assembly and adds data files to the traceinfo, fasta and qual directories. These reads will be used in the assembly process to cover regions that are not represented in the Sanger data set, so that the consensus quality scores are based on the original alternate assembly quality scores.
NOTE: Given the alternate assembly and Sanger reads, HybridAssemble will run a full assembly process from the beginning to the end. There is no need to manually cut the alternate assembly into fake reads etc.
|Argument name||Argument type||Default value||Meaning|
|DATA_OUT||String||[None]||Optional output directory for readoids.|
|EXT_FASTB||Filename||[None]||Alternate assembly (fastb).|
|EXT_FASTA||Filename||[None]||Alternate assembly (fasta). Either EXT_FASTA or EXT_FASTB needs to be specified.|
|EXT_QUAL||Filename||[None]||Alternate assembly quality scores.|
|EXT_QUALB||Filename||[None]||Alternate assembly quality scores as qualb.|
|EXCLUDE_FILE||Filename||[None]||Reads to exclude. This argument is passed directly to Assemblez when set, otherwise Assemblez will use its default: "reads.to_exclude".|
|QUALB_SCALE||Double||1||Scale down quality scores.|
|MIN_CONTIG||Unsigned Integer||1000||Minimum contig length to use from alternate assembly.|
|end_floor||String||[None]||Reads are trimmed until no quality values below this number exist within 12 bases of either end. This argument is passed directly to Assemblez when set, otherwise Assemblez will use its default of 10.|
|explicit_trim||Bool||False||Use the clip_vector_left and clip_vector_right fields from the XML_ancillary_files for read trimming.|
|fakeReadLen||Unsigned Integer||800||If ADD_FAKE_READS=True, this determines the readoid length.|
|fakeReadShift||Unsigned Integer||100||The distance between the start of each readoid along the alternate assembly. If NON_OVERLAPPING=True, this argument is set to the same value as fakeReadLen.|
|fakeReadInsert||Unsigned Integer||2000||If PAIR_FAKE_READS=True, this determines the insert size of the readoid pairs.|
|fakeReadHead||String||454||The readoid name and file headers.|
|ADD_FAKE_READS||Bool||True||Add artificial readoids.|
|PAIR_FAKE_READS||Bool||False||This custom argument tells HybridAssemble to create readoid pairs if set to True. The paired readoid library is named "<fakeReadHead>_lib.p", and the unpaired is named "<fakeReadHead>_lib.u". When set to True an unpaired readoid library is still generated due to alternate assembly sequence < fakeReadInsert, and as artifacts when reaching contig ends. Using this tends to vastly improve the hybrid assembly, and it is recommended that it be set to True. An important effect of this is that it changes the meaning of NON_OVERLAPPING. If no other readoid options are modified, defaults should give ~2X coverage.|
|PRE_ASSEMBLE||Bool||True||Run all pre-processing.|
|MID_ASSEMBLE||Bool||True||Populate alternate assembly with Sanger reads.|
|POST_ASSEMBLE||Bool||True||Post-processing: Patch gaps, rebuild consensus, etc.|
|FULL_EXTERNAL_COVERAGE||Bool||True||If False, it will run contig layout to recover regions not present in the alternate assembly.|
|NON_OVERLAPPING||Bool||True||Sets fakeReadShift to fakeReadLen when True. WhenPAIR_FAKE_READS=False, the effect is the generation of adjacent, non-overlapping readoids.|
|recycle_bad_contigs||Bool||True||Keep unpopulated stand-alone contigs.|
|ACE||Bool||False||Generate ACE files.|
|VERBOSE_LOAD||Bool||True||Stub, currently does nothing.|
How to run a hybrid assembly
- Generate alternate assembly.
- Prepare standard Assemblez input files. Any desired Assemblez command-line arguments not settable within HybridAssemble may be set in defaults.Assemblez.
- Assemble. Make the usual project from the Sanger data and run HybridAssemble by providing the alternate assembly. The rest will just run like a regular assembly.
How to generate readoids only
- Generate alternate assembly.
- Set PRE_ASSEMBLE, MID_ASSEMBLE, and POST_ASSEMBLE to False.
- Set DATA_OUT to prevent the readoids from ending up in the standard assembly directories.
- (Optional) Customize fakeReadHead.
- Set other options as necessary to obtain desired coverage and sizes (an easy rule of thumb with paired readoids is to approximate coverage as fakeReadLen × 2 ÷ fakeReadShift, though this breaks if the average sequence size of the alternate assembly is ≤ 2X fakeReadInsert).