Assemblez

From ArachneWiki

Revision as of 17:44, 2 May 2008 by JoshuaBurton (Talk | contribs)
(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search
For the deprecated module, see Assemble.

Assemblez is a script module that performs the central tasks of creating a de novo assembly. Assemblez calls a long pipeline of modules, which may be split into four phases:

  1. Data verification: Checking the input data for integrity. If there are problems (mostly caused by missing or misplaced information in the configuration files), Assemblez will give informative error messages with instructions on how to correct the problems.
  2. Pre-processing: Preparing the data for use. This includes setting up binary files for faster access (e.g. the read bases and quality scores) as well as error correction, computing overlaps between the reads, etc.
  3. Assembly process: The arrangement of reads into contigs and supercontigs. This starts with the layout (the first attempt to create contigs from reads) and entails various rounds of scaffolding (including scaffold and contig breaking), read moving, and gap closing.
  4. Output: A small number of output modules, which create files describing the draft final assembly.

Assemblez provides a convenient way to automatically turn an input data set into a draft final assembly. This assembly is a first effort; depending on the difficulty the genome poses, it might not be the optimal assembly, or even close to it. However, it always provides an excellent starting point for analysis and improvement.

A note to the monolingual: Assemblez is French for "assemble", in the imperative mood. When you type Assemblez, you are commanding your computer: "Assemblez!" It is pronounced approximately "ah-sahm-blay".

Contents

Syntax

To run Assemblez, type

Assemblez PRE=pre DATA=data RUN=run

with appropriate values for the PRE, DATA, and RUN directories. All intermediate data and assemblies will be located in

PRE/DATA/RUN/work

and subdirectories thereof.

To preview all modules that will be executed, run Assemblez with the option NOGO=True.

Assemblez can also perform only a subset of the entire assembly process; specifying the options START=module1 and STOP=module2, e.g. "Assemblez START=Assemblator STOP=Rebuilder.2" (since many modules are being executed several times, specifying ".x" will identify the x-th instance in which the module is being called).

After the initial layout (i.e. contig build), modules take a full assembly in Arachne-internal format as input and, subseqently, create a full assembly as output. The individual aassemblies are stored in separate subdirectories un your RUN/work directory, which by default are prefixed by "stage", a number, and the name of the module which generated it (e.g. "stage01.00_Rebuilder". Since Assemblez runs some modules in big loops, you will also see direcories called "stage_plus_plus04.50_Rebuilder", meaning that this is the third iteration of running this particular module in this particular context). You can analyze, visualize and convert each assemby into human-readable format. Also, from any stage on, you can create your own sequence of modules for specific improvements.

Many modules in Arachne are not included in Assemblez because they are not part of the standard assembly process. These modules tend to perform more specialized functions. They typically require a DATA, RUN/work, SUBDIR, and possibly an OUTDIR.

Tip: some modules do not generate the file mergedcontigs.superb, but some other modules require it to be there. If this file is missing, you can generate it by running the module FindGapDeviations on the subdirectory.


Parameters

Assemblez provides the following command-line arguments (and defaults):

use this defaults file.

  • string START

start with this module.

  • string STOP

stop at that module.

  • string OMIT1, OMIT2, OMIT3

omit running these modules.

  • string TO
  • string FROM
  • True/False FORCE False
  • True/False PROMPT False
  • True/False NOGO False

Instead of executing anything, Prints the list of modules to be executed to stdout.

  • True/False SOURCE_ARCHIVE True
  • True/False FINAL_BACKUP False
  • True/False FORCE_VERSION False

Forces it to run even on unsupported versions of gcc/the OS.

  • True/False ACE False
  • True/False CONFIRM_READ_LOCATIONS True
  • True/False DO_NQS_ALIGNS True

NQSAligns is a required step for standard assemblies, but can be skipped for highly customized assemblies, since it takes a long time to generate.

  • string LOGFILE assemblez.log

Name of log file (useful if multiple instances of Assemblez are run on one project)

  • True/False REQUIRE_TAG False
  • string TAG
  • string ASSERTED_TAG
  • True/False BUILD True
  • unsigned int MAX_SECONDS_ALLOWED 0
  • unsigned int MAX_MEMORY_ALLOWED 0
  • True/False BREAK_AND_REBUILD True

Breaks scaffolds and rebuilds them using long links first and short links second.

  • True/False REMOVE_EXCESS_ALIGNS False

This step removes overlaps not confirmed by insert linking. Experimental.

  • string SUBDIR_PREFIX stage

Prefix for intermediate assemblies.

  • string FINAL_SUBDIR stage_final_v1

Location of the final assembly.

  • True/False ADD_MISSING True
  • True/False COMP_LIB_STATS True

Re-estimate insert sizes and deviations based on the assembly.

  • True/False SW_GAP_STEP False

Use a more agressive step to extend scaffolds (useful for polymorphic genomes).

  • True/False FINISH_ASSEMBLY True

Run up to stage_final_v1 or not.

  • True/False SANITIZE_CONTIGS True

Break contigs where there is no read coverage caused by reads that were removed.

...and many more that we haven't had the time to document yet...

  • True/False POSITIVE_BREAK_AT_END False
  • True/False POSITIVE_BREAK True
  • True/False FIRST_STAGES_ONLY False
  • True/False KMER_KONSENSUS False
  • True/False CLEAN_ASSEMBLY False
  • True/False THREE_KILLER_PATCHES True
  • True/False REBUILD_AFTER_PATCHING True
  • True/False REMOVE_INTERMEDIATES False
  • True/False BINGE_AND_PURGE False
  • True/False BINGE_AND_PURGE_2HAP False
  • True/False FAST_BINGE True
  • True/False STREAMLINE False
  • True/False USE_CLASSIC_LOCFIX False
  • True/False EXTERNAL_OVERLAPS False
  • string maxcliq1 300
  • string aggressive_correction False
  • string filter_cutoff 18.5
  • string high_qual_cutoff 19
  • string very_high_qual_cutoff 30
  • string high_qual_pbms_rate 75
  • string min_overlap 10
  • string use_finishing True
  • string use_transposon_links True
  • string parse_wibr_read_names False
  • string config_file reads_config.xml
  • string exclusion_file reads.to_exclude
  • string filter_evaluation True
  • string correct1_passes 3
  • string dump_to_orig False
  • string reads_to_aligns_piles 3
  • string num_cpus 4
  • string num_cpus_pi 1
  • string save_subrun_files False
  • string max_read_frac 15
  • string max_read_frac_trim 10
  • string max_percent_invalid 1
  • string remove_duplicate_reads False
  • string eval_dir
  • string known_contigs contigs.fasta
  • string alignment_file nobbits
  • string max_bad_look 100000
  • string poly_score_look 0
  • string initial_max_score 0
  • string mouse_filter False
  • string end_floor 10
  • string left_sub_end_look 12
  • string parse_last True
  • string parse_first False
  • string one_ace_file False
  • string correct_errors_max_aligns 100
  • string try_if_no_proper False
  • string perfect_proper_only False
  • string align_kmer_size 48
  • string trim_bad_ends_max_aligns 100
  • string trim_bad_ends_minbad2 0.4
  • string locate_chimeras_last_grab 9
  • string merge_transposons False
  • string mc_max_nodes 70000
  • string mc_min_overlap 128
  • string assemblator_min_overlap 56
  • string n_haplotypes 1
  • string mc_strict True
  • string mc_score_thresh 0.7
  • string recycle_bad_contigs True
  • string break_link_dist 20000
  • string break_offset 0
  • string mc_quit_if_no_links True
  • string mc_min_read_len 300
  • string rebuilder_min_links 2
  • string rebuilder_check_coverage False
  • string use_plates_in_lib_stats True
  • string tilings_batch_size 1500
  • string fast_draft_consensus True
  • string big_reads False
  • string maxcliq2 300


Series of modules

As of 2007-12-12, Assemblez runs the following set of modules in order:

Data verification

  1. CheckConfigurationFile
  2. RequiredDataCheck
  3. TraceArchiveParser
  4. ConfigurationParser

Pre-processing

  1. PartitionInput
  2. FormPairings (steps 1-3)
  3. BackupOriginalData
  4. ExtractMetainfo
  5. ReadStatisticsTables
  6. PreassembleInserts
  7. ReadsToAligns (steps 1-3)
  8. ReorderReads* (two passes)
  9. WellSlop
  10. EraseImproperAligns*
  11. TrimBadEnds2*
  12. VerifyReadLengths
  13. TidyAligns
  14. AnalyzeDepth (steps 1-2)
  15. CorrectErrors* (steps 1-2)
  16. EraseDeadAligns
  17. LocateChimeras (steps 1-5)
  18. CleanAlignments
  19. FilterAlignments*
  20. RemoveExcessAligns (if REMOVE_EXCESS_ALIGNS is set to True)
  21. TagRepeatReads

* These modules do not run if EXTERNAL_OVERLAPS is set to True.

Assembly process

The modules run during the assembly process are highly dependent on flags, including BINGE_AND_PURGE, BINGE_AND_PURGE_2HAP, BREAK_AND_REBUILD, CLEAN_ASSEMBLY, COMP_LIB_STATS, DO_NQS_ALIGNS, FAST_BINGE, FIRST_STAGES_ONLY, KMER_KONSENSUS, REMOVE_INTERMEDIATES, STREAMLINE, SW_GAP_STEP, THREE_KILLER_PATCHES. They include, in approximate run order:

* These modules may not run, depending on flags.

Output

  1. CopySubdir
  2. RestoreMergedContigs*
  3. BackupMergedContigs*
  4. CreateAceFile*
  5. MakeReadableOutput
  6. EvaluateConsensus
  7. RefineConsensusAlignments
  8. FinishOldReport
  9. FinishNewReport

* These modules may not run, depending on flags.

Personal tools