AssistMain

From ArachneWiki

Jump to: navigation, search
AssistMain
Function Scri
Phase Assembly
Standard CLAs PRE, DATA, RUN, SUBDIR, OUTDIR, GDB, NO_HEADER
Special CLAs ASSIST, TARGET_FASTB, MAX_COV, START, STOP, SQUASH, NOGO
Source location ARACHNE_DIR/assisted

AssistMain is a script module that performs assisted assembly. It can be used to improve an existing assembly, provided a close “reference” exists, either as finished sequence or as draft assembly. For example, we own a draft assembly of Drosophila pseudoobscura, and we also own shotgun reads of the closely related species Drosophila persimilis. We can use Arachne to generate an initial draft of the assembly of persimilis, and then the module AssistMain to improve the draft assembly of persimilis using pseudoobscura as the reference genome.

Contents

Input

AssistMain requires the following input:

  1. A draft initial assembly of the project we want to improve
  2. The sequence of the reference genome, in fastb format
  3. The alignments of the shotgun reads of our project against the reference genome

Initial draft assembly

AssistMain assumes the initial draft assembly was generated with Arachne (so all the files describing the assembly are in Arachne binary format). You can use either CopySubdir or BackupMergedContigs to save the assembly in a SUBDIR.

Reference genome

If the reference genome is available as a fasta file, you can use the module Fastb to convert it to fastb.

Alignments of the shotgun reads against the reference genome

The alignments can be generated with any tool, but they need to be saved as look_align objects, such as QueryLookupTable produces; see LookAlign for details of the format. Important remark: both the project’s shotgun reads and the reference sequence fragments are identified by their numeric ID, i.e. by their 0-based position in the fastb vectors (the reads’ fastb is the file reads.fastb in the RUN directory. The alignments file must be saved as

DATA/Aligns/ASSIST/all_aligns.qlt

where ASSIST is a code name for the reference genome. For example, when assisting persimilis against pseudoobscura we can select “pseudoobscura” for ASSIST, and save the alignments as /PRE/DATA/Aligns/pseudoobscura/all_aligns.qlt.

Procedures for generating all_aligns.qlt include the module chain of BsubBlastz, FinalizeAllEcfAligns, and ConvertEcfBuild35ToLookAligns; and the module chain of MakeLookupTable, QueryLookupTable, FilterAlignments, ValidateLookAligns, and TagRepeatReads.

Arguments

The command-line arguments GDB, PRE, DATA, RUN, SUBDIR, and OUTDIR are consistent with the other modules of Arachne; the only difference is that when using AssistMain both SUBDIR and OUTDIR must be not empty (i.e. real Arachne SUBDIRs).

NOGO

Setting the argument NOGO to True will cause AssistMain to list the modules that would be run, without actually running them.

START and STOP

These arguments can be used to run the AssistMain module from/to some specific point only. For example, setting START=ParallelFixConsensus.2 and STOP=AnchorSupersToTarget.1 will result in AssistMain running only the modules at and after the second instance of ParallelFixConsensus and at and before the first instance of AnchorSupersToTaget. Notice that the SUBDIR specified by the START argument must exist.

ASSIST

This argument must match exactly the code name used for the file with the alignments of the reads. A draft assembly can be assisted by several reference genomes, which are uniquely identified by the ASSIST argument.

TARGET_FASTB

This is the file with the fastb of the reference genome (full path name).

MAX_COV

The argument MAX_COV provides a "cap" for the piling up of reads onto the reference genome, and some of the modules use this argument to stay away from repetitive regions. A good thumb rule is to set MAX_COV to about 8 times the expected coverage of the project. If, for example, the project has been sequenced at 4X coverage, then the suggested value for MAX_COV is 32.

SQUASH

This is an experimental argument used to assist a draft assembly with a higher rate of polymorphism. The effect of setting SQUASH=True is to call some extra modules that first identify and then merge supercontigs suspected to be the two haplotypes of the same genomic regions. Warning! At present this is not well tested. We recommend to leave it turned to False.

Personal tools