Input
From ArachneWiki
Before running Arachne, you must set up the input. This highly non-trivial task will require you to gather information about your sequencing project and the organism you are trying to sequence and convert it into appropriate formats. This page will walk you through the process of setting up these files.
You can find examples of each of these input files by examining the sample projects, which are available at ftp://ftp.broad.mit.edu/pub/wga/sample_projects/. Also, large amounts of real-life input data for the fasta, qual, and traceinfo directories can be found, sorted by species, at the NCBI website: ftp://ftp.ncbi.nlm.nih.gov/pub/TraceDB/.
Contents |
Directory tree
Each run of Arachne will have its own set of directories and subdirectories on your filesystem. (See Directory tree.) Most of these directories will be created by Arachne itself and will contain output, but some of them must be created manually and filled with input files. In particular, the PRE and DATA directories must contain certain files.
PRE
You must add the following subdirectories to the PRE directory:
- dtds: Some documentation.
- e_coli and e_coli_transposons: The mitochondrial genome contigs, in fasta format.
- vector: The vector genome contigs, in fasta format.
Each of these directories (other than dtds) must contain a file contigs.fasta.
DATA
Subdirectories
The DATA directory must contain three subdirectories. Each of these directories contains essential information about the input reads.
- fasta/: read sequence files in fasta format. Any and all files of the form reads.fasta, reads.fasta.gz, fasta/fasta.*, fasta/*.fasta, or fasta/*.fasta.gz will be used.
- qual/: read quality score files in qual format. Any and all files of the form reads.qual, reads.qual.gz, qual/qual.*, qual/*.qual, qual/*.qual.gz will be used. The quality score files must match the read sequence files on a file-by-file basis.
Note that every input read must appear exactly once in each of these directories. Reads' identities are defined by read names. For XML ancillary files, read names are defined by the trace_name field. Mis-labeling of read and library names in the XML ancillary files is a common problem and will complicate the assembly process if unfixed.
Individual files
Several other files can appear in the DATA directory. The following files are required:
The following files are optional in the DATA directory:
- organism: a one-line file containing an appropriate name of this organism (i.e., "Guinea pig" or "Candida albicans".) Used in assembly.ps.
- nhaplotypes: a one-line file containing the number of haplotypes this organism has. Used in polymorphism algorithms. If no file is given, Arachne assumes nhaplotypes = 2.
- insert.sites: A list of insert locations. Required to perform exact trimming.
- reads.to_exclude: the read exclusion file
- mitochondrial.fasta: a fasta file containing sequence contigs for the mitochondrial genome of the organism being sequenced. Sequence reads matching these contigs are not used in the assembly.
- contaminants.fasta: a fasta file containing sequence contigs for known contaminants of the organism being sequenced. Sequence reads matching these contigs are not used in the assembly.
- vector.fasta: a fasta file containing supplementary vector sequence for the organism being sequenced. Matching sequence is trimmed from the containing reads.
- contigs.fasta (highly recommended)
- defaults.Assemblez: a set of command-line arguments that are used implicitly whenever Assemblez is run. Any default value may be overridden by explicitly stating a different value for that argument. The format of this file is:
ARG1=val1 ARG2=val2 ...
- defaults: identical to defaults.Assemblez, except it is used for Assemble. It is NOT used for Assemblez.
- labnotes: This file, which takes the name <species>.labnotes, may appear in sample projects. It is a simple shell script indicating the modules that should be run (or have already been run) in this assembly project.
