Input

From ArachneWiki

Revision as of 18:43, 12 August 2008 by Dheiman (Talk | contribs)
(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search

Before running Arachne, you must set up the input. This highly non-trivial task will require you to gather information about your sequencing project and the organism you are trying to sequence and convert it into appropriate formats. This page will walk you through the process of setting up these files.

You can find examples of each of these input files by examining the sample projects, which are available at ftp://ftp.broad.mit.edu/pub/wga/sample_projects/. Also, large amounts of real-life input data for the fasta, qual, and traceinfo directories can be found, sorted by species, at the NCBI website: ftp://ftp.ncbi.nlm.nih.gov/pub/TraceDB/.

Contents

Directory tree

Each run of Arachne will have its own set of directories and subdirectories on your filesystem. (See Directory tree.) Most of these directories will be created by Arachne itself and will contain output, but some of them must be created manually and filled with input files. In particular, the PRE and DATA directories must contain certain files.

PRE

You must add the following subdirectories to the PRE directory:

Each of these directories (other than dtds) must contain a file contigs.fasta.

DATA

Subdirectories

The DATA directory must contain three subdirectories. Each of these directories contains essential information about the input reads.

  • fasta/: read sequence files in fasta format. Any and all files of the form reads.fasta, reads.fasta.gz, fasta/fasta.*, fasta/*.fasta, or fasta/*.fasta.gz will be used.
  • qual/: read quality score files in qual format. Any and all files of the form reads.qual, reads.qual.gz, qual/qual.*, qual/*.qual, qual/*.qual.gz will be used. The quality score files must match the read sequence files on a file-by-file basis.

Note that every input read must appear exactly once in each of these directories. Reads' identities are defined by read names. For XML ancillary files, read names are defined by the trace_name field. Mis-labeling of read and library names in the XML ancillary files is a common problem and will complicate the assembly process if unfixed.

Individual files

Several other files can appear in the DATA directory. The following files are required:

The following files are optional in the DATA directory:

  • nhaplotypes: a one-line file containing the number of haplotypes this organism has. Used in polymorphism algorithms. If no file is given, Arachne assumes nhaplotypes = 2.
  • mitochondrial.fasta: a fasta file containing sequence contigs for the mitochondrial genome of the organism being sequenced. Sequence reads matching these contigs are not used in the assembly.
  • contaminants.fasta: a fasta file containing sequence contigs for known contaminants of the organism being sequenced. Sequence reads matching these contigs are not used in the assembly.
  • vector.fasta: a fasta file containing supplementary vector sequence for the organism being sequenced. Matching sequence is trimmed from the containing reads.
ARG1=val1
ARG2=val2
...
  • defaults: identical to defaults.Assemblez, except it is used for Assemble. It is NOT used for Assemblez.
  • labnotes: This file, which takes the name <species>.labnotes, may appear in sample projects. It is a simple shell script indicating the modules that should be run (or have already been run) in this assembly project.
Personal tools