Fasta

From ArachneWiki

Jump to: navigation, search

Fasta is a standard file format for reads and contigs, indicating a set of sequences. The corresponding format for quality scores is qual. The fasta format is supported by Arachne and also by most base-calling software, such as PHRED. Files in this format include contigs.fasta and assembly_supers.fasta.

Here is an example of two fasta-format reads:

>L1000ABCDEFG.readname
ACGCATCGACTGACGTACTCGATCGA
TGCTGGTCATGATGCTGACTGACTAG
ACGTTGGGACATCACCCGCTAGGTAA
TGTCTGATGCCCATG ...
>gnl|ti|3 G10P69425RH3.T0
GACGACTGACTGACTGACTACGACGC
AAGTGACTACGAGATAGATGACATCG
CTGACTAGCATCGTTGACGTACGCCG
ACC ...

Each read has a name, which are important for input and output files. Read names are determined in fasta (and qual) files as follows: Take the rightmost white-space-free string on a line beginning with ">". In the above example, the read names are L1000ABCDEFG.readname and G10P69425RH3.T0.

Directory

fasta is also the name of an input directory. It is a subdirectory of DATA and contains the raw data for all reads, in fasta format.


Fasta
Function Conversion
Phase Conversion
Standard CLAs PRE, DATA, RUN, GDB, NO_HEADER
Special CLAs SUBSET, HEAD, CLEAN, NAMES, MAXREADS, GZIPPED, FOLD_SIZE, MAX_BASES
Source location ARACHNE_DIR/util

Module

Lastly, there is an executable module called Fasta. It converts a file from fastb into fasta format.

Special Command-line arguments

Argument name Argument type Default value Meaning
SUBSET Index list "" If SUBSET is not empty, it is parsed as an IntSet (c.f. ParseSet.h) and only those entries are printed.
HEAD String "reads" HEAD.fastb will be converted to HEAD.fasta.
CLEAN Bool False If True, generate a blastable file (no blank lines, no empty reads).
NAMES Bool False If True, use original read names. HEAD.ids must exist.
MAXREADS UnsignedInt 0 If MAXREADS > 0, print no more than MAXREADS reads.
GZIPPED Bool False If True, saves output file in GZIP format.
FOLD_SIZE UnsignedInt 0 If specified, fold fasta sequences into chunks of size FOLD_SIZE. This is needed for some external programs.
MAX_BASES UnsignedInt 0 If specified, will only output sequences <= MAX_BASES.
Personal tools