Fasta
From ArachneWiki
Fasta is a standard file format for reads and contigs, indicating a set of sequences. The corresponding format for quality scores is qual. The fasta format is supported by Arachne and also by most base-calling software, such as PHRED. Files in this format include contigs.fasta and assembly_supers.fasta.
Here is an example of two fasta-format reads:
>L1000ABCDEFG.readname ACGCATCGACTGACGTACTCGATCGA TGCTGGTCATGATGCTGACTGACTAG ACGTTGGGACATCACCCGCTAGGTAA TGTCTGATGCCCATG ... >gnl|ti|3 G10P69425RH3.T0 GACGACTGACTGACTGACTACGACGC AAGTGACTACGAGATAGATGACATCG CTGACTAGCATCGTTGACGTACGCCG ACC ...
Each read has a name, which are important for input and output files. Read names are determined in fasta (and qual) files as follows: Take the rightmost white-space-free string on a line beginning with ">". In the above example, the read names are L1000ABCDEFG.readname and G10P69425RH3.T0.
Directory
fasta is also the name of an input directory. It is a subdirectory of DATA and contains the raw data for all reads, in fasta format.
| Fasta | |
|---|---|
| Function | Conversion |
| Phase | Conversion |
| Standard CLAs | PRE, DATA, RUN, GDB, NO_HEADER |
| Special CLAs | SUBSET, HEAD, CLEAN, NAMES, MAXREADS, GZIPPED, FOLD_SIZE, MAX_BASES |
| Source location | ARACHNE_DIR/util |
Module
Lastly, there is an executable module called Fasta. It converts a file from fastb into fasta format.
Special Command-line arguments
| Argument name | Argument type | Default value | Meaning |
|---|---|---|---|
| SUBSET | Index list | "" | If SUBSET is not empty, it is parsed as an IntSet (c.f. ParseSet.h) and only those entries are printed. |
| HEAD | String | "reads" | HEAD.fastb will be converted to HEAD.fasta. |
| CLEAN | Bool | False | If True, generate a blastable file (no blank lines, no empty reads). |
| NAMES | Bool | False | If True, use original read names. HEAD.ids must exist. |
| MAXREADS | UnsignedInt | 0 | If MAXREADS > 0, print no more than MAXREADS reads. |
| GZIPPED | Bool | False | If True, saves output file in GZIP format. |
| FOLD_SIZE | UnsignedInt | 0 | If specified, fold fasta sequences into chunks of size FOLD_SIZE. This is needed for some external programs. |
| MAX_BASES | UnsignedInt | 0 | If specified, will only output sequences <= MAX_BASES. |
