A base is the fundamental unit of all DNA sequences. There are four DNA bases - adenine, cytosine, guanine, and thymine - commonly represented as A, C, G, and T. Hence any known DNA sequence may be represented by a string in a four-letter alphabet: ACCTGAACTT...
All DNA objects are composed of bases (or, in the case of double-stranded DNA, base pairs or bp). Length in base pairs is the canonical measure of an object's size. Sanger reads are around 700 base pairs, while Solexa reads are typically about 35. The human genome is 3 billion base pairs (3 Gbp).
- acgt: Putting the four bases in lowercase represents lack of confidence in them. This may denote regions of a read or consensus with low quality scores.
- N: In input reads, this represents a failure of the reading machinery to determine a base's identity. In certain fasta files (including assembly_supers.fasta, assembly_supers.quals, supercontigs.fasta, mergedcontigs.fasta), it indicates gaps between contigs in a supercontig.
- *: In consensus files (generated by GenerateTilings) and in ace files, this represents a pad in the consensus.
- #: In consensus files (generated by GenerateTilings), this represents a pad in a low-quality region -- i.e., where a known base would be in lowercase, per above.
Internally, Arachne represents bases in a compressed binary format. Since there are four different nucleotides, each individual base requires two bits to encode; hence a sequence of N bases can be described most efficiently by 2N bits, or N/4 bytes. The basevector module and the fastb format handle sequences in this way. Note that this format cannot represent ambiguity in the same way that text files do, by using symbols such as acgt and N.
Base lengths are often represented in abbreviated units according to SI naming and abbreviating conventions:
All of these abbreviations may also be followed by "p" for pairs (i.e., Mbp for megabase pairs.)