From ArachneWiki

Revision as of 04:19, 5 May 2008 by JoshuaBurton (Talk | contribs)
(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search
DNA double-helix structure, showing the four types of bases.
DNA double-helix structure, showing the four types of bases.

A base is the fundamental unit of all DNA sequences. There are four DNA bases - adenine, cytosine, guanine, and thymine - commonly represented as A, C, G, and T. Hence any known DNA sequence may be represented by a string in a four-letter alphabet: ACCTGAACTT...

All DNA objects are composed of bases (or, in the case of double-stranded DNA, base pairs or bp). Length in base pairs is the canonical measure of an object's size. Sanger reads are around 700 base pairs, while Solexa reads are typically about 35. The human genome is 3 billion base pairs (3 Gbp).

In Arachne

All human-readable Arachne input and output files represent bases with their letters, ACGT. In certain circumstances, though, the exact representation of a base may take on other forms:

Internally, Arachne represents bases in a compressed binary format. Since there are four different nucleotides, each individual base requires two bits to encode; hence a sequence of N bases can be described most efficiently by 2N bits, or N/4 bytes. The basevector module and the fastb format handle sequences in this way. Note that this format cannot represent ambiguity in the same way that text files do, by using symbols such as acgt and N.


Base lengths are often represented in abbreviated units according to SI naming and abbreviating conventions:

Number Term Abbreviation
103 Kilobase Kb
106 Megabase Mb
109 Gigabase Gb

All of these abbreviations may also be followed by "p" for pairs (i.e., Mbp for megabase pairs.)

Personal tools