In genomics, an alignment is a matching of two sequences with an identical or similar stretch of bases. Alignments are rarely perfect matches; they tend to contain mismatches and indels and may leave hanging ends.
The term "alignment" also refers to the process of sequence alignment, i.e., finding alignments. Sequence alignment is an important problem in computational biology, and many algorithms have been developed for it. Arachne uses these algorithms to create alignments, and then uses alignments to assemble reads into contigs through the process of overlap and consensus. Alignments also play a role in assisted assembly.
Arachne features many different types of data structures to represent alignments. In a particular situation, the choice of data structure depends on the type of sequences being aligned and the time and memory requirements. A simple representation of an alignment may include pointers to each sequence and integers representing the indices on each sequence at which the alignment begins and ends. (The aligned stretch may not be the same length on the two sequences, due to indels.) More detailed data structures include the precise locations of indels and mismatches, as well as information about the aligned sequences.
Read-read alignments are calculated in pre-processing, in the ReadsToAligns module series. They are stored in the file aligns.total2, which is then read in by FilterAlignments in VecAlignmentPlus and CAlignmentAccess format. Read-read alignments are necessary for the overlap and consensus stages of the assembly process; some of them are unfortunately a result of repeat overlaps.
A lookup table can be used to describe a set of alignments between two large sequences, typically between a draft assembly and a reference assembly. Creating a lookup table is the first step in assisted assembly, in which alignments with the known genome of a related species are used to assemble a new genome.
Other data structures that represent alignments include ReadLocation, LookAlign, PackAlign, Tiling, t_align, and nobbits. File formats that describe alignments include look_align and ECF. Other modules to find alignments include NqsAligns, UniquifyAligns, and AlignTwoBasevectors.
Alignment algorithms may be global or local, depending on whether the sequences are to be aligned in full, or just a particular aligning sequence found. They may also be gapped or ungapped, depending on whether their scoring procedures allow for gaps such as indels. The Needleman-Wunsch algorithm is a gapped global alignment algorithm. The Smith-Waterman algorithm (used in Rebuilder and QueryLookupTable) is a gapped local alignment algorithm. The BLAST algorithm, available online at the NCBI website (http://www.ncbi.nlm.nih.gov/blast/Blast.cgi), is an ungapped local alignment algorithm.
Another type of alignment is a multiple sequence alignment (MSA), a combined alignment between three or more sequences. The Clustal algorithm, available online at http://align.genome.jp/, performs MSA.