In the context of Whole Genome Shotgun Assembly, a gap is a stretch of unknown base pairs between two known sequences. The most frequently examined gaps are between the contigs in a supercontig. Gaps may also occur within a contig (see assembly_supers.fasta).

The most important feature of a gap is its size (the gap size), which is usually not known exactly, and is thus expressed as an expected value with a standard deviation. Note that gap sizes may be negative; this corresponds to contigs overlapping without being merged, which must be an admissible option due to the possibility of polymorphism.

Gaps are found by examining the location of paired production reads in which the two reads in a pair fall on different contigs. By comparing the reads' location in their contigs against their insert size, it is possible to estimate gap sizes statistically. If a pair of contigs is joined by several read pairs, the gap size may be calculated to high precision.

In Arachne, gaps are usually represented by a pair of integers: the estimated size and the standard deviation. See for example the superb module. If they appear in fasta files, gaps are represented by N; note that this is translated into a random base when the fasta is converted to fastb.

Alternate definitions

The word "gap" may also be used to refer to a deletion in an alignment, especially in the context of alignment algorithms.

