An optical map or restriction map is a particular type of visualization of a completed assembly. In practical terms, an optical map estimates the distances between restriction sites (DNA molecules are digested with sequence-specific restriction endonucleases) which can be easily found in the assembled scaffolds, allowing for alignments between these scaffolds and the map.
Optical maps have been proven useful in finding misassemblies, assigning scaffolds to linkage groups, and improving the connectivity of highly repetitive genomes (see also: the assembly of Rhizopus oryzae). However, there are some challenges:
- If two or more restriction sites are too close to each other (less than 2 kbp), the sites will show up in the assembly but tend to not be present in the map
- Contig gaps mask possible restriction sites
- If scaffolds are wrongly joined, one needs to allow for partial (i.e. non full-length) alignments
- Repeat clusters that were shuffled during assembly but reflect the overall structure correctly might not align very well with the map
First, we look for all restriction sites in the consensus sequence on a per-scaffold basis and create a list of distances between them (the scaffold map), analogous to the optical map. For each scaffold map, we then look at overlapping windows of 12 (or less for smaller scaffolds) sites and align these to the entire optical map using a dynamic programming algorithm. The scoring function used is
||optical distance - scaffold distance|| / (optical distance + scaffold distance)
plus we allow for skipping n sites on both the scaffold and optical map by adding a penalty of n * 0.25. We then compare the total scores for all alignments and pick the one with the lowest score. If, however, the best alignment does not contain at least 6 distances that have a score of less than 0.2 and are contiguous in groups of at least 3, we discard the alignment.
Once all the partially overlapping alignments are computed, we merge alignments by overlaying them; each alignment has one vote for each restriction site on the scaffold that it aligned, the majority vote determines the place in the optical map to which the scaffold gets mapped. Finally, now that we defined the boundaries for the alignments, we re-align the scaffold to the map within these boundaries to get a better score and to confirm that we indeed found the correct placement.
Modules to map scaffolds to linkage groups
OptiMapToAgp takes an assembly, the optical map (in Arachne format - maps from OpGen can be converted by running OpgenMap2Arachne) and produces an AGP file as output, which orderes and orients alignable scaffolds to the linkage group specified by the map. Optionally, OptiMapToAgp writes out a new assembly where scaffolds are joined if the map suggests so (so, there will be spots that are not linked by inserts). For a more detailed alignment, run OptiDetailedAligns.
Recommended procedure (our experience)
- Run OpgenMap2Arachne with the original map as input and generate a file /PRE/DATA/RUN/opticalMap/optimap.simple.
- Note: in most cases, the linkage group containing the rDNA will be present as two separate groups, e.g. chr02a and chr02b. You might want to combine this into one single linkage group, since the assembly might extend across the rDNA (which will be overcollapsed).
- IMPORTANT: find out which way to join and if one group needs to be reversed.
- Run OptiDetailedAligns with the SUBDIR and cutting site DNA sequence as input. Save the log file (stdout).
- Check for misjoins in the assembly (these will appear as incomplete alignments) and fix them. Back to step 2 until all issues are resolved.
- Find all scaffolds that have alignments with '_rc' (or really reversed, since the cutting site sequence is typically a palindrom) and use SuperFlipper to reverse these scaffolds so that only forward alignments remain. Trust us, doing this extra flipping will save you from a lot of pain and suffering further down the road!
- Run OptiDetailedAligns to generate the final alignments.