Polymorphism is a phenomenon in which a genome contains two or more versions of the same sequence or set of sequences. Polymorphism arises due to the multiplicity of chromosomes in a genome: the chromosomes are similar but not identical, and a particular read may be sequenced from any of the similar chromosomes. The different versions are called haplotypes.
The extent of polymorphism in a genome depends on the ploidy. Most eukaryotic species are diploid, meaning that there are two copies of each chromosome; the chromosome pairs split during meiosis and are then recombined during sexual reproduction. Prokaryotic species are generally haploid, with only one copy of each chromosome, and therefore exhibit no polymorphism. Inbred populations (such as the laboratory mouse) are also effectively haploid due to the lack of genetic diversity.
Types of polymorphism
Before getting into the question of how to actually assemble polymorphic genomes, we want to draw attention to the fact that there are very different incarnations:
In addition, DNA sequence can be very different in some cases, e.g. in some sexual fungi (like C.albicans), one allele contains the male version of a gene, and the other one the female version.
Also, the extent of polymorphic events -- the polymorphism rate -- must be taken into account: high SNP density and/or short indels poses one kind of challenge, larger indels and re-arrangements another (since then certain regions are only represented at half coverage in the data set and half the inserts linking across this region will appear stretched or illogical).
Lastly, heterozygosity is not typically distributed evenly in most genomes. In the fish Stickleback, for example, the overall polymorphism rate is an estimated 0.8%, but long stretches appear to be homozygous, so that the remaining parts are polymorphic at a much higher rate.
Assembling polymorphic genomes
When assembling (and representing) a polymorphic genome, there is a major question. Should the haplotypes be represented as two completely distinct sequences? Or should the genome be represented as a single unit, with short branches and splits for the haplotypes? In the former case, connectivity is a hindrance. More read coverage means larger contigs and scaffolds; this suggests that the assembly should be squeezed together, both haplotypes assembled into one, even though this is (quite literally) sometimes a stretch. On the other hand, experience (e.g. C. albicans, L. elongisporus) showed that it can pose enormous difficulties to gene calling when the consensus sequence is derived from a mishmash of both haplotypes: you get everything from false frameshifts to false premature stop codons, even looking for real ORFs becomes impossible.
Ideally, one would like to represent both haplotypes seperately, but this is in most cases not possible. Therefore, in Arachne, we tried to find a compromise.
In our experience, it is a good idea to start with an initial assembly, run with default parameters. Depending on the genome, Assemblez's built-in tolerance for polymorphism might yield an acceptable draft assembly (which might need some attention for other reasons). Even in the worst case, it will provide a good starting point for examining the genome and decide which subsequent operations to perform. Perhaps the first thing to look at is the repeat structure: if, due to high polymorphism, the haplotypes ended up assembled in separate contigs or scaffolds, these will appear as repeats (which in some sense, they are), at a multiplicity of 2.
In order to extend scaffolds, scaffolding requires contigs to align if insert links suggest that there is an overlap before making joins. If sequence is present in one allele but missing in the other, this can prevent joins from being made. Also, if the haplotypes are two far diverged in certain hotspots, alignments may exist but are rejected as not being good enough. However, having good scaffolds is prerequisite to having good contigs, since contig merges rely on the fact that they are well ordered relative to each other. So, tuning the various parameters to be more lenient should be a good start.
To be even more aggressive, the module MergeHaplotypes will join scaffolds not based on insert linking but strictly on contig-to-contig alignments.
After the scaffold structure is in place, the obvious thing to do is to merge sequence from both haplotypes; As a first step, the module SquashOverlaps takes care of contigs that are entirely subsumed in another by removing this contig and aligning its reads to the subsuming contig with very lenient parameters. A subsequent run of contig gap stuffing (i.e. various forms of merging or bridging gaps between contigs, see for details in that section) will succeed in increasing the connectivity of the assembly.
However, when doing this, there are two things that need to be monitored very carefully:
- Read usage
- Clusters of SNPs and indels
If, in the process, overall read usage drops dramatically when going from an unmerged assembly to a merged one, then this might be a sign that there are too few reads to hold scaffolds together; running BreakUnlinked will break scaffolds in spots that are not spanned by insert linking.
Clusters of disagreements can have bad side effects when computing contig consensus, since there might be many crosses between the two haplotypes too close together, which will have negative effects on e.g. gene calling (see Frankentyping).
When merging haplotypes, it is easily possible to end up with consensus sequence that is too artificial a mix: when making transitions within genes, false frame shifts or stop codons might apper. In C. tropicalis, for example, a diploid fungus, there were 994 genes affected that could not be identified correctly as functional genes. After "separating" the haplotypes (a process we call "frankentyping" since we still have a mix of both haplotypes), this number went down to 116.
So, how to separate the haplotypes? As a first step, we remove all read pairs in which at least one end read has high-quality disagreements with other overlapping reads. We then use a greedy algorithm to place back reads avoiding high-quality disagreements, thus populating the contigs but increasing the distance between spots in which we cross the haplotypes. Re-computing the consensus then will be a better reflection of what one haploid genome could look like.
As a last step, all reads that were not placed back can be assembled again, allowing for some estimate of the second haplotype relative to the 'main' assembly.
Estimating Overall Polymorphism
The polymorphism rate (and nature) between the two haplotypes of a diploid genome can vary quite dramatically, so it is desirable to have a tool which can assess and quantify overall statistics for the entire genome. The relevant information to be used is:
- Where are the repeats (output of TagRepeats)
- Where are the polymorphic events (output of HQDAnnotator)
PolymorphismEstimator combines this information: since we do not want to have an artificially high rate arising from misassembled repetitive regions, we exclude all regions that are marked as repeats or 48 bases away from repeats. Next, we require at least three reads to (consistently with each other) disagree with consensus, where the disagreement can be a single nucleotide polymorphism (SNP) or a short indel (less than 42 bases).
Note: even for haploid genomes, there is some background noise due to either systematic [[sequencing error]s, falsely corrected reads, or other artifacts. This background is typically around 0.001% (or one in 100,000 bases), consistent with an average base quality of about 50.
This metric, however, only captures polymorphism on a very local scale, ignoring larger indels or inversions. In addition, if the polymorphism rate is very high in certain regions of the genome (~3% and up), haplotypes tend to be assembled into separate scaffolds. To examine these events, we recommend looking at the repeat structure).
Note: polymorphism tends to not be evenly distributed throughout the genome. There might be long homozygous stretches and clusters of SNPs and indels in other regions.