Sometimes parallel processing of samples can result in low level cross contamination, and sometimes there can be enough to assemble, especially when the samples are sequenced at high coverage. We’ve added a program
CrossOut that can remove most of this contamination from parallel DISCOVAR de novo assemblies, by looking for improbable molarity differences. It has a single argument
DIR, the parent directory for the assemblies, and creates new assembly directories
a.clean within each.
We’ve added some new assembly statistics to DISCOVAR de novo. These are in the file stats in
a.final and are mirrored in standard output. These along with the file
frags.dist.png are often diagnostic.
DISCOVAR de novo produces several output files, including a file of scaffolds
a.lines.fasta in which a single path through a genomic locus is shown, even when multiple paths are possible (for one of several reasons, including polymorphism). (See “Edges, lines and scaffolds“.) This ‘flattened’ representation of the assembly loses information but has the advantage that it is FASTA and so can be processed by standard tools. With revision 51386, we now pick the paths used to be those having highest coverage. This is completely arbitrary in cases of bona fide polymorphism, but is helpful in cases where an assembly bubble occurs because of sequencing difficulty, making it uncertain which bubble branch is correct. In such cases, and in cases of ‘minor alleles’ in bacterial cultures, choosing the highest coverage branch makes sense.
Each DISCOVAR de novo assembly will now come with a plot like this
showing the observed size distribution for the fragments defined by the input read pairs, and in the file
frags.dist.png. These plots can be highly diagnostic. They are available from revision 51298 onwards. The raw data are in the file
We found a serious bug in DISCOVAR de novo, revision 50862, resulting in data corruption in some cases. We will correct this bug as soon as possible. In the meantime we are rolling back to revision 50693. And needless to say, we are beefing up our release testing.
DISCOVAR and DISCOVAR de novo take as input read pairs from fragments of size 400-500 bp, with some larger and some smaller. The blog and manual contained references to fragments of size 700 bp, which were outdated, and have now been removed. Note that the protocol yields a wide size distribution, including some large fragments.
We’ve added a key purification step “A second 0.7x SPRI clean up was performed following adapter ligation to remove adapter dimers and library fragments below ~150 bp in size.” This improves performance.
Revision 46631 contains a number of algorithmic improvements. In particular there are now less ‘false’ bubbles in assemblies. These arise particularly in assemblies at high (> 60x) coverage, as one might have e.g. for a bacterium. These bubbles have substantial support on both branches but examination of the quality score distributions of the read bases associated with both allows DISCOVAR to kill off one branch.