Highest coverage paths now used in scaffolds

DISCOVAR de novo produces several output files, including a file of scaffolds a.lines.fasta in which a single path through a genomic locus is shown, even when multiple paths are possible (for one of several reasons, including polymorphism). (See “Edges, lines and scaffolds“.) This ‘flattened’ representation of the assembly loses information but has the advantage that it is FASTA and so can be processed by standard tools. With revision 51386, we now pick the paths used to be those having highest coverage. This is completely arbitrary in cases of bona fide polymorphism, but is helpful in cases where an assembly bubble occurs because of sequencing difficulty, making it uncertain which bubble branch is correct. In such cases, and in cases of ‘minor alleles’ in bacterial cultures, choosing the highest coverage branch makes sense.

Fragment library size distribution plots

Each DISCOVAR de novo assembly will now come with a plot like this


showing the observed size distribution for the fragments defined by the input read pairs, and in the file frags.dist.png. These plots can be highly diagnostic. They are available from revision 51298 onwards. The raw data are in the file frags.dist

DISCOVAR paper published

Our manuscript “Comprehensive variation discovery in single human genomes” is now available as an advance online publication from Nature Genetics. This paper describes our assembly and variant calling algorithm DISCOVAR, which is able to find many novel variants missing from standard call sets. DISCOVAR is available for download now, and we encourage you to try it out. For de novo assembly without variant calling, see our other package: DISCOVAR de novo.

Thread control for DISCOVAR de novo

You can now limit the maximum number of threads DISCOVAR de novo uses with the new option NUM_THREADS (release 51183). This is useful if you have to share your hardware, or if your system admin has limited the number of threads a single process can use. It can also be a good idea to restrict the number of threads if your hardware has many cores (>50), as the parallelization efficiency can start to drop due to locking and cache coherency issues.

Native support added for BAMs in DISCOVAR de novo

The latest release (50893) of DISCOVAR de novo now supports BAM files directly, and no longer requires SAMtools to be installed. This change has the added benefit of halving the time required to import data from a BAM, potentially saving hours on a human sized genome. Note that the original variant calling version of DISCOVAR still requires SAMtools in order to work.

Peak memory improvements

Thanks to new data compression methods, our latest DISCOVAR de novo release (50963) reduces the peak memory required for large genomes. The exact peak memory required depends on a number of factors, including the nature of the genome and the quality of the data, but is mostly determined by coverage and genome size.