Memory limit added

DISCOVAR de novo now has an argument MAX_MEM_GB that can be used to limit memory usage to roughly the given amount. This can be useful on very large shared-memory systems.

New DISCOVAR de novo stats

We’ve added some new assembly statistics to DISCOVAR de novo. These are in the file stats in and are mirrored in standard output. These along with the file frags.dist.png are often diagnostic.

Highest coverage paths now used in scaffolds

DISCOVAR de novo produces several output files, including a file of scaffolds a.lines.fasta in which a single path through a genomic locus is shown, even when multiple paths are possible (for one of several reasons, including polymorphism). (See “Edges, lines and scaffolds“.) This ‘flattened’ representation of the assembly loses information but has the advantage that it is FASTA and so can be processed by standard tools. With revision 51386, we now pick the paths used to be those having highest coverage. This is completely arbitrary in cases of bona fide polymorphism, but is helpful in cases where an assembly bubble occurs because of sequencing difficulty, making it uncertain which bubble branch is correct. In such cases, and in cases of ‘minor alleles’ in bacterial cultures, choosing the highest coverage branch makes sense.

Clarification of DISCOVAR input requirements

DISCOVAR and DISCOVAR de novo take as input read pairs from fragments of size 400-500 bp, with some larger and some smaller. The blog and manual contained references to fragments of size 700 bp, which were outdated, and have now been removed. Note that the protocol yields a wide size distribution, including some large fragments.

Understanding DISCOVAR output

A DISCOVAR de novo assembly is a graph. A typical assembly consists almost entirely of linear stretches, typically like this


which we call ‘lines’, and providing a rich data type that captures polymorphism and other important features. Further, with some loss of information, these lines may be ‘flattened’ into standard contigs. We have added a tutorial explaining how these data types are available as part of the DISCOVAR output. We are also interested in hearing your thoughts regarding the utility of these output types and others that might be useful to you.

HiSeq 2500 data quality

We have been asked what our DISCOVAR input data looks like, and the best way to answer this question is with some examples. We don’t claim that these data are necessarily representative, but they do illustrate what we are able to generate here at the Broad Institute.

DISCOVAR performance tips

Are you getting the most out of your hardware when running DISCOVAR?
Take a look at our Computational Performance tips – they could help you get more bang for your computational buck!

DISCOVAR is a heavily multithreaded and memory intensive tool that will push your machines hard. Configuring your hardware to get the best performance isn’t straightforward, but with the right settings you may see significant improvements. After much experimentation and investigation, and with help from fellow DISCOVAR users, we have prepared a set of tips. We’ll continue to add and update them as we learn more, and we would like to hear about your experiences via our forum.