Clarification of DISCOVAR input requirements

DISCOVAR takes as input read pairs from fragments of size 400-500 bp, with some larger and some smaller. The blog and manual contained references to fragments of size 700 bp, which were outdated, and have now been removed. Note that the protocol yields a wide size distribution, including some large fragments.

Understanding DISCOVAR output

A DISCOVAR de novo assembly is a graph. A typical assembly consists almost entirely of linear stretches, typically like this


which we call ‘lines’, and providing a rich data type that captures polymorphism and other important features. Further, with some loss of information, these lines may be ‘flattened’ into standard contigs. We have added a tutorial explaining how these data types are available as part of the DISCOVAR output. We are also interested in hearing your thoughts regarding the utility of these output types and others that might be useful to you.

HiSeq 2500 data quality

We have been asked what our DISCOVAR input data looks like, and the best way to answer this question is with some examples. We don’t claim that these data are necessarily representative, but they do illustrate what we are able to generate here at the Broad Institute.

DISCOVAR performance tips

Are you getting the most out of your hardware when running DISCOVAR?
Take a look at our Computational Performance tips – they could help you get more bang for your computational buck!

DISCOVAR is a heavily multithreaded and memory intensive tool that will push your machines hard. Configuring your hardware to get the best performance isn’t straightforward, but with the right settings you may see significant improvements. After much experimentation and investigation, and with help from fellow DISCOVAR users, we have prepared a set of tips. We’ll continue to add and update them as we learn more, and we would like to hear about your experiences via our forum.

Cleaner assemblies

Revision 46631 contains a number of algorithmic improvements. In particular there are now less ‘false’ bubbles in assemblies.  These arise particularly in assemblies at high (> 60x) coverage, as one might have e.g. for a bacterium.  These bubbles have substantial support on both branches but examination of the quality score distributions of the read bases associated with both allows DISCOVAR to kill off one branch.

Choosing the right hardware

Would you like to help us benchmark servers?

We are contemplating server purchases and would like to get the most bang for our buck. We imagine that some of you are in the same situation. Therefore, to share intelligence, we are creating a table that shows DISCOVAR performance stats, along with server configuration information. Please take a look at the current benchmark table, which we will continue to update as we get more results. Better yet – why not participate by benchmarking your systems and sharing the results with us.

Assessing assembly methods

For purposes of assessing our assemblies and variant calls, we generated some NA12878 clone reference sequences.  We believe that these data will be of interest to the community and have therefore decided to make them available to all. These clone sequences and the raw data used to generate them can be found on our FTP site.

The sequences were obtained by randomly selecting ~100 clones from an NA12878 Fosmid library.  Two pools of ~50 each were created, then sequenced by MiSeq (250 bases) and PacBio (~3000 bases).  There are also some jumps.

We completely assembled 103 clones, without ambiguity, in some cases with manual intervention.  Cloning vector has been removed.  There are a small number of additional clones in the pools, not included in the assemblies, including a few that had low coverage, some EBV, and some centromeric sequence.

This is version 1.0 of the set.  We believe that the error rate on the clones is very low, however we are carrying out laboratory validation and will roll out updated versions as the results come back.

This work is supported by NHGRI grants.