DISCOVAR can now be freely used without restriction in both non-academic and academic settings under the terms of our new license. We still encourage users to register with us if they find DISCOVAR useful.
We’ve just added some examples to our online demo to help you explore our DISCOVAR de novo assembly of NA12878. You can select a region of interest from the new drop down menu and it will be displayed below. Alternatively, enter the coordinates of your favorite region of the genome if you want to explore on your own.
Want a sneak preview of what we’ve been working on lately? Then check out this online demo that lets you explore a de novo human assembly produced by our new assembler DISCOVAR de novo.
Developed over the past 6 months, the new DISCOVAR de novo algorithm will be released later this summer. Unlike DISCOVAR, it can assemble large genomes de novo. It is also much faster, but still takes the same low-cost single-library input data that DISCOVAR does.
Whilst we prepare DISCOVAR de novo for release, take a look at the online demo we’ve set up. Here you can explore and visualize an assembly of the human cell line NA12878. You can enter any coordinates on the human reference sequence GRCh38, and the demo will show you the part of the assembly that aligns there. Using this tool, large structural variation events can be directly visualized, and simple SNPs appear as short bubbles.
Please check it out and let us know what you think via the forum.
We have been asked what our DISCOVAR input data looks like, and the best way to answer this question is with some examples. We don’t claim that these data are necessarily representative, but they do illustrate what we are able to generate here at the Broad Institute.
DISCOVAR now generates variant lists using the Variant Calling Format (VCF). This is the standard used by the community and is supported by many tools. Whilst the VCF file contains all events found by DISCOVAR, the complementary
.variant file may contain additional information not easily represented in the VCF format. We encourage our users to look at both. The VCF should be filtered prior to use, and we have provided a tool and instructions on how to do this.
To facilitate calling variants using DISCOVAR on large genomes, we have created a tool to merge VCF files generated for overlapping regions. Simply run DISCOVAR on each region in turn (or in parallel to speed things up), then merge the VCF files that are produced. We currently recommend using a 50 kb region size, with a 10 kb overlap.
For more information on the VCF output, filtering and merging, please refer to our manual.
We’ve added a key purification step “A second 0.7x SPRI clean up was performed following adapter ligation to remove adapter dimers and library fragments below ~150 bp in size.” This improves performance.
DISCOVAR is designed to use 250 base reads. In response to numerous queries we have now posted instructions for generating such data on the HiSeq 2500, as this is the most cost effective approach.
We are now posting detailed instructions for generating libraries appropriate for use with DISCOVAR. Instructions for generating 250 base reads on the HiSeq 2500 will be posted as soon as we have a version that we’re sure is portable.
Are you getting the most out of your hardware when running DISCOVAR?
Take a look at our Computational Performance tips – they could help you get more bang for your computational buck!
DISCOVAR is a heavily multithreaded and memory intensive tool that will push your machines hard. Configuring your hardware to get the best performance isn’t straightforward, but with the right settings you may see significant improvements. After much experimentation and investigation, and with help from fellow DISCOVAR users, we have prepared a set of tips. We’ll continue to add and update them as we learn more, and we would like to hear about your experiences via our forum.
Revision 46631 contains a number of algorithmic improvements. In particular there are now less ‘false’ bubbles in assemblies. These arise particularly in assemblies at high (> 60x) coverage, as one might have e.g. for a bacterium. These bubbles have substantial support on both branches but examination of the quality score distributions of the read bases associated with both allows DISCOVAR to kill off one branch.