Explore a de novo human assembly online now

Want a sneak preview of what we’ve been working on lately? Then check out this online demo that lets you explore a de novo human assembly produced by our new assembler DISCOVAR de novo.

Developed over the past 6 months, the new DISCOVAR de novo algorithm will be released later this summer. Unlike DISCOVAR, it can assemble large genomes de novo. It is also much faster, but still takes the same low-cost single-library input data that DISCOVAR does.

Whilst we prepare DISCOVAR de novo for release, take a look at the online demo we’ve set up. Here you can explore and visualize an assembly of the human cell line NA12878. You can enter any coordinates on the human reference sequence GRCh38, and the demo will show you the part of the assembly that aligns there. Using this tool, large structural variation events can be directly visualized, and simple SNPs appear as short bubbles.

Please check it out and let us know what you think via the forum.

HiSeq 2500 data quality

We have been asked what our DISCOVAR input data looks like, and the best way to answer this question is with some examples. We don’t claim that these data are necessarily representative, but they do illustrate what we are able to generate here at the Broad Institute.

VCF format now supported

DISCOVAR now generates variant lists using the Variant Calling Format (VCF). This is the standard used by the community and is supported by many tools. Whilst the VCF file contains all events found by DISCOVAR, the complementary .variant file may contain additional information not easily represented in the VCF format. We encourage our users to look at both. The VCF should be filtered prior to use, and we have provided a tool and instructions on how to do this.

To facilitate calling variants using DISCOVAR on large genomes, we have created a tool to merge VCF files generated for overlapping regions. Simply run DISCOVAR on each region in turn (or in parallel to speed things up), then merge the VCF files that are produced. We currently recommend using a 50 kb region size, with a 10 kb overlap.

For more information on the VCF output, filtering and merging, please refer to our manual.

DISCOVAR performance tips

Are you getting the most out of your hardware when running DISCOVAR?
Take a look at our Computational Performance tips – they could help you get more bang for your computational buck!

DISCOVAR is a heavily multithreaded and memory intensive tool that will push your machines hard. Configuring your hardware to get the best performance isn’t straightforward, but with the right settings you may see significant improvements. After much experimentation and investigation, and with help from fellow DISCOVAR users, we have prepared a set of tips. We’ll continue to add and update them as we learn more, and we would like to hear about your experiences via our forum.

Using shorter Illumina reads

Many people have asked if they can use their existing Illumina datasets with DISCOVAR – datasets that don’t meet the recommendations of ~60x coverage by 250 base paired reads from a ~700 bp PCR-free fragment library. We investigated and made some minor changes to the algorithm, embodied in release 46382 onwards, and it is now possible to use shorter reads from PCR libraries – with some caveats. We have successfully tested DISCOVAR on 100 base reads from a ~180 bp PCR fragment library, obtaining reasonable results but inferior to those generated from the recommended data. For more information on please see our FAQ.

Choosing the right hardware

Would you like to help us benchmark servers?

We are contemplating server purchases and would like to get the most bang for our buck. We imagine that some of you are in the same situation. Therefore, to share intelligence, we are creating a table that shows DISCOVAR performance stats, along with server configuration information. Please take a look at the current benchmark table, which we will continue to update as we get more results. Better yet – why not participate by benchmarking your systems and sharing the results with us.

Assessing assembly methods

For purposes of assessing our assemblies and variant calls, we generated some NA12878 clone reference sequences.  We believe that these data will be of interest to the community and have therefore decided to make them available to all. These clone sequences and the raw data used to generate them can be found on our FTP site.

The sequences were obtained by randomly selecting ~100 clones from an NA12878 Fosmid library.  Two pools of ~50 each were created, then sequenced by MiSeq (250 bases) and PacBio (~3000 bases).  There are also some jumps.

We completely assembled 103 clones, without ambiguity, in some cases with manual intervention.  Cloning vector has been removed.  There are a small number of additional clones in the pools, not included in the assemblies, including a few that had low coverage, some EBV, and some centromeric sequence.

This is version 1.0 of the set.  We believe that the error rate on the clones is very low, however we are carrying out laboratory validation and will roll out updated versions as the results come back.

This work is supported by NHGRI grants.