Peak memory improvements

Thanks to new data compression methods, our latest release (50963) reduces the peak memory required for large genomes. The exact peak memory required depends on a number of factors, including the nature of the genome and the quality of the data, but is mostly determined by coverage and genome size.

Assembly graph visualization tool released

The assembly graph can be large, complex and unwieldy, so DISCOVAR de novo does not generate a viewable graph directly. Instead we have developed an interactive tool that allows you to explore your assembly by creating smaller viewable graphs of the regions you are interested in. This new tool, called NhoodInfo, is now part of the DISCOVAR package, as of release 50612 . It is also the engine behind our online demoso you can try it out right now without having to create an assembly of your own. Full instructions on using NhoodInfo are included in the DISCOVAR package.

Large genome de novo assembler released

We are pleased to announce the release of our new de novo assembler suitable for large genomes up to human size. This is an early release and should be considered experimental, but is fully functioning. Download it now.

Our new assembler, called DISCOVAR de novo (experimental), uses the same cheap data that the original DISCOVAR release does: 250 base paired-end PCR-free Illumina reads. No other libraries are required. The runtime for a human genome on a 48 core, 0.5 Tb server is only 36 hours, and produces an assembly with a contig N50 of ~100 kb.

We are actively developing DISCOVAR de novo, so check back often for updates.

Explore a de novo human assembly online now

Want a sneak preview of what we’ve been working on lately? Then check out this online demo that lets you explore a de novo human assembly produced by our new assembler DISCOVAR de novo.

Developed over the past 6 months, the new DISCOVAR de novo algorithm will be released later this summer. Unlike DISCOVAR, it can assemble large genomes de novo. It is also much faster, but still takes the same low-cost single-library input data that DISCOVAR does.

Whilst we prepare DISCOVAR de novo for release, take a look at the online demo we’ve set up. Here you can explore and visualize an assembly of the human cell line NA12878. You can enter any coordinates on the human reference sequence GRCh38, and the demo will show you the part of the assembly that aligns there. Using this tool, large structural variation events can be directly visualized, and simple SNPs appear as short bubbles.

Please check it out and let us know what you think via the forum.

VCF format now supported

DISCOVAR now generates variant lists using the Variant Calling Format (VCF). This is the standard used by the community and is supported by many tools. Whilst the VCF file contains all events found by DISCOVAR, the complementary .variant file may contain additional information not easily represented in the VCF format. We encourage our users to look at both. The VCF should be filtered prior to use, and we have provided a tool and instructions on how to do this.

To facilitate calling variants using DISCOVAR on large genomes, we have created a tool to merge VCF files generated for overlapping regions. Simply run DISCOVAR on each region in turn (or in parallel to speed things up), then merge the VCF files that are produced. We currently recommend using a 50 kb region size, with a 10 kb overlap.

For more information on the VCF output, filtering and merging, please refer to our manual.

Using shorter Illumina reads

Many people have asked if they can use their existing Illumina datasets with DISCOVAR – datasets that don’t meet the recommendations of ~60x coverage by 250 base paired reads from a ~450 bp PCR-free fragment library. We investigated and made some minor changes to the algorithm, embodied in release 46382 onwards, and it is now possible to use shorter reads from PCR libraries – with some caveats. We have successfully tested DISCOVAR on 100 base reads from a ~180 bp PCR fragment library, obtaining reasonable results but inferior to those generated from the recommended data. For more information on please see our FAQ.

New DISCOVAR release

A new release (r46399) of DISCOVAR is now available. It contains the following changes:

- More robust SAMtools version checking.
- Improvements to .variant file format.
- MALLOC_PER_THREAD = 1 environment setting no longer mandatory. However setting this may give a significant performance boost.
- Various bug fixes.

Thanks to all the users who have brought these problems to our attention.