FAQ

Is DISCOVAR a genome assembler or a variant caller?

DISCOVAR is both! The heart of DISCOVAR is a de novo genome assembler, one that is accurate enough to produce assemblies that can also be used for variant calling. If you supply a reference sequence DISCOVAR will generate variants in addition to an assembly.

NEW! - Our new release DISCOVAR de novo (experimental) is a large genome de novo assembler. Please continue to use the original DISCOVAR release for variant calling.

What are the inputs required to run DISCOVAR?

DISCOVAR has specific requirements for input data.

DISCOVAR requires a single Illumina fragment (paired end) library. We strongly recommend using a PCR-free protocol. From the library, 250 base paired reads can be created using either Illumina MiSeq or HiSeq2500 genome sequencers. The recommended coverage is about 60x. Somewhat higher or lower coverage is fine. Longer Illumina reads also work.

For variant calling you must also supply a reference for your genome – in FASTA format.

Can you tell me more about the PCR-free library?

For a human genome, this can be made from 0.5 ug DNA.  Please see the protocol, which typically yields fragments of size ~450 bp. This is achieved by size selection using SPRI beads, rather than a gel. This method yields a wide size distribution, including some longer fragments, which is advantageous.

Do I need a jumping library?

No. DISCOVAR does not require a jumping* library and cannot currently use one.
*sometimes referred to as a mate pair library.

Can I use other types of Illumina data?

Possibly, although the results will likely not be as good as those obtained from the recommended DISCOVAR data. Here are some tips:

  • Reads longer than 250 bases can be used.
  • PCR libraries can be used, but there will be some degradation in the quality of assemblies and variant calls.
  • Reads between 100 and 250 bases in length may work with DISCOVAR, depending on the fragment size and other factors.
  • Short reads made from long fragments cannot be used. DISCOVAR closes fragments by extending into the gap with other reads, which then must overlap – see the diagram below. For this to work, the fragment length must be substantially smaller than four times the read length. For example, 100 base reads from ~400 bp fragments will not work.
       ---------->            <----------     original read pair
               ----------                     extending read
                       ----------             extending read

Can I use reads from another sequencing technology?

No. However, we intend to support promising new technologies. See our roadmap.

Can DISCOVAR carry out a de novo assembly of a human-sized genome?

Not yet. We are in the process of scaling up the algorithms. However, you can still call variants on such a genome, by running on smaller regions of the genome.

NEW! – Our new release DISCOVAR de novo (experimental) can de novo assemble human-sized genomes.

Can DISCOVAR carry out de novo assemblies of microbial-sized genomes?

Yes.

NEW! - Our new release DISCOVAR de novo (experimental) can de novo assemble human-sized genomes.

Can I call variants on a large genome (e.g. human)?

Absolutely! Although DISCOVAR can currently only assemble small genomes, it is possible to instead assemble smaller portions of a larger genome. You simply specify the region of the genome you are interested in, and DISCOVAR will do the rest. This requires that you first align your reads to a reference, and provide DISCOVAR with a resulting BAM file. The alignments are used to localize reads to the region of interest, and are not used in the assembly process. Our goal is to further optimise DISCOVAR so that it will be able to operate on the entire genome at once, allowing the discovery of completely novel sequence as well as variants.

What size of region should I use when calling variants?

We recommend using small regions, for example ~100 kb.  It is possible to run on larger regions, but DISCOVAR will not at present scale to the entire human genome, and it is often easier to interpret the assemblies of smaller regions.

How does DISCOVAR represent genome assemblies?

DISCOVAR genome assemblies are graphs, with edges representing sequence. Each edge is given as a record in a FASTA file, with graph connectivity information recorded in the header (>…) lines. DISCOVAR also generates a graphical representation using the dot file format. For more information, please see the manual.

We plan to add FASTG support soon.

How can one view the assembly graph?

The dot files generated by DISCOVAR may be viewed with Graphviz.

How does DISCOVAR represent variants?

Variants are listed in a human-readable plain text file, in a transitional format that is specific to DISCOVAR. This format will be expanded in the near future. We are also working to translate the format to VCF, however enhancements to VCF will be needed to accommodate complex variation features.

Can I use DISCOVAR to call variants in a population?

No. DISCOVAR is designed to work with single samples, not populations. Try using a tool like GATK instead.

In the Illumina pipeline, should I turn off the EAMSS filtering that generates bases having quality score 2?

We recommend leaving it turned on – but are continuing to investigate the impact of this change.

I wish to use DISCOVAR in my research, do I need a commercial license?

No. DISCOVAR is released under the term of this license, but we do encourage you to register with us. To register simply email us, stating your name, organization name and details.

My company would like to use DISCOVAR, do I need to purchase a license?

No. DISCOVAR is released under the term of this license, but we do encourage you to register with us. To register simply email us, stating your name, organization name and details.

How does DISCOVAR relate to ALLPATHS-LG?

Currently the application spaces of ALLPATHS-LG and DISCOVAR are complementary. Notably, ALLPATHS-LG can be used to assemble 100 base Illumina reads, and it has capabilities not yet available in DISCOVAR, including the ability to work with multiple libraries and to create de novo assemblies of large genomes.