Is DISCOVAR a genome assembler or a variant caller?
DISCOVAR is both! The heart of DISCOVAR is a de novo genome assembler, one that is accurate enough to produce assemblies that can also be used for variant calling. If you supply a reference sequence DISCOVAR will generate variants in addition to an assembly.
What are the inputs required to run DISCOVAR?
DISCOVAR has specific requirements for input data.
DISCOVAR requires a single Illumina fragment (paired end) library. We strongly recommend using a PCR-free protocol. From the library, 250 base paired reads can be created using either Illumina MiSeq or HiSeq2500 genome sequencers. The recommended coverage is about 60x. Somewhat higher or lower coverage is fine. Longer Illumina reads also work.
For variant calling you must also supply a reference for your genome – in FASTA format.
Can you tell me more about the PCR-free library?
For a human genome, this can be made from 1 ug DNA. As per the Illumina protocol, one should not use a gel to size select. A good target size for fragments is 700 bp, however in our experience the actual fragment size after sequencing is closer to 450 bp.
Do I need a jumping library?
No. DISCOVAR does not require a jumping* library and cannot currently use one.
*sometimes referred to as a mate pair library.
Can I use other types of Illumina data?
Possibly, although the results will likely not be as good as those obtained from the recommended DISCOVAR data. Here are some tips:
- Reads longer than 250 bases can be used.
- PCR libraries can be used, but there will be some degradation in the quality of assemblies and variant calls.
- Reads between 100 and 250 bases in length may work with DISCOVAR, depending on the fragment size and other factors.
- Short reads made from long fragments cannot be used. DISCOVAR closes fragments by extending into the gap with other reads, which then must overlap – see the diagram below. For this to work, the fragment length must be substantially smaller than four times the read length. For example, 100 base reads from ~400 bp fragments will not work.
----------> <---------- original read pair ---------- extending read ---------- extending read
Can I use reads from another sequencing technology?
No. However, we intend to support promising new technologies. See our roadmap.
Can DISCOVAR carry out a de novo assembly of a human-sized genome?
Not yet. We are in the process of scaling up the algorithms. However, you can still call variants on such a genome, by running on smaller regions of the genome.
Can DISCOVAR carry out de novo assemblies of microbial-sized genomes?
Can I call variants on a large genome (e.g. human)?
Absolutely! Although DISCOVAR can currently only assemble small genomes, it is possible to instead assemble smaller portions of a larger genome. You simply specify the region of the genome you are interested in, and DISCOVAR will do the rest. This requires that you first align your reads to a reference, and provide DISCOVAR with a resulting BAM file. The alignments are used to localize reads to the region of interest, and are not used in the assembly process. Our goal is to further optimise DISCOVAR so that it will be able to operate on the entire genome at once, allowing the discovery of completely novel sequence as well as variants.
What size of region should I use when calling variants?
We recommend using small regions, for example ~100 kb. It is possible to run on larger regions, but DISCOVAR will not at present scale to the entire human genome, and it is often easier to interpret the assemblies of smaller regions.
How does DISCOVAR represent genome assemblies?
DISCOVAR genome assemblies are graphs, with edges representing sequence. Each edge is given as a record in a FASTA file, with graph connectivity information recorded in the header (>…) lines. DISCOVAR also generates a graphical representation using the dot file format. For more information, please see the manual.
We plan to add FASTG support soon.
How can one view the assembly graph?
The dot files generated by DISCOVAR may be viewed with Graphviz.
How does DISCOVAR represent variants?
Variants are listed in a human-readable plain text file, in a transitional format that is specific to DISCOVAR. This format will be expanded in the near future. We are also working to translate the format to VCF, however enhancements to VCF will be needed to accommodate complex variation features.
Can I use DISCOVAR to call variants in a population?
No. DISCOVAR is designed to work with single samples, not populations. Try using a tool like GATK instead.
In the Illumina pipeline, should I turn off the EAMSS filtering that generates bases having quality score 2?
We recommend leaving it turned on – but are continuing to investigate the impact of this change.
I wish to use DISCOVAR in my research, do I need a commercial license?
In a non-profit academic research setting you are free to use DISCOVAR without obtaining a commercial license. However, if the funding for your research is from a commercial source, you should contact us first to make sure.
My company would like to use DISCOVAR, do I need to purchase a license?
The answer is likely to be yes. Please contact us to find out for sure and to negotiate terms. Note that you are free to try out DISCOVAR for a 60 day period before seeking a license, but please register first.
How does DISCOVAR relate to ALLPATHS-LG?
Currently the application spaces of ALLPATHS-LG and DISCOVAR are complementary. Notably, ALLPATHS-LG can be used to assemble 100 base Illumina reads, and it has capabilities not yet available in DISCOVAR, including the ability to work with multiple libraries and to create de novo assemblies of large genomes.