DISCOVAR de novo now has an argument MAX_MEM_GB that can be used to limit memory usage to roughly the given amount. This can be useful on very large shared-memory systems.
We have added support for assembly of multiple related samples to DISCOVAR de novo. For example,
READS="sample:T :: t.bam + sample:N :: n.bam"
will assemble together data from two bam files
n.bam, and in so doing keep track of their sample identities as “T” and “N”. These sample identities are carried forward and may be seen during visualization, for example to show the number of reads from each sample supporting a given edge. This feature is compatible with the FASTQ support described in the previous post. We will add full documentation later.
We’ve added FASTQ support to DISCOVAR de novo. Pretty much any reasonable syntax for
READS="..." including “globable” wild card characters should be interpreted correctly. The interlaced and non-interlaced cases should be correctly distinguished. Allowed suffixes are
.fq.gz. We will provide detailed documentation later.
DISCOVAR de novo requires roughly 2 bytes of memory for each base of input data. The program now provides feedback for a given run. For example, if you have only 1 byte of memory per base, a warning will be issued. We have also fixed bugs associated with having more than 231 (about two billion) reads.
DISCOVAR de novo now reports the fraction of read pairs that appear to be chimeric. Fractions of around 1% are expected and probably due to artifacts of read mapping within the assembly. Fractions much higher than this are indicative of a serious problem, most likely a computational scrambling of the read pairs defined as input to DISCOVAR de novo.
Sometimes parallel processing of samples can result in low level cross contamination, and sometimes there can be enough to assemble, especially when the samples are sequenced at high coverage. We’ve added a program
CrossOut that can remove most of this contamination from parallel DISCOVAR de novo assemblies, by looking for improbable molarity differences. It has a single argument
DIR, the parent directory for the assemblies, and creates new assembly directories
a.clean within each.
We’ve added some new assembly statistics to DISCOVAR de novo. These are in the file stats in
a.final and are mirrored in standard output. These along with the file
frags.dist.png are often diagnostic.
DISCOVAR de novo produces several output files, including a file of scaffolds
a.lines.fasta in which a single path through a genomic locus is shown, even when multiple paths are possible (for one of several reasons, including polymorphism). (See “Edges, lines and scaffolds“.) This ‘flattened’ representation of the assembly loses information but has the advantage that it is FASTA and so can be processed by standard tools. With revision 51386, we now pick the paths used to be those having highest coverage. This is completely arbitrary in cases of bona fide polymorphism, but is helpful in cases where an assembly bubble occurs because of sequencing difficulty, making it uncertain which bubble branch is correct. In such cases, and in cases of ‘minor alleles’ in bacterial cultures, choosing the highest coverage branch makes sense.
Each DISCOVAR de novo assembly will now come with a plot like this
showing the observed size distribution for the fragments defined by the input read pairs, and in the file
frags.dist.png. These plots can be highly diagnostic. They are available from revision 51298 onwards. The raw data are in the file