DISCOVAR de novo now reports the fraction of read pairs that appear to be chimeric. Fractions of around 1% are expected and probably due to artifacts of read mapping within the assembly. Fractions much higher than this are indicative of a serious problem, most likely a computational scrambling of the read pairs defined as input to DISCOVAR de novo.
Sometimes parallel processing of samples can result in low level cross contamination, and sometimes there can be enough to assemble, especially when the samples are sequenced at high coverage. We’ve added a program
CrossOut that can remove most of this contamination from parallel DISCOVAR de novo assemblies, by looking for improbable molarity differences. It has a single argument
DIR, the parent directory for the assemblies, and creates new assembly directories
a.clean within each.
We’ve added some new assembly statistics to DISCOVAR de novo. These are in the file stats in
a.final and are mirrored in standard output. These along with the file
frags.dist.png are often diagnostic.
DISCOVAR de novo produces several output files, including a file of scaffolds
a.lines.fasta in which a single path through a genomic locus is shown, even when multiple paths are possible (for one of several reasons, including polymorphism). (See “Edges, lines and scaffolds“.) This ‘flattened’ representation of the assembly loses information but has the advantage that it is FASTA and so can be processed by standard tools. With revision 51386, we now pick the paths used to be those having highest coverage. This is completely arbitrary in cases of bona fide polymorphism, but is helpful in cases where an assembly bubble occurs because of sequencing difficulty, making it uncertain which bubble branch is correct. In such cases, and in cases of ‘minor alleles’ in bacterial cultures, choosing the highest coverage branch makes sense.
Each DISCOVAR de novo assembly will now come with a plot like this
showing the observed size distribution for the fragments defined by the input read pairs, and in the file
frags.dist.png. These plots can be highly diagnostic. They are available from revision 51298 onwards. The raw data are in the file
Our manuscript “Comprehensive variation discovery in single human genomes” is now available as an advance online publication from Nature Genetics. This paper describes our assembly and variant calling algorithm DISCOVAR, which is able to find many novel variants missing from standard call sets. DISCOVAR is available for download now, and we encourage you to try it out. For de novo assembly without variant calling, see our other package: DISCOVAR de novo.
You can now limit the maximum number of threads DISCOVAR de novo uses with the new option
NUM_THREADS (release 51183). This is useful if you have to share your hardware, or if your system admin has limited the number of threads a single process can use. It can also be a good idea to restrict the number of threads if your hardware has many cores (>50), as the parallelization efficiency can start to drop due to locking and cache coherency issues.
We have fixed the bug in DISCOVAR de novo cited in the last blog message. Please download and use the new version (50964) from our ftp site.
We found a serious bug in DISCOVAR de novo, revision 50862, resulting in data corruption in some cases. We will correct this bug as soon as possible. In the meantime we are rolling back to revision 50693. And needless to say, we are beefing up our release testing.