What to do about contamination

Suppose that you’re trying to sequence species X, but you’ve got contamination from other species mixed in. What can be done about this?

Let’s suppose that the other species are similar to species in Genbank. Then you have two main options: either filter the reads or filter the scaffolds, in either case by alignment to stuff in Genbank. Depending on circumstances, on or the other of the two approaches may be better:

- Filtering out reads has the advantage that it prevents possible problems with assembly. We know of two specific cases: (1) ALLPATHS-LG currently calibrates copy number using the largest unipaths, and hence can be “tricked” by contaminant. This isn’t likely to happen but it’s possible. We’re working on a fix. (2) There may be computational performance issues if, let’s say, 95% of your data is contaminant.

- Because scaffolds are much larger than reads, filtering by scaffolds has hugely greater power than filtering by reads. So filtering the assembly is much more likely to find all the contamination and much less likely to accidentally remove sequence from the species you’re trying to assemble.

