Over the last two months, since we wrote the ALLPATHS-LG paper, the accuracy of contigs generated by the algorithm has been substantially improved. Here’s a table of results for a test assembly consisting of the mouse chr 1 region, from 10 to 30 Mb:
| date | revision | base errors (per 104) |
misassembly rate (%) |
ambiguous bases (%) |
| 11-1-2010 | 35038 | 1.74 | 2.69 | 0.050 |
| 1-7-2011 | 35511 | 0.79 | 1.41 | 0.078 |
This is based on an analysis of ~1 kb chunks, as in the paper. It’s an all-in analysis that captures all errors in contigs, which are categorized either as base errors or misassemblies. Some errors are eliminated in the assembly by encoding that we don’t know the exact answer, thus for example, …CCTAAAAAAAAAA{,A,AA}GTC… has a run of between 10 and 12 As, and this gets counted as right if the true answer is one of those. However, there is no free lunch: we also count these ‘ambiguous’ bases, as shown in the last column of the table. The total of these has increased, but is still under 1 per 1000, and most are concentrated in simple sequence repeats.