A. HELP SOURCES
A1. Something went wrong, what should I do?
Try assembling our test genome data – and look at the metadata files for hints on how to create your own.
Take a look at the kmer spectrum for your data and see if it reveals anything unusual.
If you are still stuck, try asking for help on the ALLPATHS user forum. Search it first and you might even find that your question has already been answered.
B. INPUT DATA
B1. Can I assemble data from ONE library using ALLPATHS-LG?
No, but we understand the need for programs that can do this, and there are some, including Velvet and ABySS. Multiple libraries enable higher assembly quality but entail more labwork.
B2. Do I need paired reads from an “overlapping” fragment library?
Yes. We use paired reads of length ~100 bases from fragments of size ~180 bp. For longer reads, somewhat longer fragments could be used.
B3. Do I need data from a jumping library?
Yes. ALLPATHS-LG works by “walking” from one end of a long-fragment read pair to the other.
B4. We’re having trouble making jumping libraries, what would you suggest?
If you use Illumina’s protocol, try writing to them to ask for help. We have successfully made many libraries in the 2-3 kb range. At present libraries from much longer fragments are hit and miss. In general jumping libraries are hard to make because there are many steps in the process and some DNA is lost at each step. Thus at the end of the process, the number of surviving unique molecules may be small.
B5. Can I use PacBio data?
Yes, so long as you have the basic data types described above, and so long as your genome is small (we have only tested on bacterial genomes).
B6. Can I use 454 data?
Not now, as the ALLPATHS-LG algorithm is tuned for data having a very low indel rate. The per base cost of 454 data is about 100 times higher than Illumina data — so we are at present not very motivated to accommodate 454. However, if 454 costs fall or Ion Torrent succeeds in generating long cheap reads (which is possible), then we will write the required code, which should work for both 454 and Ion.
B7. What test data sets are available?
C. COMPUTE NEEDS
C1. Can I run ALLPATHS-LG on a cluster?
You can, but it will only use one machine, not the entire cluster. That machine would need to have enough memory to fit the entire assembly. ALLPATHS-LG does not support distributed computing using MPI, instead it uses Shared Memory Parallelization.
C2. How much memory does ALLPATHS-LG require?
Peak usage is roughly 1.7 bytes per read base, at least for mammalian-size genomes. If you find that memory usage is much higher than this, we will try to help. There are many places in the algorithm where memory usage could be reduced.
C3. Who sells large-memory servers?
We are aware of HP, Dell, and SuperMicro. SuperMicro seems to have the best prices at the moment (June 2011), roughly $32K for a 512 GB server with 48 processors.
D. RUNNING ALLPATHS-LG
D1. Can I change K?
There are dozens of heuristic parameters (such as K) that in principle could be adjusted. We do not do this ourselves for individual datasets and recommend that you don’t either. Rather, as part of controlled experiments where the genome is known, we try to choose the ‘ideal’ values.