QueryLookupTable

From ArachneWiki

Jump to: navigation, search
QueryLookupTable
Function Alignment
Phase Post-processing
Standard CLAs GDB, NO_HEADER
Special CLAs SEQS, K, SEQ_NAMES, QUALS, QUALS_G, LOOKUP_TABLE, MAX_OFFSET_DIFF, MIN_OVERLAP, MAX_FREQ, WINNING_EDGE, MIN_COVERAGE, MIN_HITS_TO_OVERRIDE, PROGRESSION_RATIO, MAX_NQS_PERCENT, MAX_QUAL_SCORE, MAX_RMR_PERCENT, MIN_MM_PERCENT, KEEP_BEST, MIN_MUTMER_LENGTH, SINGLETON_HITS, END_STRETCH, HEURISTICS, SHOW_PREMUTMERS, SHOW_MUTMERS, SMITH_WAT, BW_ADD, LIST_UNPLACED, LIST_UNPLACED_BY_PASS, REQUIRE_PROPER, REQUIRE_FULL1, REQUIRE_FULL2, REQUIRE_POS2_ZERO, REMOVE_DOMINATED, UNPLACED_FILE, ANNOUNCE_ITERATIONS, DUMP_WINDOW, OUTPUT, PARSEABLE, PARSEABLE_BRIEF, VISUAL, VISUAL_ABBR, READABLE_BRIEF, RMR_BY_BLOCK, GLOBAL_STATS, TARGET_NAMING, QUERY_NAMING, PRINT_MEMORY_USAGE, SEQS_TO_PROCESS, TARGETS_TO_PROCESS, SEQS_TARGETS_TO_PROCESS, SEQS_TO_EXCLUDE, TMP_DIR, CHUNKS, IMPERFECT_EXTENSION, FILTER, PRINT_NQS, PRINT_RMR, PRINT_MM, PRINT_QUAL_SCORE, SW_GAP_VERBOSE, TRACEBACK_ON_INTERRUPT, MAX_PLACEMENTS, nameParser, ALIGN_UNALIGNED_BITS, UNPLACED_SEQUENCE_FILE, MAX_MISMATCHES, MAX_INDELS, MIN_MATCHES_PERCENT, MAX_ERROR_PERCENT, MIN_BASES_COVERED, MIN_QUERY_LENGTH, SYNC_ALIGNS_TO_TACG, QUIET, SEQS_IS_FASTB, QUALS_IS_QUALB, FILTER_ADD, OUTFILE, FW_ONLY, TRUNCATE_TO, MIN_TO_PRINT
Source location ARACHNE_DIR/lookup

QueryLookupTable or QLT is a fast heuristic aligner that aligns short sequences (queries) within long ones (reference contigs), by means of a lookup table. A similar module is ShortQueryLookup, which performs the same function but is optimized for shorter reads, such as Solexa reads.

QueryLookupTable is commonly used in post-processing, to compare a test assembly against a known genome that the assembly should resemble. Evaluating the resulting nobbits provides a metric of the quality of the new assembly, as reported in assembly.ps.

QueryLookupTable is also crucial to assisted assembly. It is used to align the input reads to the reference genome, thereby validating some reads and links.

Algorithm

QueryLookupTable runs several algorithmic steps; each step is controlled by several command-line arguments. QueryLookupTable has an enormous number of command-line arguments, and unlike in many modules, it is worth investigating the meaning of every argument rather than simply taking their default values.

  1. The reference is encoded as a lookup table, which is k-mer based. Typically k=12, so the reference is encoded as the set of 12-mers that it contains, together with their positions in the reference contigs. (This step occurs in MakeLookupTable rather than in QueryLookupTable).
  2. Each query sequence is similarly converted to k-mers.
  3. All matches ("hits") between the k-mers of the read and the reference are noted.
  4. Candidate alignments are found. A candidate alignment is identified by a cluster of hits.
  5. Candidate alignments are filtered heuristically. Candidates that pass the filters are converted into alignments.

Output Format

In the output, by default, the queries and the reference contigs are identified by their numeric id, i.e. by their 0-based position in the input data.

The default output prints the alignments in brief human-readable form, stating the query and target of each alignment and the starting and ending bases on each. Those base numbers are also 0-based, and the end base is actually one-past-the-end.

Output options:

  • READABLE_BRIEF=False: Turns off the default human-readable output.
  • PARSEABLE=True: Prints out the alignments in full machine-readable detail. That option outputs the alignments in LookAlign format, as described by the LookAlign module. If the output (both STDOUT and STDER) are piped to a QLT file (e.g., QueryLookupTable PARSEABLE=True ... >& alignments.qlt), the QLT file can then be used for further analysis modules such as PlotHitsCoverage and EvaluateConsensus.
  • VISUAL=True: Prints out the alignments in a visual 3-line format, as defined by the module [[PrintAlignments]. The format is: two aligned sequences, with spaces for gaps, and |/* markers of indels/mismatches above).

These options do not interfere with each other. You may set any or all of them in a given run.

Personal tools