Gene Finding Methods



This document provides a general description of our automated genome annotation for prokaryotic genomes. Gene annotation at Broad is a multi-step process.

Evidence collection

  • Blast evidence:

    Blast homology search against the Genbank's NR database produces a set of raw blast output. Individual blast alignments are then clustered into single blast clusters by linking the blast alignments derived from the same blast hit. Several such overlapping blast clusters on the genomic region represents what we call as blast loci on the genome assembly. Currently, all blast hits with e-values better than 1e-10 are used as blast evidence.

  • Pfam domains:

    We run Hmmer searches using Pfam/TIGRfam library to find Pfam/TIGRfam domains on six-frame translations of the genomic sequence.

Computational Prediction of Non-Coding RNA Features

Ribosomal RNAs (rRNAs) are identified with RNAmmer (Lagesen et al., Nucleic Acids Res., 2007, Apr 22). The tRNA features are identified using tRNAScan (Lowe&Eddy, Nucleic Acids Res, 1997, 25, 955-964). Other common RNA features are identified with RFAM (Griffiths-Jones et al. Nucleic Acids Res. 2005, 33, D121-D124).

Ab initio Gene Models

Ab initio gene models are predicted using computational gene prediction programs such as using GeneMark (Borodovsky & McIninch, Comp. Chem., 1993, 17, 123-133), Glimmer3 (Delcher, et al., Nucleic Acids Res., 1999, 27, 4636-4641), MetaGene (Noguchi, et al. Nucleic Acids Res., 2006, 34, 5623-5630)

Gene Models Based on Annotation Transfer from Reference Genome(s)

Well-curated annotations from reference genomes, if available, are transferred to the current genome assembly to improve our automated annotation. Broad's in-house synteny-based gene transfer process has two main steps. First, we find collinear blocks between the two genomes by creating pair-wise alignments between the two genomes, and then generate global alignment for the entire region the collinear blocks cover. In the second step, we use an in-house gene mapping program to transfer genes from reference onto the target genome within the specific syntenic blocks, and we use genewise to further refine a gene model at each locus.

Gene Models Based on Blast Evidence

Broad's in-house program, "findBlastOrfs", leverages BLASTX alignments to build a complete gene model from the hit. It is particularly useful in low-coverage genomes with frame shifts or gaps in coverage. Ab initio gene predictors generally produce truncated predictions at best, and no prediction at worst, when they encounter an incorrect stop produced by a frame shift. Furthermore, ab initio tools generally produce wildly different results when confronted with a sequence with gaps in it.

Selection of Consensus Gene Models

Broad's automated gene calling process uses a rule-based selection process to evaluate the evidence and build consensus gene models.

Ab initio predictions, models generated using blast hits against NR, transferred reference gene models, and manual gene models are clustered into potential gene loci. At each locus, we select the most likely non-conflicting gene models based on the best evidence available, e.g., Pfam hits, length agreement with the BLAST hits, and overlap to non-coding RNA features. Gene models with problems are tagged appropriately with curation flags and notes in the gene report to indicate potential problems. Despite all the progress in the field of gene finding, accurate gene finding on draft genomes is still a challenge. We make an effort to track likely problematic gene models and tag them with appropriate curation flags to alert the users of the nature of the problems. These tags are also used by manual annotators to specifically target manual editing and fine-tuning of bad gene models.

Assigning Gene Product Names

Our current gene naming protocol relies on high confidence blast homology and in some cases, community inputs to assign gene product names.

Gene Numbering

Every annotated gene is given a Locus Number of the form xxxG_#####, which is guaranteed to identify a gene uniquely, both at the Broad Institute web site and in GenBank.