Gene Finding Methods
- Evidence collection
- Building EST-based gene models
- Computational prediction of tRNAs and RNAs
- Training Gene Prediction Programs
- Computational Gene Models
- Transfer Annotation
- Selection of Consensus Gene Models
- Predicting Alternatively Spliced Transcript Models
- Predicting UTR
- Assigning Gene Product Names
- Reporting Annotation Accuracy
- Gene Numbering
This document provides a general description of our automated genome annotation for eukaryotic genomes.
Gene annotation at Broad is a multi-step process.
- Blast evidence:
- EST evidence:
- Pfam domains:
- EST alignments:
Blast homology search against the Genbank's NR database produces a set of raw blast out put. Individual blast alignments are then clustered into single blast clusters by linking the blast alignments derived from the same blast hit. Several such overlapping blast clusters on the genomic axis represents what we call as blast loci on the genome assembly. Currently, all blast data with e-values greater than 1e-10 are used considered as usable blast evidence.
For most of our fungal and other eukaryotic genomes, we use species-specific ESTs sequenced here are broad as well as publicly available EST sequences from genbank to produce a set of high confidence gene models.
We run Hmmer searches using pFAM library to find pFAM domains on six-frame translations of the genomic sequence.
First, we align ESTs to the genome using BLAT and then collapse them into distinct clusters transcripts using a Broad-developed program called CallReferenceGenes (described below). EST alignments with 90% identity over 50% of the EST length with canonical splice junctions are considered valid EST alignments suitable for building gene models.
Building EST-based gene models
EST clusters produced in the previously described section are used as inputs for building high confidence gene models for the purposes of training and evaluating gene finding tools.
FindEstOrf is an internal tool used for constructing CDS from EST transcripts. This tool assigns start and stop codons to create a longest ORFs. We then compare these constructed ORFs to overlapping BLAST evidence and pick the ones with comparable reading frame and protein length to the best blast hits from closely related species. FL_EST gene models created by this process are fully covered by EST alignments and are within + 20% of the length of homologous proteins.
In addition, we also build EST-based gene models by a manual process. By combining blast, EST and ab initio predictions produced using a parameter file from a related genome, manual annotators carefully build gene models that are otherwise missed by our highly conservative automated findORF script.
Computational prediction of tRNAs and RNAs
tRNA scan is used for detecting tRNAs on the genome assembly. We use RNA finding programs such as RNAmmer and RFamSearch to detect the common RNA features.
Training Gene Prediction Programs
Commonly used gene finding programs such as Augustus, GENEID, GeneMark, FGENESH and SNAP are trained in house or by the developers of these programs using the high confidence EST gene sets.
Computational Gene Models
Gene structures are predicted using a combination of gene models from computational gene prediction programs such as FGENESH, GENEID, GeneMark and EST-based automated and manual gene models. FGENESH is a commercial gene prediction program sold by Softberry, while GENEID, by Enrique Blanco and Roderic Guigo, is available under the GPL. GeneMark is another gene finding tool developed by the Mark Borodovsky's group (Borodovsky & McIninch, Comp. Chem., 1993). Both GENEID and FGENESH are usually trained using a set of high confidence EST-based gene models generated by clustering blat-aligned species-specific ESTs. In addition, we also use other commonly used gene finding programs such as SNAP (Ian Korf) and Augustus (Mario Stanke) if and when genome-specific parameter files are available.
After training the gene prediction programs, we run each of them on the assembly and evaluate their performance by comparing the gene models with EST and blast evidence. Those that perform adequately are used in the automated gene calling pipeline. The modular architecture of our automated pipeline makes it easy to incorporate new gene prediction programs and customize the pipeline to suit the genomes annotated.
We use well-curated annotations from other sources and reference genomes to improve our automated annotations. Broad's in-house synteny-based gene transfer process has two main steps:
First, we generate collinear block alignments between the two genomes by creating pair-wise alignments between the two genomes. Further, a global alignment is generated for the entire region the collinear block covers.
Second, an in-house gene mapping program then transfers genes from reference onto the target genome within the specific syntenic blocks. We use genewise to further refine a gene model at each locus.
Selection of Consensus Gene Models
Broad's automated gene calling process uses a rule-based selection process to evaluate the evidence and build consensus gene models.
Ab initio predictions, blast and EST alignments, reference gene models, and manual and automated EST-gene models are clustered into potential gene loci. We select the most likely non-conflicting gene models based on the best evidence available at each locus. Our method uses heuristics such as splice agreement with ESTs and relative overlap with the BLAST hits to choose the prediction most in accord with the evidence. It does not have an internal model of gene structure and thus runs on a wide variety of eukaryotic and prokaryotic organisms without training.
Our gene selection process offers several useful options to exclude calling genes at certain loci. For example, we can exclude genes in regions with tRNA, rRNA and known repeat elements using a conservative overlap criterion.
Gene models with problems are tagged appropriately with curation flags and notes in the gene report to indicate potential problems. Despite all the progress in the field of gene finding, accurate gene finding on draft genomes is still a challenge. We make an effort to track easily identifiable problematic gene models and tag them with appropriate curation flags to alert the users of the nature of the problems. These tags are also used by manual annotators to specifically target manual editing and fine-tuning of bad gene models.
Predicting Alternatively Spliced Transcript Models
We do not predict alternatively spliced transcript unless there is manual or full-length EST evidence in support of their existence. Transcript models that differ only with respect to un-spliced 3' and 5' ends with the reference gene models are not considered evidence for alternative splicing. We include only canonically spliced and uniquely aligned ESTs with alternate splice junctions as valid alternatively spliced transcripts.
When an EST alignment uniquely aligns with >= 95% identity and overlaps a gene prediction, and the region of overlap has absolute labeling agreement (e.g., every nucleotide in the region of overlap is exonic in both the prediction and the alignment, or is intronic in both), we consider the prediction and alignment to be compatible. UTR predictions are generated by walking out along these compatible EST alignments from the end(s) of each prediction. A chain of one or more overlapping, compatible EST alignments that begin at the 5' or 3' end forms each UTR extension. When an EST aligns to more than one location on the genome, or touches more than one gene prediction (on either strand) it is ignored.
Assigning Gene Product Names
We do not assign gene symbols. Instead, our gene naming protocol currently relies on high confidence blast homology and in some cases, community inputs to assign gene product names. We hope to improve the gene naming process in the future based on other functional annotation protocols and tools.
Reporting Annotation Accuracy
We run in-depth reports on each annotation we produce to get a measure of our annotation accuracy. A host of accuracy statistics is compiled for each prediction that touches an EST; following Guigo's Evaluation of gene structure prediction programs (Genomics, 1996 Jun 15; 34(3):353-67). We calculate specificity, sensitivity, correlation coefficient and simple matching coefficient on the levels of nucleotides, splice sites, introns and exons.
Every annotated gene is given a Locus Number of the form MGYG_##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure.