Gene Finding Methods
- Evidence Collection
- Ab Initio Gene Predictions
- Transfer Annotations
- Building Consensus Gene Models
- Predicting Alternatively Spliced Transcript Models
- Predicting UTR
- Predicting Non-coding RNA Genes
- Filtering spurious gene calls and identifying problematic gene models
- Assigning Gene Product Names
- Gene Numbering
- Reporting Annotation Accuracy
This document provides a general description of our automated genome annotation for eukaryotic genomes. Gene annotation at Broad is a multi-step process that starts with sequence assembly insertion and proceeds with automated annotation and manual review. Once complete, the annotations are released and submitted to NCBI.
We collect three types of evidence before we run our in-house gene caller to build consensus gene models. These are: blast and Pfam, EST reference genes and ab initio gene predictions.
- Blast and Pfam Evidence:
We use blastx (E-value cutoff 1e-10) to search the Genbank NR database, this produces a set of raw blast outputs. Individual blast alignments are then collapsed into single blast clusters and we refer to these collapsed blast hits (each overlapping a unique genomic region) as ?single blast hit loci? on the genome assembly.
We also run hmmer searches (E-value cutoff 0.01) using a pFAM library to find pFAM domains on six-frame translations of the genomic sequence.
- Building EST Reference Genes:
For most of our fungal and other eukaryotic genomes, we use species-specific ESTs generated here at the Broad Institute as well as EST sequence collections that are publicly available from other sequencing centers as well as from GenBank to produce a set of EST reference gene models.
First, we align ESTs to the genome using BLAT and then collapse them into distinct transcript clusters using a Broad-developed program called CallReferenceGenes. EST alignments with 90% identity over 50% of the EST length with canonical splice junctions are considered valid EST alignments suitable for building gene models.
For each EST-derived transcript cluster we also identify the best possible ORF. From these transcripts, we pull out a subset of high confidence reference genes that have an ORF length within 80% to 120% of the top blast hits on the corresponding loci. This set is used to train ab initio gene prediction programs.
Ab Initio Gene Predictions
Commonly used gene finding programs include GeneMark-ES, GeneID, Augustus, FGENESH and SNAP. GeneMark-ES is a self-training eukaryote gene finding tool developed by Mark Borodovsky's group (Lomsadze & Borodovsky, NAR, 2005; Ter-Hovhannisyan & Borodovsky, Genome Research, 2008). FGENESH is a commercial gene prediction program sold by Softberry, while GENEID, by Enrique Blanco and Roderic Guigo, is available under the GPL. GeneMark-ES uses an unsupervised training so no reference genes are required. Both GENEID and FGENESH are usually trained using a set of high confidence EST-based gene models. In addition, we also use other common gene finding programs such as SNAP (Ian Korf) and Augustus (Mario Stanke) when genome-specific parameter files are available.
These programs are trained in-house or by the developers of these programs using high confidence EST reference genes. If there are not enough EST reference genes (at least 150), we then review a subset of loci with good EST coverage and manually build gene structures using all of the available evidence. These manually curated genes are then added to the reference set.
Trained ab initio prediction programs are run on assembled genomes and gene model predictions are evaluated in comparison to whole EST-based gene sets via detailed gene reports. The diagnostic gene reports are used to set an order of priority for the trained ab initio prediction programs which is followed when building consensus gene models.
High quality annotations from external sources as well as reference genomes are also used to improve our automated annotations. Broad's in-house synteny-based gene transfer process has two main steps:
Pairwise alignments between the target genome and a reference genome are used to generate collinear block alignments. A global alignment is then generated for the entire region that the collinear block covers.
An in-house gene mapping program is then used to transfer genes from the reference genome onto the target genome within the specific syntenic blocks. We use genewise to further refine the gene model at each locus.
Building Consensus Gene Models
The automated process of our annotation pipeline predicts consensus gene models using a rule-based selection process to evaluate the evidence from a combination of the following methods:
- Ab initio predictions such as GeneMark, FGENESH and GENEID,
- EST-based reference gene models
- Gene Transfers and manually curated gene models from reference genomes or external sources.
The automated predictions are then reviewed and we select the most likely non-conflicting gene models based on the best evidence available at each locus. Our computational in-house method uses heuristics such as splice agreement with ESTs and relative overlap with the BLAST hits to choose the prediction in highest agreement with the evidence. It does not have an internal model of gene structure and thus runs on a wide variety of eukaryotic and prokaryotic organisms without training.
Predicting Alternatively Spliced Transcript Models
We do not predict alternatively spliced transcripts unless there is manual or full-length EST evidence in support of their existence. Transcript models that differ only with respect to un-spliced 3' and 5' ends with the reference gene models are not considered evidence for alternative splicing. We include only canonically spliced and uniquely aligned ESTs with alternate splice junctions as valid predictions of alternatively spliced transcripts.
When an EST alignment uniquely aligns with >= 95% identity and overlaps a gene prediction, and the region of overlap has absolute labeling agreement (e.g., every nucleotide in the region of overlap is exonic in both the prediction and the alignment, or is intronic in both), we consider the prediction and alignment to be compatible. UTR predictions are generated by walking out along these compatible EST alignments from the end(s) of each prediction. A chain of one or more overlapping, compatible EST alignments that begin at the 5' or 3' end forms each UTR extension. When an EST aligns to more than one location on the genome, or touches more than one gene prediction (on either strand) it is ignored.
Predicting Non-coding RNA Genes
We use RNA finding programs such as RNAmmer and RFamSearch to detect the common RNA features, such as ribosomal RNAs. tRNA scan is used for finding tRNAs on the genome assembly. Our gene selection process offers several useful options to exclude calling genes in regions with tRNA, rRNA and known repeat elements using a conservative overlap criterion.
Filtering spurious gene calls and identifying problematic gene models
Gene models with problems are tagged appropriately with curation flags and notes in the gene report to indicate potential problems. Despite all the progress in the field of gene finding, accurate gene finding on draft genomes is still a challenge. We make an effort to track easily identifiable problematic gene models and tag them with appropriate curation flags to alert the users of the nature of the problems. These tags are also used by manual annotators to specifically target manual editing and fine-tuning of bad gene models.
Following our automated and manual annotation review process, we filter out any low confidence genes (e.g., genes predicted by only one prediction method that lack blast and hmmer evidence support, and the CDS/gene length ratio is less than one third). We also filter predicted transposons prior to generating a final consensus gene set.
Assigning Gene Product Names
Our gene naming protocol currently relies on high confidence blast homology and in some cases, community inputs to assign gene product names. We hope to improve the gene naming process in the future based on other functional annotation protocols and tools. We use four types of names depending on the available evidence:
- known protein name: supported by blast hit that is a known protein;
- conserved hypothetical protein: supported by blast evidence that is not a known protein (e.g., conserved hypothetical protein, predicted protein);
- hypothetical protein: supported by reference EST evidence only;
- predicted protein: no blast or EST evidence.
Every annotated gene is given a Locus Tag of the form xxxG_##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure.
Reporting Annotation Accuracy
We run in-depth reports on each annotation we produce to get a measure of our annotation accuracy. A host of accuracy statistics is compiled for each prediction that touches an EST; following Guigo's Evaluation of gene structure prediction programs (Genomics, 1996 Jun 15; 34(3):353-67). We calculate specificity, sensitivity, correlation coefficient and simple matching coefficient on the levels of nucleotides, splice sites, introns and exons.