Gene Finding Methods

Outline

Overview

This document provides a general description of our automated re-annotation of a new chromosomal assembly for Cryptococcus neoformans var. grubii H99 from Dr. Fred Deitrich and coworkers at the Duke Center for Genome Technology.

Gene annotation at Broad is a multi-step process.

Evidence collection

  • Blast evidence:

    Blast homology search against the Genbank's NR database produces a set of raw blast output. Individual blast alignments are then clustered into single blast clusters by linking the blast alignments derived from the same blast hit. Several such overlapping blast clusters on the genomic axis represents what we call as blast loci on the genome assembly. Currently, all blast data with e-values greater than 1e-10 are considered as usable blast evidence.

  • EST evidence:

    For most of our fungal and other eukaryotic genomes we use species-specific ESTs sequenced here at the Broad as well as publicly available EST sequences from genbank to produce a set of high confidence gene models.

  • pFAM domains:

    We run Hmmer searches using a pFAM library to find pFAM domains on six-frame translations of the genomic sequence.

  • EST alignments:

    We align ESTs to the genome using BLAT and then collapse them into distinct clusters of transcripts using a Broad-developed program called CallReferenceGenes (described below). EST alignments with 90% identity over 50% of the EST length with canonical splice junctions are considered valid EST alignments suitable for building gene models.

Building EST-based gene models

EST clusters produced in the previously described section are used as inputs for building gene models for the purposes of training and evaluating gene finding tools. Start and stop codons are used to create the longest ORFs, which are then compared with overlapping blast evidence. Models with comparable reading frame and protein length to the best blast hits from closely related species constitute a high confidence set.

Computational prediction of tRNAs and RNAs

tRNASCAN is used for detecting tRNAs on the genome assembly. We use RNA finding programs such as RNAmmer and RFamSearch to detect the common RNA features. For Cryptococcus neoformans var. grubii H99 we also used blast to accurately delineate ribosomal RNA features and include the results as an annotation track.

Training Gene Prediction Programs

The gene finding programs AUGUSTUS, GENEID and GeneMark are trained in house or by the developers of these programs using the high confidence EST gene sets.

Computational Gene Models

Gene structures are predicted using a combination of gene models from computational gene prediction programs such as AUGUSTUS, GENEID, GeneMark, GLEAN, TWINSCAN and EST-based automated and manual gene models. GeneMark was run in collaboration with its developer, Dr. Mark Borodovsky at the Georgia Institute of Technology. AUGUSTUS, TWINSCAN, and GLEAN predictions were performed by Dr. Jason Stajich at the University of California, Berkeley.

Transfer of Information from Previous Release

We used annotations from our previous release of the Cryptococcus neoformans var. grubii H99 to inform our annotation of the current assembly using the Broad's in-house synteny-based gene transfer process.

Collinear block alignments between the two assemblies were generated by creating pair-wise alignments, and a global alignment was generated for the entire region the collinear blocks cover. An in-house gene mapping program then transferred genes from the old assembly onto the new within the specific syntenic blocks.

Selection of Consensus Gene Models

Broad's automated gene calling process uses a rule-based selection process to evaluate the evidence and build consensus gene models.

Ab initio predictions, blast and EST alignments, reference gene models, and manual and automated EST-gene models are clustered into potential gene loci. We select the most likely non-conflicting gene models based on the best evidence available at each locus. Our method uses heuristics such as splice agreement with ESTs and relative overlap with the blast hits to choose the prediction most in accord with the evidence. It does not have an internal model of gene structure and thus runs on a wide variety of eukaryotic and prokaryotic organisms without training.

Our gene selection process offers several useful options to exclude calling genes at certain loci. For example, we can exclude genes in regions with tRNA, rRNA and known repeat elements using a conservative overlap criterion.

Gene models with problems are tagged appropriately with curation flags and notes in the gene report to indicate potential problems. Despite all the progress in the field of gene finding, accurate gene finding on draft genomes is still a challenge. We make an effort to track easily identifiable problematic gene models and tag them with appropriate curation flags to alert the users of the nature of the problems. These tags are also used by manual annotators to specifically target manual editing and fine-tuning of bad gene models.

Filtering Repeating Elements and False Positives

Gene predictions within repeating elements and low complexity sequence regions are identified using RepeatMasker with a fungal repeat library, and in-house methods. Predictions occurring within such regions were not included in the final gene set. Predictions supported by a single ab-initio model that were not supported by blast, EST or pFAM evidence were also removed from the final gene set. The ab initio models underlying the filtered gene calls are still visible in their respective evidence tracks.

Assigning Gene Product Names

We do not assign gene symbols. Instead, our gene naming protocol currently relies on high confidence blast homology and in some cases, community inputs to assign gene product names. In addition, we have named protein kinase-superfamily genes by using blast to screen them against a curated set of protein kinases. We hope to improve the gene naming process in the future based on other functional annotation protocols and tools.

Annotation of Mitochondrial Genes

Mitochondrial genes were constructed manually based on blast and pFAM homology. RNA features were identified using TRNASCAN and RFAMSEARCH.

Gene Numbering

Every annotated gene is given a Locus Number of the form CNAG_##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. A gene correspondence table for this assembly may be downloaded here.