Gene Finding Methods

Outline

Overview

The Broad Institute has prepared for publication automated gene calls on A. nidulans and A. terreus. In addition, we have prepared gene calls provisionally on A. flavus for the purposes of this website. This document describes how automated gene models are built and verified.

Annotations are created in five steps:

  1. Genes are predicted by combining manual annotations with output from multiple gene prediction programs. This process is described in Gene Structure Prediction.
  2. Names are assigned to gene predictions based on homology to annotated genes on other genomes. This process is described in Gene Naming.
  3. UTRs (untranslated regions) are predicted. See "UTR Prediction".
  4. Locus numbers are assigned. See "Gene Locus Numbers".
  5. When species-specific EST libraries are available, we compare our gene models to splice models built from aligned ESTs. Since these EST data are also used in step one, this comparison is not statistically rigorous, but does provide a rough estimate of our annotation accuracy.


Gene Structure Prediction

Gene structures are predicted using a combination of manual annotation, ab initio gene predictors, and homology-based gene predictors. Currently, the programs we use include AUGUSTUS, CONRAD, FGENESH, FINDORFS, GENEID, and GENEWISE. The tools used on each genome vary, as each program's performance can vary widely from genome to genome. Each tool is evaluated on the genome for suitability before being used in the final gene build.

Some of these programs require training before use, so that they can accomodate genome-specific features such as GC content and intron length distribution in their predictions. Generally, when available, output from FINDORFS is used as training for the other programs, although in some cases a training file from a closely-related species may be used.

When multiple programs produce overlapping predictions, we use BLAST and EST evidence to choose the prediction most in accord with the evidence. Targeted manual annotation is carried out in loci where all predictions clash with EST or blase evidence. Predictions that overlap tRNA/RNA features are removed from the final gene set, as are those that are too short (30 aa, or 50 aa without BLAST evidence.)

Every model we predict is tagged with the program from which it came. In some cases, two programs may agree exactly on a gene model. In our reports these genes are marked as coming from multiple sources.

The gene prediction programs used for A. nidulans are: FGENESH, FGENESH+ and GENEID. This gene set was then extensively modified by TIGR using its PASA software pipeline as well as manual curators at the Broad.

The gene prediction programs used for A. terreus are: FGENESH, GENEID and GENEWISE.

The gene prediction programs used for A. flavus are: FGENESH and GENEID. Also, gene models from A. terreus were mapped onto A. flavus via whole-genome alignments for use as evidence.


Gene Naming

Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we try to avoid transferring unverified functional names for genes in one species to predictions in another species.

We hope to improve the gene naming process in the future based on Gene Ontology categories.

A gene will be named as follows:

  1. "NAME": The gene has at least one significant BLASTP hit to a known protein in Swiss-Prot (complexity filtering off, -F F, expect = 1e-10, 50%+ identity and 70%+ coverage).
  2. "hypothetical protein similar to NAME": The gene has at least one significant BLASTP hit to a protein in NCBI's protein set NR.
  3. "conserved hypothetical protein": The gene has at least one significant BLASTP hit, but the homologous protein(s) contain words indicating the name(s) are unverified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed, etc.}
  4. "hypothetical protein": The gene has no significant BLAST evidence but does have overlapping EST evidence.
  5. "predicted protein": The gene has no significant BLAST evidence and no overlapping EST evidence.

In all cases names are filtered to remove spurious identifiers, extra spaces, etc.


UTR Prediction

When an EST alignment uniquely aligns with > 95% identity and overlaps a gene prediction, and the region of overlap has 100% labeling agreement (e.g., every nucleotide in the region of overlap is exonic in both the prediction and the alignment, or is intronic in both), we consider the prediction and alignment to be compatible. UTR predictions are generated by walking out along compatible EST alignments from the end(s) of the prediction. A chain of one or more overlapping, compatible EST alignments that begin at the 3' or 5' end forms each UTR extension. When an EST aligns to more than one location on the genome, or touches more than one gene prediction (on either strand) it is ignored.


Gene Locus Numbers

Each annotated gene is given a stable locus number of the form XXXG_#####. This is the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies and annotation releases. These numbers are opaque identifiers and have no particular order or internal structure. In particular, it is unsafe to infer anything about the position of two genes based on their locus numbers. Position is an attribute of a gene that can be retrieved via its locus.

When we reannotate a genome, we map all genes from the previous assembly to the new one. Any genes that cannot be mapped due to assembly changes will be retired along with its locus number. New genes that cannot be associated 1:1 with a gene in the previous release receive new locus numbers. If a gene is unchanged, its locus number remains unchanged. If a gene changes, but a 1:1 mapping to a previous gene can be established, its locus number remains unchanged. In this way, researchers can find out whether a specific gene is still present in the gene set by searching for its locus. If two genes merge, both locus numbers are retired and a new one is assigned to the combined gene. A csv-formatted spreadsheet is provided for each reannotation describing in detail what happened at each locus and why.

Each gene also has a version attribute (so loci can be fully specified as XXXG_#####.V). When genes are mapped from one assembly to another or when we release a new set of gene calls, we increment this version. All the loci in a particular release will have the same version number.

Transcripts are assigned accession numbers of the form XXXT_#####.V. These numbers are also opaque and have no particular ordering. In addition, because there are generally more transcripts predicted than genes, there can be *no guarantee* that a gene and its transcript(s) will have the same or similar numerical portions of their identifiers.

The locus prefixes for these genomes are:

A. nidulans:AN
A. terreusATEG
The gene calls for A. flavus are provisional and have not been assigned locus numbers.


Structure Prediction Validation

To evaluate the accuracy of our gene predictions, we create a set of partial reference models exclusively from EST data. We then compare the two gene sets using a variety of metrics. In the tables presented for each annotation, we refer to the final, published gene set as the query and the EST-based set of partial genes as the reference.

The Feature comparisons and Splice analysis sections only report on the subset of query genes that overlap references. Although a substantial number of predicted genes overlap EST alignments, the majority do not. Because we use EST data to train, generate and improve our gene calls, we expect lower accuracy in regions that lack supporting EST evidence, on the order of 5-10%. Therefore, while they are a useful measure of gene prediction accuracy, the numbers reported in those two sections do not apply evenly to all predicted genes.


Overview of Query Genes

A. terreus (top) followed by A. nidulans

10406 genes (5241 on '+' strand, 5165 on '–')
10406 transcripts (8866 spliced, 1540 unspliced)
33116 exons, 22710 introns

len%cov%gc%at
genic1747451559.5855.7544.25
intergenic1185668040.4248.6551.35
exonic1561741553.2556.5743.43
intronic18571006.3348.8351.17
coding1561741553.2556.5743.43
5′ UTR00.00--
3′ UTR00.00--
alt. spliced00.00--
genomic2919793999.5552.9047.10

min

median

mean

n50

max
total length (incl. UTR + introns)15014271679196118048
coding length15012631501177317529
exons per transcript133.18426
exons per spliced transcript233.56426
nt per exon125847291613746
nt per intron145882741884
intergenic nt77721137150732146
5′ UTR nt-----
3′ UTR nt-----

Overview of Reference genes

A. terreus (top) followed by A. nidulans

5321 genes (2737 on '+' strand, 2584 on '–')
5739 transcripts (3221 spliced, 2518 unspliced)
12072 exons, 6333 introns

len%cov%gc%at
genic499940317.0454.9445.06
intergenic2433179282.9652.4847.52
exonic458177815.6255.4144.59
intronic3961611.3549.6950.31
coding377590312.8756.8543.15
5′ UTR3284061.1254.3245.68
3′ UTR5693521.9446.1153.89
alt. spliced214640.0750.9749.03
genomic2919793999.5552.9047.10

min

median

mean

n50

max
total length (incl. UTR + introns)648239539544666
coding length646547047653762
exons per transcript122.10512
exons per spliced transcript232.97312
nt per exon43264146573677
nt per intron215874641148
intergenic nt32012459710325168294
5′ UTR nt143831811662
3′ UTR nt11081442441693

Splice Analysis

A. terreus (top) followed by A. nidulans

8805 splice agreements, 2718 disagreements.
9319 ignored: 186 due to EST misalignment, 9133 due to partial initial/terminal exon coverage.
perfect exon:exon/intron:intron matches: 1885/4173
0 query transcripts contained noncanonical splices.

transcripts with no splice problems284874.2%
   ... with complete reference coverage57114.9%
explainable by alternate splicing52213.6%
   ... with spliced reference39710.3%
clashes46812.2%
   ... with spliced reference3679.6%

Transcripts that have a splice site dis­agree­ment with an over­lap­ping ref­er­ence gene are placed into two cate­gor­ies, de­pend­ing on the se­ver­ity of the clash. If all splice dis­agree­ments could be ex­plain­ed by well-known types of al­ter­nate splic­ing, we call the trans­cript a "pos­sible al­ter­nate splice." If the two trans­cripts cannot be rec­on­ciled in this way, we label the query a "clash." In par­ti­tio­ning splice dis­agree­ments into two cat­e­gor­ies, we are not as­sert­ing that 13.6% of this ge­nome shows al­ter­nate splic­ing. We do this as a form of triage: genes in the "clash" cat­e­go­ry are man­ually in­spect­ed before release.

in ref.in query
cassette exons472
retained introns294119
early 3′ splices6881
late 5′ splices3650

cassette exon an exon that falls completely with an intron of a variant transcript. Such exons may represent alternative splice forms but are more likely instances of exonic over- and under-prediction. retained intron an intron that falls within the exon of a variant transcript. These introns may indicate alternative splicing but usually are over- and under-predicted introns. early 3′ splices two introns agree on their 5′ splice site but differ on the 3′ side, relative to the affected intron. In other words, differing 3′ splice sites lie on the leading edge of an exon. late 5′ splices two introns agree on their 3′ splice site but differ on the 5′ side, again relative to the affected intron. Most ter­mi­nol­ogy is from Mat­lin AJ, et. al. Under­stand­ing al­ter­na­tive splic­ing: to­wards a cel­lular code. Nat Rev Mol Cell Biol. 2005 May;6(5):386-98.


Possible Problems

A. terreus (top) followed by A. nidulans

AT1_FGENESH_34008
AT1_GENEID_15744
AT1_GENEWISE_2654
short protein, < 50 aa0+0---← not tallied in problems
shorter protein, < 30 aa0+0---
very short protein, < 10 aa0+0---
initial exon ≤ 6 nt141+463820
internal exon ≤ 6 nt42+14201
terminal exon ≤ 6 nt40+014260
≥ 15 exons15+01410
intron ≥ 1000 nt2+2400
intron ≤ 20 nt9+20101
first codon not Met0+0---← not tallied in problems
first codon not xTG0+0---
first codon not known START0+0---
last codon not known STOP1+3400
contains in-frame STOP0+0---
coding length not modulo 30+2200
non-canonical splicing0+0---
has ≥1 tagged BLAST hit6713+17028413392650← not tallied in problems
≤1/3 as long as BLAST tags60+125360
≥3× longer than BLAST tags161+567927
contains ≥1 N in exon0+1001
low-quality exonic sequence0+2531171315← not tallied in problems
touches gap(s)36+2145102
spans contigs36+2145102
within 1 kb of contig edge234+5714313711← not tallied in problems
any overlap (UTR or CDS)0+0---
CDS overlap only0+0---
CDS overlap > 50 nt0+0---
CDS overlap > 100 nt0+0---
CDS overlap > 200 nt0+0---
has predicted UTR0+0---← not tallied in problems
UTR ≥ CDS length0+0---← not tallied in problems
UTR is spliced0+0---← not tallied in problems
one or more problems488+3626025311