Gene Finding Methods
Outline
- Overview
- Gene Structure Prediction
- Gene Naming
- UTR Prediction
- Gene Locus Numbers
- Structure Prediction Validation
- Overview of Query Genes
- Overview of Reference genes
- Splice Analysis
- Possible Problems
Overview
The Broad Institute has prepared for publication automated gene calls on A. nidulans and A. terreus. In addition, we have prepared gene calls provisionally on A. flavus for the purposes of this website. This document describes how automated gene models are built and verified.
Annotations are created in five steps:
- Genes are predicted by combining manual annotations with output from multiple gene prediction programs. This process is described in Gene Structure Prediction.
- Names are assigned to gene predictions based on homology to annotated genes on other genomes. This process is described in Gene Naming.
- UTRs (untranslated regions) are predicted. See "UTR Prediction".
- Locus numbers are assigned. See "Gene Locus Numbers".
- When species-specific EST libraries are available, we compare our gene models to splice models built from aligned ESTs. Since these EST data are also used in step one, this comparison is not statistically rigorous, but does provide a rough estimate of our annotation accuracy.
Gene Structure Prediction
Gene structures are predicted using a combination of manual annotation, ab initio gene predictors, and homology-based gene predictors. Currently, the programs we use include AUGUSTUS, CONRAD, FGENESH, FINDORFS, GENEID, and GENEWISE. The tools used on each genome vary, as each program's performance can vary widely from genome to genome. Each tool is evaluated on the genome for suitability before being used in the final gene build.
Some of these programs require training before use, so that they can accomodate genome-specific features such as GC content and intron length distribution in their predictions. Generally, when available, output from FINDORFS is used as training for the other programs, although in some cases a training file from a closely-related species may be used.
When multiple programs produce overlapping predictions, we use BLAST and EST evidence to choose the prediction most in accord with the evidence. Targeted manual annotation is carried out in loci where all predictions clash with EST or blase evidence. Predictions that overlap tRNA/RNA features are removed from the final gene set, as are those that are too short (30 aa, or 50 aa without BLAST evidence.)
Every model we predict is tagged with the program from which it came. In some cases, two programs may agree exactly on a gene model. In our reports these genes are marked as coming from multiple sources.
The gene prediction programs used for A. nidulans are: FGENESH, FGENESH+ and GENEID. This gene set was then extensively modified by TIGR using its PASA software pipeline as well as manual curators at the Broad.
The gene prediction programs used for A. terreus are: FGENESH, GENEID and GENEWISE.
The gene prediction programs used for A. flavus are: FGENESH and GENEID. Also, gene models from A. terreus were mapped onto A. flavus via whole-genome alignments for use as evidence.
Gene Naming
Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we try to avoid transferring unverified functional names for genes in one species to predictions in another species.
We hope to improve the gene naming process in the future based on Gene Ontology categories.
A gene will be named as follows:
- "NAME": The gene has at least one significant BLASTP hit to a known protein in Swiss-Prot (complexity filtering off, -F F, expect = 1e-10, 50%+ identity and 70%+ coverage).
- "hypothetical protein similar to NAME": The gene has at least one significant BLASTP hit to a protein in NCBI's protein set NR.
- "conserved hypothetical protein": The gene has at least one significant BLASTP hit, but the homologous protein(s) contain words indicating the name(s) are unverified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed, etc.}
- "hypothetical protein": The gene has no significant BLAST evidence but does have overlapping EST evidence.
- "predicted protein": The gene has no significant BLAST evidence and no overlapping EST evidence.
In all cases names are filtered to remove spurious identifiers, extra spaces, etc.
UTR Prediction
When an EST alignment uniquely aligns with > 95% identity and overlaps a gene prediction, and the region of overlap has 100% labeling agreement (e.g., every nucleotide in the region of overlap is exonic in both the prediction and the alignment, or is intronic in both), we consider the prediction and alignment to be compatible. UTR predictions are generated by walking out along compatible EST alignments from the end(s) of the prediction. A chain of one or more overlapping, compatible EST alignments that begin at the 3' or 5' end forms each UTR extension. When an EST aligns to more than one location on the genome, or touches more than one gene prediction (on either strand) it is ignored.
Gene Locus Numbers
Each annotated gene is given a stable locus number of the form XXXG_#####. This is the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies and annotation releases. These numbers are opaque identifiers and have no particular order or internal structure. In particular, it is unsafe to infer anything about the position of two genes based on their locus numbers. Position is an attribute of a gene that can be retrieved via its locus.
When we reannotate a genome, we map all genes from the previous assembly to the new one. Any genes that cannot be mapped due to assembly changes will be retired along with its locus number. New genes that cannot be associated 1:1 with a gene in the previous release receive new locus numbers. If a gene is unchanged, its locus number remains unchanged. If a gene changes, but a 1:1 mapping to a previous gene can be established, its locus number remains unchanged. In this way, researchers can find out whether a specific gene is still present in the gene set by searching for its locus. If two genes merge, both locus numbers are retired and a new one is assigned to the combined gene. A csv-formatted spreadsheet is provided for each reannotation describing in detail what happened at each locus and why.
Each gene also has a version attribute (so loci can be fully specified as XXXG_#####.V). When genes are mapped from one assembly to another or when we release a new set of gene calls, we increment this version. All the loci in a particular release will have the same version number.
Transcripts are assigned accession numbers of the form XXXT_#####.V. These numbers are also opaque and have no particular ordering. In addition, because there are generally more transcripts predicted than genes, there can be *no guarantee* that a gene and its transcript(s) will have the same or similar numerical portions of their identifiers.
The locus prefixes for these genomes are:
| A. nidulans: | AN |
| A. terreus | ATEG |
Structure Prediction Validation
To evaluate the accuracy of our gene predictions, we create a set of partial reference models exclusively from EST data. We then compare the two gene sets using a variety of metrics. In the tables presented for each annotation, we refer to the final, published gene set as the query and the EST-based set of partial genes as the reference.
The Feature comparisons and Splice analysis sections only report on the subset of query genes that overlap references. Although a substantial number of predicted genes overlap EST alignments, the majority do not. Because we use EST data to train, generate and improve our gene calls, we expect lower accuracy in regions that lack supporting EST evidence, on the order of 5-10%. Therefore, while they are a useful measure of gene prediction accuracy, the numbers reported in those two sections do not apply evenly to all predicted genes.
Overview of Query Genes
A. terreus (top) followed by A. nidulans
10406 genes (5241 on '+' strand, 5165 on '–')
10406 transcripts
(8866 spliced, 1540 unspliced)
33116 exons, 22710 introns
| len | %cov | %gc | %at | ||
|---|---|---|---|---|---|
| genic | 17474515 | 59.58 | 55.75 | 44.25 | |
| intergenic | 11856680 | 40.42 | 48.65 | 51.35 | |
| exonic | 15617415 | 53.25 | 56.57 | 43.43 | |
| intronic | 1857100 | 6.33 | 48.83 | 51.17 | |
| coding | 15617415 | 53.25 | 56.57 | 43.43 | |
| 5′ UTR | 0 | 0.00 | - | - | |
| 3′ UTR | 0 | 0.00 | - | - | |
| alt. spliced | 0 | 0.00 | - | - | |
| genomic | 29197939 | 99.55 | 52.90 | 47.10 | |
min | median | mean | n50 | max | |
| total length (incl. UTR + introns) | 150 | 1427 | 1679 | 1961 | 18048 |
| coding length | 150 | 1263 | 1501 | 1773 | 17529 |
| exons per transcript | 1 | 3 | 3.18 | 4 | 26 |
| exons per spliced transcript | 2 | 3 | 3.56 | 4 | 26 |
| nt per exon | 1 | 258 | 472 | 916 | 13746 |
| nt per intron | 14 | 58 | 82 | 74 | 1884 |
| intergenic nt | 7 | 772 | 1137 | 1507 | 32146 |
| 5′ UTR nt | - | - | - | - | - |
| 3′ UTR nt | - | - | - | - | - |
Overview of Reference genes
A. terreus (top) followed by A. nidulans
5321 genes (2737 on '+' strand, 2584 on '–')
5739 transcripts
(3221 spliced, 2518 unspliced)
12072 exons, 6333 introns
| len | %cov | %gc | %at | ||
|---|---|---|---|---|---|
| genic | 4999403 | 17.04 | 54.94 | 45.06 | |
| intergenic | 24331792 | 82.96 | 52.48 | 47.52 | |
| exonic | 4581778 | 15.62 | 55.41 | 44.59 | |
| intronic | 396161 | 1.35 | 49.69 | 50.31 | |
| coding | 3775903 | 12.87 | 56.85 | 43.15 | |
| 5′ UTR | 328406 | 1.12 | 54.32 | 45.68 | |
| 3′ UTR | 569352 | 1.94 | 46.11 | 53.89 | |
| alt. spliced | 21464 | 0.07 | 50.97 | 49.03 | |
| genomic | 29197939 | 99.55 | 52.90 | 47.10 | |
min | median | mean | n50 | max | |
| total length (incl. UTR + introns) | 64 | 823 | 953 | 954 | 4666 |
| coding length | 64 | 654 | 704 | 765 | 3762 |
| exons per transcript | 1 | 2 | 2.10 | 5 | 12 |
| exons per spliced transcript | 2 | 3 | 2.97 | 3 | 12 |
| nt per exon | 4 | 326 | 414 | 657 | 3677 |
| nt per intron | 21 | 58 | 74 | 64 | 1148 |
| intergenic nt | 3 | 2012 | 4597 | 10325 | 168294 |
| 5′ UTR nt | 1 | 43 | 83 | 181 | 1662 |
| 3′ UTR nt | 1 | 108 | 144 | 244 | 1693 |
Splice Analysis
A. terreus (top) followed by A. nidulans
8805 splice agreements, 2718 disagreements.
9319 ignored:
186 due to EST misalignment,
9133 due to partial initial/terminal exon coverage.
perfect exon:exon/intron:intron matches: 1885/4173
0 query transcripts contained noncanonical splices.
| transcripts with no splice problems | 2848 | 74.2% |
| ... with complete reference coverage | 571 | 14.9% |
| explainable by alternate splicing | 522 | 13.6% |
| ... with spliced reference | 397 | 10.3% |
| clashes | 468 | 12.2% |
| ... with spliced reference | 367 | 9.6% |
Transcripts that have a splice site disagreement with an overlapping reference gene are placed into two categories, depending on the severity of the clash. If all splice disagreements could be explained by well-known types of alternate splicing, we call the transcript a "possible alternate splice." If the two transcripts cannot be reconciled in this way, we label the query a "clash." In partitioning splice disagreements into two categories, we are not asserting that 13.6% of this genome shows alternate splicing. We do this as a form of triage: genes in the "clash" category are manually inspected before release.
| in ref. | in query | ||
|---|---|---|---|
| cassette exons | ![]() | 47 | 2 |
| retained introns | ![]() | 294 | 119 |
| early 3′ splices | ![]() | 68 | 81 |
| late 5′ splices | ![]() | 36 | 50 |
cassette exon an exon that falls completely with an intron of a variant transcript. Such exons may represent alternative splice forms but are more likely instances of exonic over- and under-prediction. retained intron an intron that falls within the exon of a variant transcript. These introns may indicate alternative splicing but usually are over- and under-predicted introns. early 3′ splices two introns agree on their 5′ splice site but differ on the 3′ side, relative to the affected intron. In other words, differing 3′ splice sites lie on the leading edge of an exon. late 5′ splices two introns agree on their 3′ splice site but differ on the 5′ side, again relative to the affected intron. Most terminology is from Matlin AJ, et. al. Understanding alternative splicing: towards a cellular code. Nat Rev Mol Cell Biol. 2005 May;6(5):386-98.
Possible Problems
A. terreus (top) followed by A. nidulans
| AT1_FGENESH_3 | 4008 | |||||
|---|---|---|---|---|---|---|
| AT1_GENEID_1 | 5744 | |||||
| AT1_GENEWISE_2 | 654 | |||||
| short protein, < 50 aa | 0+0 | - | - | - | ← not tallied in problems | |
| shorter protein, < 30 aa | 0+0 | - | - | - | ||
| very short protein, < 10 aa | 0+0 | - | - | - | ||
| initial exon ≤ 6 nt | 141+4 | 63 | 82 | 0 | ||
| internal exon ≤ 6 nt | 42+1 | 42 | 0 | 1 | ||
| terminal exon ≤ 6 nt | 40+0 | 14 | 26 | 0 | ||
| ≥ 15 exons | 15+0 | 14 | 1 | 0 | ||
| intron ≥ 1000 nt | 2+2 | 4 | 0 | 0 | ||
| intron ≤ 20 nt | 9+2 | 0 | 10 | 1 | ||
| first codon not Met | 0+0 | - | - | - | ← not tallied in problems | |
| first codon not xTG | 0+0 | - | - | - | ||
| first codon not known START | 0+0 | - | - | - | ||
| last codon not known STOP | 1+3 | 4 | 0 | 0 | ||
| contains in-frame STOP | 0+0 | - | - | - | ||
| coding length not modulo 3 | 0+2 | 2 | 0 | 0 | ||
| non-canonical splicing | 0+0 | - | - | - | ||
| has ≥1 tagged BLAST hit | 6713+170 | 2841 | 3392 | 650 | ← not tallied in problems | |
| ≤1/3 as long as BLAST tags | 60+1 | 25 | 36 | 0 | ||
| ≥3× longer than BLAST tags | 161+5 | 67 | 92 | 7 | ||
| contains ≥1 N in exon | 0+1 | 0 | 0 | 1 | ||
| low-quality exonic sequence | 0+253 | 117 | 131 | 5 | ← not tallied in problems | |
| touches gap(s) | 36+21 | 45 | 10 | 2 | ||
| spans contigs | 36+21 | 45 | 10 | 2 | ||
| within 1 kb of contig edge | 234+57 | 143 | 137 | 11 | ← not tallied in problems | |
| any overlap (UTR or CDS) | 0+0 | - | - | - | ||
| CDS overlap only | 0+0 | - | - | - | ||
| CDS overlap > 50 nt | 0+0 | - | - | - | ||
| CDS overlap > 100 nt | 0+0 | - | - | - | ||
| CDS overlap > 200 nt | 0+0 | - | - | - | ||
| has predicted UTR | 0+0 | - | - | - | ← not tallied in problems | |
| UTR ≥ CDS length | 0+0 | - | - | - | ← not tallied in problems | |
| UTR is spliced | 0+0 | - | - | - | ← not tallied in problems | |
| one or more problems | 488+36 | 260 | 253 | 11 | ||




