Gene Finding Methods
Outline
- Overview
- Gene Structure Prediction
- Gene Naming
- Name counts
- Gene Numbering
- Structure Prediction Validation
- Overview of Query Genes
- Overview of Reference genes
- Feature Mapping
- Feature Comparisons
- Splice Analysis
- Possible Problems
Overview
This document explains how 8794 automated gene models were produced for the Batrachochytrium dendrobatidis genome.
The annotation was created in two steps:
- Final gene structures were predicted by combining predictions from GENEID, FGENESH, EST_GENES and manual annotation. This process is described in Gene Structure Prediction.
- Gene names were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in Gene Naming.
Gene Structure Prediction
Gene structures were predicted using a combination of manual annotation, FGENESH, GENEID and EST-based genes called FindORFs. FGENESH is a commercial gene prediction program sold by Softberry, while GENEID, by Enrique Blanco and Roderic Guigo, is available under the GPL. We used two versions of GENEID predictions. Both GENEID and FGENESH were trained using 600 EST-based gene models. GENEID_1 predictions were obtained using the default parameter for the trained GENEID while GENEID_2 predictions were produced by using a modified parameter file with adjusted intron-lengths to reduce concatenation of adjacent gene models. Where multiple predictions overlap each other and EST evidence, we choose the one most in accord with the EST splice sites. If we had sufficient EST coverage to build complete ORFs, we replaced any overlapping predicted gene model with the ORF predicted purely from ESTs.
FindORFs gene models are those clustered EST gene models with a valid start and stop codon. They were built as follows. First, ESTs are aligned to the genome and grouped into loci consisting of overlapping ESTs. Then, each locus is examined for compatible splicing. If two ESTs in the same locus have identical splice sites where they overlap, they are considered fragments of a larger transcript. Putative transcripts are incrementally built out by adding additional ESTs to either end. Each putative transcript is built from one or more ESTs, but may not represent the full biological transcript if the EST coverage is incomplete. We search each putative transcript for ORFs beginning with ATG and ending with a stop codon, with no frame shifts. If a putative transcript contains an ORF longer than 180 bases that covers 1/3 or more of its spliced length, we considered it a valid gene prediction. Further, we select only a subset of putative full-length gene models from these EST-based findORF transcripts that fully overlap the best ab inito gene prediction. An independent blast-based analysis was also carried out to validate both the length and the correctness of the reading frame by comparing them to the best hits known proteins in the NR database.
In the final step, we ran our automated gene caller to select the best gene model for each locus. Targeted manual annotation was carried out in loci where gene predictions clashed with EST evidence or blast evidence. Automated gene models overlapping tRNA/RNA features were removed from the final gene set. The resulting final gene set contains 8794 genes and 8819 transcripts.
Gene Naming
Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we do not want to propagate misinformation by transferring unverified functional names for genes in one species to predicted genes in another species.
We hope to improve the gene naming process in the future based on Gene Ontology categories.
There are currently 5 types of gene name that fall into 3 categories:
- NAME, or hypothetical protein similar to NAME, or conserved hypothetical protein
- NAME is assigned to gene predictions where there is excellent homology to a known NR protein. The criteria for this category are:
- At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect = 1e-10),
- A minimum of 50% identity and 70% coverage of both the query and subject sequence.
- The name will follow one of these three formats:
- conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, otherwise
- NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise:
- hypothetical protein similar to NAME
- Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits we choose the one with the highest average identity to the amount of overlap to the target gene.
- In all cases we take the NR protein name and filter out the species name, GIs, parenthetical comments, extra white space, etc.
- Hypothetical protein. Assigned to gene predictions that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:
- BLASTP hit to NR (complexity filtering off, -F F, expect = 1e-10)
- Predicted protein. Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.
Name counts
| "conserved hypothetical protein" | 2491 |
| "hypothetical protein" | 1107 |
| "predicted protein" | 3928 |
| hypothetical protein similar to... | 406 |
| other non-empty name | 887 |
Gene Numbering
Every annotated gene is given a Locus Number of the form BDEG_##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.
Structure Prediction Validation
To evaluate the accuracy of our gene predictions, we created a set of reference gene models exclusively from EST data. We then compared the two gene sets using a variety of metrics. In the tables below, we refer to the final, published gene set as the query and the EST-based gene set as the reference.
The Feature comparisons and Splice analysis sections only report on the subset of query genes that overlap reference genes. Although a substantial number of predicted genes overlap EST alignments, the majority do not. Because we use EST data to improve our gene calls, we expect lower accuracy in regions that lack supporting EST evidence, on the order of 5–10%. Therefore, while they are a useful measure of gene prediction accuracy, the numbers reported in those two sections do not apply evenly to all predicted genes.
Overview of Query Genes
8794 genes:
8819 transcripts
(7801 spliced, 1018 unspliced;
4438/4381 +/-)
38552 exons, 29733 introns
| len | %cov | %gc | %at | |
|---|---|---|---|---|
| genic | 16053925 | 68.60 | 40.61 | 59.35 |
| intergenic | 7349693 | 31.40 | 36.41 | 63.68 |
| exonic | 12704410 | 54.28 | 42.43 | 57.57 |
| intronic | 3354268 | 14.33 | 33.70 | 66.10 |
| coding | 12441261 | 53.16 | 42.57 | 57.43 |
| 5′ UTR | 79125 | 0.34 | 40.76 | 59.24 |
| 3′ UTR | 197660 | 0.84 | 34.51 | 65.49 |
| alt. spliced | 4753 | 0.02 | 36.08 | 63.92 |
| genomic | 23403618 | 100.00 | 39.29 | 60.71 |
min | median | mean | max | |
| overall length (incl. UTR) | 90 | 1113 | 1444 | 14343 |
| coding length | 90 | 1077 | 1412 | 14265 |
| exons per transcript | 1 | 3 | 4.37 | 44 |
| exons per spliced transcript | 2 | 4 | 4.81 | 44 |
| bp per exon | 1 | 169 | 330 | 11725 |
| bp per intron | 27 | 90 | 113 | 997 |
| 5′ UTR bp | 1 | 46 | 94 | 801 |
| 3′ UTR bp | 1 | 83 | 114 | 2005 |
Overview of Reference genes
3678 genes:
4026 transcripts
(2748 spliced, 1278 unspliced;
2017/2009 +/-)
11893 exons, 7867 introns
| len | %cov | %gc | %at | |
|---|---|---|---|---|
| genic | 3686789 | 15.75 | 40.48 | 59.52 |
| intergenic | 19716829 | 84.25 | 39.07 | 60.93 |
| exonic | 2869567 | 12.26 | 42.38 | 57.62 |
| intronic | 837838 | 3.58 | 33.82 | 66.18 |
| coding | 0 | 0.00 | - | - |
| 5′ UTR | 0 | 0.00 | - | - |
| 3′ UTR | 0 | 0.00 | - | - |
| alt. spliced | 20616 | 0.09 | 35.39 | 64.61 |
| genomic | 23403618 | 100.00 | 39.29 | 60.71 |
min | median | mean | max | |
| overall length (incl. UTR) | 99 | 726 | 772 | 2908 |
| coding length | - | - | - | - |
| exons per transcript | 1 | 2 | 2.95 | 15 |
| exons per spliced transcript | 2 | 3 | 3.86 | 15 |
| bp per exon | 4 | 167 | 261 | 2908 |
| bp per intron | 27 | 88 | 112 | 993 |
| 5′ UTR bp | - | - | - | - |
| 3′ UTR bp | - | - | - | - |
Feature Mapping
Searched for mappings from 8794 query features to 3678 reference features.
Reference genes farther than 200bp from any query gene: 49
Containing 53 transcripts:
43 unspliced,
10 with canonical splices.
These may indicate missed genes.
72 query genes (0.8%) hit 66 reference genes (1.8%) on the opposing strand. Hits to opposing strands are considered misses in the following comparisions.
Strand mismatches such as these may be orientation problems in the reference set, or incorrect predictions in the query set. Unspliced reference transcripts are difficult to orient correctly. 46 reference gene(s) did not contain a single canonical splice site. These may easily have been placed on the wrong strand; queries hitting these genes are likely to be oriented correctly. 20 reference gene(s) had at least one canonical splice site; everything else being equal, their orientation is more likely to be correct than the queries hitting them.
The table below shows how query genes (rows, roman type) map to reference genes (columns, italic type). We cluster genes into overlapping groups on the same strand and record eight types of relationship. To the left of each arrow is the number of query genes in each type of cluster; the number of reference genes is to the right.
| none | reference one | many | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| none | - | 0 | ↔ | 140 | 0 | ↔ | 0 | |||
| query | one | 5646 | ↔ | 0 | 2580 | ↔ | 2580 | 421 | ↔ | 869 |
| many | 0 | ↔ | 0 | 117 | ↔ | 58 | 30 | ↔ | 31 | |
| counts per cluster type | ||||||||||
| none | reference one | many | ||||||||
| none | - | 0 | ↔ | 3.8 | 0 | ↔ | 0 | |||
| query | one | 64.2 | ↔ | 0 | 29.3 | ↔ | 70.1 | 4.8 | ↔ | 23.6 |
| many | 0 | ↔ | 0 | 1.3 | ↔ | 1.6 | 0.3 | ↔ | 0.8 | |
| percentages per cluster type | ||||||||||
one↔one clear, unambiguous mappings from one query gene to one reference gene. none↔one reference genes that do not map to any query gene (misses). none↔many reference genes that overlap other reference genes but do not map to query genes. None↔X mappings may represent underpredictions. one↔none query genes that do not map to any reference gene. many↔none query genes that overlap other query genes but do not map to reference genes. X↔none mappings may represent overpredictions or deficiencies in the reference gene set. one↔many single query genes overlapped by multiple reference genes (merges). These may occur if the reference set contains partial genes. many↔one single reference genes spanning multiple query genes (splits). These are often annotation errors. many↔many complex clusters where multiple query genes map to multiple reference genes.
Feature Comparisons
Predicted genes that hit reference genes on the wrong strand, or hit reference genes containing only non–canonically-spliced transcripts, are ignored in the following calculations. When scoring genes with multiple transcripts, we select the single best transcript↔transcript comparison. For clusters in the one↔many and many↔many categories above, each query is compared to as many references as possible. The best non-overlapping comparisons are used to score these loci.
| #comp | TP | TN | FP | FN | ||
|---|---|---|---|---|---|---|
| nucleotides | 3448660 | 2615258 | 778535 | 8409 | 46458 | |
| splice sites | 17897 | 16195 | - | 1065 | 637 | |
sn | sp | smc | acp | ac | cc | |
| nucleotides | 0.9825 | 0.9968 | 0.9841 | 0.9781 | 0.9562 | 0.9560 |
| splice sites | 0.9622 | 0.9383 | 0.9049 | - | - | - |
| #comp | TP | TN | FP | FN | ||
|---|---|---|---|---|---|---|
| introns | 7346 | 6621 | - | 235 | 15 | |
| exons | 10306 | 9254 | - | 47 | 53 | |
sn | sp | sn_sp | wrong | miss | ||
| introns | 0.9312 | 0.9013 | 0.9163 | 0.0021 | 0.0320 | |
| exons | 0.9042 | 0.8979 | 0.9011 | 0.0052 | 0.0046 |
#comp the number of subfeatures compared.
TP true positives: nucleotides predicted as exonic in both
query and reference; or, splice junctions with exact agreement, in
position and type (donor:donor, acceptor:acceptor), between query and
reference; or, introns or exons with both splice sites in exact
agreement with the reference.
TN true negatives: nucleotides predicted as intronic in
both query and reference; not defined for splice sites, introns or
exons.
FP false positives (overpredictions): nucleotides marked
as exonic in the query but intronic in the reference; or, splice
junctions identified only in the query*; or query introns/exons that
do not overlap a reference intron/exon.
FN false negatives (underpredictions): nucleotides marked as
intronic in the query but exonic in the reference; or, splice
junctions identified only in the reference*; or reference
introns/exons that do not overlap a query intron/exon.
sn sensitivity,
The table below briefly summarizes the relative coverage of
overlapping query and reference genes. The first column shows the
amount of sequence (in bp and percent) in each gene set that is
overlapped by sequence in the other gene set. The next three columns
show the overhanging sequence from transcripts that overlap but are
offset. Overhangs are categorized as 5′ (one transcript begins
upstream of the other), 3′ (one transcript extends downstream
past the other), and mixed (indeterminate). Finally, the last column,
"complete miss", counts the number of bp in queries that did not
hit a reference at all, and vice versa.
A high degree of overlap can indicate a good match between the query
and reference sets. A high "missed" score may simply indicate a small
reference set.
| overlap | overhang, 5′ | overhang, 3′ | overhang, mixed | complete miss | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| query | 3448660 bp | / | 21.5% | 2210073 bp | / | 13.8% | 682301 bp | / | 4.2% | 768965 bp | / | 4.8% | 8958001 bp | / | 55.8% |
| reference | 3448660 bp | / | 91.9% | 63305 bp | / | 1.7% | 108584 bp | / | 2.9% | 21037 bp | / | 0.6% | 111292 bp | / | 3.0% |
Splice Analysis
16195 splice agreements, 1702 disagreements.
4477 ignored:
170 due to EST misalignment,
4307 due to partial initial/terminal exon coverage.
perfect exon:exon/intron:intron matches: 6376/6621
5 query transcripts contained noncanonical splices.
| transcripts with no splice problems | 2472 | 78.5% |
| ... with complete reference coverage | 609 | 19.3% |
| explainable by alternate splicing | 394 | 12.5% |
| ... with spliced reference | 312 | 9.9% |
| clashes | 282 | 9.0% |
| ... with spliced reference | 227 | 7.2% |
Transcripts that have a splice site disagreement with an overlapping reference gene are placed into two categories, depending on the severity of the clash. If all splice disagreements could be explained by well-known types of alternate splicing, we call the transcript a "possible alternate splice." If the two transcripts cannot be reconciled in this way, we label the query a "clash." In partitioning splice disagreements into two categories, we are not asserting that 12.5% of this genome shows alternate splicing. We do this as a form of triage: genes in the "clash" category are manually inspected before release.
| in ref. | in query | ||
|---|---|---|---|
| cassette exons | ![]() | 24 | 3 |
| retained introns | ![]() | 190 | 15 |
| early 3′ splices | ![]() | 39 | 71 |
| late 5′ splices | ![]() | 70 | 70 |
cassette exon an exon that falls completely with an intron of a variant transcript. Such exons may represent alternative splice forms but are more likely instances of exonic over- and under-prediction. retained intron an intron that falls within the exon of a variant transcript. These introns may indicate alternative splicing but usually are over- and under-predicted introns. early 3′ splices two introns agree on their 5′ splice site but differ on the 3′ side, relative to the affected intron. In other words, differing 3′ splice sites lie on the leading edge of an exon. late 5′ splices two introns agree on their 3′ splice site but differ on the 5′ side, again relative to the affected intron. Most terminology is from Matlin AJ, et. al. Understanding alternative splicing: towards a cellular code. Nat Rev Mol Cell Biol. 2005 May;6(5):386-98.
Possible Problems
| {multiple sources} | 3849 | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| BD_JEL423_FINDORFS_5 | 266 | ||||||||
| BD_JEL423_FGENESH_6 | 1991 | ||||||||
| BD_JEL423_GENEID_1 | 397 | ||||||||
| BD_JEL423_MANUAL_1 | 56 | ||||||||
| BD_JEL423_GENEID_2 | 2260 | ||||||||
| short proteins < 50aa | 24 | 7 | 0 | 4 | 3 | 1 | 9 | ← not tallied in problems | |
| shorter proteins < 30aa | 0 | - | - | - | - | - | - | ||
| very short proteins < 10aa | 0 | - | - | - | - | - | - | ||
| initial exon ≤ 6bp | 107 | 29 | 0 | 39 | 6 | 0 | 33 | ||
| internal exon ≤ 6bp | 17 | 1 | 2 | 14 | 0 | 0 | 0 | ||
| terminal exon ≤ 6bp | 102 | 37 | 0 | 3 | 8 | 0 | 54 | ||
| ≥ 15 exons | 201 | 28 | 0 | 81 | 22 | 8 | 62 | ||
| intron ≥ 1000bp | 0 | - | - | - | - | - | - | ||
| intron ≤ 20bp | 0 | - | - | - | - | - | - | ||
| first codon not Met | 4 | 2 | 0 | 0 | 0 | 2 | 0 | ← not tallied in problems | |
| first codon not xTG | 3 | 2 | 0 | 0 | 0 | 1 | 0 | ||
| first codon not known START | 3 | 2 | 0 | 0 | 0 | 1 | 0 | ||
| last codon not STOP | 12 | 6 | 0 | 3 | 0 | 2 | 1 | ||
| contains in-frame STOP | 1 | 0 | 0 | 0 | 0 | 1 | 0 | ||
| coding length not modulo 3 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | ||
| non-canonical splicing | 3 | 0 | 0 | 0 | 0 | 3 | 0 | ||
| has ≥1 good BLAST hit | 3526 | 1464 | 153 | 1169 | 276 | 40 | 424 | ← not tallied in problems | |
| ≤1/3 as long as BLAST hit | 0 | - | - | - | - | - | - | ||
| ≥3× longer than BLAST hit | 207 | 81 | 0 | 63 | 32 | 2 | 29 | ||
| contains ≥1 N in exon | 0 | - | - | - | - | - | - | ||
| contains low-quality sequence | 1958 | 799 | 38 | 482 | 95 | 20 | 524 | ← not tallied in problems | |
| touches gap(s) | 47 | 3 | 0 | 28 | 6 | 0 | 10 | ||
| spans contigs | 47 | 3 | 0 | 28 | 6 | 0 | 10 | ||
| within 1kb of contig edge | 430 | 175 | 4 | 103 | 25 | 2 | 121 | ← not tallied in problems | |
| any overlap (UTR or CDS) | 59 | 30 | 27 | 32 | 5 | 8 | 18 | ← in 59 clusters | |
| CDS overlap only | 0 | - | - | - | - | - | - | ||
| CDS overlap > 50bp | 0 | - | - | - | - | - | - | ||
| CDS overlap > 100bp | 0 | - | - | - | - | - | - | ||
| CDS overlap > 200bp | 0 | - | - | - | - | - | - | ||
| has predicted UTR | 1886 | 842 | 264 | 517 | 79 | 50 | 134 | ← not tallied in problems | |
| UTR ≥ 50% length | 68 | 23 | 25 | 6 | 3 | 6 | 5 | ← not tallied in problems | |
| UTR is spliced | 178 | 79 | 46 | 15 | 9 | 16 | 13 | ← not tallied in problems | |
| one or more problems | 767 | 207 | 29 | 244 | 70 | 19 | 198 | ||




