Aspergillus nidulans Automated Gene Calling
- Overview
- Gene Structure Prediction
- Manual Annotation by TIGR
- Gene Naming
- Gene Locus Numbers
- Structure Prediction Validation
Overview
This document describes some of the details of the methodology used to produce the final gene calls for the genome of Aspergillus nidulans.
The gene calls in the current release were produced in a five-step procedure:
- Gene location and structures were predicted using a combination of FGENESH, FGENESH+, and GENEWISE. This process is described in section Gene Structure Prediction.
- The Eukaryotic Annotation team TIGR led by Jennifer Wortman, refined the gene structure of automated gene calls and produced a final set of 10,701 genes in the current release. This process is described in section Manual annotation by TIGR.
- The gene predictions from the TIGR Annotation team were then trimmed to removed untranslated regions from the 5' and 3' ends, resulting in the final gene set used in this release. The untrimmed predictions have been preserved as "TIGR-updated Broad" features in this release.
- Gene names were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in section Gene Naming.
- The newly created genes were compared against EST data to evaluate accuracy. This process is described in section Structure Prediction Validation.
Gene Structure Prediction
Gene structures were predicted using a combination of FGENESH, FGENESH+, and GENEWISE. Both FGENESH and FGENESH+ are gene prediction programs acquired from Softberry.com and GENEWISE is part of the WISE2 package developed by Ewan Birney and is available from the Sanger Center.
Both FGENESH and FGENESH+ utilize a statistical model of gene structure that require training on each organism for accurate prediction. FGENESH+ additionally combines a protein sequence with the statistical model to improve accuracy. We acquired these programs already trained by Softberry on Aspergillus nidulans sequences. FGENESH was used by MIPS for their automated annotation of LGII and LGV.
GENEWISE (as we ran it), splices and aligns a protein sequence with genomic sequence to predict a gene structure. Although GENEWISE does utilize some species-specific parameters, most notably for intron nucleotide statistics and splice site consensus sequences, these can be set to non-species specific defaults. In this case, GENEWISE essentially produces the best local alignment of a protein assuming that introns start at GT and end at AG most of the time and in some cases this results a full alignment of the protein to the genome. Since we are interested in predicting complete gene structures, we post-processed GENEWISE incomplete protein alignments by moving the first and last exon upstream or downstream to the nearest start and stop codons respectively.
If a stop codon was encountered upstream of a gene before a start could be found, the gene call was not used.
An assessment of the accuracy of GENEWISE as well as FGENESH, and FGENESH+ is described below in section Structure Prediction Validation.
Briefly, these three gene callers were combined in the following manner:
- FGENESH was run on the entire genomic sequence to provide an initial set of predicted genes. Each FGENESH predicted was put into a set of EVIDENCE_GENES.
- The genome was also searched against the non-redundant protein database using BLASTX
- Regions of the genome with blastx homology spanning over 80% of a protein (when sub-alignments are stitched together in a consistent fashion) were considered "Homologous Gene Regions" (HGRs).
- HGRs were clustered into groups of HGRs that all implicated the same gene structure (most often representing groups of essentially orthologous proteins).
- For each cluster of HGRs, the protein showing the most sequence similarity to the genome was passed to both FGENESH and GENEWISE to produce 2 gene predictions, if the protein had >80% amino acid identity to the translated genome (cumulative across sub-alignments).
- If the protein used in the previous had >90% amino acid identity to the translated genome (cumulative across sub-alignments), then the GENEWISE call, if valid, was favored over the FGENESH+ call, and was used as the EVIDENCE_GENE for the HGR (see below for the reason why) and added to the set of EVIDENCE_GENES. If this protein had >80% but less than 90% amino acid identity to the translated genome (cumulative across sub-alignments), then the FGENESH+ call, if valid, was favored over the GENEWISE call, and was used as the EVIDENCE_GENE for the HGR (see Structure Prediction Validation for the reason why) and added to the set of EVIDENCE_GENES.
- When EVIDENCE_GENES overlapped in their exons, the EVIDENCE_GENE with the least amount of homology support (as measured by the sequence similarity of the protein used to make the call or zero for FGENESH calls) was removed from the set of EVIDENCE_GENES.
- All remaining EVIDENCE_GENES were then called as our official ANNOTATED_GENES and passed to the next step of gene calling for Gene Naming.
(Since EVIDENCE_GENES represent potential alternate gene predictions that may be based on homology, these genes are available to the user on the website)
Manual Annotation by TIGR
The original Broad Institute data set consisted of 9,541 protein-coding genes with an AN##### locus designation. There was also a set of 426 genes with fewer than 100 amino acids that were not released to GenBank with an ANS locus designation. The manual annotation team used both of these data sets as a starting point for annotation. The PASA pipeline was used to align all available EST data to the genome and compare it to the annotation dataset. Over 1,000 gene structure updates were performed by the PASA software, another approximately 2,000 genes and intergenic regions were flagged for manual review. In addition to the EST data, the team also relied on protein alignments and the output of an internal software package, called EvidenceModeler, to help in determining candidates for gene edits, most notably splitting of inappropriately merged gene models. At the end of the curation phase, the revised gene set was mapped to the original locus identifiers.
Some statistics:
Total number of loci: 10,701
One to one mapping between loci: 9,447 (include AN and ANS loci, some of the sequences have changed)
Original locus split into two or more loci: 494
Original loci merged into single locus: 16
New loci: 214
Gene Naming
Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we do not want to propagate misinformation by transferring unverified functional names for genes in one species to predicted genes in another species.
We hope to improve the gene naming process in the future based on Gene Ontology categories.
There are currently 5 types of gene name that fall into 3 categories:
- NAME, or
hypothetical protein similar to NAME, or
conserved hypothetical proteinAssigned to gene predictions where there is excellent homology to an known NR protein. The criteria for this category are:
- At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect ≤ 1e-10), with
- ≥ 80% identity and ≥ 80% coverage of both the query and subject sequence.
The name will follow one of these three formats:
- conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, otherwise
- NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise:
- hypothetical protein similar to NAME
Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits we choose the one with the highest average identity × the amount of overlap to the target gene.
In all cases we take the NR protein name and filter out the species name, GIs, parenthetical comments, extra whitespace, etc.
- Hypothetical protein
Assigned to gene predictions that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:
- BLASTP hit to NR (complexity filtering off, -F F, expect ≤ 1e-10)
- Predicted protein
Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.
Name counts
| "conserved hypothetical protein" | 8216 |
| "hypothetical protein" | 1345 |
| "hypothetical protein similar to ORF" | 25 |
| "predicted protein" | 470 |
| hypothetical protein similar to... | 434 |
| other non-empty name | 211 |
Gene Locus Numbers
Every annotated gene is given a Locus Number of the form AN##### that should be considered the only guaranteed way to identify a gene uniquely and positively. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. Encoding attributes of an object in the identifier for an object we feel is a bad idea. Position is an attribute of a gene that can be retrieved by the locus.
With each new assembly, we do our best to map all genes from the previous assembly and thus preserve loci. Any loci that cannot be mapped will be retired. New genes will receive new loci. Each gene also has a version attribute (so loci are in fact displayed as AN#####.version). When genes are mapped from one assembly to another or if a gene call is altered, we will increment this version. Thus a particular AN#####.version will refer to a particular instantiation and annotation of the locus AN#####.
Note: We assigned new 'AN' locus identifiers to those genes that originally had an 'ANS' identifier, merged genes, split genes, and new loci. The TIGR assigned locus identifiers can be distinguished in that they contain 5 numerals after the prefix, as the gene number is over 10,000
Structure Prediction Validation
To evaluate the accuracy of our gene predictions for Coccidioides immitis assembly 2, we created a set of reference gene models exclusively from EST data. We then compared the two gene sets using a variety of metrics. In the tables below, we refer to the final, published gene set as the query and the EST-based gene set as the reference.
The Feature comparisons and Splice analysis sections only report on the subset of query genes that overlap reference genes. Although a substantial number of predicted genes overlap EST alignments, the majority do not. Because we use EST data to improve our gene calls, we expect lower accuracy in regions that lack supporting EST evidence, on the order of 5–10%. Therefore, while they are a useful measure of gene prediction accuracy, the numbers reported in these two sections do not apply evenly to all predicted genes.
Overview of query genes
10701 genes: 10701 transcripts,
9227 spliced, 1474 unspliced
35525 exons, 24824 introns
| min | median | mean | max | |
|---|---|---|---|---|
| overall length (incl. UTR) | 30 | 1233 | 1447 | 21645 |
| coding length | 30 | 1233 | 1447 | 21645 |
| exons per transcript | 1 | 3 | 3.32 | 26 |
| exons per spliced transcript | 2 | 3 | 3.69 | 26 |
| bp per exon | 1 | 219 | 435 | 18015 |
| bp per intron | 1 | 60 | 88 | 1628 |
len |
%cov |
%gc |
%at |
|
| exonic | 15477584 | 51.47 | 53.29 | 46.71 |
| intronic | 2193876 | 7.30 | 45.81 | 54.19 |
| intergenic | 12397054 | 41.23 | 47.42 | 52.58 |
Overview of reference genes
5112 genes: 5743 transcripts,
3557 spliced, 2186 unspliced
12999 exons, 7256 introns
| min | median | mean | max | |
|---|---|---|---|---|
| overall length (incl. UTR) | 50 | 734 | 734 | 3177 |
| coding length | 50 | 734 | 734 | 3177 |
| exons per transcript | 1 | 2 | 2.26 | 11 |
| exons per spliced transcript | 2 | 3 | 3.04 | 11 |
| bp per exon | 3 | 251 | 324 | 3073 |
| bp per intron | 20 | 60 | 80 | 766 |
len |
%cov |
%gc |
%at |
|
| exonic | 3056766 | 10.17 | 51.83 | 48.17 |
| intronic | 340667 | 1.13 | 46.01 | 53.99 |
| intergenic | 26671081 | 88.70 | 50.21 | 49.79 |
Feature mapping
Searched for mappings from 10701 query features to 5112 reference features.
Reference genes farther than 200bp from any query gene: 183
Containing 187 transcripts:
137 unspliced,
25 with non-canonical splices,
25 with canonical splices.
These may indicate missed genes.
363 query genes (3.4%) hit a reference
on the opposite strand.
2050 reference genes (40.1%) hit a query on
the opposite strand.
Gene hits on opposite strands may be orientation problems in the reference set and are considered misses in the following comparisions. The table below shows how query genes (rows, roman type) map to reference genes (columns, italic type). We cluster genes into overlapping groups on the same strand and record eight types of relationship. To the left of each arrow is the number of query genes in each type of cluster; the number of reference genes is to the right.
| none | reference one |
many | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| none | - | 0 | ↔ | 2336 | 0 | ↔ | 4 | |||
| query | one | 8178 | ↔ | 0 | 2271 | ↔ | 2271 | 238 | ↔ | 492 |
| many | 0 | ↔ | 0 | 10 | ↔ | 5 | 4 | ↔ | 4 | |
| counts per cluster type | ||||||||||
| none | reference one |
many | ||||||||
| none | - | 0 | ↔ | 45.7 | 0 | ↔ | 0.1 | |||
| query | one | 76.4 | ↔ | 0 | 21.2 | ↔ | 44.4 | 2.2 | ↔ | 9.6 |
| many | 0 | ↔ | 0 | 0.1 | ↔ | 0.1 | 0.0 | ↔ | 0.1 | |
| percentages per cluster type | ||||||||||
one↔one clear, unambiguous mappings from one query gene to one reference gene. none↔one reference genes that do not map to any query gene (misses). none↔many reference genes that overlap other reference genes but do not map to query genes. None↔X mappings may represent underpredictions. one↔none query genes that do not map to any reference gene. many↔none query genes that overlap other query genes but do not map to reference genes. X↔none mappings may represent overpredictions or deficiencies in the reference gene set. one↔many single query genes overlapped by multiple reference genes (merges). These may occur if the reference set contains partial genes. many↔one single reference genes spanning multiple query genes (splits). These are often annotation errors. many↔many complex clusters where multiple query genes map to multiple reference genes.
Feature comparisons
Predicted genes that missed a reference gene, or hit one on the wrong strand, are ignored in the following calculations. We select the single best transcript↔transcript match when comparing genes with multiple transcripts. For clusters in the one↔many, many↔one, and many↔many categories above, the best query↔reference gene comparison is chosen per query transcript.
| #comp | TP | TN | FP | FN | ||
|---|---|---|---|---|---|---|
| nucleotides | 2522 | 1493140 | 240752 | 9355 | 15255 | |
| splice sites | 2377 | 6201 | - | 285 | 175 | |
sn |
sp |
smc |
acp |
ac |
cc |
|
| nucleotides | 0.9899 | 0.9938 | 0.9860 | 0.9717 | 0.9433 | 0.9433 |
| splice sites | 0.9726 | 0.9561 | 0.9309 | - | - | - |
#comp number of feature:feature comparisons. The
number of comparisons on the splice level may be lower
because our splice comparison algorithm ignores
reference genes with non-canonical splices.
TP true positives: nucleotides
predicted as exonic in both query and reference; or,
splice junctions with exact agreement, in position and
type (donor:donor, acceptor:acceptor), between query and
reference. TN true negatives:
nucleotides predicted as intronic in both query and
reference; not defined for splice sites.
FP false positives (overpredictions):
nucleotides marked as exonic in the query but intronic in the
reference; or, splice junctions identified only in the
query.* FN false negatives
(underpredictions): nucleotides marked as
intronic in the query but exonic in the reference; or, splice
junctions identified only in the reference.*
sn sensitivity,
The table below briefly summarizes the relative coverage of overlapping query and reference genes. The first column shows the amount of sequence (in bp and percent) in each gene set that is overlapped by sequence in the other gene set. The other columns show the overhanging sequence on both the 5′ and 3′ sides. As above, query genes that miss the reference, or reference genes missed by the query, are not considered in this table. A high degree of overlap can indicate a good match between the query and reference sets.
| overlap | overhang, 5′ | overhang, 3′ | |||||||
|---|---|---|---|---|---|---|---|---|---|
| query | 1758502 bp | / | 42.4% | 979406 bp | / | 23.6% | 1411331 bp | / | 34.0% | reference | 1758502 bp | / | 79.8% | 238613 bp | / | 10.8% | 206972 bp | / | 9.4% |
Splice analysis
6201 splice agreements, 460 disagreements,
4652 ignored at ends.
perfect exon:exon/intron:intron matches: 1573/3061
0 query transcripts contained noncanonical splices.
145 reference transcripts contained noncanonical splices
and were ignored in the following comparisions.
| transcripts with no splice problems | 2197 | 92.4% |
| explainable by alternate splicing | 26 | 1.1% |
| clashes | 154 | 6.5% |
Transcripts that have a splice site disagreement with an overlapping reference gene are placed into two categories, depending on the severity of the clash. If all splice disagreements could be explained by well-known types of alternate splicing, we call the transcript a "possible alternate splice." If the two transcripts cannot be reconciled in this way, we label the query a "clash." In partitioning splice disagreements into two categories, we are not asserting that 1.1% of this genome shows alternate splicing. We do this as a form of triage: genes in the "clash" category are manually inspected before release.
| in ref. | in query | ||
|---|---|---|---|
| cassette exons | ![]() |
16 | 3 |
| retained introns | ![]() |
2 | 8 |
| early 3′ splices | ![]() |
12 | 12 |
| late 5′ splices | ![]() |
4 | 6 |
cassette exon an exon that falls completely with an intron of a variant transcript. Such exons may represent alternative splice forms but are more likely instances of exonic over- and under-prediction. retained intron an intron that falls within the exon of a variant transcript. These introns may indicate alternative splicing but usually are over- and under-predicted introns. early 3′ splices two introns agree on their 5′ splice site but differ on the 3′ side, relative to the affected intron. In other words, differing 3′ splice sites lie on the leading edge of an exon. late 5′ splices two introns agree on their 3′ splice site but differ on the 5′ side, again relative to the affected intron. Most terminology is from Matlin AJ, et. al. Understanding alternative splicing: towards a cellular code. Nat Rev Mol Cell Biol. 2005 May;6(5):386-98.
Possible problems
| AN1_TIGR_UPDATED | 10701 | |||
|---|---|---|---|---|
| short proteins < 50aa | 177 | 177 | ← not tallied in problems | |
| shorter proteins < 30aa | 85 | 85 | ||
| very short proteins < 10aa | 0 | - | ||
| exon-less transcripts | 0 | - | ||
| initial exon ≤ 6bp | 204 | 204 | ||
| internal exon ≤ 6bp | 85 | 85 | ||
| terminal exon ≤ 6bp | 72 | 72 | ||
| intron ≥ 1000bp | 5 | 5 | ||
| coding length not mod 3 | 24 | 24 | ||
| first codon not START | 25 | 25 | ||
| last codon not STOP | 29 | 29 | ||
| contains 1+ N in exon | 0 | - | ||
| non-canonical splicing | 0 | - | ||
| overlapping | 35 | 70 | ← in 35 clusters | |
| spanning contigs | 0 | - | ||
| one or more problems | 488 | 453 | ||




