Aspergillus nidulans Automated Gene Calling

Overview

This document describes some of the details of the methodology used to produce the final gene calls for the genome of Aspergillus nidulans.

The gene calls in the current release were produced in a five-step procedure:

  1. Gene location and structures were predicted using a combination of FGENESH, FGENESH+, and GENEWISE. This process is described in section Gene Structure Prediction.
  2. The Eukaryotic Annotation team TIGR led by Jennifer Wortman, refined the gene structure of automated gene calls and produced a final set of 10,701 genes in the current release. This process is described in section Manual annotation by TIGR.
  3. The gene predictions from the TIGR Annotation team were then trimmed to removed untranslated regions from the 5' and 3' ends, resulting in the final gene set used in this release. The untrimmed predictions have been preserved as "TIGR-updated Broad" features in this release.
  4. Gene names were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in section Gene Naming.
  5. The newly created genes were compared against EST data to evaluate accuracy. This process is described in section Structure Prediction Validation.

Gene Structure Prediction

Gene structures were predicted using a combination of FGENESH, FGENESH+, and GENEWISE. Both FGENESH and FGENESH+ are gene prediction programs acquired from Softberry.com and GENEWISE is part of the WISE2 package developed by Ewan Birney and is available from the Sanger Center.

Both FGENESH and FGENESH+ utilize a statistical model of gene structure that require training on each organism for accurate prediction. FGENESH+ additionally combines a protein sequence with the statistical model to improve accuracy. We acquired these programs already trained by Softberry on Aspergillus nidulans sequences. FGENESH was used by MIPS for their automated annotation of LGII and LGV.

GENEWISE (as we ran it), splices and aligns a protein sequence with genomic sequence to predict a gene structure. Although GENEWISE does utilize some species-specific parameters, most notably for intron nucleotide statistics and splice site consensus sequences, these can be set to non-species specific defaults. In this case, GENEWISE essentially produces the best local alignment of a protein assuming that introns start at GT and end at AG most of the time and in some cases this results a full alignment of the protein to the genome. Since we are interested in predicting complete gene structures, we post-processed GENEWISE incomplete protein alignments by moving the first and last exon upstream or downstream to the nearest start and stop codons respectively.

If a stop codon was encountered upstream of a gene before a start could be found, the gene call was not used.

An assessment of the accuracy of GENEWISE as well as FGENESH, and FGENESH+ is described below in section Structure Prediction Validation.

Briefly, these three gene callers were combined in the following manner:

  1. FGENESH was run on the entire genomic sequence to provide an initial set of predicted genes. Each FGENESH predicted was put into a set of EVIDENCE_GENES.
  2. The genome was also searched against the non-redundant protein database using BLASTX
  3. Regions of the genome with blastx homology spanning over 80% of a protein (when sub-alignments are stitched together in a consistent fashion) were considered "Homologous Gene Regions" (HGRs).
  4. HGRs were clustered into groups of HGRs that all implicated the same gene structure (most often representing groups of essentially orthologous proteins).
  5. For each cluster of HGRs, the protein showing the most sequence similarity to the genome was passed to both FGENESH and GENEWISE to produce 2 gene predictions, if the protein had >80% amino acid identity to the translated genome (cumulative across sub-alignments).
  6. If the protein used in the previous had >90% amino acid identity to the translated genome (cumulative across sub-alignments), then the GENEWISE call, if valid, was favored over the FGENESH+ call, and was used as the EVIDENCE_GENE for the HGR (see below for the reason why) and added to the set of EVIDENCE_GENES. If this protein had >80% but less than 90% amino acid identity to the translated genome (cumulative across sub-alignments), then the FGENESH+ call, if valid, was favored over the GENEWISE call, and was used as the EVIDENCE_GENE for the HGR (see Structure Prediction Validation for the reason why) and added to the set of EVIDENCE_GENES.
  7. When EVIDENCE_GENES overlapped in their exons, the EVIDENCE_GENE with the least amount of homology support (as measured by the sequence similarity of the protein used to make the call or zero for FGENESH calls) was removed from the set of EVIDENCE_GENES.
  8. All remaining EVIDENCE_GENES were then called as our official ANNOTATED_GENES and passed to the next step of gene calling for Gene Naming.

(Since EVIDENCE_GENES represent potential alternate gene predictions that may be based on homology, these genes are available to the user on the website)

Manual Annotation by TIGR

The original Broad Institute data set consisted of 9,541 protein-coding genes with an AN##### locus designation. There was also a set of 426 genes with fewer than 100 amino acids that were not released to GenBank with an ANS locus designation. The manual annotation team used both of these data sets as a starting point for annotation. The PASA pipeline was used to align all available EST data to the genome and compare it to the annotation dataset. Over 1,000 gene structure updates were performed by the PASA software, another approximately 2,000 genes and intergenic regions were flagged for manual review. In addition to the EST data, the team also relied on protein alignments and the output of an internal software package, called EvidenceModeler, to help in determining candidates for gene edits, most notably splitting of inappropriately merged gene models. At the end of the curation phase, the revised gene set was mapped to the original locus identifiers.

Some statistics:
Total number of loci: 10,701
One to one mapping between loci: 9,447 (include AN and ANS loci, some of the sequences have changed)
Original locus split into two or more loci: 494
Original loci merged into single locus: 16
New loci: 214

Gene Naming

Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we do not want to propagate misinformation by transferring unverified functional names for genes in one species to predicted genes in another species.

We hope to improve the gene naming process in the future based on Gene Ontology categories.

There are currently 5 types of gene name that fall into 3 categories:

  1. NAME, or
    hypothetical protein similar to NAME, or
    conserved hypothetical protein

    Assigned to gene predictions where there is excellent homology to an known NR protein. The criteria for this category are:

    • At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect ≤ 1e-10), with
    • ≥ 80% identity and ≥ 80% coverage of both the query and subject sequence.

    The name will follow one of these three formats:

    • conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, otherwise
    • NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise:
    • hypothetical protein similar to NAME

    Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits we choose the one with the highest average identity × the amount of overlap to the target gene.

    In all cases we take the NR protein name and filter out the species name, GIs, parenthetical comments, extra whitespace, etc.

  2. Hypothetical protein

    Assigned to gene predictions that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:

    • BLASTP hit to NR (complexity filtering off, -F F, expect ≤ 1e-10)

  3. Predicted protein

    Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.

Name counts

26 transcript(s) had different names than their parent.
670 transcript(s) had non-generic names.

"conserved hypothetical protein" 8216
"hypothetical protein" 1345
"hypothetical protein similar to ORF" 25
"predicted protein" 470
 hypothetical protein similar to... 434
 other non-empty name 211

Gene Locus Numbers

Every annotated gene is given a Locus Number of the form AN##### that should be considered the only guaranteed way to identify a gene uniquely and positively. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. Encoding attributes of an object in the identifier for an object we feel is a bad idea. Position is an attribute of a gene that can be retrieved by the locus.

With each new assembly, we do our best to map all genes from the previous assembly and thus preserve loci. Any loci that cannot be mapped will be retired. New genes will receive new loci. Each gene also has a version attribute (so loci are in fact displayed as AN#####.version). When genes are mapped from one assembly to another or if a gene call is altered, we will increment this version. Thus a particular AN#####.version will refer to a particular instantiation and annotation of the locus AN#####.

Note: We assigned new 'AN' locus identifiers to those genes that originally had an 'ANS' identifier, merged genes, split genes, and new loci. The TIGR assigned locus identifiers can be distinguished in that they contain 5 numerals after the prefix, as the gene number is over 10,000

Structure Prediction Validation

To evaluate the accuracy of our gene predictions for Coccidioides immitis assembly 2, we created a set of reference gene models exclusively from EST data. We then compared the two gene sets using a variety of metrics. In the tables below, we refer to the final, published gene set as the query and the EST-based gene set as the reference.

The Feature comparisons and Splice analysis sections only report on the subset of query genes that overlap reference genes. Although a substantial number of predicted genes overlap EST alignments, the majority do not. Because we use EST data to improve our gene calls, we expect lower accuracy in regions that lack supporting EST evidence, on the order of 5–10%. Therefore, while they are a useful measure of gene prediction accuracy, the numbers reported in these two sections do not apply evenly to all predicted genes.

Overview of query genes

10701 genes: 10701 transcripts, 9227 spliced, 1474 unspliced
35525 exons, 24824 introns

min median mean max
overall length (incl. UTR) 30 1233 1447 21645
coding length 30 1233 1447 21645
exons per transcript 1 3 3.32 26
exons per spliced transcript 2 3 3.69 26
bp per exon 1 219 435 18015
bp per intron 1 60 88 1628

len

%cov

%gc

%at
exonic 15477584 51.47 53.29 46.71
intronic 2193876 7.30 45.81 54.19
intergenic 12397054 41.23 47.42 52.58

Overview of reference genes

5112 genes: 5743 transcripts, 3557 spliced, 2186 unspliced
12999 exons, 7256 introns

min median mean max
overall length (incl. UTR) 50 734 734 3177
coding length 50 734 734 3177
exons per transcript 1 2 2.26 11
exons per spliced transcript 2 3 3.04 11
bp per exon 3 251 324 3073
bp per intron 20 60 80 766

len

%cov

%gc

%at
exonic 3056766 10.17 51.83 48.17
intronic 340667 1.13 46.01 53.99
intergenic 26671081 88.70 50.21 49.79

Feature mapping

Searched for mappings from 10701 query features to 5112 reference features.

Reference genes farther than 200bp from any query gene: 183
Containing 187 transcripts: 137 unspliced, 25 with non-canonical splices, 25 with canonical splices.
These may indicate missed genes.

363 query genes (3.4%) hit a reference on the opposite strand.
2050 reference genes (40.1%) hit a query on the opposite strand.

Gene hits on opposite strands may be orientation problems in the reference set and are considered misses in the following comparisions. The table below shows how query genes (rows, roman type) map to ref­er­ence genes (col­umns, ita­lic type). We clus­ter genes into over­lap­ping groups on the same strand and re­cord eight types of rela­tion­ship. To the left of each arrow is the num­ber of query genes in each type of clus­ter; the num­ber of ref­er­ence genes is to the right.

none reference
one
many
none - 0 2336 0 4
query one 8178 0 2271 2271 238 492
many 0 0 10 5 4 4
counts per cluster type
 
none reference
one
many
none - 0 45.7 0 0.1
query one 76.4 0 21.2 44.4 2.2 9.6
many 0 0 0.1 0.1 0.0 0.1
percentages per cluster type

one↔one clear, unam­big­uous map­pings from one query gene to one ref­er­ence gene. none↔one ref­er­ence genes that do not map to any query gene (misses). none↔many ref­er­ence genes that over­lap other ref­er­ence genes but do not map to query genes. None↔X map­pings may rep­re­sent under­pre­dic­tions. one↔none query genes that do not map to any ref­er­ence gene. many↔none query genes that over­lap other query genes but do not map to ref­er­ence genes. X↔none map­pings may repre­sent over­pre­dic­tions or de­fi­cien­cies in the ref­er­ence gene set. one↔many single query genes over­lap­ped by mul­ti­ple ref­er­ence genes (merges). These may oc­cur if the ref­er­ence set con­tains par­tial genes. many↔one single ref­er­ence genes span­ning mul­ti­ple query genes (splits). These are often an­no­ta­tion errors. many↔many com­plex clus­ters where multiple query genes map to multiple ref­er­ence genes.

Feature comparisons

Pre­dict­ed genes that missed a ref­er­ence gene, or hit one on the wrong strand, are ig­nored in the fol­low­ing cal­cula­tions. We se­lect the sin­gle best tran­script↔tran­script match when com­par­ing genes with mul­ti­ple tran­scripts. For clus­ters in the one↔many, many↔one, and many↔many cate­gor­ies above, the best query↔ref­er­ence gene com­par­i­son is chosen per query transcript.

#comp TP TN FP FN
nucleotides 2522 1493140 240752 9355 15255
splice sites 2377 6201 - 285 175

sn

sp

smc

acp

ac

cc
nucleotides 0.9899 0.9938 0.9860 0.9717 0.9433 0.9433
splice sites 0.9726 0.9561 0.9309 - - -

#comp number of feature:feature com­par­isons. The num­ber of com­par­isons on the splice level may be lower be­cause our splice com­par­ison algo­rithm ig­nores ref­er­ence genes with non-canon­ical splices. TP true pos­itives: nuc­leo­tides pre­dict­ed as exonic in both query and ref­er­ence; or, splice junc­tions with exact agree­ment, in posi­tion and type (donor:donor, accep­tor:accep­tor), be­tween query and ref­er­ence. TN true nega­tives: nuc­leo­tides pre­dict­ed as intronic in both query and ref­er­ence; not de­fin­ed for splice sites. FP false posi­tives (over­pre­dic­tions): nuc­leo­tides marked as exonic in the query but intronic in the refer­ence; or, splice junc­tions ident­ified only in the query.* FN false nega­tives (under­pre­dic­tions): nuc­leo­tides marked as intronic in the query but exonic in the ref­er­ence; or, splice junc­tions iden­ti­fied only in the ref­er­ence.* sn sen­si­tiv­ity, TP ÷ (TP+FN). sp spe­ci­fic­ity, TP ÷ (TP+FP). smc sim­ple match­ing co­effi­cient, (TP+TN) ÷ (TP+FN+FP+TN). acp av­er­age con­di­tion­al prob­ability. ac ap­prox­imate cor­re­la­tion. cc cor­re­la­tion co­ef­fi­cient. These last three terms are des­cribed in detail in the cit­a­tion below, and are not well de­fined for splices. *In the very rare cases where the query pre­dicts a donor site exactly where the ref­er­ence has an ac­cep­tor, or vice versa, we count it as ½ FP and ½ FN. Ter­min­ology is from Bur­set M and Guigo R, "Eval­ua­tion of gene struc­ture pre­dic­tion pro­grams." Genomics, 1996 Jun 15;34(3):353-67.

The table below briefly summarizes the relative coverage of overlapping query and reference genes. The first column shows the amount of sequence (in bp and percent) in each gene set that is overlapped by sequence in the other gene set. The other columns show the overhanging sequence on both the 5′ and 3′ sides. As above, query genes that miss the reference, or reference genes missed by the query, are not considered in this table. A high degree of overlap can indicate a good match between the query and reference sets.

overlap overhang, 5′ overhang, 3′
query 1758502 bp / 42.4% 979406 bp / 23.6% 1411331 bp / 34.0%
reference 1758502 bp / 79.8% 238613 bp / 10.8% 206972 bp / 9.4%

Splice analysis

6201 splice agreements, 460 disagreements, 4652 ignored at ends.
perfect exon:exon/intron:intron matches: 1573/3061
0 query transcripts contained noncanonical splices.
145 reference transcripts contained noncanonical splices and were ignored in the following comparisions.

transcripts with no splice problems 2197 92.4%
  explainable by alternate splicing 26 1.1%
  clashes 154 6.5%

Transcripts that have a splice site dis­agree­ment with an over­lap­ping ref­er­ence gene are placed into two cate­gor­ies, de­pend­ing on the se­ver­ity of the clash. If all splice dis­agree­ments could be ex­plain­ed by well-known types of al­ter­nate splic­ing, we call the trans­cript a "pos­sible al­ter­nate splice." If the two trans­cripts cannot be rec­on­ciled in this way, we label the query a "clash." In par­ti­tio­ning splice dis­agree­ments into two cat­e­gor­ies, we are not as­sert­ing that 1.1% of this ge­nome shows al­ter­nate splic­ing. We do this as a form of triage: genes in the "clash" cat­e­go­ry are man­ually in­spect­ed before release.

in ref. in query
cassette exons 16 3
retained introns 2 8
early 3′ splices 12 12
late 5′ splices 4 6

cassette exon an exon that falls completely with an intron of a variant transcript. Such exons may represent alternative splice forms but are more likely instances of exonic over- and under-prediction. retained intron an intron that falls within the exon of a variant transcript. These introns may indicate alternative splicing but usually are over- and under-predicted introns. early 3′ splices two introns agree on their 5′ splice site but differ on the 3′ side, relative to the affected intron. In other words, differing 3′ splice sites lie on the leading edge of an exon. late 5′ splices two introns agree on their 3′ splice site but differ on the 5′ side, again relative to the affected intron. Most ter­mi­nol­ogy is from Mat­lin AJ, et. al. Under­stand­ing al­ter­na­tive splic­ing: to­wards a cel­lular code. Nat Rev Mol Cell Biol. 2005 May;6(5):386-98.

Possible problems

AN1_TIGR_UPDATED 10701
short proteins < 50aa 177 177 ← not tallied in problems
shorter proteins < 30aa 85 85
very short proteins < 10aa 0 -
exon-less transcripts 0 -
initial exon ≤ 6bp 204 204
internal exon ≤ 6bp 85 85
terminal exon ≤ 6bp 72 72
intron ≥ 1000bp 5 5
coding length not mod 3 24 24
first codon not START 25 25
last codon not STOP 29 29
contains 1+ N in exon 0 -
non-canonical splicing 0 -
overlapping 35 70 ← in 35 clusters
spanning contigs 0 -
one or more problems 488 453