Neurospora crassa Automated Gene Calling
Outline
- Overview
- Gene Structure Prediction
- Gene Naming
- Gene Locus Numbers
- Correspondence of release 3 and release 7 gene calls.
- Structure Prediction Validation
Overview
This document describes some of the details of the methodology used to produce the automated gene calls for the genome of Neurospora crassa. Automated gene calls were produced in essentially a three step procedure:- Gene location and structures were predicted using a combination of FGENESH, FGENESH+, and GENEWISE. This process is described in section Gene Structure Prediction.
- Gene "names" were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in section Gene Naming.
- The newly created genes were compared against loci from the previous release to generate the correspondences. This process is described in section Gene Correspondence.
Gene Structure Prediction
Gene structures were predicted using a combination of FGENESH, FGENESH+, and GENEWISE. Both FGENESH and FGENESH+ are gene prediction programs acquired from Softberry.com and GENEWISE is part of the WISE2 package developed by Ewan Birney and is available from the Sanger Center.Both FGENESH and FGENESH+ utilize a statistical model of gene structure that require training on each organism for accurate prediction. FGENESH+ additionally combines a protein sequence with the statistical model to improve accuracy. We acquired these programs already trained by Softberry on Neurospora crassa sequences.
GENEWISE (as we ran it), splices and aligns a protein sequence with genomic sequence to predict a gene structure. Although GENEWISE does utilize some species-specific parameters, most notably for intron nucleotide statistics and splice site consensus sequences, these can be set to non-species specific defaults. In this case, GENEWISE essentially produces the best local alignment of a protein assuming that introns start at GT and end at AG most of the time and in some cases this results a full alignment of the protein to the genome. Since we are interested in predicting complete gene structures, we post-processed GENEWISE incomplete protein alignments by moving the first and last exon upstream or downstream to the nearest start and stop codons respectively. If a stop codon was encountered upstream of a gene before a start could be found, the gene call was not used.
An assessment of the accuracy of GENEWISE as well as FGENESH, and FGENESH+ is described below in section Structure Prediction Validation.
Briefly, these three gene callers were combined in the following manner:
- FGENESH was run on the entire genomic sequence to provide an initial set of predicted genes. Each FGENESH predicted was put into a set of EVIDENCE_GENES.
- The genome was also searched against the non-redundant protein database using BLASTX. In order to remove transitive annotation based on our previous release of neurospora calls, we rejected any blast results to neurospora sequences which were submitted after our annotation was submitted and exactly matched any of our previous genes.
- Regions of the genome with blastx homology spanning over 80% of a protein (when sub-alignments are stitched together in a consistent fashion) were passed to both FGENESH+ and GENEWISE to produce 2 gene predictions, if the protein had >80% amino acid identity to the translated genome (cumulative across sub-alignments).
- Overlapping FGENESH+ predictions and overlapping GENEWISE predictions were separately clustered and where overlaps occured, the call with the highest amount of homology support (as measured by the sequence similarity of the protein used to make the call) was selected, producing a set of non-overlapping FGENESH+ calls and a set of non-overlapping GENEWISE calls.
- Where FGENESH+ and/or GENEWISE calls overlapped, the GENEWISE call was selected if its query protein had >90% amino acid identity to the translated genome (cumulative across sub-alignments). Otherwise, the FGENESH+ call was selected. (See Structure Prediction Validation for the reasoning behind these cutoff values. This non-overlapping set of FGENESH+ and GENEWISE calls is refered to as the 'homology genes'.
- The final gene set was produced by taking the homology genes set and adding in any FGENESH calls which did have overlap with any homology genes. UTR was added to these genes by incorporating EST evidence which did not conflict with the gene structure. These genes were then passed to the next step of the gene calling for Gene Naming.
Gene Naming
Genes are assigned names VERY CONSERVATIVELY. Because this is a purely automated gene prediction process, we do not want to propogate mis-information by transfering unverified functional names for genes in one species to predicted genes in another species.We hope to improve the gene naming process in the future based on Gene Ontology categories.
There are currently 5 types of gene names, that make up 3 categories:
- NAME, or
hypothetical protein similar to NAME, or
conserved hypothetical proteinAssigned to gene predictions where there is excellent homology to an known NR protein. The criteria for this category are:
- Top BlastP hit to a known NR protein (complexity filtering off -F F, expect <= 1e-5), with
- >=80% identity and >= 80% coverage of both the query and subject sequence.
The exact name is assigned:- NAME if the homologous protein is from the curated SwissProt gene set (IE we trust the gene name), otherwise:
- conserved hypothetical protein if the homologous protein NAME contains a word in the set {hypothetical, homolog, probable, putative, similar to, predicted, unnamed, unknown} (IE we do not want to transfer suspect names), otherwise
- hypothetical protein similar to NAME
- Hypothetical protein
Assigned to gene predictions that show significant BlastP homology to a protein in NCBI's protein set NR or an EST alignment. The criteria for this category are:- BlastP hit to NR (complexity filtering off -F F, expect <= 1e-5), or
- EST hit (>=300nt, >=98%identity, >95% coverage) which overlaps gene
- Predicted protein
Assigned to gene predictions that do not have an EST alignment or show significant BlastP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BlastP analysis was performed on the gene set. The criteria for this category are:- No BlastP hit to NR (complexity filtering off -F F, expect <= 1e-5), and
- No EST hit (>=300nt, >=98%identity, >95% coverage) which overlaps gene
Gene Locus Numbers
Every annotated gene is given a Locus Number of the form NCU##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encoding attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.With each new assembly, we do our best to map all genes from the previous assembly and thus preserve loci. Any loci that cannot be mapped will be retired. New genes will receive new loci. Each gene also has a version attribute (so loci are in fact displayed as NCU#####.version). When genes are mapped from one assembly to another or when we release a new set of gene calls, we will increment this version. All the loci in a particular release will have the same version number so that we can ensure consistency.
Version 1 of the annotations was released with sequence release 3. Version 2 of the annotations are being released with release 7 of the sequence. A locus number without a version corresponds to the most recent release of the sequence. E.g., NCU03456.1 refers to locus NCU03456 as annotated on sequence 3. NCU03456.2 and NCU03456 both refer to the same locus as annotated on sequence 7.
Correspondence of release 3 and release 7 gene calls.
To compare release 3 and release 7 gene calls, each of the 10082 gene calls from release 3 was uniquely aligned to the release 7 sequence. Consistency in gene order was verified between release 3 and release 7. The 12 cases where an ordering discrepency was found were manually examined and verified to all be caused by valid changes in the sequence. The resulting alignments were compared against the gene calls for release 7.- 10620 gene calls in release 7 vs. 10082 in release 3.
- 8072 genes correspond one-to-one between release 3 and release 7 and have identical coding sequence.
- 106 release 3 genes have been deleted in release 7. 3 deletes are based on homology evidence which did not pass our filters for release 7. The rest were predicted proteins based on ab-initio gene calls.
- 682 new genes have been added in the release 7 gene set.
- 303 of these gene calls are in or near gaps in the release 3 sequence
- 16% of these gene calls are based on protein homology, vs. 25% across the entire gene set.
- In 68 cases, two release 3 gene calls have been merged, such that they are both included in a single corresponding release 7 gene call.
- In 30 cases, one release 3 gene call has been split into multiple release 7 genes.
- 1874 release 7 gene calls correspond one-to-one with a release 3 gene, but have some difference in coding sequence and have been assigned to one of four categories:
- SEQUENCE_CHANGED - (612 genes) A change in the genome sequence of the loci or a neighboring region resulted in a different gene prediction.
- EXTENDED - (101 genes) The CDS for the release 7 gene is a superset of the corresponding release 3 CDS.
- TRUNCATED - (56 genes) The CDS for the release 7 gene is a subset of the corresponding release 3 CDS.
- EVIDENCE_CHANGED - (1105 genes) The homology evidence used as an input to fgenesh+ or genewise has changed, resulting in a changed gene model, or ab-initio gene calling predicted a slightly different gene structure.
Structure Prediction Validation
The available EST alignments were used to validate the gene structure predictions. To perform this comparison ESTs were first clustered by combining all overlapping ESTs into a cluster. Each EST cluster was then compared to any overlapping predicted genes. If the gene structure matched the alignment in the area of overlap, the cluster had no problems. If the alignment had an region that did not overlap any coding bases of the gene structure, it was considered a missing exon error. If a gap in the alignment fully contained an exon of the gene model, it was considered a wrong exon error. Partial overlaps were classified as splice junction error. In cases where multiple overlapping ESTs suggest different gene structures, the EST that most closely matched the gene structure was used.
The following table shows the accuracy of the release 3 gene calls and the release 7 gene calls. Roughly 25% of genes have some overlap with an EST cluster and 76% of those genes show no problems in both the set of release 3 gene calls and the set of release 7 gene calls.
| Release 3 | Release 7 | |
|---|---|---|
| Summary | ||
| # of genes | 10082 | 10620 |
| # of EST clusters | 6001 | 6008 |
| single exon | 3825 | 3826 |
| hitting genes on opposite strand ignored for stats, believed to be alignment/strand correction problems) | 1584 | 1591 |
| # of EST clusters minus ignored ones | 4417 | 4410 |
| # of genes hit by an EST cluster | 2579 | 2605 |
| # of EST clusters hitting multiple genes | 15 | 17 |
| # of genes hitting multiple EST clusters | 926 | 932 |
| EST Cluster | ||
| # of EST clusters with no problems | 2940 (67%) | 2940 (67%) |
| not counting missed EST clusters | 2940 (82%) | 2940 (82%) |
| # of EST clusters not hitting genes | 848 (19%) | 816 (19%) |
| # of EST clusters hit, with problems | 630 (18%) | 653 (18%) |
| Problems: | ||
| # of EST clusters with missing exons | 61 (2%) | 76 (2%) |
| # of EST clusters with wrong exons | 24 (1%) | 25 (1%) |
| # with splice junction problems | 590 (17%) | 620 (17%) |
| Predicted Genes | ||
| # of genes hit by an EST cluster | 2579 (26%) | 2605 (25%) |
| # of those genes with no problems | 1965 (76%) | 1970 (76%) |
| # of genes hit, with problems | 614 (24%) | 634 (24%) |
| Problems: | ||
| # of genes with missing exons | 60 (2%) | 76 (3%) |
| # of genes with wrong exons | 54 (2%) | 56 (2%) |
| # with splice junction problems | 558 (22%) | 581 (22%) |
The strategy for combining gene prediction programs to identify potential gene structures was based on an assessment of the performance of these programs on test set of 191 genomic sequences for Neurospora genes generously provided by Dr. Chuck Staben of the University of Kentucky. It is important to note, however, that some of the proteins in this test set were undoubtedly used in the training of FGENESH and FGENESH+ by Softberry. Thus the performance of these programs on the test set are possibly inflated relative to the performance on a random set of N. crassa genes. We hope to be able to perform cross-validation tests on these tools in the future to generate more accurate performance statistics. This is not an issue for GENEWISE as it required no training.
In assessing the performance of the gene callers we asked two questions:
- How well does FGENESH perform in predicting genes in the absence of homology?
- How well do FGENESH+ and GENEWISE perform when given protein sequences of varying amino acid identity to the actual N. crassa translated gene?
CC = (Tp*Tn - Fp*Fn) / SQRT( (Tp+Fp)*(Tn+Fn)*(Tp+Fn)*(Tn+Fp) )
where Tp, Tn, Fp, and Fn are the number of true positives, true negatives, false positives, and false negatives respectively, all defined at the nucleotide level relative to the trusted gene. A CC=1 represents a perfect prediction relative to the trusted gene.
The overall results or this analysis are shown in Figure 1. In this figure, each point represents a gene prediction. The X axis is the AA %ID sequence similarity of the protein used by FGENESH+ (in blue) or GENEWISE (in yellow) to produce the gene prediction. FGENESH (in pink) does not incorporate homology and so all its predictions are shown at the very left hand side of the graph
Figure 1. Accuracy gene predictions as a function of protein homology used for the prediction. See text for details.
As can be seen from the figure, FGENESH (the pink
points) produces relatively accurate predictions on this test set. This
can be seen more clearly in Figure 2 below. This figure shows a histogram
of Correlation Coefficients for all the genes predicted by FGENESH on the
test set of 191 genes. It must be stressed that this is likely to be
biased due to inclusion of some or perhaps most test sequences in the training
set for FGENESH.
Figure 2. Histogram of gene prediction correlation coefficients for all FGENESH predictions on the test set of sequences.
FGENESH+ and GENEWISE on the other hand show very poor performance when proteins with less than 80% AA identity to the translated genome sequence are used as a basis for the gene prediction. For proteins with > 80% AA identity, these gene callers do appear to allow an improvement in gene prediction accuracy as compared with FGENESH. This can bee seen more clearly in Figure 3 and 4 below.
In each of these figures, histograms of gene prediction correlation coefficients are shown for predictions based on proteins with >90% AA identity, between 80% and 90% AA identity, and between 70% and 80% protein identity.
In the case of GENEWISE and comparing with Figure 2, it can be seen that for proteins with >90% AA identity, GENEWISE performs very well and appears to offer significant improvement over FGENESH in prediction accuracy. For proteins with <90% AA identity, however, GENEWISE does not perform much better that FGENESH.
Figure 3. Histogram of GENEWISE performance broken down by homology range. The top histogram corresponds to predictions using proteins with between 70% and 80% AA identity, the middle histogram for proteins between 80% and 90%, and the bottom for proteins with >90% identity. The X axis in each pane is the correlation coefficient (CC) of the gene prediction measured against the trusted gene. The Y axis is he number of predictions at each CC value.
In the case of FGENESH+ and comparing with Figure 2 and Figure 3, it can be seen that for proteins with >90% AA identity, FGENESH+ performs slightly better than FGENESH but worse than GENEWISE. For proteins with between 80% and 90% AA identity however, FGENESH+ does outperform GENEWISE and appears to perform slightly better than FGENESH in that the fraction of poor predictions (CC<0.8) diminishes.
Figure 4. Histogram of FGENESH+ performance broken down by homology range. The top histogram corresponds to predictions using proteins with between 70% and 80% AA identity, the middle histogram for proteins between 80% and 90%, and the bottom for proteins with >90% identity. The X axis in each pane is the correlation coefficient (CC) of the gene prediction measured against the trusted gene. The Y axis is he number of predictions at each CC value.
These results suggest that for Neurospora, GENEWISE should be used when a protein with >90% AA identity is available, FGENESH+ should be used for proteins with between 80% and 90% AA identity, and FGENESH should be used in other cases. This is essentially the logic used by the Gene Structure Prediction system.
