Neurospora crassa Automated Gene Calling
Outline
- Overview
- Gene Structure Prediction
- Preliminary Gene Name Annotation
- Gene Locus Numbers
- Structure Prediction Validation
Overview
This document describes some of the details of the methodology used to produce the automated gene calls for the genome of Neurospora crassa. Automated gene calls were produced in essentially a two step procedure:- Gene location and structures were predicted using a combination of FGENESH, FGENESH+, and GENEWISE. This process is described in section Gene Structure Prediction.
- Gene "names" were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in section Preliminary Gene Name Annotation.
Gene Structure Prediction
Gene structures were predicted using a combination of FGENESH, FGENESH+, and GENEWISE. Both FGENESH and FGENESH+ are gene prediction programs acquired from Softberry.com and GENEWISE is part of the WISE2 package developed by Ewan Birney and is available from the Sanger Center.Both FGENESH and FGENESH+ utilize a statistical model of gene structure that require training on each organism for accurate prediction. FGENESH+ additionally combines a protein sequence with the statistical model to improve accuracy. We acquired these programs already trained by Softberry on Neurospora sequences. FGENESH was used by MIPS for their automated annotation of LGII and LGV.
GENEWISE (as we ran it), splices and aligns a protein sequence with genomic sequence to predict a gene structure. Although GENEWISE does utilize some species-specific parameters, most notably for intron nucleotide statistics and splice site consensus sequences, these can be set to non-species specific defaults. In this case, GENEWISE essentially produces the best local alignment of a protein assuming that introns start at GT and end at AG most of the time and in some cases this results a full alignment of the protein to the genome. Since we are interested in predicting complete gene structures, we post-processed GENEWISE incomplete protein alignments by extending the first exon upstream to the nearest start codon, and by extending the last exon downstream to the first stop codon. If a stop codon was encountered upstream of a gene before a start could be found, the gene call was not used.
An assessment of the accuracy of GENEWISE as well as FGENESH, and FGENESH+ is described below in section Structure Prediction Validation.
Briefly, these three gene callers were combined in the following manner:
- FGENESH was run on the entire genomic sequence to provide an initial set of predicted genes. Each FGENESH predicted was put into a set of EVIDENCE_GENES.
- The genome was also searched against the non-redundant protein database using BLASTX
- Regions of the genome with blastx homology spanning over 80% of a protein (when sub-alignments are stitched together in a consistent fashion) were considered "Homologous Gene Regions" (HGRs).
- HGRs were clustered into groups of HGRs that all implicated the same gene structure (most often representing groups of essentially orthologous proteins).
- For each cluster of HGRs, the protein showing the most sequence similarity to the genome was passed to both FGENESH and GENEWISE to produce 2 gene predictions, if the protein had >80% amino acid identity to the translated genome (cumulative across sub-alignments).
- If the protein used in step 6 had >80% amino acid identity to the translated genome (cumulative across sub-alignments), then the GENEWISE call was favored over the FGENESH+ call, and was used as the EVIDENCE_GENE for the HGR (see below for the reason why) and added to the set of EVIDENCE_GENES.
- If the protein used in step 6 had >90% but less than 90% amino acid identity to the translated genome (cumulative accross sub-alignments), then the FGENESH+ call was favored over the GENEWISE call, and was used as the EVIDENCE_GENE for the HGR (see Structure Prediction Validation for the reason why) and added to the set of EVIDENCE_GENES.
- When EVIDENCE_GENES overlapped in their exons, the EVIDENCE_GENE with the least amount of homology support (as measured by the sequence similarity of the protein used to make the call or zero for FGENESH calls) was removed from the set of EVIDENCE_GENES.
- All remaining EVIDENCE_GENES were then called as our official ANNOTATED_GENES and passed to the next step of gene calling for Preliminary Gene Name Annotation.
Preliminary Gene Name Annotation
Given a predicted gene structure, gene names (corresponding protein name actually) were assigned according to the following rules (in order of precedence):- Genes predicted using FGENESH+ or GENEWISE using a Neurospora protein from Swissprot and with >90% AA identity to the translated genome sequence inherited the name of the Swissprot protein (i.e. CLOCK-CONTROLLED PROTEIN 8).
- Genes predicted using FGENESH+ or GENEWISE using a protein from the MIPS manually curated protein from LGII and LGV and with >90% AA identity to the translated genome sequence inherited the name of the MIPS protein along with the tag "[MIPS]" (i.e. probable beta-succinyl CoA synthetase precursor [MIPS]), unless the MIPS protein was annotated as 'putative protein', 'hypothetical protein', or 'predicted protein'.
- Genes predicted using FGENESH+ or GENEWISE using a protein from Swissprot and with >70% AA identity to the translated genome sequence inherited the name of the Swissprot protein (i.e. HOMOSERINE O-ACETYLTRANSFERASE (HOMOSERINE O-TRANS-ACETYLASE)). Note: we will append a tag to these gene names in the future to differentiate from case 1.
- Other genes predicted using FGENESH+ or GENEWISE are called 'hypothetical proteins' followed by the name of the protein used in parenthesis (i.e. hypothetical protein ( (AY029769) protein kinase 1 [Cryphonectria parasitica] ).
- Genes predicted based on FGENESH calls with overlapping blastx hits (but not with trusted homology) were called 'hypothetical protein'
- Genes predicted based on FGENESH calls with no overlapping blastx hits were called 'predicted protein'
Gene Locus Numbers
Every annotated gene is given a Locus Number of the form NCU##### that should be considered the only guaranteed way to identify a gene uniquely and positively. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. Encoding attributes of an object in the identifier for an object we feel is a bad idea. Position is an attribute of a gene that can be retrieved by the locus.With each new assembly, we do our best to map all genes from the previous assembly and thus preserve loci. Any loci that cannot be mapped will be retired. New genes will receive new loci. Each gene also has a version attribute (so loci are in fact displayed as NCU#####.version). When genes are mapped from one assembly to another or if a gene call is altered, we will increment this version. Thus a particular NCU#####.version will refer to a particular instantiation and annotation of the locus NCU#####.
The U in NCU##### mean unfinished. Once we have finished the Neurospora genome, we will map all NCU##### onto NC##### (no U) locus numbers and also renumber so that loci at the time of the release are sequential on a given chromosome (for convenience only - we do not guarantee this will always be a property of loci). We will publish the mapping between NCU numbers and NC numbers at this time.
Structure Prediction Validation
The strategy for combining gene prediction programs to identify potential gene structures was based on an assessment of the performance of these programs on test set of 191 genomic sequences for Neurospora genes generously provided by Dr. Chuck Staben of the University of Kentucky. It is important to note, however, that some of the proteins in this test set were undoubtedly used in the training of FGENESH and FGENESH+ by Softberry. Thus the performance of these programs on the test set are possibly inflated relative to the performance on a random set of N. crassa genes. We hope to be able to perform cross-validation tests on these tools in the future to generate more accurate performance statistics. This is not an issue for GENEWISE as it required no training.In assessing the performance of the gene callers we asked two questions:
- How well does FGENESH perform in predicting genes in the absence of homology?
- How well do FGENESH+ and GENEWISE perform when given protein sequences of varying amino acid identity to the actual N. crassa translated gene?
CC = (Tp*Tn - Fp*Fn) / SQRT( (Tp+Fp)*(Tn+Fn)*(Tp+Fn)*(Tn+Fp) )
where Tp, Tn, Fp, and Fn are the number of true positives, true negatives, false positives, and false negatives respectively, all defined at the nucleotide level relative to the trusted gene. A CC=1 represents a perfect prediction relative to the trusted gene.
The overall results or this analysis are shown in Figure 1. In this figure, each point represents a gene prediction. The X axis is the AA %ID sequence similarity of the protein used by FGENESH+ (in blue) or GENEWISE (in yellow) to produce the gene prediction. FGENESH (in pink) does not incorporate homology and so all its predictions are shown at the very left hand side of the graph
Figure 1. Accuracy gene predictions as a function of protein homology used for the prediction. See text for details.
As can be seen from the figure, FGENESH (the pink
points) produces relatively accurate predictions on this test set. This
can be seen more clearly in Figure 2 below. This figure shows a histogram
of Correlation Coefficients for all the genes predicted by FGENESH on the
test set of 191 genes. It must be stressed that this is likely to be
biased due to inclusion of some or perhaps most test sequences in the training
set for FGENESH.
Figure 2. Histogram of gene prediction correlation coefficients for all FGENESH predictions on the test set of sequences.
FGENESH+ and GENEWISE on the other hand show very poor performance when proteins with less than 80% AA identity to the translated genome sequence are used as a basis for the gene prediction. For proteins with > 80% AA identity, these gene callers do appear to allow an improvement in gene prediction accuracy as compared with FGENESH. This can bee seen more clearly in Figure 3 and 4 below.
In each of these figures, histograms of gene prediction correlation coefficients are shown for predictions based on proteins with >90% AA identity, between 80% and 90% AA identity, and between 70% and 80% protein identity.
In the case of GENEWISE and comparing with Figure 2, it can be seen that for proteins with >90% AA identity, GENEWISE performs very well and appears to offer significant improvement over FGENESH in prediction accuracy. For proteins with <90% AA identity, however, GENEWISE does not perform much better that FGENESH.
Figure 3. Histogram of GENEWISE performance broken down by homology range. The top histogram corresponds to predictions using proteins with between 70% and 80% AA identity, the middle histogram for proteins between 80% and 90%, and the bottom for proteins with >90% identity. The X axis in each pane is the correlation coefficient (CC) of the gene prediction measured against the trusted gene. The Y axis is he number of predictions at each CC value.
In the case of FGENESH+ and comparing with Figure
2 and Figure 3, it can be seen that for proteins with >90% AA identity,
FGENESH+ performs slightly better than FGENESH but worse than GENEWISE.
For proteins with between 80% and 90% AA identity however, FGENESH+ does
outperform GENEWISE and appears to perform slightly better than FGENESH
in that the fraction of poor predictions (CC<0.8) diminishes.
Figure 4. Histogram of FGENESH+ performance broken down by homology range. The top histogram corresponds to predictions using proteins with between 70% and 80% AA identity, the middle histogram for proteins between 80% and 90%, and the bottom for proteins with >90% identity. The X axis in each pane is the correlation coefficient (CC) of the gene prediction measured against the trusted gene. The Y axis is he number of predictions at each CC value.
These results suggest that for Neurospora, GENEWISE should be used when a protein with >90% AA identity is available, FGENESH+ should be used for proteins with between 80% and 90% AA identity, and FGENESH should be used in other cases. This is essentially the logic used by the Gene Structure Prediction system.
