Neurospora crassa Automated Gene Calling

Outline

Overview

This document describes some of the details of the methodology used to produce the automated gene calls for the genome of Neurospora crassa. Automated gene calls were produced in essentially a three step procedure:
  • Gene location and structures were predicted using a combination of FGENESH, FGENESH+, and GENEWISE. This process is described in section Gene Structure Prediction.
  • Gene "names" were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in section Gene Naming.
  • The newly created genes were compared against loci from the previous release to generate the correspondences. This process is described in section Gene Correspondence.

Gene Structure Prediction

Gene structures were predicted using a combination of FGENESH, FGENESH+, and GENEWISE. Both FGENESH and FGENESH+ are gene prediction programs acquired from Softberry.com and GENEWISE is part of the WISE2 package developed by Ewan Birney and is available from the Sanger Center.

Both FGENESH and FGENESH+ utilize a statistical model of gene structure that require training on each organism for accurate prediction. FGENESH+ additionally combines a protein sequence with the statistical model to improve accuracy. We acquired these programs already trained by Softberry on Neurospora crassa sequences.

GENEWISE (as we ran it), splices and aligns a protein sequence with genomic sequence to predict a gene structure. Although GENEWISE does utilize some species-specific parameters, most notably for intron nucleotide statistics and splice site consensus sequences, these can be set to non-species specific defaults. In this case, GENEWISE essentially produces the best local alignment of a protein assuming that introns start at GT and end at AG most of the time and in some cases this results a full alignment of the protein to the genome. Since we are interested in predicting complete gene structures, we post-processed GENEWISE incomplete protein alignments by moving the first and last exon upstream or downstream to the nearest start and stop codons respectively. If a stop codon was encountered upstream of a gene before a start could be found, the gene call was not used.

An assessment of the accuracy of GENEWISE as well as FGENESH, and FGENESH+ is described below in section Structure Prediction Validation.

Briefly, these three gene callers were combined in the following manner:

  1. FGENESH was run on the entire genomic sequence to provide an initial set of predicted genes. Each FGENESH predicted was put into a set of EVIDENCE_GENES.
  2. The genome was also searched against the non-redundant protein database using BLASTX. In order to remove transitive annotation based on our previous release of neurospora calls, we rejected any blast results to neurospora sequences which were submitted after our annotation was submitted and exactly matched any of our previous genes.
  3. Regions of the genome with blastx homology spanning over 80% of a protein (when sub-alignments are stitched together in a consistent fashion) were passed to both FGENESH+ and GENEWISE to produce 2 gene predictions, if the protein had >80% amino acid identity to the translated genome (cumulative across sub-alignments).
  4. Overlapping FGENESH+ predictions and overlapping GENEWISE predictions were separately clustered and where overlaps occured, the call with the highest amount of homology support (as measured by the sequence similarity of the protein used to make the call) was selected, producing a set of non-overlapping FGENESH+ calls and a set of non-overlapping GENEWISE calls.
  5. Where FGENESH+ and/or GENEWISE calls overlapped, the GENEWISE call was selected if its query protein had >90% amino acid identity to the translated genome (cumulative across sub-alignments). Otherwise, the FGENESH+ call was selected. (See Structure Prediction Validation for the reasoning behind these cutoff values. This non-overlapping set of FGENESH+ and GENEWISE calls is refered to as the 'homology genes'.
  6. The final gene set was produced by taking the homology genes set and adding in any FGENESH calls which did have overlap with any homology genes. UTR was added to these genes by incorporating EST evidence which did not conflict with the gene structure. These genes were then passed to the next step of the gene calling for Gene Naming.

Gene Naming

Genes are assigned names VERY CONSERVATIVELY. Because this is a purely automated gene prediction process, we do not want to propogate mis-information by transfering unverified functional names for genes in one species to predicted genes in another species.

We hope to improve the gene naming process in the future based on Gene Ontology categories.

There are currently 5 types of gene names, that make up 3 categories:

  1. NAME, or
    hypothetical protein similar to NAME, or
    conserved hypothetical protein

    Assigned to gene predictions where there is excellent homology to an known NR protein. The criteria for this category are:

    • Top BlastP hit to a known NR protein (complexity filtering off -F F, expect <= 1e-5), with
    • >=80% identity and >= 80% coverage of both the query and subject sequence.

    The exact name is assigned:
    • NAME if the homologous protein is from the curated SwissProt gene set (IE we trust the gene name), otherwise:
    • conserved hypothetical protein if the homologous protein NAME contains a word in the set {hypothetical, homolog, probable, putative, similar to, predicted, unnamed, unknown} (IE we do not want to transfer suspect names), otherwise
    • hypothetical protein similar to NAME
    In all cases we take the NR protein name and try to filter out the species name, GIs, and extra whitespace

  2. Hypothetical protein
    Assigned to gene predictions that show significant BlastP homology to a protein in NCBI's protein set NR or an EST alignment. The criteria for this category are:
    • BlastP hit to NR (complexity filtering off -F F, expect <= 1e-5), or
    • EST hit (>=300nt, >=98%identity, >95% coverage) which overlaps gene

  3. Predicted protein
    Assigned to gene predictions that do not have an EST alignment or show significant BlastP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BlastP analysis was performed on the gene set. The criteria for this category are:
    • No BlastP hit to NR (complexity filtering off -F F, expect <= 1e-5), and
    • No EST hit (>=300nt, >=98%identity, >95% coverage) which overlaps gene

Gene Locus Numbers

Every annotated gene is given a Locus Number of the form NCU##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encoding attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.

With each new assembly, we do our best to map all genes from the previous assembly and thus preserve loci. Any loci that cannot be mapped will be retired. New genes will receive new loci. Each gene also has a version attribute (so loci are in fact displayed as NCU#####.version). When genes are mapped from one assembly to another or when we release a new set of gene calls, we will increment this version. All the loci in a particular release will have the same version number so that we can ensure consistency.

Version 1 of the annotations was released with sequence release 3. Version 2 of the annotations are being released with release 7 of the sequence. A locus number without a version corresponds to the most recent release of the sequence. E.g., NCU03456.1 refers to locus NCU03456 as annotated on sequence 3. NCU03456.2 and NCU03456 both refer to the same locus as annotated on sequence 7.

Correspondence of release 3 and release 7 gene calls.

To compare release 3 and release 7 gene calls, each of the 10082 gene calls from release 3 was uniquely aligned to the release 7 sequence. Consistency in gene order was verified between release 3 and release 7. The 12 cases where an ordering discrepency was found were manually examined and verified to all be caused by valid changes in the sequence. The resulting alignments were compared against the gene calls for release 7.
  • 10620 gene calls in release 7 vs. 10082 in release 3.
  • 8072 genes correspond one-to-one between release 3 and release 7 and have identical coding sequence.
  • 106 release 3 genes have been deleted in release 7. 3 deletes are based on homology evidence which did not pass our filters for release 7. The rest were predicted proteins based on ab-initio gene calls.
  • 682 new genes have been added in the release 7 gene set.
    • 303 of these gene calls are in or near gaps in the release 3 sequence
    • 16% of these gene calls are based on protein homology, vs. 25% across the entire gene set.
  • In 68 cases, two release 3 gene calls have been merged, such that they are both included in a single corresponding release 7 gene call.
  • In 30 cases, one release 3 gene call has been split into multiple release 7 genes.
  • 1874 release 7 gene calls correspond one-to-one with a release 3 gene, but have some difference in coding sequence and have been assigned to one of four categories:
    • SEQUENCE_CHANGED - (612 genes) A change in the genome sequence of the loci or a neighboring region resulted in a different gene prediction.
    • EXTENDED - (101 genes) The CDS for the release 7 gene is a superset of the corresponding release 3 CDS.
    • TRUNCATED - (56 genes) The CDS for the release 7 gene is a subset of the corresponding release 3 CDS.
    • EVIDENCE_CHANGED - (1105 genes) The homology evidence used as an input to fgenesh+ or genewise has changed, resulting in a changed gene model, or ab-initio gene calling predicted a slightly different gene structure.

Structure Prediction Validation

The available EST alignments were used to validate the gene structure predictions. To perform this comparison ESTs were first clustered by combining all overlapping ESTs into a cluster. Each EST cluster was then compared to any overlapping predicted genes. If the gene structure matched the alignment in the area of overlap, the cluster had no problems. If the alignment had an region that did not overlap any coding bases of the gene structure, it was considered a missing exon error. If a gap in the alignment fully contained an exon of the gene model, it was considered a wrong exon error. Partial overlaps were classified as splice junction error. In cases where multiple overlapping ESTs suggest different gene structures, the EST that most closely matched the gene structure was used.

The following table shows the accuracy of the release 3 gene calls and the release 7 gene calls. Roughly 25% of genes have some overlap with an EST cluster and 76% of those genes show no problems in both the set of release 3 gene calls and the set of release 7 gene calls.

 Release 3Release 7
Summary
    # of genes1008210620
    # of EST clusters60016008
        single exon38253826
        hitting genes on opposite strand
ignored for stats, believed to be alignment/strand correction problems)
15841591
# of EST clusters minus ignored ones44174410
    # of genes hit by an EST cluster25792605
    # of EST clusters hitting multiple genes1517
    # of genes hitting multiple EST clusters926932
EST Cluster
    # of EST clusters with no problems2940 (67%)2940 (67%)
        not counting missed EST clusters2940 (82%)2940 (82%)
    # of EST clusters not hitting genes848 (19%)816 (19%)
    # of EST clusters hit, with problems630 (18%)653 (18%)
    Problems:
        # of EST clusters with missing exons61 (2%)76 (2%)
        # of EST clusters with wrong exons24 (1%)25 (1%)
        # with splice junction problems590 (17%)620 (17%)
Predicted Genes
    # of genes hit by an EST cluster2579 (26%)2605 (25%)
    # of those genes with no problems1965 (76%)1970 (76%)
    # of genes hit, with problems614 (24%)634 (24%)
    Problems:
        # of genes with missing exons60 (2%)76 (3%)
        # of genes with wrong exons54 (2%)56 (2%)
        # with splice junction problems558 (22%)581 (22%)

The strategy for combining gene prediction programs to identify potential gene structures was based on an assessment of the performance of these programs on test set of 191 genomic sequences for Neurospora genes generously provided by Dr. Chuck Staben of the University of Kentucky. It is important to note, however, that some of the proteins in this test set were undoubtedly used in the training of FGENESH and FGENESH+ by Softberry. Thus the performance of these programs on the test set are possibly inflated relative to the performance on a random set of N. crassa genes. We hope to be able to perform cross-validation tests on these tools in the future to generate more accurate performance statistics. This is not an issue for GENEWISE as it required no training.

In assessing the performance of the gene callers we asked two questions:

  1. How well does FGENESH perform in predicting genes in the absence of homology?
  2. How well do FGENESH+ and GENEWISE perform when given protein sequences of varying amino acid identity to the actual N. crassa translated gene?
A number of metrics were used to compare predicted gene structures against trusted gene structures. These are described in detail in (Guigo, R., P. Agarwal, J.F. Abril, M. Burset, and J.W. Fickett. 2000. An assessment of gene prediction accuracy in large DNA sequences. Genome Res 10: 1631-1642). We will present the results here for the correlation coefficient only, defined as

CC = (Tp*Tn - Fp*Fn) / SQRT( (Tp+Fp)*(Tn+Fn)*(Tp+Fn)*(Tn+Fp) )

where Tp, Tn, Fp, and Fn are the number of true positives, true negatives, false positives, and false negatives respectively, all defined at the nucleotide level relative to the trusted gene. A CC=1 represents a perfect prediction relative to the trusted gene.

The overall results or this analysis are shown in Figure 1. In this figure, each point represents a gene prediction. The X axis is the AA %ID sequence similarity of the protein used by FGENESH+ (in blue) or GENEWISE (in yellow) to produce the gene prediction. FGENESH (in pink) does not incorporate homology and so all its predictions are shown at the very left hand side of the graph

Figure 1. Accuracy gene predictions as a function of protein homology used for the prediction. See text for details.

As can be seen from the figure, FGENESH (the pink points) produces relatively accurate predictions on this test set. This can be seen more clearly in Figure 2 below. This figure shows a histogram of Correlation Coefficients for all the genes predicted by FGENESH on the test set of 191 genes. It must be stressed that this is likely to be biased due to inclusion of some or perhaps most test sequences in the training set for FGENESH.
 

Figure 2. Histogram of gene prediction correlation coefficients for all FGENESH predictions on the test set of sequences.

FGENESH+ and GENEWISE on the other hand show very poor performance when proteins with less than 80% AA identity to the translated genome sequence are used as a basis for the gene prediction. For proteins with > 80% AA identity, these gene callers do appear to allow an improvement in gene prediction accuracy as compared with FGENESH. This can bee seen more clearly in Figure 3 and 4 below.

In each of these figures, histograms of gene prediction correlation coefficients are shown for predictions based on proteins with >90% AA identity, between 80% and 90% AA identity, and between 70% and 80% protein identity.

In the case of GENEWISE and comparing with Figure 2, it can be seen that for proteins with >90% AA identity, GENEWISE performs very well and appears to offer significant improvement over FGENESH in prediction accuracy. For proteins with <90% AA identity, however, GENEWISE does not perform much better that FGENESH.

Figure 3. Histogram of GENEWISE performance broken down by homology range. The top histogram corresponds to predictions using proteins with between 70% and 80% AA identity, the middle histogram for proteins between 80% and 90%, and the bottom for proteins with >90% identity. The X axis in each pane is the correlation coefficient (CC) of the gene prediction measured against the trusted gene. The Y axis is he number of predictions at each CC value.

In the case of FGENESH+ and comparing with Figure 2 and Figure 3, it can be seen that for proteins with >90% AA identity, FGENESH+ performs slightly better than FGENESH but worse than GENEWISE. For proteins with between 80% and 90% AA identity however, FGENESH+ does outperform GENEWISE and appears to perform slightly better than FGENESH in that the fraction of poor predictions (CC<0.8) diminishes.

Figure 4. Histogram of FGENESH+ performance broken down by homology range. The top histogram corresponds to predictions using proteins with between 70% and 80% AA identity, the middle histogram for proteins between 80% and 90%, and the bottom for proteins with >90% identity. The X axis in each pane is the correlation coefficient (CC) of the gene prediction measured against the trusted gene. The Y axis is he number of predictions at each CC value.

These results suggest that for Neurospora, GENEWISE should be used when a protein with >90% AA identity is available, FGENESH+ should be used for proteins with between 80% and 90% AA identity, and FGENESH should be used in other cases. This is essentially the logic used by the Gene Structure Prediction system.