Gene Finding Methods
Outline
Overview
This document describes some of the details of the methodology used to produce the automated gene calls for the genome of Stagonospora nodorum.
- Gene location and structures were predicted using a combination of gene prediction algorithms. This process is described in section Gene Structure Prediction.
- Gene "names" were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in section Gene Naming.
- An estimate of gene prediction accuracy was created by comparing the gene predictions against aligned ESTs from Stagonospora nodorum. This process is described in section Structure Prediction Validation.
Gene Structure Prediction
Gene structures were predicted using a combination of FGENESH, FGENESH+, GENEID, and GENEWISE. FGENESH and FGENESH+ are gene prediction programs acquired from Softberry.com, GENEWISE is part of the WISE2 package developed by Ewan Birney and is available from the Sanger Center. GENEID was developed by Roderic Guigó, and is available at the GeneId web site.
Both FGENESH and FGENESH+ utilize a statistical model of gene structure that require training on each organism for accurate prediction. FGENESH+ additionally combines a protein sequence with the statistical model to improve accuracy. We acquired these programs already trained by Softberry on Stagonospora nodorum sequences.
GENEWISE (as we ran it), splices and aligns a protein sequence with genomic sequence to predict a gene structure. Although GENEWISE does utilize some species-specific parameters, most notably for intron nucleotide statistics and splice site consensus sequences, these can be set to non-species specific defaults. In this case, GENEWISE essentially produces the best local alignment of a protein assuming that introns start at GT and end at AG most of the time and in some cases this results a full alignment of the protein to the genome. Since we are interested in predicting complete gene structures, we post-processed GENEWISE incomplete protein alignments by moving the first and last exon upstream or downstream to the nearest start and stop codons respectively. If a stop codon was encountered upstream of a gene before a start could be found, the gene call was not used.
- FGENESH was run on the entire genomic sequence to provide an initial set of predicted genes.
- GENEID was run twice on the entire genomic sequence, using 4th order and 5th order markov models. GeneId was trained by the author using 317 SN1 transcripts mannually curated at the Broad Institute.
- The genome was also searched against the non-redundant protein database using BLASTX.
- Regions of the genome with blastx homology spanning over 80% of a protein (when sub-alignments are stitched together in a consistent fashion) were passed to GENEWISE and FGENESH+ to produce gene predictions if the protein had >80% amino acid identity to the translated genome (cumulative across sub-alignments).
- All predicted genes were clustered into sets of overlapping genes.
- For each cluster, if there were high-quality blast hits to proteins from three or more different taxIds, the average length of the three proteins with the highest average identity from three different taxIds was calculated. Genes were then ranked within the cluster by how close their length was to this average length. Ties were broken by the algorithm described in the next step. For purposes of this step, high-quality is defined as a protein whose blast hits at that locus had a query coverage of >80% and an average identity of >60%.
- Otherwise, genes were ranked by their origin:
- Genewise
- Fgenesh
- Fgenesh+
- Geneid
- For each cluster, proceeding in rank order, each input gene was selected as a valid gene prediction as long as it did not overlap with any higher ranked gene.
- Finally, we included 320 genes manually annotated at the Broad Institute on SN1. If any predicted gene overlapped with a manually annotated gene, we removed the automated gene prediction and replaced it with the manually annotated gene.
- Additional notes
- Clusters with only Geneid genes did not get selected as genes because Geneid was shown to predict too many genes.
The gene calling algorithm:
Gene Naming
Genes were named by the following algorithm:
- If a gene was predicted by Genewise, all NR proteins which had excellent homology and whose blast hits overlap the predicted gene (requirement: >80% query coverage and >80% average identity) were subjected to the following naming algorithm. Afterwards, the best resulting name is selected.
- The original NAME of the protein is used if the homologous protein is from the curated SwissProt gene set (IE we trust the gene name), otherwise:
- conserved hypothetical protein if the homologous protein NAME contains a word in the set {hypothetical, homolog, probable, putative, similar to, predicted, unnamed, unknown} (IE we do not want to transfer suspect names), otherwise:
- hypothetical protein similar to NAME
- Otherwise:
- The name hypothetical protein is assigned to gene predictions that show significant BlastP homology to a protein in NCBI's protein set NR. The criteria for this category are:
- Top BlastP hit to a known NR protein (complexity filtering off -F F, expect <= 1e-5)
- Otherwise, the name predicted protein is assigned.
- The name hypothetical protein is assigned to gene predictions that show significant BlastP homology to a protein in NCBI's protein set NR. The criteria for this category are:
Gene Locus Numbers
Every annotated gene is given a Locus Number of the form SNU##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encoding attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.
With each new assembly, we do our best to map all genes from the previous assembly and thus preserve loci. Any loci that cannot be mapped will be retired. New genes will receive new loci. Each gene also has a version attribute (so loci are in fact displayed as SNU#####.version). When genes are mapped from one assembly to another or when we release a new set of gene calls, we will increment this version. All the loci in a particular release will have the same version number so that we can ensure consistency.
Structure Prediction Validation
The available EST alignments were used to validate the gene structure predictions. To perform this comparison ESTs were first clustered by combining all overlapping ESTs into a cluster. Each EST cluster was then compared to any overlapping predicted genes. If the gene structure matched the alignment in the area of overlap, the cluster had no problems. If the alignment had a region that did not overlap any coding bases of the gene structure, it was considered a missing exon error. If a gap in the alignment fully contained an exon of the gene model, it was considered a wrong exon error. Partial overlaps were classified as splice junction error. In cases where multiple overlapping ESTs suggest different gene structures, the EST that most closely matched the gene structure was used.
The following table shows the accuracy of the gene calls:
| Summary | ||
| # of genes | 16597 | |
| # of EST clusters | 1155 | |
| single exon | 457 | |
| hitting genes on opposite strand ignored for stats, believed to be alignment/strand correction problems) | 308 | |
| # of EST clusters minus ignored ones | 846 | |
| # of genes hit by an EST cluster | 738 | |
| # of EST clusters hitting multiple genes | 18 | |
| # of genes hitting multiple EST clusters | 20 | |
| EST Cluster | ||
| # of EST clusters with no problems | 534 | 0.63 |
| not counting missed EST clusters | 534 | 0.72 |
| # of EST clusters not hitting genes | 107 | 0.13 |
| # of EST clusters hit, with problems | 204 | 0.28 |
| Problems: | ||
| # of EST clusters with missing exons | 20 | 0.03 |
| # of EST clusters with wrong exons | 8 | 0.01 |
| # with splice junction problems | 199 | 0.27 |
| Predicted Genes | ||
| # of genes hit by an EST cluster | 738 | 0.04 |
| # of those genes with no problems | 533 | 0.72 |
| # of genes hit, with problems | 204 | 0.28 |
| Problems: | ||
| # of genes with missing exons | 20 | 0.03 |
| # of genes with wrong exons | 8 | 0.01 |
| # with splice junction problems | 197 | 0.27 |
