Gene Finding
Outline
Overview
This document explains how automated gene calls were produced for the Saccharomyces cerevisiae RM11-1a genome. The annotation was created in two steps:
- Final gene structures were predicted using a combination of mapped ORFs from SGD predictions, Glimmer and GeneMark. This process is described in Gene Structure Prediction.
- We relied on the SGD assigned gene names for naming most of the genes in our annotation. A few additional genes were assigned names using our automated gene naming system. This process is described in Gene Naming.
Gene Structure Prediction
Gene structures were predicted using a combination of gene predictions from Glimmer and GeneMark, mapped ORFs from SGD data collection and manual curation. The Glimmer and GeneMark gene sets contained 7940 and 6454 gene predictions respectively. 5873 SGD-annotated ORFs (on 12.15.05) were downloaded from the SGD website.
Using the BLAT alignment program developed by Jim Kent (Kent, 2002), we were able to map 5741 of the 5873 SGD annotated ORFs. Of these, 5534 ORFs mapped perfectly over the entire length of the query sequence.
The remaining mapped ORFs that had one or more problems in their alignments were manually reviewed, modified and/or tagged with the appropriate curation flags to indicate the nature of problems (frame shifts, partial due to sequence gap, defective ORFs etc). 132 SGD-annotated ORFs did not map to our genome assembly mostly due to gaps in our genome assembly.
Further, we manually reviewed intergenic spaces containing Glimmer and/or GeneMark predictions with overlapping Blastx evidence for potential genes that were missed by BLAT mapping. We manually created additional ORFs if they had sufficient evidence. The resulting manually annotated gene set contains 309 such ORFs.
The final gene set contains 5695 gene loci for Saccharomyces cerevisiae strain RM11-1a.
Gene Naming
The gene names for S. cerevisiae RM11-1a were assigned using one of the following protocols:
Gene symbols of the SGD annotation were transferred to the uniquely mapped ORFs on our genome. Further, these gene assignments were verified using our in-house Synteny-based gene mapping procedure. Briefly, Synteny-based gene mapping procedure works as follows:
First generate collinear block alignments between the two genomes. These collinear blocks are generated by-Creating pair-wise alignments between the two genomes. Then, alignments that are within some maximum distance on both the target and query sequence and are oriented the same direction are clustered into collinear blocks. Finally, a global alignment is generated for the entire region the collinear block covers. It is this global alignment of these two regions that the mapping process uses. Once the collinear regions are created, an in-house gene mapping process can be run on a gene set to copy it to the target genome.
We mapped 5619 of the SGD features using this Synteny-based approach. Only 4392 of these mapped annotations contain biologically meaningful names/gene symbols assigned by SGD. The remaining ORFs were named using our automated gene naming system as described below:
Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we do not want to propagate misinformation by transferring unverified functional names for genes in one species to predicted genes in another species.
We hope to improve the gene naming process in the future based on Gene Ontology categories.
There are currently 5 types of gene name that fall into 3 categories:
NAME, or hypothetical protein similar to NAME, or conserved hypothetical protein
NAME is assigned to gene predictions where there is excellent homology to a known NR protein. The criteria for this category are:
- At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect = 1e-10),
- A minimum of 50% identity and 70% coverage of both the query and subject sequence.
The name will follow one of these three formats:
- conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, otherwise
- NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise:
hypothetical protein similar to NAME
- Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits we choose the one with the highest average identity to the amount of overlap to the target gene.
- In all cases we take the NR protein name and filter out the species name, GIs, parenthetical comments, extra white space, etc.
Hypothetical protein. Assigned to gene predictions that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:
- BLASTP hit to NR (complexity filtering off, -F F, expect = 1e-10)
Predicted protein. Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.
Gene Numbering
Every annotated gene is given a Locus Number of the form SC1G_##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.
Name Counts
4457 transcript(s) had non-generic names
"conserved hypothetical protein" | 418 |
"hypothetical protein" | 480 |
"predicted protein" | 340 |
hypothetical protein similar to... | 85 |
other non-empty name | 4372 |
Possible Problems
| SC1_CombinedManualMappedFeatures | 3 | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SC1_ALIGNESTS_SC1_dubious_1 | 1 | |||||||||||||
| SC1_CALLREFERENCEGENES_1 | 10 | |||||||||||||
| SC1_CALLREFERENCEGENES_2 | 24 | |||||||||||||
| SC1_ALIGNESTS_SC1_ORF_1 | 5495 | |||||||||||||
| SC1_ALIGNESTS_SC1_est_1 | 2 | |||||||||||||
| SC1_GENETRANSFER_SC_12 | 22 | |||||||||||||
| SC1_CALLGENES_2 | 3 | |||||||||||||
| SC1_BLAST_NR_1 | 11 | |||||||||||||
| SC1_GLIMMER_1 | 68 | |||||||||||||
| SC1_GeneMark | 56 | |||||||||||||
| short proteins < 50aa | 62 | 0 | 0 | 0 | 0 | 56 | 0 | 1 | 0 | 0 | 2 | 3 | ← not tallied in problems | |
| shorter proteins < 30aa | 13 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 0 | 0 | 0 | 0 | ||
| very short proteins < 10aa | 0 | - | - | - | - | - | - | - | - | - | - | - | ||
| exon-less transcripts | 0 | - | - | - | - | - | - | - | - | - | - | - | ||
| initial exon ≤ 6bp | 8 | 0 | 0 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | ||
| internal exon ≤ 6bp | 0 | - | - | - | - | - | - | - | - | - | - | - | ||
| terminal exon ≤ 6bp | 3 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | ||
| intron ≥ 1000bp | 3 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | ||
| coding length not mod 3 | 278 | 0 | 0 | 0 | 15 | 257 | 0 | 2 | 0 | 2 | 1 | 1 | ||
| first codon not Met | 54 | 0 | 0 | 1 | 10 | 33 | 1 | 1 | 0 | 1 | 6 | 1 | ||
| first codon not known START | 47 | 0 | 0 | 1 | 10 | 31 | 1 | 1 | 0 | 1 | 1 | 1 | ||
| last codon not STOP | 33 | 0 | 0 | 1 | 6 | 23 | 0 | 0 | 0 | 1 | 0 | 2 | ||
| contains in-frame STOP | 279 | 0 | 0 | 1 | 10 | 261 | 0 | 2 | 0 | 2 | 2 | 1 | ||
| contains ≥1 N in exon | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | ||
| non-canonical splicing | 52 | 0 | 0 | 0 | 2 | 49 | 0 | 0 | 0 | 0 | 1 | 0 | ||
| overlapping | 22 | 0 | 0 | 1 | 0 | 43 | 0 | 1 | 0 | 0 | 0 | 0 | ← in 22 clusters | |
| overlap > 50 bases | 4 | 0 | 0 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | ← in 4 clusters | |
| overlap > 100 bases | 3 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | ← in 3 clusters | |
| overlap > 200 bases | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | ← in 1 clusters | |
| spanning contigs | 40 | 0 | 0 | 0 | 4 | 25 | 0 | 0 | 0 | 0 | 5 | 6 | ||
| one or more problems | 467 | 0 | 0 | 2 | 21 | 389 | 1 | 3 | 0 | 2 | 11 | 8 | ||
