Gene Finding
Outline
- Overview
- Gene Structure Prediction
- Gene Naming
- Summary of name counts
- Gene Locus Numbers
- Structure Prediction Validation
- Overview of Query Genes
- Possible Problems
Overview
This document describes some of the details of the methodology used to produce the automated gene calls for the genome of Uncinocarpus reesii. Automated gene calls were produced in essentially a two step procedure:
- Gene location and structures were predicted using a combination of GENEID, FGENESH, and GENEWISE. This process is described in section Gene Structure Prediction.
- Gene "names" were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in section Gene Naming.
Gene Structure Prediction
Gene structures were predicted using a combination of FGENESH, GENEID and GENEWISE. FGENESH is a commercial gene prediction program sold by Softberry, while GENEID, by Enrique Blanco and Roderic Guig?, is available under the GPL. GENEWISE is part of the WISE2 package developed by Ewan Birney and is available from the Sanger Center.
FGENESH uses a statistical model of gene structure that requires training on each organism for accurate prediction. Softberry performed the training on Aspergillus nidulans sequences. GENEID is an ab initio gene caller and was run with the Aspergillus nidulans parameter file. Although GENEWISE does utilize some species-specific parameters, most notably for intron nucleotide statistics and splice site consensus sequences, these can be set to non-species specific defaults. In this case, GENEWISE essentially produces the best local alignment of a protein assuming that introns start at GT and end at AG. Since we are interested in predicting complete gene structures, we post-processed GENEWISE incomplete protein alignments by moving the first and last exon upstream or downstream to the nearest start and stop codons respectively. If a stop codon was encountered upstream of a gene before a start could be found, the gene call was not used.
Briefly, the results from these gene callers were combined in the following manner:
- Both FGENESH and GENEID were generated using parameter files from A. nidulans and run on the entire genomic sequence to provide an initial set of predicted genes. GENEWISE was run on each BLAST hit from NR that aligned to the genome with a minimum of 80% identity and 80% similarity. This resulted in a set containing 6,757 FGENESH predictions, 7,193 GENEID predictions, and 3,589 GENEWISE predictions.
- Next, short predictions were filtered out. Any gene prediction less than 30aa (90nt) long was dropped. In addition, any gene prediction 300 nt or shorter was dropped, unless it was supported by BLAST evidence or a hmmer hit evidence.
- Wherever remaining predictions had overlapping exons, on identical or opposing strands, we clustered them into separate groups for further analysis. A set of non-overlapping genes was chosen from each cluster by ordering the genes from "best" to "worst", picking the "best" gene, then going down the list and adding any lower-ranked genes that did not overlap ones already selected. Genes were ranked according to the following criteria:
- A gene was considered to have "good BLAST coverage" if it overlapped at least three BLAST hits from different taxa, each with a minimum of 50% average identity and 50% query coverage.
- Predictions with good BLAST coverage were chosen above predictions that did not have good BLAST coverage.
- If two predictions both had good BLAST coverage, we computed the average overlapping BLAST length for each gene; namely, the average length of the three BLAST hits as defined above. (When there were more than three, the average was computed from the three with the highest scores.) The gene closer in length to the average overlapping BLAST length was chosen first.
- Otherwise, GENEID was preferred to FGENESH.
- After sorting and selecting the highest-ranked predictions, the resultant gene set contained 7,798 genes.
Gene Naming
Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we do not want to propagate misinformation by transferring unverified functional names for genes in one species to predicted genes in another species.
We hope to improve the gene naming process in the future based on Gene Ontology categories.
There are currently 5 types of gene name that fall into 3 categories:
- NAME, or hypothetical protein similar to NAME, or conserved hypothetical protein
- Assigned to gene predictions where there is excellent homology to a known NR protein. The criteria for this category are:
- At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect = 1e-10),
- A minimum of 50% identity and 70% coverage of both the query and subject sequence.
- The name will follow one of these three formats:
- conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, otherwise
- NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise:
- hypothetical protein similar to NAME
Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits we choose the one with the highest average identity ? the amount of overlap to the target gene.
In all cases we take the NR protein name and filter out the species name, GIs, parenthetical comments, extra white space, etc.
- Hypothetical protein Assigned to gene predictions that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:
- BLASTP hit to NR (complexity filtering off, -F F, expect = 1e-10)
- Predicted protein Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.
Summary of name counts
1824 transcript(s) had non-generic names
"conserved hypothetical protein" | 2204 |
"hypothetical protein" | 59 |
"hypothetical protein similar to senescence-associated protein" | 12 |
"predicted protein" | 3711 |
hypothetical protein similar to... | 696 |
other non-empty name | 1116 |
Gene Locus Numbers
Every annotated gene is given a Locus Number of the form UREG_##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.
Structure Prediction Validation
Due to lack of any species-specific EST/mRNA data, we did not perform our standard structure prediction validation and hence we do not have a good measure of the accuracy of our gene predictions for this genome.
Overview of Query Genes
7798 genes: 7798 transcripts, 6566 spliced, 1232 unspliced
24094 exons, 16296 introns
| min | median | mean | max | |
|---|---|---|---|---|
| overall length (incl. UTR) | 150 | 1185 | 1437 | 17274 |
| coding length | 150 | 1185 | 1437 | 17274 |
| exons per transcript | 1 | 3 | 3.09 | 26 |
| exons per spliced transcript | 2 | 3 | 3.48 | 26 |
| bp per exon | 1 | 256 | 465 | 11760 |
| bp per intron | 17 | 65 | 96 | 1130 |
len | %cov | %gc | %at | |
| exonic | 11205728 | 50.60 | 51.20 | 48.80 |
| intronic | 1570282 | 7.09 | 45.15 | 53.37 |
| intergenic | 9370258 | 42.31 | 46.26 | 53.99 |
Possible Problems
| UR2_GENEWISE_3 | 138 | |||||
|---|---|---|---|---|---|---|
| UR2_FGENESH_3 | 3075 | |||||
| UR2_GENEID_1 | 4585 | |||||
| short proteins < 50aa | 0 | - | - | - | ← not tallied in problems | |
| shorter proteins < 30aa | 0 | - | - | - | ||
| very short proteins < 10aa | 0 | - | - | - | ||
| exon-less transcripts | 0 | - | - | - | ||
| initial exon ≤ 6bp | 73 | 0 | 15 | 58 | ||
| internal exon ≤ 6bp | 4 | 0 | 4 | 0 | ||
| terminal exon ≤ 6bp | 26 | 0 | 0 | 26 | ||
| intron ≥ 1000bp | 5 | 0 | 5 | 0 | ||
| coding length not mod 3 | 2 | 0 | 2 | 0 | ||
| first codon not START | 7 | 0 | 1 | 6 | ||
| last codon not STOP | 2 | 0 | 2 | 0 | ||
| contains in-frame STOP | 0 | - | - | - | ||
| contains ≥1 N in exon | 0 | - | - | - | ||
| non-canonical splicing | 0 | - | - | - | ||
| overlapping | 0 | - | - | - | ||
| spanning contigs | 131 | 0 | 112 | 19 | ||
| one or more problems | 244 | 0 | 135 | 109 | ||
