Gene Finding Methods
Outline
- Overview
- Gene Structure Prediction
- Gene Naming
- Summary of Automated Gene Naming
- Overview of Query Genes
- Structure Prediction Validation
- Possible Problems
Overview
This document describes a brief summary of the method used to produce the automated gene calls for the Histoplasma capsulatum genome. Automated gene calls were produced in a three-step procedure:
- Gene location and structures were predicted using a combination of GENEWISE, FGENESH and GENEID. This process is described in section Gene Structure Prediction
- Gene names were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in section Gene Naming.
Gene Structure Prediction
Gene structures were predicted using a combination of FGENESH, GENEID and GENEWISE. FGENESH is a commercial gene prediction program sold by Softberry, while GENEID, by Enrique Blanco and Roderic Guigo, is available under the GPL. GENEWISE is part of the WISE2 package developed by Ewan Birney and is available from the Sanger Center.
FGENESH uses a statistical model of gene structure that requires training on each organism for accurate prediction. Softberry performed the training on Aspergillus nidulans sequences. GENEID is an ab initio gene caller and was run with the Histoplasma capsulatum parameter file. Although GENEWISE does utilize some species-specific parameters, most notably for intron nucleotide statistics and splice site consensus sequences, these can be set to non-species specific defaults. In this case, GENEWISE essentially produces the best local alignment of a protein assuming that introns start at GT and end at AG. Since we are interested in predicting complete gene structures, we post-processed GENEWISE incomplete protein alignments by moving the first and last exon upstream or downstream to the nearest start and stop codons respectively. If a stop codon was encountered upstream of a gene before a start could be found, the gene call was not used.
Briefly, the results from these gene callers were combined in the following manner:
- Both FGENESH and GENEID were generated using parameter files from Aspergillus nidulans and Histoplasma capsulatum respectively and run on the entire genomic sequence to provide an initial set of predicted genes. GENEWISE was run on each BLAST hit from NR that aligned to the genome with minimum 80% identity and minimum 80% coverage. This resulted in a set containing 7685 FGENESH predictions, 9561 GENEID predictions, and 286 GENEWISE predictions
- Next, short predictions were filtered out. Any gene prediction less than 30aa (90nt) long was dropped. In addition, any gene prediction less than 50aa (150nt) long was dropped, unless it was supported by BLAST evidence or a hmmer hit evidence
- 511 genes were annotated manually using publicly available ESTs from other species and BLASTX alignments
- Wherever remaining predictions had overlapping exons, on identical or opposing strands, we clustered them into separate groups for further analysis. A set of non-overlapping genes was chosen from each cluster by ordering the genes from "best" to "worst", picking the "best" gene, then going down the list and adding any lower-ranked genes that did not overlap ones already selected. Genes were ranked according to the following criteria:
- A gene was considered to have "good BLAST coverage" if it overlapped at least three BLAST hits from different taxa, each with minimum 50% average identity and minimum 70% query coverage.
- Predictions with good BLAST coverage were chosen above predictions that did not have good BLAST coverage.
- If two predictions both had good BLAST coverage, we computed the average overlapping BLAST length for each gene; namely, the average length of the three BLAST hits as defined above. (When there were more than three, the average was computed from the three with the highest scores.) The gene closer in length to the average overlapping BLAST length was chosen first.
- Otherwise, GENEID was preferred to FGENESH.
- After sorting and selecting the highest-ranked predictions, the resultant gene set contained 9349 genes.
- NAME, or
hypothetical protein similar to NAME, or
conserved hypothetical protein
Assigned to gene predictions where there is excellent homology to a known NR protein. The criteria for this category are:
At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect = 1e-10), with
minimum 50% identity and minimum 70% coverage of both the query and subject sequence.
The name will follow one of these three formats:
- conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, otherwise
- NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise:
- hypothetical protein similar to NAME Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits we choose the one with the highest average identity x the amount of overlap to the target gene.
- Hypothetical protein
Assigned to gene predictions that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:
- BLASTP hit to NR (complexity filtering off, -F F, expect = 1e-10)
- Predicted protein
Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.
Gene Naming
Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we do not want to propagate misinformation by transferring unverified functional names for genes in one species to predicted genes in another species.
We hope to improve the gene naming process in the future based on Gene Ontology categories.
There are currently 5 types of gene names that fall into 3 categories:
Summary of Automated Gene Naming
| "conserved hypothetical protein" | 2151 |
| "hypothetical protein" | 26 |
| "hypothetical protein similar to reverse transcriptase" | 52 |
| "predicted protein" | 5580 |
| hypothetical protein similar to... | 632 |
| other non-empty name | 908 |
Overview of Query Genes
9349 genes: 9349 transcripts,
8332 spliced, 1017 unspliced
32844 exons, 23495 introns
| min | median | mean | max | |
|---|---|---|---|---|
| overall length (incl. UTR) | 93 | 1068 | 1309 | 14592 |
| coding length | 93 | 1068 | 1309 | 14592 |
| exons per transcript | 1 | 3 | 3.51 | 26 |
| exons per spliced transcript | 2 | 3 | 3.82 | 26 |
| bp per exon | 1 | 183 | 372 | 10566 |
| bp per intron | 23 | 97 | 140 | 2926 |
len |
%cov |
%gc |
%at |
|
| exonic | 12240597 | 39.99 | 51.00 | 48.98 |
| intronic | 3312691 | 10.82 | 41.97 | 55.87 |
| intergenic | 15054327 | 49.18 | 43.15 | 57.34 |
Structure Prediction Validation
Due to lack of any species-specific EST/mRNA data, we did not perform our standard structure prediction validation and hence we do not have a good measure of the accuracy of our gene predictions for this genome.
Possible Problems
| HC1_GENEWISE_1 | 51 | ||||||
|---|---|---|---|---|---|---|---|
| HC1_FGENESH_5 | 2137 | ||||||
| HC1_GENEID_R | 6652 | ||||||
| HC1_MANUAL_1 | 509 | ||||||
| short proteins < 50aa | 72 | 3 | 15 | 54 | 0 | ← not tallied in problems | |
| shorter proteins < 30aa | 0 | - | - | - | - | ||
| very short proteins < 10aa | 0 | - | - | - | - | ||
| exon-less transcripts | 0 | - | - | - | - | ||
| initial exon ≤ 6bp | 187 | 0 | 57 | 104 | 26 | ||
| internal exon ≤ 6bp | 29 | 0 | 26 | 0 | 3 | ||
| terminal exon ≤ 6bp | 121 | 0 | 27 | 91 | 3 | ||
| intron ≥ 1000bp | 3 | 0 | 3 | 0 | 0 | ||
| coding length not mod 3 | 9 | 0 | 9 | 0 | 0 | ||
| first codon not START | 55 | 0 | 11 | 44 | 0 | ||
| last codon not STOP | 39 | 0 | 21 | 18 | 0 | ||
| contains in-frame STOP | 0 | - | - | - | - | ||
| contains ≥1 N in exon | 1 | 1 | 0 | 0 | 0 | ||
| non-canonical splicing | 3 | 0 | 0 | 0 | 3 | ||
| overlapping | 0 | - | - | - | - | ||
| spanning contigs | 369 | 1 | 119 | 246 | 3 | ||
| one or more problems | 771 | 1 | 253 | 481 | 36 | ||
