Gene Finding
Outline
- Overview
- Gene Structure Prediction
- Gene Naming
- Gene Numbering
- Possible Problems: P. aeruginosa 2192
- Possible problems: P. aeruginosa C3719
Overview
This document explains how automated gene calls were produced for the Pseudomonas aeruginosa genome. The annotation was created in two steps:
- Final gene structures were predicted using a combination of ORFs from mapped PseudoCAP database, Glimmer and GeneMark. This process is described in Gene Structure Prediction.
- We relied on the PseudoCAP assigned gene names and symbols for naming most of the genes in our annotation. A few additional genes were assigned names using our automated gene naming system. This process is described in Gene Naming.
- NAME, or hypothetical protein similar to NAME, or conserved hypothetical protein
- Assigned to gene predictions where there is excellent homology to a known NR protein. The criteria for this category are:
- At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect = 1e-10),
- A minimum of 50% identity and 70% coverage of both the query and subject sequence
- The name will follow one of these three formats:
- conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, otherwise
- NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise:
- hypothetical protein similar to NAME Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits, we choose the one with the highest average identity to the amount of overlap to the target gene. In all cases we take the NR protein name and filter out the species name, GIs, parenthetical comments, extra white space, etc.
- Hypothetical protein Assigned to gene predictions that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:
- BLASTP hit to NR (complexity filtering off, -F F, expect = 1e-10)
- Predicted protein Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.
Gene Structure Prediction
Gene structures were predicted using a combination of gene predictions from Glimmer and GeneMark, mapped ORFs from PseudoCAP. PseudoCAP ORFs (on date 09.29.05) were downloaded from the PseudoCAP
Using the BLAT alignment program developed by Jim Kent (Kent, 2002), we mapped PseudoCAP annotated ORFs. In addition, we mapped PseudoCAP ORFs onto our genome assembly using an in-house synteny-based gene mapping protoccol. Briefly, synteny-based gene mapping procedure works as follows:
First generate collinear block alignments between the two genomes. These collinear blocks are generated by creating pair-wise alignments between the two genomes. Then, alignments are clustered into collinear blocks when these are within some maximum distance on both the target and query sequence, and are oriented in the same direction. Finally, a global alignment is generated for the entire region the collinear block covers. It is this global alignment of these two regions that the mapping process uses. Once the collinear regions are created, an in-house gene mapping process can be run on a gene set to copy it to the target genome.
A few pseudoCAP ORFs did not map to our genome assembly mostly due to gaps in our genome assembly. Some of these mapped ORFs had one or more problems in their alignments. These problematic alignments were manually reviewed, modified, if necessary and tagged with the appropriate curation flags to indicate the nature of problems (frame shifts, partial due to sequence gap, defective ORFs etc). Further, we reviewed any Glimmer or GeneMark predictions with intergenic regions within these mapped ORFs. Predictions with sufficient BLASTX evidence were annotated as ORFs.
The final gene set was produced using an automated and evidence-based prokaryotic gene calling method as described below.
Prokaryotic gene calling works as follows:
Genes for Pseudomonas aeruginosa are selected algorithmically from three sources of input: manual annotations performed locally at the Broad, imported annotations from pseudoCAP, and genes predicted by two ab initio gene predictors, Glimmer and GeneMark. A 200-base overlap between genes is allowed between adjacent genes. Regions which are known to contain tRNAs and rRNAs are excluded from consideration on that strand.
In the case of overlap between multiple sources, Broad annotations are selected over imported other sources because these manually created annotations represent targeted editing of the imperfectly mapped ORFs and predicted ORFs with one or more problems. Otherwise, perfectly mapped (transferred from external source annotation) ORFs are chosen over ab-initio predictions. In the absence of ORFs from the above two sources, when choosing between Glimmer and GeneMark predictions, the prediction which agrees more in length with the average blast hit at that location is selected. In loci with no blast coverage, an ab initio predicted ORF is chosen only when both Glimmer and GeneMark predict a gene with a common stop codon on that strand. In this case, the longer prediction is used.
Gene Naming
The gene names for the ORFs on the Pseudomonas aeruginosa assembly were assigned using the following steps:
For loci with mapped ORFs from pseudoCAP, we transferred pseduoCAP names to the final ORFs. For the remaining loci with other types of evidence, we used our automated gene naming system to assign the gene names as described below:
Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we do not want to propagate misinformation by transferring unverified functional names for genes in one species to predicted genes in another species.
We hope to improve the gene naming process in the future based on Gene Ontology categories.
There are currently 5 types of gene name that fall into 3 categories:
Gene Numbering
Every annotated gene is given a Locus Number of the form PA2G_##### (for 2192) or PACG_##### (for C3719). This locus number is the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.
Possible Problems: P. aeruginosa 2192
| PA_2192_V1_GENETRANSFER_PA_01_5 | 4927 | ||||||
|---|---|---|---|---|---|---|---|
| PA_2192_GENEMARK | 64 | ||||||
| PA_2192_V1_GLIMMER_1 | 105 | ||||||
| PA_2192_V1_MANUAL_3 | 1101 | ||||||
| first codon not Met | 788 | 541 | 19 | 42 | 186 | ← not tallied in problems | |
| first codon not xTG | 28 | 11 | 0 | 0 | 17 | ||
| last codon not STOP | 33 | 8 | 0 | 0 | 25 | ||
| contains in-frame STOP | 210 | 20 | 0 | 0 | 190 | ||
| coding length not modulo 3 | 246 | 19 | 0 | 0 | 227 | ||
| contains ≥1 N in exon | 29 | 27 | 0 | 0 | 2 | ||
| touches gap(s) | 29 | 27 | 0 | 0 | 2 | ||
| spans contigs | 29 | 27 | 0 | 0 | 2 | ||
Possible problems: P. aeruginosa C3719
| PA_C3719_V1_GENETRANSFER_PA_01_2 | 4803 | ||||||
|---|---|---|---|---|---|---|---|
| PA_C3719_V1_GENEMARK_2 | 40 | ||||||
| PA_C3719_V1_GLIMMER_1 | 40 | ||||||
| PA_C3719_V1_MANUAL_2 | 703 | ||||||
| first codon not Met | 682 | 528 | 7 | 12 | 135 | ← not tallied in problems | |
| first codon not xTG | 38 | 11 | 0 | 0 | 27 | ||
| first codon not known START | 36 | 11 | 0 | 0 | 25 | ||
| last codon not STOP | 45 | 5 | 0 | 0 | 40 | ||
| contains in-frame STOP | 281 | 47 | 0 | 0 | 234 | ||
| coding length not modulo 3 | 321 | 40 | 0 | 0 | 281 | ||
| contains ≥1 N in exon | 60 | 55 | 0 | 0 | 5 | ||
| touches gap(s) | 60 | 55 | 0 | 0 | 5 | ||
| spans contigs | 60 | 55 | 0 | 0 | 5 | ||
