Gene Finding Methods

Outline

Overview

This document explains how automated gene calls were produced for the Burkholderia cenocepacia genome. The annotation was created in two steps:

  1. Final gene structures were predicted using a combination of ORFs from Glimmer,GeneMark and manual annotation. This process is described in section Gene Structure Prediction.
  2. Gene "names" were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in section Gene Naming.


Gene Naming

The gene names for the annotated Burkholderia cenocepacia ORFs were assigned using the following procedure:

We used the top three blastx hits as evidence to name our final gene set. Manual review of all available evidence was used to come up with the best informative name and to resolve any discrepancy among the names derived from two or more top hits.

Blast names beginning with "X Homolog", "X Family", "Similar to X protein" etc were changed to begin with the prefix "hypothetical protein similar to X". Manual review of the top blast evidence was used to assign the best informative gene product name and to resolve discrepancy among the names derived from two or more top hits. Finally, if no name could be obtained by any of the above steps, we named them as "hypothetical proteins".


Gene Structure Prediction

Gene structures were predicted using a combination of gene predictions from Glimmer, GeneMark and manually annotated ORFS using our evidence-based prokaryotic gene calling method. Regions which are known to contain tRNAs and rRNAs are excluded from consideration on that strand. In loci with no blast coverage, an ab initio predicted ORF is chosen only when both Glimmer and GeneMark predict a gene with a common stop codon on that strand. In this case, the longer prediction is used. An upper limit of 200 base overlap is allowed between adjacent genes. Automated gene models with significant differences in ORF length from available blast evidence were manually reviewed. Manual editing was done to merge, split or refine gene boundaries if necessary. Further, we reviewed all intergenic regions longer than 1.0 kb in length containing any good blast evidence and manually created new ORFs if sufficient evidence was present. Problematic annotations containing recognizable sequence gaps, errors and frame shifts are flagged with appropriate curation flags.


Gene Locus Numbers

Every annotated gene is given a Locus Number of the form BCPG_##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.


Structure Prediction Validation

Due to lack of any species-specific EST/mRNA data, we did not perform our standard structure prediction validation and hence we do not have a good measure of the accuracy of our gene predictions for this genome.


Possible Problems

BC_PC184_V1_GENETRANSFER_BD_AUO158_V1_1779
BC_PC184_V1_GENEMARK_22681
BC_PC184_V1_GLIMMER_21394
BC_PC184_V1_MANUAL_3895
short proteins < 50aa2051041← not tallied in problems
shorter proteins < 30aa0----
very short proteins < 10aa0----
initial exon ≤ 6bp0----
internal exon ≤ 6bp0----
terminal exon ≤ 6bp0----
≥ 15 exons0----
intron ≥ 1000bp0----
intron ≤ 20bp0----
first codon not Met1626120735594177← not tallied in problems
first codon not xTG105005
first codon not known START105005
last codon not STOP3180023
contains in-frame STOP1692700142
coding length not modulo 32103600174
non-canonical splicing0----
has ≥1 good BLAST hit0----← not tallied in problems
≤1/3 as long as BLAST hit0----
≥3× longer than BLAST hit0----
contains ≥1 N in exon52620026
contains low-quality sequence5023916984210← not tallied in problems
touches gap(s)52620026
spans contigs51520026
within 1kb of contig edge3834714879109← not tallied in problems
any overlap (UTR or CDS)674146597429358← in 674 clusters
CDS overlap only674146597429358← in 674 clusters
CDS overlap > 50bp2202822713765← in 220 clusters
CDS overlap > 100bp117191277225← in 117 clusters
CDS overlap > 200bp13213140← in 13 clusters
has predicted UTR0----← not tallied in problems
UTR ≥ 50% length0----← not tallied in problems
UTR is spliced0----← not tallied in problems
one or more problems1751184610429528