Gene Finding Methods

Outline

Overview

This document explains how automated gene calls were produced for the Burkholderia dolosa genome. The annotation was created in two steps:

  1. Final gene structures were predicted using a combination of ORFs from Glimmer,GeneMark and manual annotation. This process is described in section Gene Structure Prediction.
  2. Gene "names" were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in section Gene Naming.


Gene Naming

The gene names for the annotated Burkholderia dolosa ORFs were assigned using the following procedure:

We used the top three blastx hits as evidence to name our final gene set. Manual review of all available evidence was used to come up with the best informative name and to resolve any discrepancy among the names derived from two or more top hits.

Blast names beginning with "X Homolog", "X Family", "Similar to X protein" etc were changed to begin with the prefix "hypothetical protein similar to X". Manual review of the top blast evidence was used to assign the best informative gene product name and to resolve discrepancy among the names derived from two or more top hits. Finally, if no name could be obtained by any of the above steps, we named them as "hypothetical proteins".


Gene Structure Prediction

Gene structures were predicted using a combination of gene predictions from Glimmer, GeneMark and manually annotated ORFS using our evidence-based prokaryotic gene calling method. Regions which are known to contain tRNAs and rRNAs are excluded from consideration on that strand. In loci with no blast coverage, an ab initio predicted ORF is chosen only when both Glimmer and GeneMark predict a gene with a common stop codon on that strand. In this case, the longer prediction is used. An upper limit of 200 base overlap is allowed between adjacent genes. Automated gene models with significant differences in ORF length from available blast evidence were manually reviewed. Manual editing was done to merge, split or refine gene boundaries if necessary. Further, we reviewed all intergenic regions longer than 1.0 kb in length containing any good blast evidence and manually created new ORFs if sufficient evidence was present. Problematic annotations containing recognizable sequence gaps, errors and frame shifts are flagged with appropriate curation flags.


Gene Locus Numbers

Every annotated gene is given a Locus Number of the form BDAG_##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.


Structure Prediction Validation

Due to lack of any species-specific EST/mRNA data, we did not perform our standard structure prediction validation and hence we do not have a good measure of the accuracy of our gene predictions for this genome.


Possible Problems

BD_AUO158_V1_GENETRANSFER_BC_PC184_V1_71610
BD_AUO158_V1_GENEMARK_12218
BD_AUO158_V1_GLIMMER_1920
BD_AUO158_V1_MANUAL_1266
short proteins < 50aa1211010← not tallied in problems
shorter proteins < 30aa0----
very short proteins < 10aa0----
initial exon ≤ 6bp0----
internal exon ≤ 6bp0----
terminal exon ≤ 6bp0----
≥ 15 exons0----
intron ≥ 1000bp0----
intron ≤ 20bp0----
first codon not Met133328659741832← not tallied in problems
first codon not xTG3026004
first codon not known START2925004
last codon not STOP44270017
contains in-frame STOP144880056
coding length not modulo 31891380051
non-canonical splicing0----
has ≥1 good BLAST hit0----← not tallied in problems
≤1/3 as long as BLAST hit0----
≥3× longer than BLAST hit0----
contains ≥1 N in exon69362508
contains low-quality sequence57815323411972← not tallied in problems
touches gap(s)69362508
spans contigs69362508
within 1kb of contig edge4861432307439← not tallied in problems
any overlap (UTR or CDS)59439555630280← in 594 clusters
CDS overlap only59439555630280← in 594 clusters
CDS overlap > 50bp16786162858← in 167 clusters
CDS overlap > 100bp864289404← in 86 clusters
CDS overlap > 200bp0----
has predicted UTR0----← not tallied in problems
UTR ≥ 50% length0----← not tallied in problems
UTR is spliced0----← not tallied in problems
one or more problems1539521572302144