Gene Finding for Mycobacterium tuberculosis C

Outline

Overview

This document provides a brief description of how we generated the annotation for the genome of M. tuberculosis C strain. The annotation was produced in three steps:

  1. Gene locations were predicted by mapping M. tuberculosis F11 and H37Rv ORFs to M. tuberculosis C genome assembly, followed by manual review. This process is described in section Gene Prediction.
  2. Gene names were assigned to predicted genes based on homology to previously annotated genes. This process is described in section Gene Naming.
  3. The newly predicted genes were compared with a reference to evaluate accuracy. This process is described in section Comparison with Reference Genes.

Gene Prediction

Gene locations were predicted by synteny-based mapping of M. tuberculosis F11 ORFs and H37Rv GenBank (AL123456) ORFs to the M. tuberculosis C genome. Manual review was performed to resolve discrepancies, by using GeneMark and Glimmer predictions and other relevant information (see below).

GeneMark uses a species-specific inhomogeneous Markov model to calculate the probability that a given segment of the sequence is gene-encoding. GeneMark was developed by Mark Borodovsky's group (Borodovsky & McIninch, Comp. Chem., 1993, 17, 123-133).

Glimmer2 uses interpolated Markov models to identify the coding regions and distinguish them from noncoding DNA, especially for the genomes of bacteria, archaea, and viruses. Glimmer2 was developed by Steven Salzberg's group (Delcher, et al., Nucleic Acids Res., 1999, 27, 4636-4641).

The gene calling was performed in two steps:

In Step 1, we used a synteny-based approach to map onto M. tuberculosis C genome ORFs from M. tuberculosis strain H37Rv (AL123456) and F11 (AAIX01000000), and flagged loci with differences between the two.

In Step 2, the flagged loci were manually reviewed, and where appropriate, gene coordinates were adjusted using a combination of in silico Open Reading Frame (ORF) predictions by GeneMark and Glimmer2, mapped ORFs from M. tuberculosis strains H37Rv, CDC1551 and F11, blast hits against non-redundant protein database, promer alignments with related genomes (Delcher, et al., Nucleic Acids Res., 2002, 30, 2478-2483), and/or unpublished mass spectrometry peptide data kindly provided by Sarah Fortune and Eric Rubin of Harvard Medical School and Michael Chase and David Sarracino of Harvard Partners' Center for Genetics and Genomics.

This gave us a final set of 3851 genes for the M. tuberculosis strain C genome.

Gene Naming

The M. tuberculosis C gene product names were assigned using the following protocol:

  1. If an ORF can be mapped to an ORF in H37Rv (AL123456), then use the H37Rv ORF product name. This gave us the names for a total of 3794 genes.
  2. If an ORF is not mapped to any H37Rv ORF, we assigned "hypothetical protein" as the product name. This assigned 57 genes with "hypothetical protein" as the product name.

Gene Locus Numbers

Every annotated gene is given a Locus Number of the form TBCG_##### that should be considered the only guaranteed way to identify a gene uniquely. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We do not encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.

Comparison with Reference Genes

In the absence of a set of M. tuberculosis ORFs with experimentally verified start and stop coordinates, we decided to use the mapped M. tuberculosis H37Rv (AL123456) ORFs ("Ref ORF") for comparison.

The following table summarizes the comparison of the 3851 genes with the Ref ORFs mapped from H37Rv:

TotalLonger in CShorter in CSame Length
Match both start and stop 2829---
Match Start only 14561804
Match Stop only 6765321404
Both ends inside Ref ORF 130---
Both ends extend Ref ORF 26---

The following table is a comparison of the different types of start sites in M. tuberculosis strains C and H37Rv:

Start C H37Rv
ATG 57.68% 60.96%
GTG 35.12% 33.58%
TTG 4.70% 4.81%
Others 2.49% 0.65%

Gene Predictions with Possible Problems

The following M. tuberculosis C genes have one or more problems.

Genes shorter than 50 amino acids:10
Internal stop codons or frame-shift: 319
Incomplete 5', 3' or both: 79
Spanning contigs: 69