Gene Finding for Mycobacterium tuberculosis F11

Outline

Overview

This document provides a brief description of how we generated the annotation for the genome of M. tuberculosis F11 strain. The annotation was produced in three steps:

  1. Gene locations were predicted using a combination of GeneMark and Glimmer2, and mapped ORFs from M. tuberculosis strains H37Rv and CDC1551. This process is described in section Gene Prediction.
  2. Gene names were assigned to predicted genes based on homology to previously annotated genes. This process is described in section Gene Naming.
  3. The newly predicted genes were compared with a reference to evaluate accuracy. This process is described in section Comparison with Reference Genes.

Gene Prediction

Genes were predicted using a combination of in silico Open Reading Frame (ORF) predictions by GeneMark and Glimmer2, mapped ORFs from M. tuberculosis strains H37Rv and CDC1551, and manual curation.

GeneMark uses a species-specific inhomogeneous Markov model to calculate the probability that a given segment of the sequence is gene-encoding. GeneMark was developed by Mark Borodovsky's group (Borodovsky & McIninch, Comp. Chem., 1993, 17, 123-133).

Glimmer2 uses interpolated Markov models to identify the coding regions and distinguish them from noncoding DNA, especially for the genomes of bacteria, archaea, and viruses. Glimmer2 was developed by Steven Salzberg's group (Delcher, et al., Nucleic Acids Res., 1999, 27, 4636-4641).

The gene calling was performed in two steps:

In step 1, we ran GeneMark and Glimmer2 on the complete genome sequence to generate in silico ORFs, and we also mapped onto M. tuberculosis F11 genome ORFs from M. tuberculosis strains H37Rv (AL123456) and CDC1551 (NC_002755), and a complete set of 3866 manually curated ORFs from M. tuberculosis strain H37Rv. We then performed automatic ORF calling using a perl script which clusters all these ORFs into loci by stop codon locations, and makes automatic gene calls for each loci. The script also flagged loci with differences among the five ORF tracks.

In step 2, the flagged loci were manually reviewed, and where appropriate, gene coordinates were adjusted based on blast hits against non-redundant protein database, promer alignments with related genomes (Delcher, et al., Nucleic Acids Res., 2002, 30, 2478-2483), and/or unpublished mass spectrometry peptide data kindly provided by Sarah Fortune and Eric Rubin of Harvard Medical School and Michael Chase and David Sarracino of Harvard Partners' Center for Genetics and Genomics.

This gave us a final set of 3911 genes for the M. tuberculosis strain F11 genome.

Gene Naming

The M. tuberculosis F11 gene product names were assigned using the following protocol:

  1. If an ORF has a biologically meaningful product name in H37Rv (AL123456), then use that name. This gave us a total of 2270 known genes.
  2. If H37Rv product name is not meaningful, but H37Rv-manual or CDC1551 product name is, then use H37Rv-manual product name (if present) or CDC1551 product name. We obtained 307 such gene product names.
  3. If none of the above has meaningful product name, then we used the product names assigned by NCBI (using CDD and COG). There are 972 gene product names this way.
  4. For the rest of the genes, we assigned "hypothetical protein" as the product name. This assigned 659 genes with "hypothetical protein" as the product name.

Gene Locus Numbers

Every annotated gene is given a Locus Number of the form TBFG_##### that should be considered the only guaranteed way to identify a gene uniquely. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We do not encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.

Comparison with Reference Genes

In the absence of a set of M. tuberculosis ORFs with experimentally verified start and stop coordinates, we decided to use the mapped M. tuberculosis H37Rv (AL123456) ORFs ("Ref ORF") for comparison.

The following table summarizes the comparison of the 3911 genes with the Ref ORFs mapped from H37Rv:

TotalLonger in F11Shorter in F11
Match both start and stop 3080--
Match Start only 19181
Match Stop only 696541155
Both ends inside Ref ORF 4--
Both ends extend Ref ORF 10--

The following table is a comparison of the different types of start sites in M. tuberculosis strains F11 and H37Rv:

Start F11 H37Rv
ATG 58.82% 60.96%
GTG 36.12% 33.58%
TTG 4.73% 4.81%
Others 0.34% 0.65%

Gene Predictions with Possible Problems

The following M. tuberculosis F11 genes have one or more problems.

Genes shorter than 50 amino acids:3
Internal stop codons or frame-shift: 37
Incomplete 5', 3' or both: 28
Spanning contigs: 1