Gene Finding for Mycobacterium tuberculosis Haarlem
Outline
- Overview
- Gene Prediction
- Gene Naming
- Gene Locus Numbers
- Comparison with Reference Genes
- Gene Predictions with Possible Problems
Overview
This document provides a brief description of how we generated the annotation for the genome of M. tuberculosis Haarlem strain. The annotation was produced in three steps:
- Gene locations were predicted by Open Reading Frame (ORF) mapping from annotated M. tuberculosis genomes to M. tuberculosis Haarlem genome, followed by manual review. This process is described in section Gene Prediction.
- Gene names were assigned to predicted genes based on homology to previously annotated genes. This process is described in section Gene Naming.
- The newly predicted genes were compared with a reference to evaluate accuracy. This process is described in section Comparison with Reference Genes.
Gene Prediction
Gene locations were predicted by combining in silico Open Reading Frame (ORF) predictions and synteny-based ORF mapping from annotated M. tuberculosis genomes to the M. tuberculosis Haarlem genome.
The gene calling was performed in three steps:
In Step 1, we use GeneMark and Glimmer3 to predict in silico ORFs.
GeneMark uses a species-specific inhomogeneous Markov model to calculate the probability that a given segment of the sequence is gene-encoding. GeneMark was developed by Mark Borodovsky's group (Borodovsky & McIninch, Comp. Chem., 1993, 17, 123-133).
Glimmer3 is an update from Glimmer2, which uses interpolated Markov models to identify the coding regions and distinguish them from noncoding DNA, especially for the genomes of bacteria, archaea, and viruses. Glimmer2 was developed by Steven Salzberg's group (Delcher, et al., Nucleic Acids Res., 1999, 27, 4636-4641).
In Step 2, we used a synteny-based approach to map ORFs from M. tuberculosis H37Rv, CDC1551, F11 and C genomes to the M. tuberculosis Haarlem genome.
In Step 3, we predict ORFs by comparison of in silico ORFs and mapped ORFs with hits to Pfam (Finn et al., Nucleic Acids Res., 2006, 34, D247-D251), the top blast hits against the non-redundant protein database, and/or unpublished mass spectrometry peptide data kindly provided by Sarah Fortune and Eric Rubin of Harvard Medical School and Michael Chase and David Sarracino of Harvard Partners' Center for Genetics and Genomics. Decrepancies among the different features were resolved via manual review.
This gave us a final set of 3,866 genes for the M. tuberculosis strain Haarlem genome.
Gene Naming
The M. tuberculosis Haarlem gene product names were assigned using the following protocol:
- If an ORF can be mapped to an ORF in H37Rv (AL123456), then use the H37Rv ORF product name. This gave us the names for a total of 3,814 genes.
- If an ORF is not mapped to any H37Rv ORF, we assigned "hypothetical protein" as the product name. This assigned 52 genes with "hypothetical protein" as the product name.
Gene Locus Numbers
Every annotated gene is given a Locus Number of the form TBHG_##### that should be considered the only guaranteed way to identify a gene uniquely. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We do not encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.
Comparison with Reference Genes
In the absence of a set of M. tuberculosis ORFs with experimentally verified start and stop coordinates, we decided to use the mapped M. tuberculosis H37Rv (AL123456) ORFs ("Ref ORF") for comparison.
The following table summarizes the comparison of the 3,866 genes with the Ref ORFs mapped from H37Rv:
| Haarlem | |
|---|---|
| Total ORFs | 3866 |
| Same lengths | 3241 |
| Shorter in Haarlem | 191 |
| Longer in Haarlem | 378 |
The following table is a comparison of the different types of start sites in M. tuberculosis strains Haarlem and H37Rv:
| Start | Haarlem | H37Rv |
|---|---|---|
| ATG | 59.03% | 60.96% |
| GTG | 35.57% | 33.58% |
| TTG | 5.12% | 4.81% |
| Others | 0.28% | 0.65% |
Gene Predictions with Possible Problems
The following M. tuberculosis Haarlem genes have one or more problems.
| Genes shorter than 50 amino acids: | 2 |
| Internal stop codons or frame-shift: | 227 |
| Incomplete 5', 3' or both: | 21 |
| Spanning contigs: | 43 |
