Gene Finding for Mycobacterium tuberculosis F11 (finished)
Outline
Overview
This document provides a brief description of how we generated the annotation for the finished genome of M. tuberculosis F11 strain. The annotation was produced in three steps:
- Gene locations were predicted by in silico Open Reading Frame (ORF) predictions and ORF mapping from annotated M. tuberculosis genomes, followed by manual review. This process is described in section Gene Prediction.
- Gene names were assigned to predicted genes based on similarity to previously annotated genes. This process is described in section Gene Naming.
- The newly predicted genes were compared with a reference to evaluate accuracy. This process is described in section Comparison with Reference Genes.
Gene Prediction
ORFs were predicted by a combination of in silico ORF predictions and synteny-based ORF mapping from annotated M. tuberculosis genomes to the finished M. tuberculosis F11 genome.
The gene calling was performed in three steps:
In Step 1, we use GeneMark and Glimmer3 to predict in silico ORFs.
GeneMark uses a species-specific inhomogeneous Markov model to calculate the probability that a given segment of the sequence is gene-encoding. GeneMark was developed by Mark Borodovsky's group (Borodovsky & McIninch, Comp. Chem., 1993, 17, 123-133).
Glimmer3 is an update from Glimmer2, which uses interpolated Markov models to identify the coding regions and distinguish them from noncoding DNA, especially for the genomes of bacteria, archaea, and viruses. Glimmer2 was developed by Steven Salzberg's group (Delcher, et al., Nucleic Acids Res., 1999, 27, 4636-4641).
In Step 2, we use a synteny-based approach to map ORFs from M. tuberculosis H37Rv, CDC1551, F11 (draft) and C genomes to the M. tuberculosis Haarlem genome.
In Step 3, we predict ORFs by comparison of in silico ORFs and mapped ORFs with hits to Pfam (Finn et al., Nucleic Acids Res., 2006, 34, D247-D251), the top blast hits against the non-redundant protein database, and/or unpublished mass spectrometry peptide data kindly provided by Sarah Fortune and Eric Rubin of Harvard Medical School and Michael Chase and David Sarracino of Harvard Partners' Center for Genetics and Genomics. Decrepancies among the different features were resolved via manual review.
This gave us a final set of 3,959 genes for the finished M. tuberculosis F11 genome.
Gene Naming
The M. tuberculosis F11 gene product names were assigned using the following protocol:
- If an ORF can be mapped to an ORF in H37Rv (AL123456), then the H37Rv ORF product name was used. This gave us the names for a total of 3,865 genes.
- If an ORF is not mapped to any H37Rv ORF, we assigned "hypothetical protein" as the product name. This assigned 94 genes with "hypothetical protein" as the product name.
Gene Locus Numbers
Every annotated gene in the finished F11 genome is given a Locus Number of the form TBFG_1#### that should be considered the only guaranteed way to identify a gene uniquely. These locus numbers are different from the ones assigned to the genes of the F11 draft assembly, which are of the form TBFG_0####. The mapping between TBFG_1#### (finished) and TBFG_0#### (draft) can be found here.
Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We do not encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.Comparison with Reference Genes
In the absence of a set of M. tuberculosis ORFs with experimentally verified start and stop coordinates, we decided to use the M. tuberculosis H37Rv (AL123456) ORFs ("Ref ORF") for comparison.
The following table summarizes the comparison of the 3,959 genes with the Ref ORFs from H37Rv:
| F11 | |
|---|---|
| Total ORFs | 3959 |
| Same lengths | 2848 |
| Shorter in F11 | 407 |
| Longer in F11 | 704 |
The following table is a comparison of the different types of start sites in M. tuberculosis strains F11 and H37Rv:
| Start | F11 | H37Rv |
|---|---|---|
| ATG | 59.96% | 60.96% |
| GTG | 34.86% | 33.58% |
| TTG | 5.00% | 4.81% |
| Others | 0.18% | 0.65% |
To see the Genome Statistics for this assembly, click here.
