Gene Finding Methods
Outline
- Gene Finding for Mycobacterium tuberculosis F11 (finished)
- Gene Finding for Mycobacterium tuberculosis C
- Gene Finding for Mycobacterium tuberculosis Haarlem
- Gene Finding at Centers outside the Broad Institute
Gene Finding for Mycobacterium tuberculosis F11 (finished)
Overview
This document provides a brief description of how we generated the annotation for the finished genome of M. tuberculosis F11 strain. The annotation was produced in three steps:
- Gene locations were predicted by in silico Open Reading Frame (ORF) predictions and ORF mapping from annotated M. tuberculosis genomes, followed by manual review. This process is described in the Gene Prediction section
- Gene names were assigned to predicted genes based on similarity to previously annotated genes. This process is described in the Gene Naming section.
- The newly predicted genes were compared with a reference to evaluate accuracy. This process is described in the Comparison with Reference Genes section.
Gene Prediction
ORFs were predicted by a combination of in silico ORF predictions and synteny-based ORF mapping from annotated M. tuberculosis genomes to the finished M. tuberculosis F11 genome.
The gene calling was performed in three steps:
In Step 1, we use GeneMark and Glimmer3 to predict in silico ORFs.
GeneMark uses a species-specific inhomogeneous Markov model to calculate the probability that a given segment of the sequence is gene-encoding. GeneMark was developed by Mark Borodovsky's group (Borodovsky & McIninch, Comp. Chem., 1993, 17, 123-133).
Glimmer3 is an update from Glimmer2, which uses interpolated Markov models to identify the coding regions and distinguish them from noncoding DNA, especially for the genomes of bacteria, archaea, and viruses. Glimmer2 was developed by Steven Salzberg's group (Delcher, et al., Nucleic Acids Res., 1999, 27, 4636-4641).
In Step 2, we use a synteny-based approach to map ORFs from M. tuberculosis H37Rv, CDC1551, F11 (draft) and C genomes to the M. tuberculosis Haarlem genome.
In Step 3, we predict ORFs by comparison of in silico ORFs and mapped ORFs with hits to Pfam (Finn et al., Nucleic Acids Res., 2006, 34, D247-D251), the top blast hits against the non-redundant protein database, and/or unpublished mass spectrometry peptide data kindly provided by Sarah Fortune and Eric Rubin of Harvard Medical School and Michael Chase and David Sarracino of Harvard Partners' Center for Genetics and Genomics. Decrepancies among the different features were resolved via manual review.
This gave us a final set of 3,959 genes for the finished M. tuberculosis F11 genome.
Gene Naming
The M. tuberculosis F11 gene product names were assigned using the following protocol:
- If an ORF can be mapped to an ORF in H37Rv (AL123456), then the H37Rv ORF product name was used. This gave us the names for a total of 3,865 genes.
- If an ORF is not mapped to any H37Rv ORF, we assigned "hypothetical protein" as the product name. This assigned 94 genes with "hypothetical protein" as the product name.
Gene Locus Numbers
Every annotated gene in the finished F11 genome is given a Locus Number of the form TBFG_1#### that should be considered the only guaranteed way to identify a gene uniquely. These locus numbers are different from the ones assigned to the genes of the F11 draft assembly, which are of the form TBFG_0####.
Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We do not encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.
Comparison with Reference Genes
In the absence of a set of M. tuberculosis ORFs with experimentally verified start and stop coordinates, we decided to use the M. tuberculosis H37Rv (AL123456) ORFs ("Ref ORF") for comparison.
The following table summarizes the comparison of the 3,959 genes with the Ref ORFs from H37Rv:
| F11 | |
|---|---|
| Total ORFs | 3959 |
| Same lengths | 2848 |
| Shorter in F11 | 407 |
| Longer in F11 | 704 |
The following table is a comparison of the different types of start sites in M. tuberculosis strains F11 and H37Rv:
| Start | F11 | H37Rv |
|---|---|---|
| ATG | 59.96% | 60.96% |
| GTG | 34.86% | 33.58% |
| TTG | 5.00% | 4.81% |
| Others | 0.18% | 0.65% |
To see the Genome Statistics for this assembly, click here.
Gene Finding for Mycobacterium tuberculosis C
Overview
This document provides a brief description of how we generated the annotation for the genome of M. tuberculosis C strain. The annotation was produced in three steps:
- Gene locations were predicted by mapping M. tuberculosis F11 and H37Rv ORFs to M. tuberculosis C genome assembly, followed by manual review. This process is described in the Gene Prediction section.
- Gene names were assigned to predicted genes based on homology to previously annotated genes. This process is described in the Gene naming section.
- The newly predicted genes were compared with a reference to evaluate accuracy. This process is described in the Comparison with Reference Genes section.
Gene Prediction
Gene locations were predicted by synteny-based mapping of M. tuberculosis F11 ORFs and H37Rv GenBank (AL123456) ORFs to the M. tuberculosis C genome. Manual review was performed to resolve discrepancies, by using GeneMark and Glimmer predictions and other relevant information (see below).
GeneMark uses a species-specific inhomogeneous Markov model to calculate the probability that a given segment of the sequence is gene-encoding. GeneMark was developed by Mark Borodovsky's group (Borodovsky & McIninch, Comp. Chem., 1993, 17, 123-133).
Glimmer2 uses interpolated Markov models to identify the coding regions and distinguish them from noncoding DNA, especially for the genomes of bacteria, archaea, and viruses. Glimmer2 was developed by Steven Salzberg's group (Delcher, et al., Nucleic Acids Res., 1999, 27, 4636-4641).
The gene calling was performed in two steps:
In Step 1, we used a synteny-based approach to map onto M. tuberculosis C genome ORFs from M. tuberculosis strain H37Rv (AL123456) and F11 (AAIX01000000), and flagged loci with differences between the two.
In Step 2, the flagged loci were manually reviewed, and where appropriate, gene coordinates were adjusted using a combination of in silico Open Reading Frame (ORF) predictions by GeneMark and Glimmer2, mapped ORFs from M. tuberculosis strains H37Rv, CDC1551 and F11, blast hits against non-redundant protein database, promer alignments with related genomes (Delcher, et al., Nucleic Acids Res., 2002, 30, 2478-2483), and/or unpublished mass spectrometry peptide data kindly provided by Sarah Fortune and Eric Rubin of Harvard Medical School and Michael Chase and David Sarracino of Harvard Partners' Center for Genetics and Genomics.
This gave us a final set of 3851 genes for the M. tuberculosis strain C genome.
Gene Naming
The M. tuberculosis C gene product names were assigned using the following protocol:
- If an ORF can be mapped to an ORF in H37Rv (AL123456), then use the H37Rv ORF product name. This gave us the names for a total of 3794 genes.
- If an ORF is not mapped to any H37Rv ORF, we assigned "hypothetical protein" as the product name. This assigned 57 genes with "hypothetical protein" as the product name.
Gene Locus Numbers
Every annotated gene is given a Locus Number of the form TBCG_##### that should be considered the only guaranteed way to identify a gene uniquely. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We do not encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.
Comparison with Reference Genes
In the absence of a set of M. tuberculosis ORFs with experimentally verified start and stop coordinates, we decided to use the mapped M. tuberculosis H37Rv (AL123456) ORFs ("Ref ORF") for comparison.
The following table summarizes the comparison of the 3851 genes with the Ref ORFs mapped from H37Rv:
| Total | Longer in C | Shorter in C | Same Length | |
|---|---|---|---|---|
| Match both start and stop | 2829 | - | - | - |
| Match Start only | 145 | 61 | 80 | 4 |
| Match Stop only | 676 | 532 | 140 | 4 |
| Both ends inside Ref ORF | 130 | - | - | - |
| Both ends extend Ref ORF | 26 | - | - | - |
The following table is a comparison of the different types of start sites in M. tuberculosis strains C and H37Rv:
| Start | C | H37Rv |
| ATG | 57.68% | 60.96% |
| GTG | 35.12% | 33.58% |
| TTG | 4.70% | 4.81% |
| Others | 2.49% | 0.65% |
Gene Predictions with Possible Problems
The following M. tuberculosis C genes have one or more problems.
| Genes shorter than 50 amino acids: | 10 |
| Internal stop codons or frame-shift: | 319 |
| Incomplete 5', 3' or both: | 79 |
| Spanning contigs: | 69 |
Gene Finding for Mycobacterium tuberculosis Haarlem
Overview
This document provides a brief description of how we generated the annotation for the genome of M. tuberculosis Haarlem strain. The annotation was produced in three steps:
- Gene locations were predicted by Open Reading Frame (ORF) mapping from annotated M. tuberculosis genomes to M. tuberculosis Haarlem genome, followed by manual review. This process is described in the Gene Prediction section.
- Gene names were assigned to predicted genes based on homology to previously annotated genes. This process is described in the Gene Naming section.
- The newly predicted genes were compared with a reference to evaluate accuracy. This process is described in the Comparison with Reference Genes section.
Gene Prediction
Gene locations were predicted by combining in silico Open Reading Frame (ORF) predictions and synteny-based ORF mapping from annotated M. tuberculosis genomes to the M. tuberculosis Haarlem genome.
The gene calling was performed in three steps:
In Step 1, we use GeneMark and Glimmer3 to predict in silico ORFs.
GeneMark uses a species-specific inhomogeneous Markov model to calculate the probability that a given segment of the sequence is gene-encoding. GeneMark was developed by Mark Borodovsky's group (Borodovsky & McIninch, Comp. Chem., 1993, 17, 123-133).
Glimmer3 is an update from Glimmer2, which uses interpolated Markov models to identify the coding regions and distinguish them from noncoding DNA, especially for the genomes of bacteria, archaea, and viruses. Glimmer2 was developed by Steven Salzberg's group (Delcher, et al., Nucleic Acids Res., 1999, 27, 4636-4641).
In Step 2, we used a synteny-based approach to map ORFs from M. tuberculosis H37Rv, CDC1551, F11 and C genomes to the M. tuberculosis Haarlem genome.
In Step 3, we predict ORFs by comparison of in silico ORFs and mapped ORFs with hits to Pfam (Finn et al., Nucleic Acids Res., 2006, 34, D247-D251), the top blast hits against the non-redundant protein database, and/or unpublished mass spectrometry peptide data kindly provided by Sarah Fortune and Eric Rubin of Harvard Medical School and Michael Chase and David Sarracino of Harvard Partners' Center for Genetics and Genomics. Decrepancies among the different features were resolved via manual review.
This gave us a final set of 3,866 genes for the M. tuberculosis strain Haarlem genome.
Gene Naming
The M. tuberculosis Haarlem gene product names were assigned using the following protocol:
- If an ORF can be mapped to an ORF in H37Rv (AL123456), then use the H37Rv ORF product name. This gave us the names for a total of 3,814 genes.
- If an ORF is not mapped to any H37Rv ORF, we assigned "hypothetical protein" as the product name. This assigned 52 genes with "hypothetical protein" as the product name.
Gene Locus Numbers
Every annotated gene is given a Locus Number of the form TBHG_##### that should be considered the only guaranteed way to identify a gene uniquely. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We do not encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.
Comparison with Reference Genes
In the absence of a set of M. tuberculosis ORFs with experimentally verified start and stop coordinates, we decided to use the mapped M. tuberculosis H37Rv (AL123456) ORFs ("Ref ORF") for comparison.
The following table summarizes the comparison of the 3,866 genes with the Ref ORFs mapped from H37Rv:
| Haarlem | |
|---|---|
| Total ORFs | 3866 |
| Same lengths | 3241 |
| Shorter in Haarlem | 191 |
| Longer in Haarlem | 378 |
The following table is a comparison of the different types of start sites in M. tuberculosis strains Haarlem and H37Rv:
| Startt | Haarlem | H37Rv |
|---|---|---|
| ATG | 59.03% | 60.96% |
| GTG | 35.57% | 33.58% |
| TTG | 5.12% | 4.81% |
| Others | 0.28% | 0.65% |
Gene Predictions with Possible Problems
The following M. tuberculosis Haarlem genes have one or more problems.
| Genes shorter than 50 amino acids: | 2 |
| Internal stop codons or frame-shift: | 227 |
| Incomplete 5', 3' or both: | 21 |
| Spanning contigs: | 43 |
Gene Finding at Centers outside the Broad Institute
Please refer to the Centers's specific websites for a description of how they carry out their gene finding: