Mycoplasma mobile Annotation and Analysis
Proteogenomic Mapping
Proteogenomic mapping by mass spectrometry was performed as previously described (Jaffe et al. 2004), with the following modifications. For these experiments, two complete biological repeats of the proteogenomic mapping protocol were performed (i.e., two separate cultures were extracted for protein and processed). For each repeat, 1 mg of total cell protein was fractionated by strong cation exchange chromatography (SCX). LCQ DecaXP Plus mass spectrometers (ThermoFinnigan) configured for nanoelectrospray ionization were used to collect tandem mass spectra from subsequent reversed-phase separations of each of 80 SCX fractions in each repeat. These spectra were searched via SEQUEST-PVM using 61 processors of a computing cluster against a six-frame in silico translation of the primary DNA sequence determined for M. mobile (Sadygov et al. 2002). Results from the two repeats were pooled to form a final proteogenomic map.
Gene Prediction (automated and manual gene annotation)
Automated gene prediction of the M. mobile genome was performed using the Calhoun annotation system (Galagan et al. 2002, Galagan et al. 2003). GLIMMER (Delcher et al. 1999) was run on the whole genome using protein translation Table 4 to generate an initial set of ORFs. GLIMMER ORFs longer than 200 bp were annotated as genes if they did not overlap with an adjacent ORF by more than 30 bp. These ORFs were refined based on homology to the closest known protein by searching against the GenBank NR database using BLASTX with threshold E < 1 x 10-10 (Altschul et al. 1997). Subsequently, proteogenomic information was incorporated into the annotation to validate the expression of the predicted proteins where possible. Several new ORFs that had proteomic evidence but were not identified in the initial round of automated annotation were added at this stage. Start codon differences between the proteomic and sequenced-based models for ORFs were resolved with the aid of RBSFINDER and multiple sequence alignments from orthologous genes when possible (Suzek et al. 2001). Every ORF was then Gapped-BLAST-searched against the nonredundant database of protein sequences available at NCBI using a cutoff E-value of 0.001 and sequence composition-based statistics (Altschul et al. 1997; Schaffer et al. 2001). RPS-BLAST searches against the COG database were additionally used to assign COGs to the proteins where possible (Tatusov et al. 2000; Marchler-Bauer et al. 2003). E.C. numbers were assigned where applicable by comparison of the proteins to the KEGG database (Ogata et al. 1999). Results were inspected manually, and final protein annotations were selected. Multiple alignments (when necessary) were performed with CLUSTALX (Thompson et al. 1997). A controlled vocabulary was used to reflect the degree of certainty about a predicted ORF with unknown function.
tRNA genes were detected by the tRNAscan-SE program (Lowe and Eddy 1997). rDNA genes were detected by homology search using BLASTN and other tools (Altschul et al. 1997; Wuyts et al. 2004). tmRNA, 4.5S SRP-RNA, and RNAse P RNA sequences were also detected via homology search and subsequent comparison to known secondary structure features of these molecules (Brown 1999; Laslett et al. 2002).
References
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402.
Brown, J.W. 1999. The Ribonuclease P Database. Nucleic Acids Res. 27: 314.
Delcher, A.L., Harmon, D., Kasif, S., White, O., and Salzberg, S.L. 1999. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 27: 4636-4641.
Galagan JE, Nusbaum C, Roy A, Endrizzi MG, Macdonald P, FitzHugh W, Calvo S, Engels R, Smirnov S, Atnoor D, Brown A, Allen N, Naylor J, Stange-Thomann N, DeArellano K, Johnson R, Linton L, McEwan P, McKernan K, Talamas J, Tirrell A, Ye W, Zimmer A, Barber RD, Cann I, Graham DE, Grahame DA, Guss AM, Hedderich R, Ingram-Smith C, Kuettner HC, Krzycki JA, Leigh JA, Li W, Liu J, Mukhopadhyay B, Reeve JN, Smith K, Springer TA, Umayam LA, White O, White RH, Conway de Macario E, Ferry JG, Jarrell KF, Jing H, Macario AJ, Paulsen I, Pritchett M, Sowers KR, Swanson RV, Zinder SH, Lander E, Metcalf WW, Birren B. (2002) The genome of M. acetivorans reveals extensive metabolic and physiological diversity. Genome Res. 12: 532-542.
Galagan JE, Calvo SE, Borkovich KA, Selker EU, Read ND, Jaffe D, FitzHugh W, Ma LJ, Smirnov S, Purcell S, Rehman B, Elkins T, Engels R, Wang S, Nielsen CB, Butler J, Endrizzi M, Qui D, Ianakiev P, Bell-Pedersen D, Nelson MA, Werner-Washburne M, Selitrennikoff CP, Kinsey JA, Braun EL, Zelter A, Schulte U, Kothe GO, Jedd G, Mewes W, Staben C, Marcotte E, Greenberg D, Roy A, Foley K, Naylor J, Stange-Thomann N, Barrett R, Gnerre S, Kamal M, Kamvysselis M, Mauceli E, Bielke C, Rudd S, Frishman D, Krystofova S, Rasmussen C, Metzenberg RL, Perkins DD, Kroken S, Cogoni C, Macino G, Catcheside D, Li W, Pratt RJ, Osmani SA, DeSouza CP, Glass L, Orbach MJ, Berglund JA, Voelker R, Yarden O, Plamann M, Seiler S, Dunlap J, Radford A, Aramayo R, Natvig DO, Alex LA, Mannhaupt G, Ebbole DJ, Freitag M, Paulsen I, Sachs MS, Lander ES, Nusbaum C, Birren B (2003) The genome sequence of the filamentous fungus Neurospora crassa. Nature 422: 859-868.
Jaffe, J.D., Berg, H.C., and Church, G.M. 2004a. Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics 4: 59-77.
Laslett, D., Canback, B., and Andersson, S. 2002. BRUCE: A program for the detection of transfer-messenger RNA genes in nucleotide sequences. Nucleic Acids Res. 30: 3449-3453.
Lowe, T.M. and Eddy, S.R. 1997. tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25: 955-964.
Marchler-Bauer, A., Anderson, J.B., DeWeese-Scott, C., Fedorova, N.D., Geer, L.Y., He, S., Hurwitz, D.I., Jackson, J.D., Jacobs, A.R., Lanczycki, C.J., et al. 2003. CDD: A curated Entrez database of conserved domain alignments. Nucleic Acids Res. 31: 383-387.
Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H., and Kanehisa, M. 1999. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 27: 29-34.
Sadygov, R.G., Eng, J., Durr, E., Saraf, A., McDonald, H., MacCoss, M.J., and Yates III, J.R. 2002. Code developments to improve the efficiency of automated MS/MS spectra interpretation. J. Proteome Res. 1: 211-215.
Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V., and Altschul, S.F. 2001. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 29: 2994-3005.
Suzek, B.E., Ermolaeva, M.D., Schreiber, M., and Salzberg, S.L. 2001. A probabilistic method for identifying start codons in bacterial genomes. Bioinformatics 17: 1123-1130.
Tatusov, R.L., Galperin, M.Y., Natale, D.A., and Koonin, E.V. 2000. The COG database: A tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28: 33-36.
Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F., and Higgins, D.G. 1997. The CLUSTAL_X windows interface: Flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 25: 4876-4882.
Wuyts, J., Perriere, G., and Van De Peer, Y. 2004. The European ribosomal RNA database. Nucleic Acids Res. 32 Database issue: D101-D103.
