Gene Finding Methods

Outline

Overview

This document describes a brief summary of the method used to produce the automated gene calls for the Histoplasma capsulatum genome. Automated gene calls were produced in a three-step procedure:

  • Gene location and structures were predicted using a combination of GENEWISE, FGENESH and GENEID. This process is described in section Gene Structure Prediction
  • Gene names were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in section Gene Naming.

Gene Structure Prediction

Gene structures were predicted using a combination of FGENESH, GENEID and GENEWISE. FGENESH is a commercial gene prediction program sold by Softberry, while GENEID, by Enrique Blanco and Roderic Guigo, is available under the GPL. GENEWISE is part of the WISE2 package developed by Ewan Birney and is available from the Sanger Center.

FGENESH uses a statistical model of gene structure that requires training on each organism for accurate prediction. Softberry performed the training on Aspergillus nidulans sequences. GENEID is an ab initio gene caller and was run with the Histoplasma capsulatum parameter file. Although GENEWISE does utilize some species-specific parameters, most notably for intron nucleotide statistics and splice site consensus sequences, these can be set to non-species specific defaults. In this case, GENEWISE essentially produces the best local alignment of a protein assuming that introns start at GT and end at AG. Since we are interested in predicting complete gene structures, we post-processed GENEWISE incomplete protein alignments by moving the first and last exon upstream or downstream to the nearest start and stop codons respectively. If a stop codon was encountered upstream of a gene before a start could be found, the gene call was not used.

Briefly, the results from these gene callers were combined in the following manner:

  1. Both FGENESH and GENEID were generated using parameter files from Aspergillus nidulans and Histoplasma capsulatum respectively and run on the entire genomic sequence to provide an initial set of predicted genes. GENEWISE was run on each BLAST hit from NR that aligned to the genome with minimum 80% identity and minimum 80% coverage. This resulted in a set containing 7685 FGENESH predictions, 9561 GENEID predictions, and 286 GENEWISE predictions
  2. Next, short predictions were filtered out. Any gene prediction less than 30aa (90nt) long was dropped. In addition, any gene prediction less than 50aa (150nt) long was dropped, unless it was supported by BLAST evidence or a hmmer hit evidence
  3. 511 genes were annotated manually using publicly available ESTs from other species and BLASTX alignments
  4. Wherever remaining predictions had overlapping exons, on identical or opposing strands, we clustered them into separate groups for further analysis. A set of non-overlapping genes was chosen from each cluster by ordering the genes from "best" to "worst", picking the "best" gene, then going down the list and adding any lower-ranked genes that did not overlap ones already selected. Genes were ranked according to the following criteria:
    • A gene was considered to have "good BLAST coverage" if it overlapped at least three BLAST hits from different taxa, each with minimum 50% average identity and minimum 70% query coverage.
    • Predictions with good BLAST coverage were chosen above predictions that did not have good BLAST coverage.
    • If two predictions both had good BLAST coverage, we computed the average overlapping BLAST length for each gene; namely, the average length of the three BLAST hits as defined above. (When there were more than three, the average was computed from the three with the highest scores.) The gene closer in length to the average overlapping BLAST length was chosen first.
    • Otherwise, GENEID was preferred to FGENESH.
  5. After sorting and selecting the highest-ranked predictions, the resultant gene set contained 9349 genes.
  6. Gene Naming

    Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we do not want to propagate misinformation by transferring unverified functional names for genes in one species to predicted genes in another species.

    We hope to improve the gene naming process in the future based on Gene Ontology categories.

    There are currently 5 types of gene names that fall into 3 categories:

    1. NAME, or hypothetical protein similar to NAME, or conserved hypothetical protein Assigned to gene predictions where there is excellent homology to a known NR protein. The criteria for this category are: At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect = 1e-10), with minimum 50% identity and minimum 70% coverage of both the query and subject sequence.

      The name will follow one of these three formats:

      • conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, otherwise
      • NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise:
      • hypothetical protein similar to NAME Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits we choose the one with the highest average identity x the amount of overlap to the target gene.
      In all cases we take the NR protein name and filter out the species name, GIs, parenthetical comments, extra white space, etc.
    2. Hypothetical protein

      Assigned to gene predictions that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:

      • BLASTP hit to NR (complexity filtering off, -F F, expect = 1e-10)

    3. Predicted protein

      Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.

    Summary of Automated Gene Naming

    1592 transcript(s) had non-generic names.

    "conserved hypothetical protein" 2151
    "hypothetical protein" 26
    "hypothetical protein similar to reverse transcriptase" 52
    "predicted protein" 5580
     hypothetical protein similar to... 632
     other non-empty name 908

    Overview of Query Genes

    9349 genes: 9349 transcripts, 8332 spliced, 1017 unspliced
    32844 exons, 23495 introns

    min median mean max
    overall length (incl. UTR) 93 1068 1309 14592
    coding length 93 1068 1309 14592
    exons per transcript 1 3 3.51 26
    exons per spliced transcript 2 3 3.82 26
    bp per exon 1 183 372 10566
    bp per intron 23 97 140 2926

    len

    %cov

    %gc

    %at
    exonic 12240597 39.99 51.00 48.98
    intronic 3312691 10.82 41.97 55.87
    intergenic 15054327 49.18 43.15 57.34

    Structure Prediction Validation

    Due to lack of any species-specific EST/mRNA data, we did not perform our standard structure prediction validation and hence we do not have a good measure of the accuracy of our gene predictions for this genome.

    Possible Problems

    HC1_GENEWISE_1 51
    HC1_FGENESH_5 2137
    HC1_GENEID_R 6652
    HC1_MANUAL_1 509
    short proteins < 50aa 72 3 15 54 0 ← not tallied in problems
    shorter proteins < 30aa 0 - - - -
    very short proteins < 10aa 0 - - - -
    exon-less transcripts 0 - - - -
    initial exon ≤ 6bp 187 0 57 104 26
    internal exon ≤ 6bp 29 0 26 0 3
    terminal exon ≤ 6bp 121 0 27 91 3
    intron ≥ 1000bp 3 0 3 0 0
    coding length not mod 3 9 0 9 0 0
    first codon not START 55 0 11 44 0
    last codon not STOP 39 0 21 18 0
    contains in-frame STOP 0 - - - -
    contains ≥1 N in exon 1 1 0 0 0
    non-canonical splicing 3 0 0 0 3
    overlapping 0 - - - -
    spanning contigs 369 1 119 246 3
    one or more problems 771 1 253 481 36