Gene Finding Methods

Outline

Overview

This document describes some of the details of the method used to produce the automated gene calls for the Botrytis cinerea genome. Automated gene calls were produced in a three-step procedure:

  • Gene location and structures were predicted using a combination of FGENESH and GENEID. This process is described in section Gene Structure Prediction.
  • Gene names were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in section Gene Naming.
  • The newly created genes were compared against EST data to evaluate accuracy. This process is described in section Structure Prediction Validation.

Gene Structure Prediction

Gene structures were predicted using a combination of FGENESH and GENEID. FGENESH is a commercial gene prediction program from Softberry, while GENEID, by Enrique Blanco and Roderic Guigó, is available under the GPL. The version of GENEID used for these calls was 1.2a. FGENESH is unversioned.

FGENESH uses a statistical model of gene structure that requires training on each organism for accurate prediction. GENEID is an ab initio gene caller. We used parameter files from the Sclerotinia sclerotiorum analysis to run both FGENESH and GENEID (for details, see SS1 gene structure prediction.)

An assessment of the accuracy of FGENESH and GENEID is described below in section Structure Prediction Validation.

Briefly, the results from these two gene callers were combined in the following manner:

  1. Both FGENESH and GENEID were run on the entire genomic sequence to provide an initial set of predicted genes. This resulted in a set containing 13864 FGENESH predictions and 16907 GENEID predictions.
  2. Next, short predictions were filtered out. Any gene prediction less than 30aa (90nt) long was dropped. In addition, any gene prediction less than 50aa (150nt) long was dropped, unless it was overlapped by another gene prediction, BLAST evidence, a hmmer hit, or EST evidence. Applying these criteria removed 2373 genes from consideration.
  3. We incorporated 756 manually-annotated genes into the gene set. Any predicted gene that overlapped a manual prediction was dropped, and the manual annotation used instead.
  4. Wherever remaining FGENESH and GENEID predictions had overlapping exons, on identical or opposing strands, we clustered them into separate groups for further analysis. A set of non-overlapping genes was chosen from each cluster by ordering the genes from "best" to "worst", picking the "best" gene, then going down the list and adding any lower-ranked genes that did not overlap ones already selected. Genes were ranked according to the following criteria:

    • A gene was considered to have "good BLAST coverage" if it overlapped at least three BLAST hits from different taxa, each with ≥ 60% average identity and ≥ 80% query coverage.
    • Predictions with good BLAST coverage were chosen above predictions that did not have good BLAST coverage.
    • If two predictions both had good BLAST coverage, we computed the average overlapping BLAST length for each gene; namely, the average length of the three BLAST hits as defined above. (When there were more than three, the average was computed from the three with the highest scores.) When both genes were longer than their respective average overlapping BLAST lengths, the gene closer in length to the average overlapping BLAST length was chosen first.
    • Otherwise, GENEID was preferred to FGENESH.

  5. After sorting and selecting the highest-ranked predictions, the resultant gene set contained 16454 genes.

Gene Naming

Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we do not want to propagate misinformation by transferring unverified functional names for genes in one species to predicted genes in another species.

We hope to improve the gene naming process in the future based on Gene Ontology categories.

There are currently 5 types of gene name that fall into 3 categories:

  1. NAME, or
    hypothetical protein similar to NAME, or
    conserved hypothetical protein

    Assigned to gene predictions where there is excellent homology to an known NR protein. The criteria for this category are:

    • At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect ≤ 1e-5), with
    • ≥ 80% identity and ≥ 80% coverage of both the query and subject sequence.

    The name will follow one of these three formats:

    • conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, otherwise
    • NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise:
    • hypothetical protein similar to NAME

    Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits we choose the one with the highest average identity × the amount of overlap to the target gene.

    In all cases we take the NR protein name and filter out the species name, GIs, parenthetical comments, extra whitespace, etc.

  2. Hypothetical protein

    Assigned to gene predictions that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:

    • BLASTP hit to NR (complexity filtering off, -F F, expect ≤ 1e-5)

  3. Predicted protein

    Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.

Gene Locus Numbers

Every annotated gene is given a Locus Number of the form BC1_G##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encoding attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.

Structure Prediction Validation

The available EST alignments were used to validate the gene structure predictions. To perform this comparison ESTs were first clustered by combining all overlapping ESTs into a cluster. Each EST cluster was then compared to any overlapping predicted genes. If the gene structure matched the alignment in the area of overlap, the cluster had no problems. If the alignment had an exon that did not overlap any coding bases of the gene structure, it was considered a "missing exon" error. If a gap in the alignment fully contained an exon of the gene model, it was considered a "wrong exon" error. Partial overlaps were classified as splice junction errors. In cases where multiple overlapping ESTs suggest different gene structures, the EST that most closely matched the gene structure was used.

The following table shows the accuracy of the gene calls. Slightly more than 10% of genes have some overlap with an EST cluster; of these, 82% show no problems.

    # of genes 16454
    # of EST clusters 2345
        single exon 1075
        hitting genes on opposite strand
        (ignored for stats, believed to be alignment/strand correction problems)
56
# of EST clusters minus ignored ones 2288
    # of genes hit by an EST cluster 2011
    # of EST clusters hitting multiple genes 18
    # of genes hitting multiple EST clusters 81
EST Cluster
    # of EST clusters with no problems 1692 (74%)
        not counting missed EST clusters 1692 (82%)
    # of EST clusters not hitting genes 214 (9%)
    # of EST clusters hit, with problems 382 (18%)
    Problems:
        # of EST clusters with missing exons 35 (2%)
        # of EST clusters with wrong exons 13 (1%)
        # with splice junction problems 364 (18%)
Predicted Genes
    # of genes hit by an EST cluster 2011 (12%)
    # of those genes with no problems 1628 (81%)
    # of genes hit, with problems 383 (19%)
    Problems:
        # of genes with missing exons 35 (2%)
        # of genes with wrong exons 13 (1%)
        # with splice junction problems 364 (18%)

Partial Gene Predictions

A file containing the gene predictions that resulted in an incomplete transcript (either missing a start codon, stop codon, or whose sequence spanned multiple contigs) can be found in the downloads section.

Here is a summary of the partial genes with the numbers of each belonging to a given category

DescriptionCount
Locus accession of genes with coding length not divisible by 39
Locus accession of genes spanning two or more contigs444
Locus accession of genes whose last codon is not a valid stop codon150
Locus accession of genes whose first codon is not an ATG start codon68