Gene Finding Methods

Outline

Overview

This document explains how the automated gene models were produced for the Puccinia genome.

The annotation was created in two steps:

  1. Final gene structures were predicted by combining predictions from GENEID, AUGUSTUS, FGENESH, EST_GENES and manual annotation. This process is described in section Gene Structure Prediction.
  2. Gene names were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in section Gene Naming.

Gene Structure Prediction

Gene structures were predicted using a combination of manual annotation, FGENESH, GENEID, AUGUSTUS and EST-based genes called FindORFs. FGENESH is a commercial gene prediction program sold by Softberry, while GENEID, by Enrique Blanco and Roderic Guigo, is available under the GPL. Augustus, by Mario Stanke, is freely available at http://augustus.gobics.de.

Where multiple predictions overlap each other and EST evidence, we choose the one most in accord with the EST splice sites. If we had sufficient EST coverage to build complete ORFs, we replaced any overlapping predicted gene model with the ORF predicted purely from ESTs.

FindORFs gene models are those clustered EST gene models with a valid start and stop codon. They were built as follows. First, ESTs are aligned to the genome and grouped into loci consisting of overlapping ESTs. Then, each locus is examined for compatible splicing. If two ESTs in the same locus have identical splice sites where they overlap, they are considered fragments of a larger transcript. Putative transcripts are incrementally built out by adding additional ESTs to either end. Each putative transcript is built from one or more ESTs, but may not represent the full biological transcript if the EST coverage is incomplete. We search each putative transcript for ORFs beginning with ATG and ending with a stop codon, with no frame shifts. If a putative transcript contains an ORF longer than 180 bases that covers 1/3 or more of its spliced length, we considered it a valid gene prediction. Further, we select only a subset of putative full-length gene models from these EST-based findORF transcripts that fully overlap the best ab inito gene prediction. An independent blast-based analysis was also carried out to validate both the length and the correctness of the reading frame by comparing them to the best hits known proteins in the NR database.

In the final step, we ran our automated gene caller to select the best gene model for each locus. Targeted manual annotation was carried out in loci where gene predictions clashed with EST evidence or blast evidence. Automated gene models overlapping tRNA/RNA and repeat features were removed from the final gene set. The resulting final gene set contains 20,567 genes.

Gene Naming

The Puccinia gene product names were assigned based on comparison with top blast hits against the non-redundant protein database.

Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we do not want to propagate misinformation by transferring unverified functional names for genes in one species to predicted genes in another species.

We hope to improve the gene naming process in the future based on Gene Ontology categories.

There are currently 5 types of gene name that fall into 3 categories:

  1. NAME, or hypothetical protein similar to NAME, or conserved hypothetical protein.
    Assigned to gene predictions where there is excellent homology to a known NR protein. The criteria for this category are:
    • At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect = 1e-10),
    • A minimum of 50% identity and 70% coverage of both the query and subject sequence.
    The name will follow one of these three formats:
    • Conserved hypothetical protein
      Assigned to gene predictions with at least one BLASTP hit exceeding 50% identity and 70% coverage, where all such BLASTP hits contain a word (fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed) indicating the name is unreliable, otherwise:
    • Hypothetical protein similar to NAME
      Assigned to gene predictions with at least one BLASTP hit, exceeding 50% identity and 70% coverage, that does not contain any flagged words. If there are multiple such hits, the name is derived from the hit with the highest score.
    • NAME
      Assigned to gene predictions with at least one BLASTP hit to a Swiss-Prot curated protein, exceeding 50% identity and 70% coverage.

    Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits, we choose the one with the highest average identity to the amount of overlap to the target gene. In all cases we take the NR protein name and filter out the species name, GIs, parenthetical comments, extra white space, etc.

  2. Hypothetical protein.
    Assigned to gene predictions touching at least one BLASTP and/or EST alignment. No blast hit(s) exceed 50% identity and 70% coverage to the genome sequence.
  3. Predicted protein.
    Assigned to gene predictions that show no significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the BLASTP analysis was performed on the gene set, and touch no EST alignments.

Gene Locus Numbers

Every annotated gene is given a Locus Number of the form PGTG_##### or PGTB_#####, that should be considered the only guaranteed way to identify a gene uniquely. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We do not encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.