Gene Finding Methods

Outline

Overview

This document explains how we generated 22,658 automated gene models for the Phytophthora infestans genome.

The annotation was created in two steps:

  1. Final gene structures were predicted by combining predictions from GENEID, Orthosearch and SECRETED_GENES. This process is described in Gene Structure Prediction.
  2. Gene names were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in Gene Naming.

Gene Structure Prediction

Gene structures were predicted using a combination of predictions from three gene finding programs: ORTHOSEARCH, GENEID and SECRETED_GENES.

GENEID was developed by Enrique Blanco and Roderic Guigo and is available under the GPL. GENEID was trained using 520 EST-based gene models.

Orthosearch is a computational gene prediction algorithm developed at the Broad Institute. It uses a purely comparative approach that predicts genes based on DNA alignments between a target genome and one or more informant genomes. Genes are predicted based on putative orthology taking into account both gene structure and expected levels of sequence and protein divergence between the target and informant species. As a side effect of gene prediction, Orthosearch identifies the best supporting ortholog for each target prediction in each informant genome. Orthosearch requires informant genomes that are at an intermediate genetic distance from the target genome. In the annotation of P. infestans, we are using the public releases of the P. sojae and P. ramorum genomes as informant species. Orthosearch uses a series of heuristic search algorithms to predict N-way orthologous gene transcripts. The scoring metric used during the search is based on structural transcript correctness in each species (assuming canonical splice sites) and maximizing the conservation of predicted protein products within these constraints. Affine penalties may be applied to the length of introns and exons. For the P. infestans genome, we heavily penalized introns shorter than 40 bases and used a small penalty for introns larger than 100 bases, based on EST evidence for an average intron length of around 70 bases with small number of larger introns.

We applied no constraints on exon length. The search strategy used by Orthosearch is composed of multiple phases. First, two-way synteny maps are constructed between the target genome and each informant, then these maps are combined and the search space is subdivided into multiple, possibly overlapping, loci where each locus has consistent N-way synteny. Within each locus, a set of seed predictions are generated using a linear time algorithm that performs a two-way comparison between the target genome and each informant. Seed predictions are generated using a range of values for expected conservation, allowing some genes to be more conserved than others. The two-way seed predictions are then converted to N-way predictions by iteratively projecting each predicted transcript on to all other genomes. The N-way predictions are then further refined using hill-climbing optimizations that make small adjustments to the existing predictions (removing introns or exons, selecting alternate start or stop codons, adjusting splice sites and splitting or joining adjacent predictions). Predictions that are too short (transcript length less than 300 bases) or too poorly conserved are eliminated. The final prediction set is generated by choosing the set of non-overlapping predictions that maximizes the total conservation score across the target genome.

SECRETED_PROTEINS are derived from a set of 529 GLIMMER predictions. These predictions were based on the following criteria:

  1. presence of signal peptide at the N-terminal of the protein
  2. positive score of the RXLRdEER HMM
  3. Phytophthora like codon usage frequency
  4. sequence similarity to one or more identified RXLRdEER effectors

Raw predictions from these three gene prediction programs were combined to produce the final annotation. The Broad automated gene calling was run to select the best gene model for each locus. The selection was based on the best EST and Blast evidence.

Where multiple predictions overlap each other and EST evidence, we choose the one most in accord with the EST splice sites. Otherwise, the gene model with the best agreement with protein length of the best blast hit was chosen. In case of loci with no blast or EST evidence, the gene model produced by the best performing gene prediction tool (Orthosearch) was selected. Finally, all automated gene models either overlapping tRNA/rRNA or transposons were removed from the final gene set. The resulting final gene set contains 22,658 genes.

Gene Naming

Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we do not want to propagate misinformation by transferring unverified functional names for genes in one species to predicted genes in another species.

We hope to improve the gene naming process in the future based on Gene Ontology categories.

There are currently 5 types of gene name that fall into 3 categories:

  1. NAME, or hypothetical protein similar to NAME, or conserved hypothetical protein
  2. Assigned to gene predictions where there is excellent homology to a known NR protein. The criteria for this category are:
    • At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect = 1e-10),
    • A minimum of 50% identity and 70% coverage of both the query and subject sequence.
  3. The name will follow one of these three formats:
    • conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, otherwise
    • NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise:
    • hypothetical protein similar to NAME

      Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits we choose the one with the highest average identity to the amount of overlap to the target gene.

      In all cases we take the NR protein name and filter out the species name, GIs, parenthetical comments, extra white space, etc.

    • Hypothetical protein Assigned to gene predictions that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:
      • BLASTP hit to NR (complexity filtering off, -F F, expect = 1e-10)
    • Predicted protein Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.

Name counts

Gene Numbering

Every annotated gene is given a Locus Number of the form PITG_##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.

Structure Prediction Validation

To evaluate the accuracy of our gene predictions for Phytophthora infestans assembly 1, we created a set of reference gene models exclusively from EST data. We then compared the two gene sets using a variety of metrics. In the tables below, we refer to the final, published gene set as the query and the EST-based gene set as the reference.

The Feature comparisons and Splice analysis sections only report on the subset of query genes that overlap reference genes. Although a substantial number of predicted genes overlap EST alignments, the majority do not. Because we use EST data to improve our gene calls, we expect lower accuracy in regions that lack supporting EST evidence, on the order of 5–10% (smc). Therefore, while they are a useful measure of gene prediction accuracy, the numbers reported in those two sections and immediately below do not apply evenly to all predicted genes.

Overview of Query Genes

Overview of Reference genes

Feature Comparisons

Splice Analysis

Possible Problems