Gene Finding Methods

Outline

Overview

This document describes how we created gene models for the fourth annotated gene set for Neurospora crassa. Automated gene calls were created in a four-step procedure:

  • Gene location and structures were predicted by combining automated gene predictions, BLAST hits, EST alignments and manual annotation. This process is described in section Gene Structure Prediction.
  • Gene names were assigned to predicted gene structures based on homology to genes from related species. This process is described in section Gene Naming.
  • Names, symbols, ontology terms and other tags were assigned from the ongoing community annotation project. Details of this project may be found here.
  • The newly created genes were compared against those in the previous version. This process is described in section Gene Correspondence.
  • Gene Structure Prediction

    Gene structures were predicted using a combination of manual annotations, ab initio predictions (AUGUSTUS, GeneMark.hmm, FGENESH, GENEID and SNAP), homology-based predictions (GENEWISE and blastx) and EST based gene structures made by CallReference and PASA.

    FGENESH is a commercial gene prediction program sold by Softberry, while GENEID, by Enrique Blanco and Roderic Guigó, is available under the GPL. GENEWISE is part of the WISE2 package developed by Ewan Birney and is available from the Sanger Center. We used GENEID 1.2 and GENEWISE 2.2. FGENESH is not versioned. AUGUSTUS is developed and maintained by Mario Stanke, and we used version 2.3. For GeneMark.hmm we used version 3.0.

    Both FGENESH and GENEID are classified as ab initio predictors; they predict genes from genomic sequence using statistical models of gene structure. Their models must first be trained against a set of trusted genes to ensure accurate prediction. Softberry trained FGENESH using proprietary methods and generated a parameter file for us. (We used the identical FGENESH parameter file for the version 2 gene set.) GENEID was trained using a set of 515 manual annotations on a previous assembly. The parameter file is available here. (Note, we changed the intron probabilities slightly from the parameters given by this link).

    GENEWISE is a homology-based predictor of gene structure. As we ran it, it splices and aligns an input protein sequence to genomic sequence. When the resultant spliced alignment forms a valid gene model, we treat it as such. While GENEWISE does accept some parameters that vary from species to species (most notably for intron nucleotide statistics and splice site consensus sequences) these can be set to generic default values which are non–species-specific. In this default case, GENEWISE essentially produces the best local alignment of a protein assuming that introns tend to start at GT and end at AG. Since we are interested in predicting complete gene structures, we post-process incomplete GENEWISE protein alignments by moving the first and last exon upstream or downstream to the closest possible start and stop codons respectively. If a stop codon was encountered upstream of a gene before a start could be found, the gene call was not used.

    Briefly, the gene predictions were combined as follows:

    1. All five ab initio predictors were run on the entire genomic sequence to provide an initial set of predicted genes. This resulted in a set containing 10,058 FGENESH predictions, 12,358 GENEID predictions, 9,639 GeneMark predictions, 8,800 AUGUSTUS predictions and 8574 SNAP predictions.
    2. Next, GENEWISE was run on each BLAST hit from NR (after removing hits from N. crassa) that aligned to the genome with ≥ 80% identity and ≥ 80% similarity. This created 6,867 GENEWISE predictions. Because BLAST hits often overlap, the GENEWISE predictions fell into just 513 loci.
    3. 1265 genes were manually annotated from EST alignments and BLAST hits.
    4. Next, short predictions were filtered out. Any gene prediction less than 30aa (90nt) long was dropped. In addition, any FGENESH or GENEWISE gene prediction less than 50aa (150nt) long was dropped, unless it was overlapped by BLAST evidence, a hmmer hit, or EST evidence. All GENEID predictions, regardless of length, were dropped unless they had overlapping evidence.
    5. Wherever remaining predictions had overlapping exons, on identical or opposing strands, we clustered them into separate groups for further filtering. A set of non-overlapping transcripts was chosen from each cluster by ordering the genes from "best" to "worst", picking the "best" gene, then going down the list and adding any lower-ranked transcripts that did not overlap ones already selected. Transcripts were ranked according to the following criteria:

      • Intron lengths in fungi are much shorter than those in other eukaryotes. Even after training, FGENESH and GENEID occasionally predict genes that have introns that are much too long for fungi. If a transcript has an intron longer than 1000bp, it will not be chosen if there is a competing gene model with more reasonable intron lengths.
      • Where EST evidence is available, ESTs are linked together to form partial splice models. Each transcript is compared to these EST transcripts, and its splice junctions are scored. Each splice junction may be a true positive, a false positive (the prediction has a splice site where the EST evidence does not), or a false negative (the EST evidence is spliced where the prediction is not). Transcripts with fewer false predictions are chosen above those with more. Those with more true positives are chosen over those with less. Splice scores are adjusted to account for alignment errors in ESTs as well as potential instances of alternate splicing.
      • A transcript was considered to have "good BLAST coverage" if it overlapped one or more BLAST hits, each with ≥ 50% average identity and ≥ 70% query coverage. BLAST hits from N. crassa were ignored.
      • Predictions with good BLAST coverage were chosen above predictions that did not have good BLAST coverage.
      • If two predictions both had good BLAST coverage, we computed the "average BLAST length" for each transcript; that is, the average length of the top 1–3 overlapping BLAST hits from different taxa. (When there were more than three, the average was computed from the three with the highest scores.) The prediction closer in length to its average BLAST length was chosen first.
      • Otherwise, FGENESH was preferred to GENEWISE, and both were preferred to GENEID.
    6. After sorting and selecting the highest-ranked predictions, the resultant gene set contained 9,826 genes.

    Untranslated Region (UTR) Prediction

    1949 genes in this set have annotated UTR. Of these, 1242 were annotated by hand. The remaining 707 instances of UTR were predicted computationally, as follows. When an EST alignment uniquely aligns with > 95% identity and overlaps a gene prediction, and the region of overlap has 100% coding agreement (e.g., every nucleotide in the region of overlap is coding in both the prediction and the alignment, or is noncoding in both), we consider the prediction and alignment to be compatible. UTR predictions are generated by walking out along compatible EST alignments from the end(s) of the prediction. A chain of one or more overlapping, compatible EST alignments that begin at the 3′ or 5′ end forms each UTR extension. Our UTR predictions are conservative. When an EST aligns to more than one location on the genome, or touches more than one gene prediction (on either strand) it is ignored.

    Gene Naming

    Our gene naming protocol currently relies on high confidence blast or Pfam domain homology and in some cases, community inputs to assign gene product names. Hits to previous N. crassa gene annotations are not considered. We hope to improve the gene naming process in the future based on other functional annotation protocols and tools. We use two types of names depending on the available evidence:

    • known protein name: supported by blast hits or Pfam domain hits that are known proteins;
    • all others are named 'hypothetical protein'. We don't use names like 'conserved hypothetical protein' and 'predicted protein' anymore.

    Name counts

    Gene Locus Numbers

    Every annotated gene is assigned a Locus Number of the form NCU#####. Locus numbers are the only guaranteed way to uniquely identify a gene. When the same gene exists in different assemblies and annotations, it will have the same locus number. Loci are simply identifiers; they do not have any internal structure. In particular they are not ordered in any way. We believe it is unreliable to encode attributes of an object, such as position, in its stable identifier. Instead, position and other attributes of a gene can be retrieved once you know its locus.

    With each new gene set, we map all genes from the previous set and thus preserve loci whenever possible. Any loci that cannot be mapped 1:1 (due to deletions, splits or merges) will be retired. New genes that do not have a 1:1 mapping to a previous gene receive new locus identifiers. In addition to its locus, each gene has a version number (so loci are in fact displayed as NCU#####.#). When genes are mapped from one assembly to another or when we publish a new set of gene calls, we will increment this version. To ensure consistency, all loci in a particular gene set have the same version number, even if some have not changed from the previous set.

    Both annotations and assemblies are versioned. Loci versions refer to the version of the annotation, not the assembly release number. This is annotation version 3 for N. crassa. Like the previous annotation (version 2) it was generated from assembly release 7. Version 1 of the annotation was built on assembly release 3. (Assembly releases 4–6 were not annotated.)

    A locus number without an explicitly specified version corresponds to the most recent version of the annotation. For example, NCU03456.1 refers to locus NCU03456, annotation 1 (as annotated on assembly release 3). NCU03456.3 and NCU03456 both refer to the same locus as annotated in version 3 (on assembly release 7).

    Correspondence of version 3 and version 4 gene calls

    To compare version 3 and version 4 gene calls, 9,826 gene calls (version 3 annotations) from assembly NC7 were mapped to the finished assembly NC10.

  • 9,565 genes correspond one-to-one between version 2 and version 3. Of these,
    • 6,218 were completely unchanged from version 2,
    • 1,399 had identical CDS, but the annotation was extended to include UTR,
    • 612 had the same start and stop, but at least one splice site changed, and
    • 1,336 had a different start and/or stop codon due to a substantial change in the underlying gene model.
  • 94 genes were added in loci where there was no previous annotation due to new BLAST and/or EST evidence.
  • 966 genes were deleted. Over half of these were short, coding for proteins less than 30aa long. The remainder were removed due to lack of evidence or poor gene models.
  • 10 genes in version 2 were merged into 5 larger genes in version 3.
  • 79 genes in version 2 were split into 162 smaller genes in version 3.

Annotations added by the Community Annotation Project were transferred automatically to gene predictions in version 3 that were unambiguously equivalent to the same prediction in version 2 (e.g. same locus with altered UTR, splicing, etc.). Version 3 predictions that are ambiguous (merges, splits, etc.) are being manually processed to ensure that the annotation is moved to the correct new gene.

Five genes which have been characterized for phenotype have been altered (deleted, merged, or split) in version 3. Transfer of the phenotype information will be reviewed manually to ensure that, where possible, the correct gene from version 3 will receive the appropriate phenotype characterizations.

Reporting Annotation Accuracy

We run in-depth reports on each annotation we produce to get a measure of our annotation accuracy by comparing with EST based reference genes made from CallReference and PASA. A host of accuracy statistics is compiled for each prediction that touches an EST; following Guigó's Evaluation of gene structure prediction programs (Genomics, 1996 Jun 15; 34(3):353-67). We calculate specificity, sensitivity, correlation coefficient and simple matching coefficient on the levels of nucleotides, splice sites, introns and exons.