Gene Finding

Overview

This document explains how automated gene calls were produced for the Culex pipiens genome. The annotation of the final gene structures were created by merging a combination of ESTs, predicted ORFs, ab-initio and gene predictions from 3 draft datasets generated by The Broad Institute, The J. Craig Venter Institute (formerly TIGR) and VectorBase at the European Bioinformatics Institute. This automated gene calling process is described in Gene Structure Prediction. Genes were assigned names using our automated gene naming system. This process is described in Gene Naming.

Gene Structure Prediction

Automated annotation of draft gene sets were carried out in each of the 3 centers involved; The Broad institute, VectorBase (at the European Bioinformatics Institute) and JCVI (Formerly TIGR). The Broad Institute created their draft set by running the center's internal gene caller using Manual Annotations and ESTs as the primary evidence with the gene finding programs Augustus (Mario Stanke) run on both Culex and Aedes assemblies, GeneID (Enrique Blanco and Roderic Guigo), FGeneSH (commercial gene prediction program sold by Softberry), GeneWise (Ewan Birney), and SNAP (Ian Korf) used as supporting evidence in the order listed. In addition to these evidence we also used NR Blastx to verify gene model loci and query identities. JCVI used transcript alignment (PASA) insect protein homology (genewise and nap) and the genefinders AUGUSTUS, SNAP, TWINSCAN, glimmerHMM and PHAT to create their draft gene set. The protein alignments datasets included proteins from the ncbi diptera set, and the VectorBase diptera plus aedes and louse set. A small contribution came from the Broad predictions. All evidence was then combined using Evidence Modeler, weighted to emphasize the transcripts and protein alignments. VectorBase created their draft set using an automatic analysis pipeline using either a GeneWise/Exonerate model from a database protein or a set of aligned cDNAs/ESTs followed by an ORF prediction. GeneWise/Exonerate models are further combined with available aligned cDNAs/ESTs to annotate UTRs (For more information see V.Curwen et al., Genome Res. 2004 14:942-50).

An extensive comparison was made between each of the 3 draft data sets to ensure that all genes in all loci were predicted and supported. Discrepancies between the gene call datasets emerged due to how blast evidence was being used differently. Where necessary, the centers adjusted their gene callers to ensure that the top blast hit was used as the primary evidence for gene length and gene name if applicable.

Targeted Manual Annotation was carried out in loci where gene predictions clashed with EST evidence or where EST and BlastX suggested a gene locus but gene predictions were absent or in loci where gene predicitons clashed between the 3 pairwise draft sets. In total 528 manual annotations were created.

The Broad's automated gene caller was used to consolidate the data into the final merged gene set. It was run using manual annotation as the primary evidence followed by, in no particular order, draft gene sets from Broad, JCVI and VectorBase. The final merged set presents 20,306 genes.

Gene Naming

Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we do not want to propagate misinformation by transferring unverified functional names for genes in one species to predicted genes in another species.

There are currently 5 types of gene names that fall into 3 categories:

  1. NAME, or hypothetical protein similar to NAME, or conserved hypothetical protein
  2. Assigned to gene predictions where there is excellent homology to a known NR protein. The criteria for this category are:
    • At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect = 1e-10),
    • A minimum of 50% identity and 70% coverage of both the query and subject sequence.
  3. The name will follow one of these three formats:
    • conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, otherwise
    • NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise:
    • hypothetical protein similar to NAME Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits, we choose the one with the highest average identity to the amount of overlap to the target gene. In all cases we take the NR protein name and filter out the species name, GIs, parenthetical comments, extra white space, etc.
    • Hypothetical protein Assigned to gene predictions with EST evidence that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:
      • BLASTP hit to NR (complexity filtering off, -F F, expect = 1e-10)
    • Predicted protein Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.

Gene Numbering

Every annotated gene is given a Locus Number of the form CpipJ_CPIPJ###### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.

Possible Problems

multiple sources5977
CP_JHB3_Broad_Final_Geneset11066
CP_JHB3_SINEADS_MANUAL_1515
CP_JHB3_TIGR_DRAFT_01722
CP_JHB3_VECTORBASE_DRAFT_11050
short protein, < 50 aa73+556792041← not tallied in problems
shorter protein, < 30 aa0+0-----
very short protein, < 10 aa0+0-----
initial exon ≤ 6 nt240+417825018633
internal exon ≤ 6 nt3+914016
terminal exon ≤ 6 nt53+100512211114
≥ 15 exons15+8535217226
intron ≥ 1000 nt2753+617414985902252882393
intron ≤ 20 nt168+2421034615039
first codon not Met103+8757931138← not tallied in problems
first codon not xTG95+7955728129
first codon not known START98+8256830131
last codon not known STOP85+67441144548
contains in-frame STOP0+0-----
coding length not modulo 333+19360421
non-canonical splicing6+16004018
has ≥1 tagged BLAST hit5434+11143502493324201086715← not tallied in problems
≤1/3 as long as BLAST tags558+28693639156433
≥3× longer than BLAST tags9+21516810
contains ≥1 N in exon0+331511241
low-quality exonic sequence0+13442413772543811029641← not tallied in problems
touches gap(s)919+25835132305132383169
spans contigs919+25805122305131383168
within 1 kb of contig edge1701+376711153401167502283← not tallied in problems
any overlap (UTR or CDS)145+347141203258340← in 230 clusters
CDS overlap only109+22763146216937← in 155 clusters
CDS overlap > 50 nt109+22763146216937← in 155 clusters
CDS overlap > 100 nt109+22763146216937← in 155 clusters
CDS overlap > 200 nt101+22362138216736← in 150 clusters
has predicted UTR681+227111911260140206155← not tallied in problems
UTR ≥ CDS length92+2011107891284← not tallied in problems
UTR is spliced79+19814574182119← not tallied in problems
one or more problems3568+7083179469242971035601