Gene Finding
Overview
This document explains how automated gene calls were produced for the Culex pipiens genome. The annotation of the final gene structures were created by merging a combination of ESTs, predicted ORFs, ab-initio and gene predictions from 3 draft datasets generated by The Broad Institute, The J. Craig Venter Institute (formerly TIGR) and VectorBase at the European Bioinformatics Institute. This automated gene calling process is described in Gene Structure Prediction. Genes were assigned names using our automated gene naming system. This process is described in Gene Naming.
Gene Structure Prediction
Automated annotation of draft gene sets were carried out in each of the 3 centers involved; The Broad institute, VectorBase (at the European Bioinformatics Institute) and JCVI (Formerly TIGR). The Broad Institute created their draft set by running the center's internal gene caller using Manual Annotations and ESTs as the primary evidence with the gene finding programs Augustus (Mario Stanke) run on both Culex and Aedes assemblies, GeneID (Enrique Blanco and Roderic Guigo), FGeneSH (commercial gene prediction program sold by Softberry), GeneWise (Ewan Birney), and SNAP (Ian Korf) used as supporting evidence in the order listed. In addition to these evidence we also used NR Blastx to verify gene model loci and query identities. JCVI used transcript alignment (PASA) insect protein homology (genewise and nap) and the genefinders AUGUSTUS, SNAP, TWINSCAN, glimmerHMM and PHAT to create their draft gene set. The protein alignments datasets included proteins from the ncbi diptera set, and the VectorBase diptera plus aedes and louse set. A small contribution came from the Broad predictions. All evidence was then combined using Evidence Modeler, weighted to emphasize the transcripts and protein alignments. VectorBase created their draft set using an automatic analysis pipeline using either a GeneWise/Exonerate model from a database protein or a set of aligned cDNAs/ESTs followed by an ORF prediction. GeneWise/Exonerate models are further combined with available aligned cDNAs/ESTs to annotate UTRs (For more information see V.Curwen et al., Genome Res. 2004 14:942-50).
An extensive comparison was made between each of the 3 draft data sets to ensure that all genes in all loci were predicted and supported. Discrepancies between the gene call datasets emerged due to how blast evidence was being used differently. Where necessary, the centers adjusted their gene callers to ensure that the top blast hit was used as the primary evidence for gene length and gene name if applicable.
Targeted Manual Annotation was carried out in loci where gene predictions clashed with EST evidence or where EST and BlastX suggested a gene locus but gene predictions were absent or in loci where gene predicitons clashed between the 3 pairwise draft sets. In total 528 manual annotations were created.
The Broad's automated gene caller was used to consolidate the data into the final merged gene set. It was run using manual annotation as the primary evidence followed by, in no particular order, draft gene sets from Broad, JCVI and VectorBase. The final merged set presents 20,306 genes.
Gene Naming
Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we do not want to propagate misinformation by transferring unverified functional names for genes in one species to predicted genes in another species.
There are currently 5 types of gene names that fall into 3 categories:
- NAME, or hypothetical protein similar to NAME, or conserved hypothetical protein
- Assigned to gene predictions where there is excellent homology to a known NR protein. The criteria for this category are:
- At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect = 1e-10),
- A minimum of 50% identity and 70% coverage of both the query and subject sequence.
- The name will follow one of these three formats:
- conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, otherwise
- NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise:
- hypothetical protein similar to NAME Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits, we choose the one with the highest average identity to the amount of overlap to the target gene. In all cases we take the NR protein name and filter out the species name, GIs, parenthetical comments, extra white space, etc.
- Hypothetical protein Assigned to gene predictions with EST evidence that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:
- BLASTP hit to NR (complexity filtering off, -F F, expect = 1e-10)
- Predicted protein Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.
Gene Numbering
Every annotated gene is given a Locus Number of the form CpipJ_CPIPJ###### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.
Possible Problems
| multiple sources | 5977 | |||||||
|---|---|---|---|---|---|---|---|---|
| CP_JHB3_Broad_Final_Geneset | 11066 | |||||||
| CP_JHB3_SINEADS_MANUAL_1 | 515 | |||||||
| CP_JHB3_TIGR_DRAFT_0 | 1722 | |||||||
| CP_JHB3_VECTORBASE_DRAFT_1 | 1050 | |||||||
| short protein, < 50 aa | 73+55 | 6 | 79 | 2 | 0 | 41 | ← not tallied in problems | |
| shorter protein, < 30 aa | 0+0 | - | - | - | - | - | ||
| very short protein, < 10 aa | 0+0 | - | - | - | - | - | ||
| initial exon ≤ 6 nt | 240+417 | 82 | 501 | 8 | 63 | 3 | ||
| internal exon ≤ 6 nt | 3+9 | 1 | 4 | 0 | 1 | 6 | ||
| terminal exon ≤ 6 nt | 53+100 | 5 | 122 | 1 | 11 | 14 | ||
| ≥ 15 exons | 15+85 | 3 | 52 | 17 | 22 | 6 | ||
| intron ≥ 1000 nt | 2753+6174 | 1498 | 5902 | 252 | 882 | 393 | ||
| intron ≤ 20 nt | 168+242 | 10 | 346 | 15 | 0 | 39 | ||
| first codon not Met | 103+87 | 5 | 7 | 9 | 31 | 138 | ← not tallied in problems | |
| first codon not xTG | 95+79 | 5 | 5 | 7 | 28 | 129 | ||
| first codon not known START | 98+82 | 5 | 6 | 8 | 30 | 131 | ||
| last codon not known STOP | 85+67 | 4 | 41 | 14 | 45 | 48 | ||
| contains in-frame STOP | 0+0 | - | - | - | - | - | ||
| coding length not modulo 3 | 33+19 | 3 | 6 | 0 | 42 | 1 | ||
| non-canonical splicing | 6+16 | 0 | 0 | 4 | 0 | 18 | ||
| has ≥1 tagged BLAST hit | 5434+11143 | 5024 | 9332 | 420 | 1086 | 715 | ← not tallied in problems | |
| ≤1/3 as long as BLAST tags | 558+286 | 93 | 639 | 15 | 64 | 33 | ||
| ≥3× longer than BLAST tags | 9+21 | 5 | 16 | 8 | 1 | 0 | ||
| contains ≥1 N in exon | 0+33 | 15 | 11 | 2 | 4 | 1 | ||
| low-quality exonic sequence | 0+13442 | 4137 | 7254 | 381 | 1029 | 641 | ← not tallied in problems | |
| touches gap(s) | 919+2583 | 513 | 2305 | 132 | 383 | 169 | ||
| spans contigs | 919+2580 | 512 | 2305 | 131 | 383 | 168 | ||
| within 1 kb of contig edge | 1701+3767 | 1115 | 3401 | 167 | 502 | 283 | ← not tallied in problems | |
| any overlap (UTR or CDS) | 145+347 | 141 | 203 | 25 | 83 | 40 | ← in 230 clusters | |
| CDS overlap only | 109+227 | 63 | 146 | 21 | 69 | 37 | ← in 155 clusters | |
| CDS overlap > 50 nt | 109+227 | 63 | 146 | 21 | 69 | 37 | ← in 155 clusters | |
| CDS overlap > 100 nt | 109+227 | 63 | 146 | 21 | 69 | 37 | ← in 155 clusters | |
| CDS overlap > 200 nt | 101+223 | 62 | 138 | 21 | 67 | 36 | ← in 150 clusters | |
| has predicted UTR | 681+2271 | 1191 | 1260 | 140 | 206 | 155 | ← not tallied in problems | |
| UTR ≥ CDS length | 92+201 | 110 | 78 | 9 | 12 | 84 | ← not tallied in problems | |
| UTR is spliced | 79+198 | 145 | 74 | 18 | 21 | 19 | ← not tallied in problems | |
| one or more problems | 3568+7083 | 1794 | 6924 | 297 | 1035 | 601 | ||
