Gene Finding Methods

Outline

Overview

This document explains how automated gene calls were produced for each of the Fusaria genomes. The annotation of the final gene structures were created using a combination of ESTs, predicted ORFs, ab-initio gene predictions and mapped ORFs from either Fusarium graminearium (in the case of FV3) and from Fusarium verticillioides (in the case of FO2). This automated gene calling process is described in Gene Structure Prediction. Genes were assigned names using our automated gene naming system. This process is described in Gene Naming.

Gene Structure Prediction

Gene structures were predicted using a combination of manual annotation, FGENESH, GENEID and EST-based genes called FindORFs. FGENESH is a commercial gene prediction program sold by Softberry, while GENEID, by Enrique Blanco and Roderic Guigo, is available under the GPL.

In FV3, the FGeneSH and GENEID gene sets contained 13,806 and 15,408 gene predictions respectively. Where multiple predictions overlap each other and EST evidence, we choose the one most in accord with the EST splice sites. If we had sufficient EST coverage to build complete ORFs, we replaced any overlapping predicted gene model with the ORF predicted purely from ESTs.

FindORFs is a gene finding program developed at Broad. The 3,856 findOrfs genes were built as follows. First, ESTs are aligned to the genome and grouped into loci consisting of overlapping ESTs. Then, each locus is examined for compatible splicing. If two ESTs in the same locus have identical splice sites where they overlap, they are considered fragments of a larger transcript. Putative transcripts are incrementally built out by adding additional ESTs to either end.

Each putative transcript is built from one or more ESTs, but may not represent the full biological transcript if the EST coverage is incomplete. We search each putative transcript for ORFs beginning with ATG and ending with a stop codon, with no frameshifts. If a putative transcript contains an ORF longer than 180 nt that covers 1/3 or more of its spliced length, we considered it a valid gene prediction. The full set of these predictions is in the group labeled 'findOrfs'. Further, we select only a subset of putative full-length gene models from these EST-based findORF transcripts that are contained totally within the best ab initio gene prediction. An independent blast-based analysis was used also for validating the reading frame by comparing them to the best hit known proteins in the NR database.

Targeted manual annotation was carried out in loci where gene predictions clashed with EST evidence or where EST and BlastX suggested a gene locus but gene predictions were absent. In total, 1016 manual annotations were carried out. At all loci with no manually annotated gene models but containing EST-based findOrfs, we picked the full-length findOrfs over any ab-initio gene predictions.

In FG3, the FGeneSh and GeneID genesets contained 11,782 and 12,670 gene predictions respectively. The final gene set from FV3 were mapped onto the FG3 genome with 6,994 transcripts successfully transferring. MIPS Gene data from FG1 was mapped onto FG3 also with 13,909 transcripts successfully mapping. FindORFs, as described above, was run on this set and produced 2,431 genes. Targeted manual annotation was carried out in loci where gene predictions clashed with EST evidence or where EST and BlastX suggested a gene locus but gene predictions were absent. In total, 120 manual annotations were carried out. At all loci with no manually annotated gene models but containing ESTs, mapped genes from the MIPS FG1 data and the FV3 gene transfers, we picked ESTs and full-length mapped genes over any ab-initio gene predictions.

In FO2, the FGeneSh and GeneID genesets contained 18,482 and 21,117 gene predictions respectively. The final gene set from FV3 were mapped onto the FG3 genome with 12,840 transcripts successfully transferring. FindORFs, as described above, was run on this set and produced 4,204 genes. Targeted manual annotation was carried out in loci where gene predictions clashed with EST evidence or where EST and BlastX suggested a gene locus but gene predictions were absent. In total, 94 manual annotations were carried out. At all loci with no manually annotated gene models but containing ESTs and mapped genes from the FV3 gene transfers, we picked ESTs and full-length mapped genes over any ab-initio gene predictions.

Gene Naming

Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we do not want to propagate misinformation by transferring unverified functional names for genes in one species to predicted genes in another species.

We hope to improve the gene naming process in the future based on Gene Ontology categories.

There are currently 5 types of gene name that fall into 3 categories:

  1. NAME, or hypothetical protein similar to NAME, or conserved hypothetical protein
  2. Assigned to gene predictions where there is excellent homology to a known NR protein. The criteria for this category are:
    • At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect = 1e-10),
    • A minimum of 50% identity and 70% coverage of both the query and subject sequence.
  3. The name will follow one of these three formats:
    • conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, otherwise
    • NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise:
    • hypothetical protein similar to NAME Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits, we choose the one with the highest average identity to the amount of overlap to the target gene. In all cases we take the NR protein name and filter out the species name, GIs, parenthetical comments, extra white space, etc.
    • Hypothetical protein Assigned to gene predictions that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:
      • BLASTP hit to NR (complexity filtering off, -F F, expect = 1e-10)
    • Predicted protein Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.

Name Counts

Name Counts (FO2)

3814 transcript(s) had non-generic names.

"conserved hypothetical protein"9467
"hypothetical protein"527
"predicted protein"3927
hypothetical protein similar to...2393
other non-empty name1421

Name Counts (FV3)

3078 transcript(s) had non-generic names.

"conserved hypothetical protein"8913
"hypothetical protein"628
"predicted protein"1580
hypothetical protein similar to...1921
other non-empty name1157

Name Counts (FG3)

3290 transcript(s) had non-generic names.

"conserved hypothetical protein"6368
"hypothetical protein"607
"hypothetical protein "41
"predicted protein"3067
hypothetical protein similar to...1980
other non-empty name1269

Gene Numbering

Every annotated gene is given a Locus Number of the form FV3G_##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.

Overview of Query Genes

Query Genes (FO2)

17735 genes (8847 on '+' strand, 8888 on '?')
17735 transcripts (13331 spliced, 4404 unspliced)
47896 exons, 30161 introns

len%cov%gc%at
genic2671754144.5850.7349.27
intergenic3321924255.4246.5253.48
exonic2369331839.5351.4148.59
intronic30155695.0345.4754.53
coding2190571936.5551.8848.12
5' UTR7735371.2947.9652.04
3' UTR11718891.9644.9355.07
alt. spliced86540.0144.9455.06
genomic59936783100.0048.4051.60

min

median

mean

n50

max
total length (incl. UTR + introns)9012921519184922869
coding length9010231236154222596
exons per transcript122.70320
exons per spliced transcript233.26320
bp per exon128849993913941
bp per intron4581011292370
5' UTR bp11201863066664
3' UTR bp11942793915839

Query Genes (FV3)

14179 genes: 14199 transcripts (10878 spliced, 3321 unspliced; 7182/7017 +/-)
39508 exons, 25309 introns

len%cov%gc%at
genic2274952554.5550.7049.27
intergenic1895082045.4546.3153.73
exonic2032220448.7351.3348.67
intronic24361425.8445.4454.26
coding1788939242.9052.0347.97
5' UTR10713422.5748.3451.66
3' UTR15199843.6545.3554.65
alt. spliced88210.0245.8154.19
genomic41700345100.0048.7051.30

min

median

mean

max
overall length (incl. UTR)901239144522650
coding length901059126022650
exons per transcript122.7821
exons per spliced transcript233.3321
bp per exon130351922425
bp per intron225796989
5' UTR bp11231952994
3' UTR bp11912773118

Query Genes (FG3)

13332 genes: 13332 transcripts (10278 spliced, 3054 unspliced; 6703/6629 +/-)
37575 exons, 24243 introns

len%cov%gc%at
genic2099935257.9750.4149.53
intergenic1522428942.0345.4754.62
exonic1897209152.3751.1448.83
intronic20329015.6143.5356.05
coding1787425649.3451.5748.42
5' UTR3876121.0746.5453.44
3' UTR7999492.2144.1555.62
alt. spliced56400.0243.9756.03
genomic36223641100.0048.3351.67

min

median

mean

max
overall length (incl. UTR)901206143033594
coding length901103134033594
exons per transcript122.8221
exons per spliced transcript233.3621
bp per exon128350729066
bp per intron15683975
5' UTR bp11111584754
3' UTR bp11902633601

Overview of Reference genes

Reference Genes (FO2)

6568 genes (3278 on '+' strand, 4038 on '?')
7316 transcripts (3928 spliced, 3388 unspliced)
14174 exons, 6858 introns

len%cov%gc%at
genic709613511.8451.1448.86
intergenic5284064888.1648.0351.97
exonic666471611.1251.5148.49
intronic3895970.6545.0454.96
coding54896349.1652.5447.46
5' UTR5744300.9649.6550.35
3' UTR7504711.2545.2854.72
alt. spliced418220.0748.4851.52
genomic59936783100.0048.4051.60

min

median

mean

n50

max
total length (incl. UTR + introns)27958109712525036
coding length277638269184932
exons per transcript121.94210
exons per spliced transcript222.75510
bp per exon44225308054541
bp per intron20557563782
5' UTR bp1671162471897
3' UTR bp1861533262411

Reference Genes (FV3)

9522 genes: 10855 transcripts (6669 spliced, 4186 unspliced; 5455/5400 +/-)
24590 exons, 13735 introns

len%cov%gc%at
genic1118225526.8250.5749.43
intergenic3051809073.1848.0251.98
exonic1046554025.1050.9449.06
intronic8055341.9345.4154.55
coding891363921.3851.9548.05
5' UTR3761750.9047.2552.75
3' UTR13921403.3445.2354.77
alt. spliced888190.2147.9652.04
genomic41700345100.0048.7051.30

min

median

mean

max
overall length (incl. UTR)7995711377912
coding length798499447824
exons per transcript122.2721
exons per spliced transcript233.0621
bp per exon33785014737
bp per intron205472791
5' UTR bp11321852994
3' UTR bp11822573021

Reference Genes (FG3)

4714 genes: 5163 transcripts (2714 spliced, 2449 unspliced; 2623/2540 +/-)
10006 exons, 4843 introns

len%cov%gc%at
genic405115211.1850.6649.34
intergenic3217248988.8248.0451.96
exonic375838410.3851.1548.85
intronic3210700.8944.5855.39
coding00.00--
5' UTR00.00--
3' UTR00.00--
alt. spliced283020.0847.3752.63
genomic36223641100.0048.3351.67

min

median

mean

max
overall length (incl. UTR)297608025079
coding length----
exons per transcript121.9412
exons per spliced transcript222.7812
bp per exon43244144244
bp per intron245676767
5' UTR bp----
3' UTR bp----

Splice Analysis

Splice Analysis (FO2)

15205 splice agreements, 740 disagreements.
8566 ignored: 194 due to EST misalignment, 8372 due to partial initial/terminal exon coverage.
perfect exon:exon/intron:intron matches: 5159/5521
10 query transcripts contained noncanonical splices.

transcripts with no splice problems514594.2%
... with complete reference coverage79114.5%
explainable by alternate splicing2123.9%
... with spliced reference1512.8%
clashes1021.9%
... with spliced reference671.2%

Transcripts that have a splice site disagreement with an overlapping reference gene are placed into two categories, depending on the severity of the clash. If all splice disagreements could be explained by well-known types of alternate splicing, we call the transcript a "possible alternate splice." If the two transcripts cannot be reconciled in this way, we label the query a "clash." In partitioning splice disagreements into two categories, we are not asserting that 3.9% of this genome shows alternate splicing. We do this as a form of triage: genes in the "clash" category are manually inspected before release.

in ref.in query
cassette exons90
retained introns12328
early 3' splices650
late 5' splices826

cassette exonan exon that falls completely with an intron of a variant transcript. Such exons may represent alternative splice forms but are more likely instances of exonic over- and under-prediction. retained intronan intron that falls within the exon of a variant transcript. These introns may indicate alternative splicing but usually are over- and under-predicted introns. early 3' splicestwo introns agree on their 5' splice site but differ on the 3' side, relative to the affected intron. In other words, differing 3' splice sites lie on the leading edge of an exon. late 5' splicestwo introns agree on their 3' splice site but differ on the 5' side, again relative to the affected intron. Most terminology is from Matlin AJ, et. al. Understanding alternative splicing: towards a cellular code. Nat Rev Mol Cell Biol. 2005 May;6(5):386-98.

Splice Analysis (FV2)

32001 splice agreements, 1492 disagreements.
6873 ignored: 112 due to EST misalignment, 6761 due to partial initial/terminal exon coverage.
perfect exon:exon/intron:intron matches: 13696/10397
0 query transcripts contained noncanonical splices.

transcripts with no splice problems708992.0%
... with complete reference coverage378349.1%
explainable by alternate splicing5537.2%
... with spliced reference4435.7%
clashes660.9%
... with spliced reference550.7%

Transcripts that have a splice site disagreement with an overlapping reference gene are placed into two categories, depending on the severity of the clash. If all splice disagreements could be explained by well-known types of alternate splicing, we call the transcript a "possible alternate splice." If the two transcripts cannot be reconciled in this way, we label the query a "clash." In partitioning splice disagreements into two categories, we are not asserting that 7.2% of this genome shows alternate splicing. We do this as a form of triage: genes in the "clash" category are manually inspected before release.

in ref.in query
cassette exons272
retained introns56235
early 3' splices16189
late 5' splices2896

cassette exonan exon that falls completely with an intron of a variant transcript. Such exons may represent alternative splice forms but are more likely instances of exonic over- and under-prediction. retained intronan intron that falls within the exon of a variant transcript. These introns may indicate alternative splicing but usually are over- and under-predicted introns. early 3' splicestwo introns agree on their 5' splice site but differ on the 3' side, relative to the affected intron. In other words, differing 3' splice sites lie on the leading edge of an exon. late 5' splicestwo introns agree on their 3' splice site but differ on the 5' side, again relative to the affected intron. Most terminology is from Matlin AJ, et. al. Understanding alternative splicing: towards a cellular code. Nat Rev Mol Cell Biol. 2005 May;6(5):386-98.

Splice Analysis (FG3)

11676 splice agreements, 590 disagreements.
4631 ignored: 90 due to EST misalignment, 4541 due to partial initial/terminal exon coverage.
perfect exon:exon/intron:intron matches: 4246/4024
2 query transcripts contained noncanonical splices.

transcripts with no splice problems355093.5%
... with complete reference coverage88623.3%
explainable by alternate splicing1564.1%
... with spliced reference1052.8%
clashes912.4%
... with spliced reference431.1%

Transcripts that have a splice site disagreement with an overlapping reference gene are placed into two categories, depending on the severity of the clash. If all splice disagreements could be explained by well-known types of alternate splicing, we call the transcript a "possible alternate splice." If the two transcripts cannot be reconciled in this way, we label the query a "clash." In partitioning splice disagreements into two categories, we are not asserting that 4.1% of this genome shows alternate splicing. We do this as a form of triage: genes in the "clash" category are manually inspected before release.

in ref.in query
cassette exons20
retained introns9716
early 3' splices1325
late 5' splices516

cassette exonan exon that falls completely with an intron of a variant transcript. Such exons may represent alternative splice forms but are more likely instances of exonic over- and under-prediction. retained intronan intron that falls within the exon of a variant transcript. These introns may indicate alternative splicing but usually are over- and under-predicted introns. early 3' splicestwo introns agree on their 5' splice site but differ on the 3' side, relative to the affected intron. In other words, differing 3' splice sites lie on the leading edge of an exon. late 5' splicestwo introns agree on their 3' splice site but differ on the 5' side, again relative to the affected intron. Most terminology is from Matlin AJ, et. al. Understanding alternative splicing: towards a cellular code. Nat Rev Mol Cell Biol. 2005 May;6(5):386-98.

Possible Problems

Problems-Fusarium oxysporum (FO2)

FO2_GENETRANSFER_NT_FG1_MIPS_GENES24
FO2_GENETRANSFER_FG1_MIPS_GENES372
FO2_GENETRANSFER_NT_FV3_127
{multiple sources}8884
FO2_GENETRANSFER_FV3_6273
FO2_FGENESH_14812
FO2_GENEID_13284
FO2_MANUAL_159
short proteins < 50aa13410028118860<-- not tallied in problems
shorter proteins < 30aa0--------
very short proteins < 10aa0--------
initial exon ≤ 6bp18101149192370
internal exon ≤ 6bp610001804300
terminal exon ≤ 6bp12000013054530
≥ 15 exons700020221
intron ≥ 1000bp400001300
intron ≤ 20bp501040000
first codon not Met17016308123020<-- not tallied in problems
first codon not xTG14515906320020
first codon not known START15515907122020
last codon not STOP22057639725941
contains in-frame STOP0--------
coding length not modulo 317392813910500
non-canonical splicing110014927006
has ≥1 good BLAST hit12437212691971271913208155349<-- not tallied in problems
≤1/3 as long as BLAST hit2960313731151361
≥3? longer than BLAST hit380301901150
contains ≥1 N in exon40352217002
contains low-quality sequence1742371359711661532710<-- not tallied in problems
touches gap(s)1903102521610304
spans contigs179392441510303
within 1kb of contig edge11835343356334083368<-- not tallied in problems
any overlap (UTR or CDS)3090704931559488<-- in 309 clusters
CDS overlap only0--------
CDS overlap > 50bp0--------
CDS overlap > 100bp0--------
CDS overlap > 200bp0--------
has predicted UTR505644012451811913817154<-- not tallied in problems
UTR ≥ 50% length50301045475306<-- not tallied in problems
UTR is spliced413031354810289<-- not tallied in problems
one or more problems188612113159136846727919

Problems-Fusarium verticillioides (FV3)

{multiple sources}1917
FV3_FINDORFS_63127
FV3_FGENESH_14468
FV3_MANUAL_11029
FV3_GENEID_43658
short proteins < 50aa16600145696<-- not tallied in problems
shorter proteins < 30aa0-----
very short proteins < 10aa0-----
initial exon ≤ 6bp1307081042
internal exon ≤ 6bp44073511
terminal exon ≤ 6bp740034040
≥ 15 exons1102423
intron ≥ 1000bp0-----
intron ≤ 20bp0-----
first codon not Met700061<-- not tallied in problems
first codon not xTG700061
first codon not known START700061
last codon not STOP12001101
contains in-frame STOP400040
coding length not modulo 313000130
non-canonical splicing0-----
has ≥1 good BLAST hit112621629288936096542481<-- not tallied in problems
≤1/3 as long as BLAST hit263477563114
≥3? longer than BLAST hit686437615
contains ≥1 N in exon0-----
contains low-scoring sequence36729781383686
touches gap(s)26012041
spans contigs26012041
within 1kb of contig edge1672027532344<-- not tallied in problems
any overlap (UTR or CDS)38278304161103127<-- in 382 clusters
CDS overlap only0-----
CDS overlap > 50bp0-----
CDS overlap > 100bp0-----
CDS overlap > 200bp0-----
has predicted UTR65676703123858982934<-- not tallied in problems
UTR ≥ 50% length9246531752382108<-- not tallied in problems
UTR is spliced7525523363246155<-- not tallied in problems
one or more problems1686122389549217409

Problems-Fusarium graminearum (FG3)

FG3_GENETRANSFER_FG1_MIPS_GENES4327
FG3_GENETRANSFER_FV3_GENES246
FG3_GENETRANSFER_AA_FV3_1117
{multiple sources}6269
FG3_FL_EST_GENES23
FG3_FGENESH_1811
FG3_GENEID_11460
FG3_MANUAL_179
short proteins < 50aa53422503325<-- not tallied in problems
shorter proteins < 30aa0--------
very short proteins < 10aa0--------
initial exon ≤ 6bp126585026020161
internal exon ≤ 6bp68482080901
terminal exon ≤ 6bp693210207270
≥ 15 exons1380010121
intron ≥ 1000bp0--------
intron ≤ 20bp1451440100000
first codon not Met2588120015<-- not tallied in problems
first codon not xTG2478120015
first codon not known START2478120015
last codon not STOP34119810032
contains in-frame STOP200000002
coding length not modulo 31849020003
non-canonical splicing23157000001
has ≥1 good BLAST hit9446248722510352841570857648<-- not tallied in problems
≤1/3 as long as BLAST hit1652010152
≥3? longer than BLAST hit41190090850
contains ≥1 N in exon1027000001
contains low-quality sequence311651911128155284<-- not tallied in problems
touches gap(s)70280705012
spans contigs67180505012
within 1kb of contig edge4791191561630818312<-- not tallied in problems
any overlap (UTR or CDS)256912533401192613<-- in 256 clusters
CDS overlap only0--------
CDS overlap > 50bp0--------
CDS overlap > 100bp0--------
CDS overlap > 200bp0--------
has predicted UTR335933316762549238312375<-- not tallied in problems
UTR ≥ 50% length43726321319143222<-- not tallied in problems
UTR is spliced22217201144022315<-- not tallied in problems
one or more problems1078403521239511068524