Gene Finding Methods
Outline
- Overview
- Gene Structure Prediction
- Gene Naming
- Name Counts
- Gene Numbering
- Overview of Query Genes
- Overview of Reference genes
- Splice Analysis
- Possible Problems
Overview
This document explains how automated gene calls were produced for each of the Fusaria genomes. The annotation of the final gene structures were created using a combination of ESTs, predicted ORFs, ab-initio gene predictions and mapped ORFs from either Fusarium graminearium (in the case of FV3) and from Fusarium verticillioides (in the case of FO2). This automated gene calling process is described in Gene Structure Prediction. Genes were assigned names using our automated gene naming system. This process is described in Gene Naming.
Gene Structure Prediction
Gene structures were predicted using a combination of manual annotation, FGENESH, GENEID and EST-based genes called FindORFs. FGENESH is a commercial gene prediction program sold by Softberry, while GENEID, by Enrique Blanco and Roderic Guigo, is available under the GPL.
In FV3, the FGeneSH and GENEID gene sets contained 13,806 and 15,408 gene predictions respectively. Where multiple predictions overlap each other and EST evidence, we choose the one most in accord with the EST splice sites. If we had sufficient EST coverage to build complete ORFs, we replaced any overlapping predicted gene model with the ORF predicted purely from ESTs.
FindORFs is a gene finding program developed at Broad. The 3,856 findOrfs genes were built as follows. First, ESTs are aligned to the genome and grouped into loci consisting of overlapping ESTs. Then, each locus is examined for compatible splicing. If two ESTs in the same locus have identical splice sites where they overlap, they are considered fragments of a larger transcript. Putative transcripts are incrementally built out by adding additional ESTs to either end.
Each putative transcript is built from one or more ESTs, but may not represent the full biological transcript if the EST coverage is incomplete. We search each putative transcript for ORFs beginning with ATG and ending with a stop codon, with no frameshifts. If a putative transcript contains an ORF longer than 180 nt that covers 1/3 or more of its spliced length, we considered it a valid gene prediction. The full set of these predictions is in the group labeled 'findOrfs'. Further, we select only a subset of putative full-length gene models from these EST-based findORF transcripts that are contained totally within the best ab initio gene prediction. An independent blast-based analysis was used also for validating the reading frame by comparing them to the best hit known proteins in the NR database.
Targeted manual annotation was carried out in loci where gene predictions clashed with EST evidence or where EST and BlastX suggested a gene locus but gene predictions were absent. In total, 1016 manual annotations were carried out. At all loci with no manually annotated gene models but containing EST-based findOrfs, we picked the full-length findOrfs over any ab-initio gene predictions.
In FG3, the FGeneSh and GeneID genesets contained 11,782 and 12,670 gene predictions respectively. The final gene set from FV3 were mapped onto the FG3 genome with 6,994 transcripts successfully transferring. MIPS Gene data from FG1 was mapped onto FG3 also with 13,909 transcripts successfully mapping. FindORFs, as described above, was run on this set and produced 2,431 genes. Targeted manual annotation was carried out in loci where gene predictions clashed with EST evidence or where EST and BlastX suggested a gene locus but gene predictions were absent. In total, 120 manual annotations were carried out. At all loci with no manually annotated gene models but containing ESTs, mapped genes from the MIPS FG1 data and the FV3 gene transfers, we picked ESTs and full-length mapped genes over any ab-initio gene predictions.
In FO2, the FGeneSh and GeneID genesets contained 18,482 and 21,117 gene predictions respectively. The final gene set from FV3 were mapped onto the FG3 genome with 12,840 transcripts successfully transferring. FindORFs, as described above, was run on this set and produced 4,204 genes. Targeted manual annotation was carried out in loci where gene predictions clashed with EST evidence or where EST and BlastX suggested a gene locus but gene predictions were absent. In total, 94 manual annotations were carried out. At all loci with no manually annotated gene models but containing ESTs and mapped genes from the FV3 gene transfers, we picked ESTs and full-length mapped genes over any ab-initio gene predictions.
Gene Naming
Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we do not want to propagate misinformation by transferring unverified functional names for genes in one species to predicted genes in another species.
We hope to improve the gene naming process in the future based on Gene Ontology categories.
There are currently 5 types of gene name that fall into 3 categories:
- NAME, or hypothetical protein similar to NAME, or conserved hypothetical protein
- Assigned to gene predictions where there is excellent homology to a known NR protein. The criteria for this category are:
- At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect = 1e-10),
- A minimum of 50% identity and 70% coverage of both the query and subject sequence.
- The name will follow one of these three formats:
- conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, otherwise
- NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise:
- hypothetical protein similar to NAME Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits, we choose the one with the highest average identity to the amount of overlap to the target gene. In all cases we take the NR protein name and filter out the species name, GIs, parenthetical comments, extra white space, etc.
- Hypothetical protein Assigned to gene predictions that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:
- BLASTP hit to NR (complexity filtering off, -F F, expect = 1e-10)
- Predicted protein Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.
Name Counts
Name Counts (FO2)
| "conserved hypothetical protein" | 9467 |
| "hypothetical protein" | 527 |
| "predicted protein" | 3927 |
| hypothetical protein similar to... | 2393 |
| other non-empty name | 1421 |
Name Counts (FV3)
| "conserved hypothetical protein" | 8913 |
| "hypothetical protein" | 628 |
| "predicted protein" | 1580 |
| hypothetical protein similar to... | 1921 |
| other non-empty name | 1157 |
Name Counts (FG3)
| "conserved hypothetical protein" | 6368 |
| "hypothetical protein" | 607 |
| "hypothetical protein " | 41 |
| "predicted protein" | 3067 |
| hypothetical protein similar to... | 1980 |
| other non-empty name | 1269 |
Gene Numbering
Every annotated gene is given a Locus Number of the form FV3G_##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.
Overview of Query Genes
Query Genes (FO2)
17735 genes (8847 on '+' strand, 8888 on '?')17735 transcripts (13331 spliced, 4404 unspliced)
47896 exons, 30161 introns
| len | %cov | %gc | %at | ||
|---|---|---|---|---|---|
| genic | 26717541 | 44.58 | 50.73 | 49.27 | |
| intergenic | 33219242 | 55.42 | 46.52 | 53.48 | |
| exonic | 23693318 | 39.53 | 51.41 | 48.59 | |
| intronic | 3015569 | 5.03 | 45.47 | 54.53 | |
| coding | 21905719 | 36.55 | 51.88 | 48.12 | |
| 5' UTR | 773537 | 1.29 | 47.96 | 52.04 | |
| 3' UTR | 1171889 | 1.96 | 44.93 | 55.07 | |
| alt. spliced | 8654 | 0.01 | 44.94 | 55.06 | |
| genomic | 59936783 | 100.00 | 48.40 | 51.60 | |
min | median | mean | n50 | max | |
| total length (incl. UTR + introns) | 90 | 1292 | 1519 | 1849 | 22869 |
| coding length | 90 | 1023 | 1236 | 1542 | 22596 |
| exons per transcript | 1 | 2 | 2.70 | 3 | 20 |
| exons per spliced transcript | 2 | 3 | 3.26 | 3 | 20 |
| bp per exon | 1 | 288 | 499 | 939 | 13941 |
| bp per intron | 4 | 58 | 101 | 129 | 2370 |
| 5' UTR bp | 1 | 120 | 186 | 306 | 6664 |
| 3' UTR bp | 1 | 194 | 279 | 391 | 5839 |
Query Genes (FV3)
14179 genes: 14199 transcripts (10878 spliced, 3321 unspliced; 7182/7017 +/-)39508 exons, 25309 introns
| len | %cov | %gc | %at | |
|---|---|---|---|---|
| genic | 22749525 | 54.55 | 50.70 | 49.27 |
| intergenic | 18950820 | 45.45 | 46.31 | 53.73 |
| exonic | 20322204 | 48.73 | 51.33 | 48.67 |
| intronic | 2436142 | 5.84 | 45.44 | 54.26 |
| coding | 17889392 | 42.90 | 52.03 | 47.97 |
| 5' UTR | 1071342 | 2.57 | 48.34 | 51.66 |
| 3' UTR | 1519984 | 3.65 | 45.35 | 54.65 |
| alt. spliced | 8821 | 0.02 | 45.81 | 54.19 |
| genomic | 41700345 | 100.00 | 48.70 | 51.30 |
min | median | mean | max | |
| overall length (incl. UTR) | 90 | 1239 | 1445 | 22650 |
| coding length | 90 | 1059 | 1260 | 22650 |
| exons per transcript | 1 | 2 | 2.78 | 21 |
| exons per spliced transcript | 2 | 3 | 3.33 | 21 |
| bp per exon | 1 | 303 | 519 | 22425 |
| bp per intron | 22 | 57 | 96 | 989 |
| 5' UTR bp | 1 | 123 | 195 | 2994 |
| 3' UTR bp | 1 | 191 | 277 | 3118 |
Query Genes (FG3)
13332 genes: 13332 transcripts (10278 spliced, 3054 unspliced; 6703/6629 +/-)37575 exons, 24243 introns
| len | %cov | %gc | %at | |
|---|---|---|---|---|
| genic | 20999352 | 57.97 | 50.41 | 49.53 |
| intergenic | 15224289 | 42.03 | 45.47 | 54.62 |
| exonic | 18972091 | 52.37 | 51.14 | 48.83 |
| intronic | 2032901 | 5.61 | 43.53 | 56.05 |
| coding | 17874256 | 49.34 | 51.57 | 48.42 |
| 5' UTR | 387612 | 1.07 | 46.54 | 53.44 |
| 3' UTR | 799949 | 2.21 | 44.15 | 55.62 |
| alt. spliced | 5640 | 0.02 | 43.97 | 56.03 |
| genomic | 36223641 | 100.00 | 48.33 | 51.67 |
min | median | mean | max | |
| overall length (incl. UTR) | 90 | 1206 | 1430 | 33594 |
| coding length | 90 | 1103 | 1340 | 33594 |
| exons per transcript | 1 | 2 | 2.82 | 21 |
| exons per spliced transcript | 2 | 3 | 3.36 | 21 |
| bp per exon | 1 | 283 | 507 | 29066 |
| bp per intron | 1 | 56 | 83 | 975 |
| 5' UTR bp | 1 | 111 | 158 | 4754 |
| 3' UTR bp | 1 | 190 | 263 | 3601 |
Overview of Reference genes
Reference Genes (FO2)
6568 genes (3278 on '+' strand, 4038 on '?')7316 transcripts (3928 spliced, 3388 unspliced)
14174 exons, 6858 introns
| len | %cov | %gc | %at | ||
|---|---|---|---|---|---|
| genic | 7096135 | 11.84 | 51.14 | 48.86 | |
| intergenic | 52840648 | 88.16 | 48.03 | 51.97 | |
| exonic | 6664716 | 11.12 | 51.51 | 48.49 | |
| intronic | 389597 | 0.65 | 45.04 | 54.96 | |
| coding | 5489634 | 9.16 | 52.54 | 47.46 | |
| 5' UTR | 574430 | 0.96 | 49.65 | 50.35 | |
| 3' UTR | 750471 | 1.25 | 45.28 | 54.72 | |
| alt. spliced | 41822 | 0.07 | 48.48 | 51.52 | |
| genomic | 59936783 | 100.00 | 48.40 | 51.60 | |
min | median | mean | n50 | max | |
| total length (incl. UTR + introns) | 27 | 958 | 1097 | 1252 | 5036 |
| coding length | 27 | 763 | 826 | 918 | 4932 |
| exons per transcript | 1 | 2 | 1.94 | 2 | 10 |
| exons per spliced transcript | 2 | 2 | 2.75 | 5 | 10 |
| bp per exon | 4 | 422 | 530 | 805 | 4541 |
| bp per intron | 20 | 55 | 75 | 63 | 782 |
| 5' UTR bp | 1 | 67 | 116 | 247 | 1897 |
| 3' UTR bp | 1 | 86 | 153 | 326 | 2411 |
Reference Genes (FV3)
9522 genes: 10855 transcripts (6669 spliced, 4186 unspliced; 5455/5400 +/-)24590 exons, 13735 introns
| len | %cov | %gc | %at | |
|---|---|---|---|---|
| genic | 11182255 | 26.82 | 50.57 | 49.43 |
| intergenic | 30518090 | 73.18 | 48.02 | 51.98 |
| exonic | 10465540 | 25.10 | 50.94 | 49.06 |
| intronic | 805534 | 1.93 | 45.41 | 54.55 |
| coding | 8913639 | 21.38 | 51.95 | 48.05 |
| 5' UTR | 376175 | 0.90 | 47.25 | 52.75 |
| 3' UTR | 1392140 | 3.34 | 45.23 | 54.77 |
| alt. spliced | 88819 | 0.21 | 47.96 | 52.04 |
| genomic | 41700345 | 100.00 | 48.70 | 51.30 |
min | median | mean | max | |
| overall length (incl. UTR) | 79 | 957 | 1137 | 7912 |
| coding length | 79 | 849 | 944 | 7824 |
| exons per transcript | 1 | 2 | 2.27 | 21 |
| exons per spliced transcript | 2 | 3 | 3.06 | 21 |
| bp per exon | 3 | 378 | 501 | 4737 |
| bp per intron | 20 | 54 | 72 | 791 |
| 5' UTR bp | 1 | 132 | 185 | 2994 |
| 3' UTR bp | 1 | 182 | 257 | 3021 |
Reference Genes (FG3)
4714 genes:
5163 transcripts
(2714 spliced, 2449 unspliced;
2623/2540 +/-)
10006 exons, 4843 introns
| len | %cov | %gc | %at | |
|---|---|---|---|---|
| genic | 4051152 | 11.18 | 50.66 | 49.34 |
| intergenic | 32172489 | 88.82 | 48.04 | 51.96 |
| exonic | 3758384 | 10.38 | 51.15 | 48.85 |
| intronic | 321070 | 0.89 | 44.58 | 55.39 |
| coding | 0 | 0.00 | - | - |
| 5' UTR | 0 | 0.00 | - | - |
| 3' UTR | 0 | 0.00 | - | - |
| alt. spliced | 28302 | 0.08 | 47.37 | 52.63 |
| genomic | 36223641 | 100.00 | 48.33 | 51.67 |
min | median | mean | max | |
| overall length (incl. UTR) | 29 | 760 | 802 | 5079 |
| coding length | - | - | - | - |
| exons per transcript | 1 | 2 | 1.94 | 12 |
| exons per spliced transcript | 2 | 2 | 2.78 | 12 |
| bp per exon | 4 | 324 | 414 | 4244 |
| bp per intron | 24 | 56 | 76 | 767 |
| 5' UTR bp | - | - | - | - |
| 3' UTR bp | - | - | - | - |
Splice Analysis
Splice Analysis (FO2)
15205 splice agreements, 740 disagreements.
8566 ignored:
194 due to EST misalignment,
8372 due to partial initial/terminal exon coverage.
perfect exon:exon/intron:intron matches: 5159/5521
10 query transcripts contained noncanonical splices.
| transcripts with no splice problems | 5145 | 94.2% |
| ... with complete reference coverage | 791 | 14.5% |
| explainable by alternate splicing | 212 | 3.9% |
| ... with spliced reference | 151 | 2.8% |
| clashes | 102 | 1.9% |
| ... with spliced reference | 67 | 1.2% |
Transcripts that have a splice site disagreement with an overlapping reference gene are placed into two categories, depending on the severity of the clash. If all splice disagreements could be explained by well-known types of alternate splicing, we call the transcript a "possible alternate splice." If the two transcripts cannot be reconciled in this way, we label the query a "clash." In partitioning splice disagreements into two categories, we are not asserting that 3.9% of this genome shows alternate splicing. We do this as a form of triage: genes in the "clash" category are manually inspected before release.
| in ref. | in query | ||
|---|---|---|---|
| cassette exons | ![]() | 9 | 0 |
| retained introns | ![]() | 123 | 28 |
| early 3' splices | ![]() | 6 | 50 |
| late 5' splices | ![]() | 8 | 26 |
cassette exonan exon that falls completely with an intron of a variant transcript. Such exons may represent alternative splice forms but are more likely instances of exonic over- and under-prediction. retained intronan intron that falls within the exon of a variant transcript. These introns may indicate alternative splicing but usually are over- and under-predicted introns. early 3' splicestwo introns agree on their 5' splice site but differ on the 3' side, relative to the affected intron. In other words, differing 3' splice sites lie on the leading edge of an exon. late 5' splicestwo introns agree on their 3' splice site but differ on the 5' side, again relative to the affected intron. Most terminology is from Matlin AJ, et. al. Understanding alternative splicing: towards a cellular code. Nat Rev Mol Cell Biol. 2005 May;6(5):386-98.
Splice Analysis (FV2)
32001 splice agreements, 1492 disagreements.
6873 ignored:
112 due to EST misalignment,
6761 due to partial initial/terminal exon coverage.
perfect exon:exon/intron:intron matches: 13696/10397
0 query transcripts contained noncanonical splices.
| transcripts with no splice problems | 7089 | 92.0% |
| ... with complete reference coverage | 3783 | 49.1% |
| explainable by alternate splicing | 553 | 7.2% |
| ... with spliced reference | 443 | 5.7% |
| clashes | 66 | 0.9% |
| ... with spliced reference | 55 | 0.7% |
Transcripts that have a splice site disagreement with an overlapping reference gene are placed into two categories, depending on the severity of the clash. If all splice disagreements could be explained by well-known types of alternate splicing, we call the transcript a "possible alternate splice." If the two transcripts cannot be reconciled in this way, we label the query a "clash." In partitioning splice disagreements into two categories, we are not asserting that 7.2% of this genome shows alternate splicing. We do this as a form of triage: genes in the "clash" category are manually inspected before release.
| in ref. | in query | ||
|---|---|---|---|
| cassette exons | ![]() | 27 | 2 |
| retained introns | ![]() | 56 | 235 |
| early 3' splices | ![]() | 16 | 189 |
| late 5' splices | ![]() | 28 | 96 |
cassette exonan exon that falls completely with an intron of a variant transcript. Such exons may represent alternative splice forms but are more likely instances of exonic over- and under-prediction. retained intronan intron that falls within the exon of a variant transcript. These introns may indicate alternative splicing but usually are over- and under-predicted introns. early 3' splicestwo introns agree on their 5' splice site but differ on the 3' side, relative to the affected intron. In other words, differing 3' splice sites lie on the leading edge of an exon. late 5' splicestwo introns agree on their 3' splice site but differ on the 5' side, again relative to the affected intron. Most terminology is from Matlin AJ, et. al. Understanding alternative splicing: towards a cellular code. Nat Rev Mol Cell Biol. 2005 May;6(5):386-98.
Splice Analysis (FG3)
11676 splice agreements, 590 disagreements.
4631 ignored:
90 due to EST misalignment,
4541 due to partial initial/terminal exon coverage.
perfect exon:exon/intron:intron matches: 4246/4024
2 query transcripts contained noncanonical splices.
| transcripts with no splice problems | 3550 | 93.5% |
| ... with complete reference coverage | 886 | 23.3% |
| explainable by alternate splicing | 156 | 4.1% |
| ... with spliced reference | 105 | 2.8% |
| clashes | 91 | 2.4% |
| ... with spliced reference | 43 | 1.1% |
Transcripts that have a splice site disagreement with an overlapping reference gene are placed into two categories, depending on the severity of the clash. If all splice disagreements could be explained by well-known types of alternate splicing, we call the transcript a "possible alternate splice." If the two transcripts cannot be reconciled in this way, we label the query a "clash." In partitioning splice disagreements into two categories, we are not asserting that 4.1% of this genome shows alternate splicing. We do this as a form of triage: genes in the "clash" category are manually inspected before release.
| in ref. | in query | ||
|---|---|---|---|
| cassette exons | ![]() | 2 | 0 |
| retained introns | ![]() | 97 | 16 |
| early 3' splices | ![]() | 13 | 25 |
| late 5' splices | ![]() | 5 | 16 |
cassette exonan exon that falls completely with an intron of a variant transcript. Such exons may represent alternative splice forms but are more likely instances of exonic over- and under-prediction. retained intronan intron that falls within the exon of a variant transcript. These introns may indicate alternative splicing but usually are over- and under-predicted introns. early 3' splicestwo introns agree on their 5' splice site but differ on the 3' side, relative to the affected intron. In other words, differing 3' splice sites lie on the leading edge of an exon. late 5' splicestwo introns agree on their 3' splice site but differ on the 5' side, again relative to the affected intron. Most terminology is from Matlin AJ, et. al. Understanding alternative splicing: towards a cellular code. Nat Rev Mol Cell Biol. 2005 May;6(5):386-98.
Possible Problems
Problems-Fusarium oxysporum (FO2)
| FO2_GENETRANSFER_NT_FG1_MIPS_GENES | 24 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| FO2_GENETRANSFER_FG1_MIPS_GENES | 372 | ||||||||||
| FO2_GENETRANSFER_NT_FV3_1 | 27 | ||||||||||
| {multiple sources} | 8884 | ||||||||||
| FO2_GENETRANSFER_FV3_6 | 273 | ||||||||||
| FO2_FGENESH_1 | 4812 | ||||||||||
| FO2_GENEID_1 | 3284 | ||||||||||
| FO2_MANUAL_1 | 59 | ||||||||||
| short proteins < 50aa | 134 | 1 | 0 | 0 | 28 | 1 | 18 | 86 | 0 | <-- not tallied in problems | |
| shorter proteins < 30aa | 0 | - | - | - | - | - | - | - | - | ||
| very short proteins < 10aa | 0 | - | - | - | - | - | - | - | - | ||
| initial exon ≤ 6bp | 181 | 0 | 1 | 1 | 49 | 1 | 92 | 37 | 0 | ||
| internal exon ≤ 6bp | 61 | 0 | 0 | 0 | 18 | 0 | 43 | 0 | 0 | ||
| terminal exon ≤ 6bp | 120 | 0 | 0 | 0 | 13 | 0 | 54 | 53 | 0 | ||
| ≥ 15 exons | 7 | 0 | 0 | 0 | 2 | 0 | 2 | 2 | 1 | ||
| intron ≥ 1000bp | 4 | 0 | 0 | 0 | 0 | 1 | 3 | 0 | 0 | ||
| intron ≤ 20bp | 5 | 0 | 1 | 0 | 4 | 0 | 0 | 0 | 0 | ||
| first codon not Met | 170 | 1 | 63 | 0 | 81 | 23 | 0 | 2 | 0 | <-- not tallied in problems | |
| first codon not xTG | 145 | 1 | 59 | 0 | 63 | 20 | 0 | 2 | 0 | ||
| first codon not known START | 155 | 1 | 59 | 0 | 71 | 22 | 0 | 2 | 0 | ||
| last codon not STOP | 220 | 5 | 76 | 3 | 97 | 25 | 9 | 4 | 1 | ||
| contains in-frame STOP | 0 | - | - | - | - | - | - | - | - | ||
| coding length not modulo 3 | 173 | 9 | 2 | 8 | 139 | 10 | 5 | 0 | 0 | ||
| non-canonical splicing | 110 | 0 | 1 | 4 | 92 | 7 | 0 | 0 | 6 | ||
| has ≥1 good BLAST hit | 12437 | 21 | 269 | 19 | 7127 | 191 | 3208 | 1553 | 49 | <-- not tallied in problems | |
| ≤1/3 as long as BLAST hit | 296 | 0 | 3 | 1 | 37 | 3 | 115 | 136 | 1 | ||
| ≥3? longer than BLAST hit | 38 | 0 | 3 | 0 | 19 | 0 | 11 | 5 | 0 | ||
| contains ≥1 N in exon | 40 | 3 | 5 | 2 | 21 | 7 | 0 | 0 | 2 | ||
| contains low-quality sequence | 1742 | 3 | 71 | 3 | 597 | 116 | 615 | 327 | 10 | <-- not tallied in problems | |
| touches gap(s) | 190 | 3 | 10 | 2 | 52 | 16 | 103 | 0 | 4 | ||
| spans contigs | 179 | 3 | 9 | 2 | 44 | 15 | 103 | 0 | 3 | ||
| within 1kb of contig edge | 1183 | 5 | 34 | 3 | 356 | 33 | 408 | 336 | 8 | <-- not tallied in problems | |
| any overlap (UTR or CDS) | 309 | 0 | 7 | 0 | 493 | 15 | 59 | 48 | 8 | <-- in 309 clusters | |
| CDS overlap only | 0 | - | - | - | - | - | - | - | - | ||
| CDS overlap > 50bp | 0 | - | - | - | - | - | - | - | - | ||
| CDS overlap > 100bp | 0 | - | - | - | - | - | - | - | - | ||
| CDS overlap > 200bp | 0 | - | - | - | - | - | - | - | - | ||
| has predicted UTR | 5056 | 4 | 40 | 12 | 4518 | 119 | 138 | 171 | 54 | <-- not tallied in problems | |
| UTR ≥ 50% length | 503 | 0 | 1 | 0 | 454 | 7 | 5 | 30 | 6 | <-- not tallied in problems | |
| UTR is spliced | 413 | 0 | 3 | 1 | 354 | 8 | 10 | 28 | 9 | <-- not tallied in problems | |
| one or more problems | 1886 | 12 | 113 | 15 | 913 | 68 | 467 | 279 | 19 | ||
Problems-Fusarium verticillioides (FV3)
| {multiple sources} | 1917 | |||||||
|---|---|---|---|---|---|---|---|---|
| FV3_FINDORFS_6 | 3127 | |||||||
| FV3_FGENESH_1 | 4468 | |||||||
| FV3_MANUAL_1 | 1029 | |||||||
| FV3_GENEID_4 | 3658 | |||||||
| short proteins < 50aa | 166 | 0 | 0 | 14 | 56 | 96 | <-- not tallied in problems | |
| shorter proteins < 30aa | 0 | - | - | - | - | - | ||
| very short proteins < 10aa | 0 | - | - | - | - | - | ||
| initial exon ≤ 6bp | 130 | 7 | 0 | 81 | 0 | 42 | ||
| internal exon ≤ 6bp | 44 | 0 | 7 | 35 | 1 | 1 | ||
| terminal exon ≤ 6bp | 74 | 0 | 0 | 34 | 0 | 40 | ||
| ≥ 15 exons | 11 | 0 | 2 | 4 | 2 | 3 | ||
| intron ≥ 1000bp | 0 | - | - | - | - | - | ||
| intron ≤ 20bp | 0 | - | - | - | - | - | ||
| first codon not Met | 7 | 0 | 0 | 0 | 6 | 1 | <-- not tallied in problems | |
| first codon not xTG | 7 | 0 | 0 | 0 | 6 | 1 | ||
| first codon not known START | 7 | 0 | 0 | 0 | 6 | 1 | ||
| last codon not STOP | 12 | 0 | 0 | 1 | 10 | 1 | ||
| contains in-frame STOP | 4 | 0 | 0 | 0 | 4 | 0 | ||
| coding length not modulo 3 | 13 | 0 | 0 | 0 | 13 | 0 | ||
| non-canonical splicing | 0 | - | - | - | - | - | ||
| has ≥1 good BLAST hit | 11262 | 1629 | 2889 | 3609 | 654 | 2481 | <-- not tallied in problems | |
| ≤1/3 as long as BLAST hit | 263 | 4 | 7 | 75 | 63 | 114 | ||
| ≥3? longer than BLAST hit | 68 | 6 | 4 | 37 | 6 | 15 | ||
| contains ≥1 N in exon | 0 | - | - | - | - | - | ||
| contains low-scoring sequence | 367 | 29 | 78 | 138 | 36 | 86 | ||
| touches gap(s) | 26 | 0 | 1 | 20 | 4 | 1 | ||
| spans contigs | 26 | 0 | 1 | 20 | 4 | 1 | ||
| within 1kb of contig edge | 167 | 20 | 27 | 53 | 23 | 44 | <-- not tallied in problems | |
| any overlap (UTR or CDS) | 382 | 78 | 304 | 161 | 103 | 127 | <-- in 382 clusters | |
| CDS overlap only | 0 | - | - | - | - | - | ||
| CDS overlap > 50bp | 0 | - | - | - | - | - | ||
| CDS overlap > 100bp | 0 | - | - | - | - | - | ||
| CDS overlap > 200bp | 0 | - | - | - | - | - | ||
| has predicted UTR | 6567 | 670 | 3123 | 858 | 982 | 934 | <-- not tallied in problems | |
| UTR ≥ 50% length | 924 | 65 | 317 | 52 | 382 | 108 | <-- not tallied in problems | |
| UTR is spliced | 752 | 55 | 233 | 63 | 246 | 155 | <-- not tallied in problems | |
| one or more problems | 1686 | 122 | 389 | 549 | 217 | 409 | ||
Problems-Fusarium graminearum (FG3)
| FG3_GENETRANSFER_FG1_MIPS_GENES | 4327 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| FG3_GENETRANSFER_FV3_GENES | 246 | ||||||||||
| FG3_GENETRANSFER_AA_FV3_1 | 117 | ||||||||||
| {multiple sources} | 6269 | ||||||||||
| FG3_FL_EST_GENES | 23 | ||||||||||
| FG3_FGENESH_1 | 811 | ||||||||||
| FG3_GENEID_1 | 1460 | ||||||||||
| FG3_MANUAL_1 | 79 | ||||||||||
| short proteins < 50aa | 53 | 4 | 2 | 2 | 5 | 0 | 3 | 32 | 5 | <-- not tallied in problems | |
| shorter proteins < 30aa | 0 | - | - | - | - | - | - | - | - | ||
| very short proteins < 10aa | 0 | - | - | - | - | - | - | - | - | ||
| initial exon ≤ 6bp | 126 | 58 | 5 | 0 | 26 | 0 | 20 | 16 | 1 | ||
| internal exon ≤ 6bp | 68 | 48 | 2 | 0 | 8 | 0 | 9 | 0 | 1 | ||
| terminal exon ≤ 6bp | 69 | 32 | 1 | 0 | 2 | 0 | 7 | 27 | 0 | ||
| ≥ 15 exons | 13 | 8 | 0 | 0 | 1 | 0 | 1 | 2 | 1 | ||
| intron ≥ 1000bp | 0 | - | - | - | - | - | - | - | - | ||
| intron ≤ 20bp | 145 | 144 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ||
| first codon not Met | 25 | 8 | 8 | 1 | 2 | 0 | 0 | 1 | 5 | <-- not tallied in problems | |
| first codon not xTG | 24 | 7 | 8 | 1 | 2 | 0 | 0 | 1 | 5 | ||
| first codon not known START | 24 | 7 | 8 | 1 | 2 | 0 | 0 | 1 | 5 | ||
| last codon not STOP | 34 | 11 | 9 | 8 | 1 | 0 | 0 | 3 | 2 | ||
| contains in-frame STOP | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | ||
| coding length not modulo 3 | 18 | 4 | 9 | 0 | 2 | 0 | 0 | 0 | 3 | ||
| non-canonical splicing | 23 | 15 | 7 | 0 | 0 | 0 | 0 | 0 | 1 | ||
| has ≥1 good BLAST hit | 9446 | 2487 | 225 | 103 | 5284 | 15 | 708 | 576 | 48 | <-- not tallied in problems | |
| ≤1/3 as long as BLAST hit | 16 | 5 | 2 | 0 | 1 | 0 | 1 | 5 | 2 | ||
| ≥3? longer than BLAST hit | 41 | 19 | 0 | 0 | 9 | 0 | 8 | 5 | 0 | ||
| contains ≥1 N in exon | 10 | 2 | 7 | 0 | 0 | 0 | 0 | 0 | 1 | ||
| contains low-quality sequence | 311 | 65 | 19 | 11 | 128 | 1 | 55 | 28 | 4 | <-- not tallied in problems | |
| touches gap(s) | 70 | 2 | 8 | 0 | 7 | 0 | 50 | 1 | 2 | ||
| spans contigs | 67 | 1 | 8 | 0 | 5 | 0 | 50 | 1 | 2 | ||
| within 1kb of contig edge | 479 | 119 | 15 | 6 | 163 | 0 | 81 | 83 | 12 | <-- not tallied in problems | |
| any overlap (UTR or CDS) | 256 | 91 | 25 | 3 | 340 | 1 | 19 | 26 | 13 | <-- in 256 clusters | |
| CDS overlap only | 0 | - | - | - | - | - | - | - | - | ||
| CDS overlap > 50bp | 0 | - | - | - | - | - | - | - | - | ||
| CDS overlap > 100bp | 0 | - | - | - | - | - | - | - | - | ||
| CDS overlap > 200bp | 0 | - | - | - | - | - | - | - | - | ||
| has predicted UTR | 3359 | 333 | 167 | 6 | 2549 | 23 | 83 | 123 | 75 | <-- not tallied in problems | |
| UTR ≥ 50% length | 437 | 26 | 32 | 1 | 319 | 1 | 4 | 32 | 22 | <-- not tallied in problems | |
| UTR is spliced | 222 | 17 | 20 | 1 | 144 | 0 | 2 | 23 | 15 | <-- not tallied in problems | |
| one or more problems | 1078 | 403 | 52 | 12 | 395 | 1 | 106 | 85 | 24 | ||












