Gene Finding Methods

Outline

Overview

This document explains how automated gene calls were produced for each of the Fusaria genomes. The annotation of the final gene structures were created using a combination of RNAseq/ESTs, ab-initio gene predictions and blast hits to a UniRef90 protein sequence database. This automated gene calling process is described in Gene Structure Prediction. Gene product names were assigned as described in Gene Naming.

RNAseq-Based Gene Structure Prediction

For the Fusarium genome annotation, we use EVM and RNAseq/PASA to annotate or update the annotation for 12 Fusarium oxysporum strains plus Fusarium graminearum PH-1 and Fusarium verticillioides 7600. We use a large collection of RNAseq/EST data, including 39 strand-specific and paired read data sets generated at Broad Institute (source organisms: Fusarium oxysporum,Fusarium graminearum PH-1 and Fusarium verticillioides 7600), 18 non-stranded and paired RNAseq data sets (source organisms: Fusarium oxysporum f. sp. lycopersici 4287, Fusarium solani and F. oxysporum f. sp. pisi NRRL 37622) from collaborators, plus 4 EST data sets from 454 (source organism: Fusarium oxysporum II5). We use Trinity transcript assembler (Grabherr et al., Nature Biotech., 2011, 29,644-652) to process the individual RNAseq data set to generate transcript assemblies, and then combine all the strand-specific assemblies into one sequence fasta file, and all the non-stranded assemblies and the ESTs into another sequence fasta file. We then use PASA to align the two transcript fasta files to the genome assemblies to generate PASA alignments, one for strand-specific data and the other for the non-stranded data and the ESTs. Combining individual Trinity assemblies for PASA alignment gets around a major memory problem if the all the raw read BAM files are combined into a single BAM file before hand and then assembled using Trinity Transcript assembler.

For gene prediction with EVM, we generate ab initio gene models using predictions from GeneMarkES, GeneId, Augustus, GlimmerHMM and SNAP, in conjuction with strand-specific PASA alignment and GeneWise features from blast against UniRef90 database. The EVM gene models are first updated with PASA alignments from the 39 stranded RNAseq dataset, and the output is updated again with PASA alignments from the 22 non-stranded RNAseq/EST dataset. The resulting track is filtered to remove spurious genes from repeat sequences(based on TransposonPSI prediction, repeat PFAM domains, blast hits to RepBase and CDS alignment to >10 different locations of the genome), and excessive UTRs overlapping neighboring CDSs are trimmed to its own CDS start, or stop or both. Additional gene models are added from non-overlapping ORFs from 39 stranded RNAseq data set, and from 22 non-stranded RNAseq/EST data set. We also generate a track of EVMLITE gene models from PASA ORFs and the ab initio gene models, and use this to add back additional genes if a gene model does not overlap EVM gene models, but is present in OrthoMCL clusters with at lease 2 genomes. For genomes with previous annotation (FO2, FG3 and FV3), the old gene models are repeat-filtered and included in the final gene set if a gene model does not overlap existing gene models.

Identification of non-coding RNA genes: rRNA genes are predicted by RNAmmer and tRNAs by tRNAscan-SE.

Gene Naming

Genes product names are assigned based on Kinannote prediction (Goldberg et al., Bioinformatics, 2013, Aug 13, doi: 10.1093/bioinformatics/btt419), best BLAST hit to SwissProt database (protein identity >= 70% and query coverage >= 70%), HMM profile alignment to TIGRfam equivalogs, and best BLAST hits to KEGG protein sequences with KO numbers (protein identity >= 50% and query coverage >= 50%). The remaining genes are assigned "hypothetical protein" as gene product name.

Gene Numbering

Every annotated gene is given a Locus Number of the form FV3G_##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that it is a bad idea to encode attributes of an object, such as position, in its identifier. Position is an attribute of a gene that can be retrieved by the locus.