Gene Finding for Plasmodium falciparum HB3
Outline
Overview
This document explains how automated gene calls were produced for the Plasmodium falciparum HB3 WGS assembly. The gene structures were created using a combination of ab initio gene predictions and mapped transcripts from Plasmodium falciparum 3D7. This process is described in Gene Structure Prediction. Genes were assigned names using our automated gene naming system. This process is described in Gene Naming.
Gene Prediction
Gene structures were predicted using a combination of transferred 3D7 genes, in silico gene finding and manual annotation.
The in silico gene predictions include FGENESH, GENEID, PHAT, GeneWise and EST-based gene finding called FindORFs. FGENESH is a commercial gene prediction program that was developed by Solovyev and is available from Softberry. GENEID was developed by Enrique Blanco and Roderic Guigo, and is available under the GPL. PHAT was developed by Anthony Wirth, Simon Cawley and Terry Speed. GeneWise was developed by Ewan Birney, Michele Clamp and Richard Durbin. FGENESH, GENEID, PHAT and GeneWise gene sets contained 7,168, 6,516, 7,268 and 3,293 gene predictions respectively.
FindORFs is a gene finding program developed at Broad, and was used in this study to find genes, as a complement to synteny-based mapping. The 4,238 findORFs genes were built as follows. First, P. falciparum 3D7 transcripts were aligned to the genome and grouped into loci consisting of overlapping CDSs. Then, each locus was examined for compatible splicing. If two CDSs in the same locus have identical splice sites where they overlap, they are considered fragments of a larger transcript. Putative transcripts are incrementally built out by adding additional ESTs to either end. Each putative transcript is built from one or more CDSs, but may not represent the full biological transcript if the CDSs coverage is incomplete. We search each putative transcript for ORFs beginning with ATG and ending with a stop codon, with no frameshifts. If a putative transcript contains an ORF longer than 180 nt that covers 1/3 or more of its spliced length, we considered it a valid gene prediction. The full set of these predictions is in the group labeled 'findORFs'.
The HB3 gene finding was performed in three steps:
In Step 1, P. falciparum 3D7 transcripts (from plasmoDB 5.0) were trasferred to HB3 via synteny-based mapping. In synteny-based mapping, we first built a one-to-one alignment between HB3 sequences and 3D7 sequences, and then used the correlation to transfer genes from 3D7 to HB3. The 3D7 transferred transcripts (total=5102) with no problems were used directly as the gene call in HB3, while the transferred transcripts with one or more problems were compared with predictions from FGENESH, GENEID, PHAT, GeneWise, and findORFs from aligned 3D7 transcripts. If the latter has no problem, is within >=70% of the length of the transferred transcripts, and the CDS length difference is <500bp, then transferred feature is replaced by the latter. The remaining 3D7 transferred transcripts with problems were manually reviewed and adjusted when appropriate.
In Step 2, additional transcripts were added from FGENESH, GENEID, PHAT or GeneWise predictions, if the prediction has no overlap to synteny-mapped features, but has overlap with spliced EST rerefence genes and/or Blast hits to NR protein database (expect<=1e-10).
In Step 3, the transcripts from Step 1 and Step 2 were combined. Targeted manual annotation was carried out in loci with problems, e.g., where gene predictions clashed with EST evidence, or the CDS contains frame-shift or stop codons. In all, 539 manual annotations were carried out. This created the P. falciparum HB3 final gene set with 5,623 genes.
Gene Naming
Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we do not want to propagate misinformation by transferring unverified functional names for genes in one species to predicted genes in another species.
We hope to improve the gene naming process in the future based on Gene Ontology categories.
There are currently five types of gene names that fall into three categories:
- NAME, or hypothetical protein similar to NAME, or conserved hypothetical protein
- Assigned to gene predictions where there is excellent homology to a known NR protein. The criteria for this category are:
- At least one BLASTP hit to a known NR protein (complexity filtering off, -F F, expect = 1e-10),
- A minimum of 50% identity and 70% coverage of both the query and subject sequence.
- The name will follow one of these three formats:
- conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, otherwise
- NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise:
- hypothetical protein similar to NAME
Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits we choose the one with the highest average identity to the amount of overlap to the target gene.
In all cases we take the NR protein name and filter out the species name, GIs, parenthetical comments, extra white space, etc.
- Hypothetical protein Assigned to gene predictions that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:
- BLASTP hit to NR (complexity filtering off, -F F, expect = 1e-10)
- Predicted protein Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.
Gene Locus Numbers
Every annotated gene is given a Locus Number of the form PFHG_##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. Position is an attribute of a gene that can be retrieved by the locus.
Structure Prediction Validation
To evaluate the accuracy of our gene predictions for Plasmodium falciparum HB3 nuclear assembly 1, we created a set of reference gene models exclusively from EST data. We then compared the two gene sets using a variety of metrics. In the tables below, we refer to the final, published gene set as the query and the EST-based gene set as the reference.
The Feature comparisons and Splice analysis sections only report on the subset of query genes that overlap reference genes. Although a substantial number of predicted genes overlap EST alignments, the majority do not. Because we use EST data to improve our gene calls, we expect lower accuracy in regions that lack supporting EST evidence, on the order of 5–10%. Therefore, while they are a useful measure of gene prediction accuracy, the numbers reported in those two sections and immediately below do not apply evenly to all predicted genes.
