Argo File Formats: GFF2

GFF (General Feature Format) is a simple tab delimited format for describing genomic features. It is designed for easy sharing of the bare basics. The problem is there are (at least) 4 slightly but significantly different flavors: GFF1, GFF2, GFF3 and GTF2. If you are unfamiliar with GFF and its flavors it is important that you read this GFF overview to decide which flavor is best suited to your needs (or to identify the flavor of a file you already have). The current document describes the GFF2 flavor.

File Extensions

Give your GFF2 files the extension '.gff2' instead of '.gff' so that Argo will interpret them correctly. If you do not use 'gff2' Argo will prompt you to choose a flavor. This is annoying and error prone.

GFF Records

A GFF file consists of one or more records, each of which represents a simple start to stop feature. Records are separated by newlines, one record per row. Each record has 9 fields, the last of which is optional. This last optional field is the only field that differs among the different GFF flavors, but the difference is significant. In GFF2, this final field is used for descriptive attributes. Attribute keys and values are separated by spaces. Values containing spaces must be double quoted. Attribute pairs are separated by semicolons. It is not possible to group records using the GFF2 format (create compound, multi-subfeature features).

GFF files contain features but no sequence. To view them in Argo, you will have to load sequence data first (for example, a fasta file) and superimpose the gff files onto the sequence.

Examples

Here are some sample GFF2 records:

seq1     BLASTX  similarity   101  235 87.1 + 0 Target "HBA_HUMAN" 11 55 ; E_value 0.0003
dJ102G20 GD_mRNA coding_exon 7105 7201   .  - 2 Sequence "dJ102G20.C1.1"

Note that tabs have been replaced with spaces here for easier viewing.

Field Descriptions

Note: Up until the last field (field 9) all gff flavors are the same.

  1. seqname - The name of the sequence. Typically a chromosome or a contig. Argo does not care what you put here. It will superimpose gff features on any sequence you like.
  2. source - The program that generated this feature. Argo displays the value of this field in the inspector but does not do anything special with it.
  3. feature - The name of this type of feature. Some examples of standard feature types are "CDS", "start_codon", "stop_codon", and "exon". Argo does not do anything with this value except display it.
  4. start - The starting position of the feature in the sequence. The first base is numbered 1.
  5. end - The ending position of the feature (inclusive).
  6. score - A score between 0 and 1000. If there is no score value, enter ".".
  7. strand - Valid entries include '+', '-', or '.' (for don't know/don't care).
  8. frame - If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'. Argo does not do anything with this field except display its value.
  9. GFF2: attributes Attribute keys and values are separated by spaces. Values containing spaces must be double quoted. Attribute pairs are separated by semicolons. This field is the one important difference between GFF flavors.

For More Information

See also the GFF2 spec.


Last Updated: Sept 18 2006
Contact: Reinhard Engels
argo-support@broad.mit.edu