Argo File Formats: GFF1

GFF (General Feature Format) is a simple tab delimited format for describing genomic features. It is designed for easy sharing of the bare basics. The problem is there are (at least) 4 slightly but significantly different flavors: GFF1, GFF2, GFF3 and GTF2. If you are unfamiliar with GFF and its flavors it is important that you read this GFF overview to decide which flavor is best suited to your needs (or to identify the flavor of a file you already have). This document describes the GFF1 flavor.

File Extensions

Give your GFF1 files the extension '.gff1' instead of '.gff' so that Argo will interpret them correctly. If you do not use 'gff1' Argo will prompt you to choose a flavor. This is annoying and error prone.

GFF Records

A GFF file consists of one or more records, each of which represents a simple start to stop feature. Records are separated by newlines, one record per row. Each record has 9 fields, the last of which is optional. This last optional field is the only field that differs among the different GFF flavors, but the difference is significant. In GFF1, this final field is used for grouping. Any value in this field is used to cluster records together, allowing for the creation of complex multi-subfeature features (such as transcripts composed of exons).

GFF files contain features but no sequence. To view them in Argo, you will have to load sequence data first (for example, a fasta file) and superimpose the gff files onto the sequence.

Examples

Here are some sample GFF1 records:

chr22  TeleGene enhancer  1000000  1001000  500 +  .  touch1
chr22  TeleGene promoter  1010000  1010100  900 +  .  touch1
chr22  TeleGene promoter  1020000  1020000  800 -  .  touch2

Note that tabs have been replaced with spaces here for easier viewing.

Field Descriptions

Note: Up until the last field (field 9) all gff flavors are the same.

  1. seqname - The name of the sequence. Typically a chromosome or a contig. Argo does not care what you put here. It will superimpose gff features on any sequence you like.
  2. source - The program that generated this feature. Argo displays the value of this field in the inspector but does not do anything special with it.
  3. feature - The name of this type of feature. Some examples of standard feature types are "CDS", "start_codon", "stop_codon", and "exon". Argo does not do anything with this value except display it.
  4. start - The starting position of the feature in the sequence. The first base is numbered 1.
  5. end - The ending position of the feature (inclusive).
  6. score - A score between 0 and 1000. If there is no score value, enter ".".
  7. strand - Valid entries include '+', '-', or '.' (for don't know/don't care).
  8. frame - If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'. Argo does not do anything with this field except display its value.
  9. GFF1: group - All lines with the same group are linked together into a single item. This field is the one important difference between GFF flavors.

For More Information

GFF1 doesn't really have a formal spec. The UCSC description is about as good as it gets.


Last Updated: Sept 18 2006
Contact: Reinhard Engels
argo-support@broad.mit.edu