Argo File Formats: GFF1
GFF (General Feature Format) is a simple tab delimited format for describing genomic features. It is designed for easy sharing of the bare basics. The problem is there are (at least) 4 slightly but significantly different flavors: GFF1, GFF2, GFF3 and GTF2. If you are unfamiliar with GFF and its flavors it is important that you read this GFF overview to decide which flavor is best suited to your needs (or to identify the flavor of a file you already have). This document describes the GFF1 flavor.
File Extensions
Give your GFF1 files the extension '.gff1' instead of '.gff' so that Argo will interpret them correctly. If you do not use 'gff1' Argo will prompt you to choose a flavor. This is annoying and error prone.
GFF Records
A GFF file consists of one or more records, each of which represents a simple start to stop feature. Records are separated by newlines, one record per row. Each record has 9 fields, the last of which is optional. This last optional field is the only field that differs among the different GFF flavors, but the difference is significant. In GFF1, this final field is used for grouping. Any value in this field is used to cluster records together, allowing for the creation of complex multi-subfeature features (such as transcripts composed of exons).
GFF files contain features but no sequence. To view them in Argo, you will have to load sequence data first (for example, a fasta file) and superimpose the gff files onto the sequence.
Examples
Here are some sample GFF1 records:
chr22 TeleGene enhancer 1000000 1001000 500 + . touch1 chr22 TeleGene promoter 1010000 1010100 900 + . touch1 chr22 TeleGene promoter 1020000 1020000 800 - . touch2
Note that tabs have been replaced with spaces here for easier viewing.
Field Descriptions
Note: Up until the last field (field 9) all gff flavors are the same.
- seqname - The name of the sequence. Typically a chromosome or a contig. Argo does not care what you put here. It will superimpose gff features on any sequence you like.
- source - The program that generated this feature. Argo displays the value of this field in the inspector but does not do anything special with it.
- feature - The name of this type of feature. Some examples of standard feature types are "CDS", "start_codon", "stop_codon", and "exon". Argo does not do anything with this value except display it.
- start - The starting position of the feature in the sequence. The first base is numbered 1.
- end - The ending position of the feature (inclusive).
- score - A score between 0 and 1000. If there is no score value, enter ".".
- strand - Valid entries include '+', '-', or '.' (for don't know/don't care).
- frame - If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'. Argo does not do anything with this field except display its value.
- GFF1: group - All lines with the same group are linked together into a single item. This field is the one important difference between GFF flavors.
For More Information
GFF1 doesn't really have a formal spec. The UCSC description is about as good as it gets.
