Argo File Formats: GTF2

GTF2 (Gene Transfer Format) is a flavor of GFF (General Feature Format), a simple tab delimited format for describing genomic features. GTF2 combines the grouping capability of GFF1 with the descriptive attributes of GFF2. If you are unfamiliar with GFF and its flavors it is important that you read this GFF overview to decide which flavor is best suited to your needs. The current document describes only the GTF2 flavor.

Note: GTF2 is a pretty seriously flawed format. The problem is that the descriptive attributes can only be associated on the level of the subfeature (e.g., exon), which is probably not where you want them. GFF3 is better suited for capturing both grouping and descriptive attributes arbitrary (e.g., transcript) levels.

Use GTF2 if you do not really care about descriptive attributes but would like a simple and commonly used format for describing basic gene structure (exons, transcripts, genes). Each row represents and exon and can be grouped by transcript and gene id. If you need more than this, have a look at GFF3.

GTF2 files are directly editable in Argo.

File Extensions

Give your GTF2 files the extension '.gtf2' instead of '.gff' so that Argo will interpret them correctly. If you do not use 'gtf2' Argo will prompt you to choose a flavor. This is annoying and error prone.

GFF Records

A GFF file consists of one or more records, each of which represents a simple start to stop feature. Records are separated by newlines, one record per row. Each record has 9 fields, the last of which is optional. This last optional field is the only field that differs among the different GFF flavors, but the difference is significant. In GTF2, this final field is used for descriptive attributes AND a special attributes that are used for grouping multiple records into a single composite record.

In the GTF2 flavor of GFF, each record represents an exon. Exons may be associated into transcripts and genes using the special attribute fields in column 9. Because Argo is transcript centic, other non-grouping exon attributes are ignored.

Attribute keys and values are separated by spaces. Values containing spaces must be double quoted. Attribute pairs are separated by semicolons. It is not possible to group records using the GFF2 format (create compound, multi-subfeature features).

GFF files contain features but no sequence. To view them in Argo, you will have to load sequence data first (for example, a fasta file) and superimpose the gff files onto the sequence.

Examples

Here are some sample GTF2 records:

AB000381 Twinscan  CDS          380   401   .   +   0  gene_id "001"; transcript_id "001.1";
AB000381 Twinscan  CDS          501   650   .   +   2  gene_id "001"; transcript_id "001.1";
AB000381 Twinscan  CDS          700   707   .   +   2  gene_id "001"; transcript_id "001.1";
AB000381 Twinscan  start_codon  380   382   .   +   0  gene_id "001"; transcript_id "001.1";
AB000381 Twinscan  stop_codon   708   710   .   +   0  gene_id "001"; transcript_id "001.1";

Note that tabs have been replaced with spaces here for easier viewing.

Field Descriptions

Note: up util the last field (field 9) all gff flavors are the same.

  1. seqname - The name of the sequence. Typically a chromosome or a contig. Argo does not care what you put here. It will superimpose gff features on any sequence you like.
  2. source - The program that generated this feature. Argo displays the value of this field in the inspector but does not do anything special with it.
  3. feature - The name of this type of feature. Some examples of standard feature types are "CDS", "start_codon", "stop_codon", and "exon". In most cases, Argo does not do anything with this value except display it. start_codon and start_codon (if present) are the only exceptions, and are used to set the start and stop codons.
  4. start - The starting position of the feature in the sequence. The first base is numbered 1.
  5. end - The ending position of the feature (inclusive).
  6. score - A score between 0 and 1000. If there is no score value, enter ".".
  7. strand - Valid entries include '+', '-', or '.' (for don't know/don't care).
  8. frame - If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'. Argo does not do anything with this field except display its value.
  9. GTF2: grouping attributes Attribute keys and values are separated by spaces. Values containing spaces must be double quoted. Attribute pairs are separated by semicolons. The special grouping attibutes are "transcript_id" and "gene_id." Argo will ignore other attributes. This field is the one important difference between GFF flavors.

For More Information

See also the GTF2 spec.


Last Updated: Sept 18 2006
Contact: Reinhard Engels
argo-support@broad.mit.edu