Cufflinks Documentation, v6  Print-icon ▸ Open Module on GenePattern Public Server

Description: Assembles transcripts, estimates abundances, and tests for differential expression and regulation in RNA-seq samples

Author: Cole Trapnell et al, University of Maryland Center for Bioinformatics and Computational Biology

Algorithm Version: Cufflinks 2.0.2

Contact: gp-help@broadinstitute.org

Summary

Cufflinks assembles transcripts and estimates their abundances in RNA-seq samples. It accepts aligned RNA-seq reads, then assembles the alignments into a parsimonious set of transcripts, reporting as few full-length transcript fragments [transfrags] as are needed to explain the data. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one. 

Cufflinks was created at the University of Maryland Center for Bioinformatics and Computational Biology. This document is adapted from the Cufflinks documentation for release 2.0.2.

Usage

Cufflinks takes a file of alignments in SAM or BAM (the binary equivalent of SAM) format as input. For more details on the SAM/BAM format, see the Input Files section and/or the specification. The RNA-seq read mapper TopHat produces BAM output, and is recommended for use with Cufflinks. However Cufflinks will accept SAM/BAM alignments generated by any read mapper so long as they meet some particular requirements; see the Input Files section for more details.

Optionally, a reference genome annotation file can be submitted as well.  If it is sent to the GTF parameter, Cufflinks will use this file to estimate isoform expression and will not assemble novel transcripts; the program will ignore alignments not structurally compatible with any reference transcript.  It can also be sent to the GTF guide parameter to enable Cufflinks to use the reference annotation based transcript (RABT) assembly algorithm.  This guide file is used to generate faux-reads against which the actual reads are tiled so that every reference transcript position is covered by multiple reads, and the information in the faux-reads is merged with the data from the sequenced reads.  For more information, see Roberts et al (2011) or the "How It Works" page on the Cufflinks site.  The reference genome annotation GTF can be sent to either of these parameters.

The Cufflinks tool provides a number of additonal options and switches that are not directly available through this module's paramters.  The additional.cufflinks.options parameter is provided to pass these through if you feel that you need them.  To use it, simply specify the extra option(s) along with any arguments in the input text field separated by spaces.  At this time, this parameter unfortunately does not easily support options which require a file argument.  Check the Cufflinks manual for more details of the available options.  Also note that there may be additional undocumented options; manually running the cufflinks executable at the command line with no arguments may show even more options.  If you feel that a particular missing option would be of broad general interest, please contact the GenePattern team and we will look into adding it.  Use of this parameter is recommended for expert use only; use it at your own discretion.  The GenePattern team does not explicitly test all of the possible options that may be passed through using this parameter and can only provide limited support.  

For more information on using RNA-seq modules in GenePattern, see the RNA-seq Analysis page.

Important Notes:

Cufflinks jobs can be very resource intensive.  If your job does not complete within a day, retry it on a server with more available memory, or, if you are running on the GenePattern public server, see this FAQ.

There are known issues that prevent Cufflinks from running on the Mac Mini and possibly other Mac hardware.

References

Trapnell C, Hendrickson D,Sauvageau S, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNA-seqNature Biotechnology. 2013;31:46-53.

Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 2012;7;562–578.

Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-SeqBioinformatics. 2011 Sep 1;27(17):2325-9.

Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.  Nat Biotechnol. 2010;28:511-515.

Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-SeqBioinformatics. 2009;25:1105-1111.

Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25.

Links

Cufflinks website.
Cufflinks manual.  Note that this information may be based on a subsequent version of Cufflinks.
TopHat website.

Parameters

Name Description
input file *

Input file of RNA-seq read alignments in SAM/BAM format.  Cufflinks has some particular requirements for these inputs; see the Input Files section for more details.

transfrag label  A label for the transcribed fragments (transfrags) in the output files.
GTF  Reference annotation file in GTF/GFF format.  Cufflinks will use this file to estimate isoform expression. It will not assemble novel transcripts, and the program will ignore alignments not structurally compatible with any reference transcript.
GTF guide  Annotation file in GTF/GFF format, used to guide reference annotation based transcript (RABT) assembly. Reference transcripts will be tiled with faux-reads to provide additional information in assembly. Output will include all reference transcripts as well as any novel genes and isoforms that are assembled.
mask file  GTF/GFF file specifying transcripts to be ignored. The Cufflinks team recommends including any annotated rRNA, mitochondrial transcripts, and other abundant transcripts you wish to ignore in this file. Due to variable efficiency of mRNA enrichment methods and rRNA depletion kits, masking these transcripts often improves the overall robustness of transcript abundance estimates.
frag bias correct  Providing Cufflinks with a FASTA file via this option instructs it to run a bias detection and correction algorithm that can significantly improve accuracy of transcript abundance estimates.
multi read correct  Tells Cufflinks to do an initial estimation procedure to more accurately weight reads mapping to multiple locations in the genome.
library type  The library type used to generate reads. The choices are inferred, fr-unstranded, fr-firststrand, fr-secondstrand, ff-unstranded, ff-firststrand, ff-secondstrand, and transfrags.  The default is inferred, meaning that no library type information is passed.
min frags per transfrag  Assembled transfrags supported by fewer than this many aligned RNA-Seq fragments are not reported.

additional cufflinks options

Additional options to be passed along to the Cufflinks program at the command line.  This parameter gives you a means to specify otherwise unavailable Cufflinks options and switches not supported by the module; check the Cufflinks manual for details.  Note that the information at this link may refer to a subsequent version of Cufflinks.  Recommended for experts only; use this at your own discretion.

* - required

Cufflinks pass-through options

The following may be useful for advanced users who wish to use the additional.cufflinks.options parameter.  This is the 'usage' output from running cufflinks at the command-line, which gives a list of all of the available options and switches.  Note that this was generated by Cufflinks v2.0.2 and that the options here may differ from the documentation provided online at the Cufflinks website due to subsequent version updates.


cufflinks v2.0.2
linked against Boost version 104700
-----------------------------
Usage:   cufflinks [options] <hits.sam>
General Options:
  -o/--output-dir              write all output files to this directory              [ default:     ./ ]
  -p/--num-threads             number of threads used during analysis                [ default:      1 ]
  --seed                       value of random number generator seed                 [ default:      0 ]
  -G/--GTF                     quantitate against reference transcript annotations                      
  -g/--GTF-guide               use reference transcript annotation to guide assembly                   
  -M/--mask-file               ignore all alignment within transcripts in this file                     
  -b/--frag-bias-correct       use bias correction - reference fasta required        [ default:   NULL ]
  -u/--multi-read-correct      use 'rescue method' for multi-reads (more accurate)   [ default:  FALSE ]
  --library-type               library prep used for input reads                     [ default:  below ]
 
Advanced Abundance Estimation Options:
  -m/--frag-len-mean           average fragment length (unpaired reads only)         [ default:    200 ]
  -s/--frag-len-std-dev        fragment length std deviation (unpaired reads only)   [ default:     80 ]
  --upper-quartile-norm        use upper-quartile normalization                      [ default:  FALSE ]
  --max-mle-iterations         maximum iterations allowed for MLE calculation        [ default:   5000 ]
  --num-importance-samples     number of importance samples for MAP restimation      [    DEPRECATED   ]
  --compatible-hits-norm       count hits compatible with reference RNAs only        [ default:  FALSE ]
  --total-hits-norm            count all hits for normalization                      [ default:  TRUE  ]
  --num-frag-count-draws       Number of fragment generation samples                 [ default:   1000 ]
  --num-frag-assign-draws      Number of fragment assignment samples per generation  [ default:      1 ]
  --max-frag-multihits         Maximum number of alignments allowed per fragment     [ default: unlim  ]
  --no-effective-length-correction   No effective length correction                  [ default:  FALSE ]
  --no-length-correction       No effective length correction                        [ default:  FALSE ]
 
Advanced Assembly Options:
  -L/--label                   assembled transcripts have this ID prefix             [ default:   CUFF ]
  -F/--min-isoform-fraction    suppress transcripts below this abundance level       [ default:   0.10 ]
  -j/--pre-mrna-fraction       suppress intra-intronic transcripts below this level  [ default:   0.15 ]
  -I/--max-intron-length       ignore alignments with gaps longer than this          [ default: 300000 ]
  -a/--junc-alpha              alpha for junction binomial test filter               [ default:  0.001 ]
  -A/--small-anchor-fraction   percent read overhang taken as 'suspiciously small'   [ default:   0.09 ]
  --min-frags-per-transfrag    minimum number of fragments needed for new transfrags [ default:     10 ]
  --overhang-tolerance         number of terminal exon bp to tolerate in introns     [ default:      8 ]
  --max-bundle-length          maximum genomic length allowed for a given bundle     [ default:3500000 ]
  --max-bundle-frags           maximum fragments allowed in a bundle before skipping [ default: 500000 ]
  --min-intron-length          minimum intron size allowed in genome                 [ default:     50 ]
  --trim-3-avgcov-thresh       minimum avg coverage required to attempt 3' trimming  [ default:     10 ]
  --trim-3-dropoff-frac        fraction of avg coverage below which to trim 3' end   [ default:    0.1 ]
  --max-multiread-fraction     maximum fraction of allowed multireads per transcript [ default:   0.75 ]
  --overlap-radius             maximum gap size to fill between transfrags (in bp)   [ default:     50 ]
 
Advanced Reference Annotation Guided Assembly Options:
  --no-faux-reads              disable tiling by faux reads                          [ default:  FALSE ]
  --3-overhang-tolerance       overhang allowed on 3' end when merging with reference[ default:    600 ]
  --intron-overhang-tolerance  overhang allowed inside reference intron when merging [ default:     30 ]
 
Advanced Program Behavior Options:
  -v/--verbose                 log-friendly verbose processing (no progress bar)     [ default:  FALSE ]
  -q/--quiet                   log-friendly quiet processing (no progress bar)       [ default:  FALSE ]
  --no-update-check            do not contact server to check for update availability[ default:  FALSE ]
 
Supported library types:
ff-firststrand
ff-secondstrand
ff-unstranded
fr-firststrand
fr-secondstrand
fr-unstranded (default)
transfrags

 

Input Files

  1. <input.file> (required)
    File of RNA-seq read alignments in SAM (a tab-delimited format) or BAM (a compressed binary version of SAM) format.  SAM is a standard short read alignment that allows aligners to attach custom tags to individual alignments.  This file is the output of a read mapping application, such as TopHat, and the alignment section contains information regarding the mapped location of each sequenced RNA-seq read on a reference genome.
    For more information on the SAM format, see the specification.

    Cufflinks will accept SAM alignments generated by any read mapper.  These must, however use the custom 'xs' tag.  This attribute, which must have a value of "+" or "-", indicates which strand the RNA that produced this read came from. While this tag can be applied to any alignment, including unspliced ones, it must be present for all spliced alignment records (those with a 'N' operation in the CIGAR string).

    Also, the SAM file supplied to Cufflinks must be sorted by reference position. If you aligned your reads with TopHat, your alignments will be properly sorted already.  If not, this can be done with the SortSam module.
  2. <GTF> (optional)
    A tab-delimited reference annotation file in GTF format.  This file is used by Cufflinks to estimate abundances of isoforms. These reference annotation files can be downloaded for many genomes from sites like UCSC Genome Browser.  For more information on the GTF format, see the specification.
    The GenePattern FTP site hosts a number of reference annotation GTFs, available in a dropdown selection (requires GenePattern 3.7.0+).

  3. <GTF.guide> (optional)
    A tab-delimited reference annotation file in GTF format.  This file is used by Cufflinks to guide RABT assembly.
    The GenePattern FTP site hosts a number of reference annotation GTFs, available in a dropdown selection (requires GenePattern 3.7.0+).

  4. <mask.file> (optional)
    A tab-delimited GTF file that specifies transcripts to be ignored.

  5. <frag.bias.correct> (optional)
    Reference multi-FASTA file for bias detection and correction algorithm.   For more information on the FASTA format, see this description.
    The GenePattern FTP site hosts a number of reference genomes, available in a dropdown selection (requires GenePattern 3.7.0+).

Output Files

  1. transcripts.gtf
    This GTF file contains Cufflinks' assembled isoforms. The first 7 columns are standard GTF, and the last column contains attributes, some of which are also standardized ("gene_id" and "transcript_id"). There is one GTF record per row, and each record represents either a transcript or an exon within a transcript.
  2. genes.fpkm_tracking
    This is a tab-delimited file containing one row per gene; the columns contain the attributes in the GTF file.  This file contains gene-level coordinates and expression values.  Note that since the output for Cufflinks is for a single sample, the "q" numbering format (see the file format information) is not used.
  3. isoforms.fpkm_tracking
    This is a tab-delimited file containing one row per isoform; the columns contain the attributes in the GTF file.  This file contains transcript-level coordinates and expression values.  Note that since the output for Cufflinks is for a single sample, the "q" numbering format (see the file format information) is not used.

Platform Dependencies

Module Type: RNA-seq
CPU Type: x86_64
OS: Mac, Linux
Language: C++, Perl

GenePattern Module Version Notes

VersionRelease DateDescription
62014-02-14Added a parameter to allow the user to pass through extra Cufflinks options and changed the frag.bias.correction fasta parameter to use hosted genomes.
52013-09-25Added dynamic dropdowns and HTML-based docs
42013-05-07Updated to Cufflinks 2.0.2
32012-01-13Updated to Cufflinks 1.3.0
22011-12-23Updated to Cufflinks 1.2.1
12010-12-07