How do I run RNA-SeQC?

Running Online

RNA-SeQC can be run online using the GenePattern genomic analysis platform. The program is listed under the RNA-seq category of modules.

Running Locally

It can also be downloaded and run locally. There are no installation dependencies that are strictly required. The exception is that one of the modes of computing rRNA levels requires a BWA installation.

Pre-run Checklist

  1. Are the contig names consistent between your BAM, Reference, and the GTF file?
  2. Is your BAM indexed? (use samtools index)
  3. Is your reference Indexed? (use samtools faidx)
  4. Does your reference have a dict (dictionary) file? (use CreateSequenceDictionary.jar)

Usage

java -jar RNA-SeQC.jar <args>        

 

-bwa <arg>

Path to BWA, which should be set if it's not in your path and BWArRNA is used.

-BWArRNA <arg>

Use an on the fly BWA alignment for estimating rRNA content. The value should be the rRNA reference fasta. If this flag is absent, rRNA estimation will be based upon the rRNA transcript intervals provided in the GTF (a faster but less robust method).

-corr <arg>

GCT file for expression correlation comparison. Note, that the values must be log normalized, and the identifiers must match those of the GTF file.

-d <arg>

Perform downsampling to the given number of reads.

-e <arg>

Change the definition of a transcripts end (5' or 3') to the given length. (50, 100, 200 are acceptable values; 200 is default)

-expr <arg>

Uses provided GCT file for expression values instead of on-the-fly RPKM calculation

-gc <arg>

File of transcript id <tab> gc content. Used for stratification.

-n <arg>

Number of top transcripts to use. Default is 1000.

-noDoC

Suppresses GATK Depth of Coverage calculations.

-noReadCounting

Suppresses read count-based metrics.

-o <arg>

Output directory (will be created if doesn't exist).

-r <arg>

Reference Genome in fasta format.

-rRNA <arg>        

intervalFIle for rRNA loci (must end in .list). This is an alternative flag to the -BWArRNA flag.

-s <arg>

Sample File: tab-delimited description of samples and their bams. This file header is:
Sample ID    Bam File    Notes
When running on just one sample, this argument can be a string of the form
"Sample ID|Bam File|Notes", where Bam File is the path to the input file.

-singleEnd          

This BAM contains single end reads.

-strat <arg>

Stratification options: current supported option is 'gc'

-strictMode <arg>

When counting reads per exon or generating RPKMs, reads will be filtered out that have a mapping quality of zero, more than 6 non-reference bases or improper pairs.

-t <arg>

GTF File defining transcripts (must end in '.gtf').

-transcriptDetails  

Provide an HTML report for each transcript.

-ttype <arg>        

The column in GTF to use to look for rRNA transcript type. Mainly used for running on Ensembl GTF (specify "-ttype 2"). Otherwise, for spec-conforming GTF files, disregard.

-rRNAdSampleTarget

Downsamples to calculate rRNA rate more efficiently. Default is 1 million. Set to 0 to disable.

-gcMargin

Used in conjunction with '-strat gc' to specify the percent gc content to use as boundaries. E.g. .25 would set a lower cutoff of 25% and an upper cutoff of 75% (default is 0.375).

-gld

Gap Length Distribution: if flag is present, the distribution of gap lengths will be plotted.

-gatkFlags

Pass a string of quotes directly to the GATK (e.g. -gatkFlags "-DBQ 0" to set missing base qualities to zero).

 

 

 

 

 

'*' Bold parameters are required

 

Metrics

The following is a list of metrics provided by RNA-SeQC. For more detailed descriptions, see the GenePattern Help File.

  • Read Metrics
    • Total, unique, duplicate reads
    • Alternative alignment reads
    • Read Length
    • Fragment Length mean and standard deviation
    • Read pairs: number aligned, unpaired reads, base mismatch rate for each pair mate, chimeric pairs
    • Vendor Failed Reads
    • Mapped reads and mapped unique reads
    • rRNA reads
    • Transcript-annotated reads (intragenic, intergenic, exonic, intronic)
    • Expression profiling efficiency (ratio of exon-derived reads to total reads sequenced)
    • Strand specificity
  • Coverage
    • Mean coverage (reads per base)
    • Mean coefficient of variation
    • 5'/3' bias
    • Coverage gaps: count, length
    • Coverage Plots
  • Downsampling
  • GC Bias
  • Correlation:
    • Between sample(s) and a reference expression profile
    • When run with multiple samples, the correlation between every sample pair is reported

Example Reports

Some report examples are given in the GenePattern Help File.