How do I run InVEx?

InVEx is mainly written in Python, requiring Python 2.6 (or 2.7) with the Cython, SciPy, NumPy, and PyFasta packages installed. There are four main stages to running InVEx: creating power fasta files, preprocessing, core InVEx, and report generation.

Creating Power FASTA Files

The create_nt_power_file_from_wig.py script converts a WIG file into a FASTA file where the power of each nucleotide is encoded at that nucleotide's location in the full hg19 FASTA file. The output filename must be related to the Tumor_Sample_Barcode entry in this individual's corresponding MAF file. Therefore, the <outputPrefix> input argument must match the Tumor_Sample_Barcode entry in the individual's MAF file, with one exception. If the Tumor_Sample_Barcode ends with "-Tumor", then the outputPrefix should strip off this extension (i.e. <outputPrefix>-Tumor). If the Tumor_Sample_Barcode doesn't contain the "-Tumor" extension, then the outputPrefix must match the Tumor_Sample_Barcode exactly.

Usage: python create_nt_power_file_from_wig.py [options] <referenceGenome> <inputWigPath> <outputPrefix>

Inputs

Required Arguments Description
<referenceGenome> Reference genome FASTA file.
<inputWigPath> Path to Mutector generated WIG.
<outputPrefix> Output filename prefix (i.e. power file is named <outputPrefix>.binary_power.fasta). This prefix must match the individual's Tumor_Sample_Barcode entry in their corresponding MAF file. If the Tumor_Sample_Barcode ends with a "-Tumor" extension, then do not include "-Tumor" in the outputPrefix (i.e. <outputPrefix>-Tumor).

Outputs

Output File Description
<outputPrefix>.binary_power.fasta.gz  Power file (gzipped).

Preprocessing

The InVExPreprocess.py script performs four tasks: generate an InVEx formatted individual set file, generate a concatenated MAF, create a *.pkl file from the concatenated MAF, and create a list of unique genes based on the concatenated MAF. Please note, the MAF file must be formatted correctly for InVEx to work properly. Therefore, the MAF must contain the following columns: Hugo_Symbol, Chromosome, Start_position, Variant_Classification, Variant_Type, Tumor_Seq_Allele1, Annotation_Transcript, Protein_Change, SwissProt_acc_Id, and UniProt_AApos. Please note, these columns must be named exactly as they are here.

Usage: python InVExPreprocess.py [options] <individualSetId> <powerPathsFile>

Inputs

Required Arguments Description
<individualSetId> Name of the individual set being examined.
<powerPathsFile> File containing paths to power files. 
Optional Arguments Description
-h, --help Display help,
--maf1 File containing paths to MAF files. MAF headers must all be similar.
--maf2 File containing paths to MAF files. MAF headers must all be similar.
--maf3 File containing paths to MAF files. MAF headers must all be similar.
--maf4 File containing paths to MAF files. MAF headers must all be similar.

Outputs

Output File Description
<individualSetId>.final_analysis_set.maf Aggregated MAF file.
<individualSetId>.hugo_gene_symbols.txt File containing set of unique genes encountered in MAF.
<individualSetId>.individual_set.txt File listing individuals and their corresponding individual sets.
PatientSet.load_maf.pkl Python MAF pickle file.

Core InVEx

There are three steps to running the core of InVEx. The first step (Prepare.py) copies and uncompresses all the power files into a single directory in a temporary location. It also prints to stdout the command line arguments for a set of jobs that can be run in parallel (InVEx.py) and a final postprocessing step (Gather.py).

Prepare

Usage: python Prepare.py <libdir> <numRunsPerJob> [options] <hugoGeneSymbols> <maf> ...

Required Arguments Description
<libdir> Path to source code (ignored argument).
<numRunsPerJob> Number of InVEx runs (one gene and all its transcripts per run) per job (50-100 recommended).
<hugoGeneSymbols> File containing list of Hugo gene symbols - one per line.
<maf> Concatenated MAF.
<powerFilesFile> File containing tab delimited individual to power file path (one line per individual).
<individualSetName> Individual set name.
<individualSet> Individual set file.
<genome> Reference genome file (i.e. hg19.fasta).
<ntClasses> NT Classes fasta file.
<gaf> GAF file.
<mutationsDir> Mutations directory.
<polyphen2Dir> PolyPhen2 directory.
<cosmicFilePath> COSMIC file path.
<genePeptideFile> File containing predicted peptide linked to predicted gene.
<swissProtRefFile> File containing number of amino acids in the shortest SwissProt transcript for each gene.
<tmpDir> Temporary directory where all the power files (including *.flat and *.gdx files) will be copied. If the power files are are gzipped, they will be gunzipped to this location - ensure that this location has enough space. These files are several orders of magnitude larger uncompressed (approximately 3-5 GB per uncompressed power file).
Optional Arguments Description
-k, --pkl MAF *.pkl file created in Preprocessing.
-p, --minperms Minimum permutations (Defaults to 100).
-m, --maxperms Maximum permutations (Defaults to 10,000,000).
-e, --minevents Minimum events (Defaults to 10): at end of permutations, if less than this many random instances had a score that met or exceeded that of the real instance, do another order of magnitude of permutations, as long as <maxperms> is not exceeded.

The Prepare.py stage prints to standard output the command line arguments for the InVEx.py script. The number of individual command lines will be the total number of genes in the <hugoGeneSymbols> file modulus the <numRunsPerJob> plus a final command line for the Gather.py script. These individual command lines are intended (but not required) to be run in parallel. When these jobs are finished, then the Gather.py script should be run.

Scatter

This is a potentially very long running step (several days for especially big genes). As mentioned above, there may be many scatter jobs to run depending on the <numRunsPerJob> argument prescribed above and the number of genes in your <hugoGeneSymbols> input file. Since the inputs are generated by the Prepare.py step, the usage description is not given here. Scatter's inputs are very similar to those of Prepare.py. One important note is that each scatter job MUST be executed in its own scatter subdirectory named scatter.<jobNumber> where jobNumber should be unique for every Scatter job (e.g. scatter.1, scatter.2, etc.). The Gather step below requires this directory structure so that it can find all the individual scatter jobs results for aggregation.

Gather

This step may also be long running if hundreds of individuals and thousands of genes are examined. Essentially, this step performs three jobs: It aggregates all the scatter results into one raw results file. It filters out any transcripts with no mutations at all (covered exons, introns, UTR) and any transcripts in which the number of amino acids in the given transcript is less than the number of amino acids in the shortest SwissProt transcript for the given gene (to avoid counting fragment transcripts). Finally, it computes and creates functional, loss-of-function, and synonymous mutation burden files. Each file contains only the lowest P-value transcript for every filtered gene.

Outputs

Output File Description
<individualSetName>.rawResults.txt Raw aggregated results file.
<individualSetName>.filteredResults.txt Filtered results file.
<individualSetName>.pph2Results.txt Functional mutation burden results file.
<individualSetName>.lofResults.txt Loss-of-function mutation burden results file.
<individualSetName>.synResults.txt Synonomous mutation burden results file.
InVExVersion.txt File containint InVEx version used in this analysis.

Report Generation

The report generation stage computes Q-values, generates a QQ plot, and a report containing the figure and tables with significant results. This stage is written in R and makes use of two R scripts: InVExReport.R and InVEx_FH_Qval_and_QQplot.R. The <moduleDir> argument is used to locate the InVEx_FH_Qval_and_QQplot.R script. Therefore, you can copy or create a symbolic link to InVExReport.R wherever you wish to run it. Usage and Input/Output details are shown below.

Usage: Rscript -e "source('InVExReport.R')" -e "main('-m<moduleDir>', '-i<individualSet>, '-p../invexCore/Melanoma.pph2Results.txt', '-l../invexCore/Melanoma.lofResults.txt', '-s../invexCore/Melanoma.synResults.txt', '-v../invexCore/InVExVersion.txt')"

Inputs

Required Arguments Description
-m<moduleDir> InVEx source code installation directory (path to InVEx-1.0.0).
-i<individualSet> Individual set file generated in preprocessing.
-p<pph2Results> Functional mutation burden results file.
-l<lofResults> Loss-of-function mutation burden results file.
-s<synResults> Synonomous mutation burden results file.
-v<version> InVEx version fle generated in Core InVEx.

Outputs

Output File Description
<individualSetId>.QQ-plot.png QQ plot.
<individualSetId>.pph2Results_withQvalues.txt Functional mutation burden results file with Q-values appended.
<individualSetId>.lofResults_withQvalues.txt Loss-of-function mutation burden results file with Q-values appended.
<individualSetId>.synResults_withQvalues.txt Synonomous mutation burden results file with Q-values appended.
nozzle.html Report containing figure and result tables of significant results.