How do I run InVEx?
InVEx is mainly written in Python, requiring Python 2.6 (or 2.7) with the Cython, SciPy, NumPy, and PyFasta packages installed. There are four main stages to running InVEx: creating power fasta files, preprocessing, core InVEx, and report generation.
Creating Power FASTA Files
The create_nt_power_file_from_wig.py script converts a WIG file into a FASTA file where the power of each nucleotide is encoded at that nucleotide's location in the full hg19 FASTA file. The output filename must be related to the Tumor_Sample_Barcode entry in this individual's corresponding MAF file. Therefore, the <outputPrefix> input argument must match the Tumor_Sample_Barcode entry in the individual's MAF file, with one exception. If the Tumor_Sample_Barcode ends with "-Tumor", then the outputPrefix should strip off this extension (i.e. <outputPrefix>-Tumor). If the Tumor_Sample_Barcode doesn't contain the "-Tumor" extension, then the outputPrefix must match the Tumor_Sample_Barcode exactly.
Usage: python create_nt_power_file_from_wig.py [options] <referenceGenome> <inputWigPath> <outputPrefix>
The InVExPreprocess.py script performs four tasks: generate an InVEx formatted individual set file, generate a concatenated MAF, create a *.pkl file from the concatenated MAF, and create a list of unique genes based on the concatenated MAF. Please note, the MAF file must be formatted correctly for InVEx to work properly. Therefore, the MAF must contain the following columns: Hugo_Symbol, Chromosome, Start_position, Variant_Classification, Variant_Type, Tumor_Seq_Allele1, Annotation_Transcript, Protein_Change, SwissProt_acc_Id, and UniProt_AApos. Please note, these columns must be named exactly as they are here.
Usage: python InVExPreprocess.py [options] <individualSetId> <powerPathsFile>
There are three steps to running the core of InVEx. The first step (Prepare.py) copies and uncompresses all the power files into a single directory in a temporary location. It also prints to stdout the command line arguments for a set of jobs that can be run in parallel (InVEx.py) and a final postprocessing step (Gather.py).
Usage: python Prepare.py <libdir> <numRunsPerJob> [options] <hugoGeneSymbols> <maf> ...
The Prepare.py stage prints to standard output the command line arguments for the InVEx.py script. The number of individual command lines will be the total number of genes in the <hugoGeneSymbols> file modulus the <numRunsPerJob> plus a final command line for the Gather.py script. These individual command lines are intended (but not required) to be run in parallel. When these jobs are finished, then the Gather.py script should be run.
This is a potentially very long running step (several days for especially big genes). As mentioned above, there may be many scatter jobs to run depending on the <numRunsPerJob> argument prescribed above and the number of genes in your <hugoGeneSymbols> input file. Since the inputs are generated by the Prepare.py step, the usage description is not given here. Scatter's inputs are very similar to those of Prepare.py. One important note is that each scatter job MUST be executed in its own scatter subdirectory named scatter.<jobNumber> where jobNumber should be unique for every Scatter job (e.g. scatter.1, scatter.2, etc.). The Gather step below requires this directory structure so that it can find all the individual scatter jobs results for aggregation.
This step may also be long running if hundreds of individuals and thousands of genes are examined. Essentially, this step performs three jobs: It aggregates all the scatter results into one raw results file. It filters out any transcripts with no mutations at all (covered exons, introns, UTR) and any transcripts in which the number of amino acids in the given transcript is less than the number of amino acids in the shortest SwissProt transcript for the given gene (to avoid counting fragment transcripts). Finally, it computes and creates functional, loss-of-function, and synonymous mutation burden files. Each file contains only the lowest P-value transcript for every filtered gene.
The report generation stage computes Q-values, generates a QQ plot, and a report containing the figure and tables with significant results. This stage is written in R and makes use of two R scripts: InVExReport.R and InVEx_FH_Qval_and_QQplot.R. The <moduleDir> argument is used to locate the InVEx_FH_Qval_and_QQplot.R script. Therefore, you can copy or create a symbolic link to InVExReport.R wherever you wish to run it. Usage and Input/Output details are shown below.
Usage: Rscript -e "source('InVExReport.R')" -e "main('-m<moduleDir>', '-i<individualSet>, '-p../invexCore/Melanoma.pph2Results.txt', '-l../invexCore/Melanoma.lofResults.txt', '-s../invexCore/Melanoma.synResults.txt', '-v../invexCore/InVExVersion.txt')"