Example InVEx run
There are four stages involved in running InVEx: creating power files, preprocessing, running InVEx core, and generating a report. The first and third stages (create power files and InVEx core) should be run in parallel. However, to reduce runtime and obviate the need for parallelising jobs, this example only examines three samples. To start, you will want to set up a test directory structure similar to the one described here. Create a top-level InVEx directory. In this directory, create a ref directory that contains all the reference files described in the How do I get InVEx? page. Next, create a subdirectory for each InVEx stage: createPowerFiles, preprocess, core, and report. Finally, either install the source code to this directory or create a soft link to the InVEx-1.0.0 directory of your source code installation. Now you are ready to begin.
Create Power Files
The first step in running InVEx is the creation of binary power FASTA files. These files are generated from each sample's wiggle file. For this example, we will be analyzing three samples. The code inputs a single wiggle file and generates a single binary power FASTA file. Therefore, it is well suited for parallelization. Start by downloading the three example input wiggle files tarball: wigFiles.tar. Untar this tarball into your createPowerFiles subdirectory. Furthermore, create a symbolic link in this subdirectory to the create_nt_power_file_from_wig.py script in the InVEx source code. Finally, you are ready to run. Issue the following command:
python create_nt_power_file_from_wig.py ../ref/hg19.fasta ME001.coverage.wig.txt.gz ME001
Do the same for the ME002 and ME009 example wiggle files. Your output should match these expected output files: powerFiles.tar. Please note, the gzipped files must be uncompressed to compare.
The preprocessing stage generates four input files for InVEx core: a gene symbols file, individual set file, aggregated MAF, and a patient set pickle. Create a directory named preprocess if you have not already done so, enter the preprocess directory, and add a symbolic link to the InVExPreprocess.py script found in your InVEx source distribution (ln -s ../InVEx-1.0.0/invex/InVExPreprocess.R). Do not copy InVExPreprocess.py, otherwise the script will fail attempting to locate dependencies. Next, download the required inputs file preprocessInputs.tgz. Unpackage them in your preprocess directory: tar xvzf preprocessInputs.tgz. This should reveal a MAFs directory and three text files describing the location of the MAFs and power files: indelsMAFPaths.txt, substitutionsMAFPaths.txt, and powerPaths.txt. The powerPaths.txt file contains relative paths to the power files created in the previous stage. Check that these paths are accurate. Furthermore, examine the MAF paths files to ensure the relative paths point to the MAFs that should have been unpackaged from the preprocessInputs.tgz file. You should now be ready to run the preprocessing stage. Issue the following command:
python InVExPreprocess.py --maf1 substitutionsMAFPaths.txt --maf2 indelsMAFPaths.txt Melanoma powerPaths.txt
This should produce the outputs contained in preprocessOutputs.tgz.
The core InVEx stage is split up into three sequential steps amenable to parallelization (Prepare, Scatter, and Gather). The Prepare step prints to standard output the command line arguments for the the Scatter and Gather steps. Each set of command line arguments is contained in its own line. And each line (Scatter arguments), except for the last (Gather arguments), can be performed in parallel. The last line of arguments are the command line arguments to the Gather step. The final step should be run after all of the parallel jobs have completed. In this test example, we are only running one job. Create a core directory if you have not already done so, enter that directory, and add symbolic links to Prepare.py, InVEx.py, and Gather.py found in your source code installation directory:
ln -s ../InVEx-1.0.0/invex/Prepare.py
ln -s ../InVEx-1.0.0/invex/InVEx.py
ln -s ../InVEx-1.0.0/invex/Gather.py
The first step prepares this stage for running. It may take a long time to run, because it unzips all the power files created in the preprocessor stage to a local directory. Therefore, running InVEx on hundreds of power files may take >1 day to complete. For our example, this step should take on the order of minutes. If you ran the first two stages and downloaded all the reference files (as described in How do I get InVEx?), you should have all but one of the inputs required for running this step. Since we want this test to run fairly quickly, we are only going to run InVEx on a single gene. Therefore, we are going to use a reduced gene symbols file: Melanoma.hugo_gene_symbols _reduced.txt. Add this input to your core working directory. Then, issue the command below. Please note, the following command may require modification if you have not followed the directory structure suggested at the beginning.
python Prepare.py . 50 Melanoma.hugo_gene_symbols_reduced.txt ../preprocess/Melanoma.final_analysis_set.maf ../preprocess/powerPaths.txt Melanoma ../preprocess/Melanoma.individual_set.txt ../ref/hg19.fasta ../ref/hg19_encoded_by_trinucleotide.fasta ../ref/TCGA.hg19.June2011.gaf ../ref/hg19 ../ref/pph2_whpss_reduced ../ref/cosmic_num_times_each_chr_pos_mutated.tab ../ref/knownGenePep.txt ../ref/swissprot_smallest_isoform_protein_length_by_gene .
This should produce something similar to the following two output lines. Notice that the relative paths have been replaced with absolute paths, so your output will be slightly different:
-p 100 -m 10000000 -e 10 PTEN /InVExExample/preprocess/Melanoma.final_analysis_set.maf /InVExExample/core/unzipped_power_files Melanoma /InVExExample/preprocess/Melanoma.individual_set.txt /InVExExample/ref/hg19.fasta /InVExExample/ref/hg19_encoded_by_trinucleotide.fasta /InVExExample/ref/TCGA.hg19.June2011.gaf /InVExExample/ref/hg19 /InVExExample/ref/pph2_whpss_reduced /InVExExample/ref/cosmic_num_times_each_chr_pos_mutated.tab
Melanoma /InVExExample/ref/knownGenePep.txt /InVExExample/ref/swissprot_smallest_isoform_protein_length_by_gene /InVExExample/core/unzipped_power_files
The first line of arguments is input to the InVEx.py script in the Scatter step, and the second line of arguments is the input to the Gather.py script in the Gather step. If you chose to run >50 genes (the 50 in the prepare arguments indicates the number of genes to run per job) then there would be >1 lines of input arguments for the Scatter step. These are intended to be run in parallel. .
Running the scatter step is fairly straightforward. Simply precede the first line of arguments given in the Prepare step with python ../InVEx.py. Notice that InVEx.py is preceded with '../'. This is due to the fact that you must create a subdirectory with the naming scheme scatter.<jobNumber> where you will run the scatter job. Typically, the subdirectories the scatter jobs are run in are named scatter.1, scatter.2, etc. Since we are only running one scatter job, create a single subdirectory named scatter.1 and enter that directory to run. Your command line should look something like this:
python ../InVEx.py -p 100 -m 10000000 -e 10 PTEN /InVExExample/preprocess/Melanoma.final_analysis_set.maf /InVExExample/core/unzipped_power_files Melanoma /InVExExample/preprocess/Melanoma.individual_set.txt /InVExExample/ref/hg19.fasta /InVExExample/ref/hg19_encoded_by_trinucleotide.fasta /InVExExample/ref/TCGA.hg19.June2011.gaf /InVExExample/ref/hg19 /InVExExample/ref/pph2_whpss_reduced /InVExExample/ref/cosmic_num_times_each_chr_pos_mutated.tab
This should produce an Output directory with the contents found in Output.tar. The results will not be an exact match due to the non-deterministic nature of the method. However, they should look similar.
Running the gather step is also straightword. Simply precede the second line of arguments given in the Prepare step with python Gather.py and run in the parent directory where your scatter.# directories live (i.e. your core working directory). Your command line should look something like this:
python Gather.py Melanoma /InVExExample/ref/knownGenePep.txt /InVExExample/ref/swissprot_smallest_isoform_protein_length_by_gene /InVExExample/core/unzipped_power_files
This should produce the result files invexCoreOutputs.tgz. Again, the non-deterministic nature of this algorithm will cause the results to not be exactly the same. However, they should be similar.
The last stage computes Q-values, generates a QQ plot, and a report containing tables with significant results. Create a report directory if you have not already done so, enter that direcdtory, and create a soft link to the InVExReport.py script in your source code distribution (ln -s ../InVEx-1.0.0/invex/InVExReport.R). The run the following command.
Rscript -e "source('InVExReport.R')" -e "main('-m../InVEx-1.0.0/invex/', '-i../preprocess/Melanoma.individual_set.txt', '-p../core/Melanoma.pph2Results.txt', '-l../core/Melanoma.lofResults.txt', '-s../core/Melanoma.synResults.txt', '-v../core/InVExVersion.txt')"
This should produce the result files in reportOutputs.tgz. Unfortunately, running InVEx on three individuals and a single gene doesn't produce exciting outputs. You will notice that the report is virtually empty. Feel free to try running the report on the supplied example outputs on the How do I run InVEx? page to generate a more interesting report.