SnpSelector.py
From GSA
Contents |
Introduction
snpSelector.py is an experimental feature-based optimizer for SNP calls. The tool accepts [VCF] formatted calls and recalibrates their QUAL field based on the relationship between ti/tv and arbitrary features in the VCF columns for INFO fields.
Running the tool
Downloading the files
Check out the [latest GATK from SVN to obtain a copy of the python/ directory. snpSelector.py lives there. It requires Python-2.4 (installed by default at the Broad).
Setting up the PYTHONPATH
You need to include the path to the GATK python directory in your PYTHONPATH variable. In TCSH, you do:
setenv PYTHONPATH ${PYTHONPATH}:<STING-TRUNK>/python
Where <STING-TRUNK> is the path to the local path where you checked out the Sting trunk. Equivalently, this is the parent directory where snpSelector.py lives.
snpSelector.py command line
Here's an example command-line that recalibrates the 1000 Genomes pilot 2 (deep coverage) from our newest Unified Genotyper using the SNP quality score field (QUAL), strand bias (SB), and depth of coverage (DP). It divides the data up into 20 bins and uses the relationship between the SNPs in these bins and their transition/transversion ratio to rescore the quality-score of the SNP itself. It assumes that whole-genome SNPs should have have a ti/tv of 2.1, which is a fairly robust assumption. It writes out the recalibrated SNP calls to recalCalls.vcf.
python python/snpSelector.py \ -f QUAL,SB,DP \ -o recalCalls.vcf \ --titv 2.1 \ -p 20 \ NA12878.SLX.chr1.good.ug.vcf
The snpSelector.py script accepts a variety of standard command-line argument:
- -f FIELDS, --f=FIELDS
- Comma-separated list of fields (either in the VCF columns of as INFO keys) to use during optimization [default: QUAL]
- -p PARTITIONS, --partitions=PARTITIONS
- Number of partitions to use for each feature. Don't use so many that the number of variants per bin is very low. [default: 25]
- --maxRecordsForCovariates=MAXRECORDSFORCOVARIATES
- Derive covariate information from up to this many VCF records. For files with more than this number of records, the system downsamples the reads [default: 200000]
- -q MAXQSCORE, --qMax=MAXQSCORE
- The maximum Q score allowed for both a single covariate and the overall QUAL score [default: 30]
- -o OUTPUTVCF, --outputVCF=OUTPUTVCF
- If provided, a VCF file will be written out to this file [default: none]
- --titv=TITVTARGET
- If provided, we will optimize calls to the targeted : ti/tv rather than that calculated from known calls [default: none]
- -b BOOTSTRAP, --bootstrap=BOOTSTRAP
- If provided, the % of the calls used to generate the recalibration tables. [default: none]
Example impact
This is from recalibrating the 1000 Genomes deep coverage CEU individual NA12878, using QUAL, SB, and DP from the GATK UnifiedGenotyper. The comparison is to trio-aware, SLX/454/SOLiD calls made by the 1000 Genomes project.
