SnpSelector.py

From GSA

Jump to: navigation, search

Contents

Introduction

snpSelector.py is an experimental feature-based optimizer for SNP calls. The tool accepts [VCF] formatted calls and recalibrates their QUAL field based on the relationship between ti/tv and arbitrary features in the VCF columns for INFO fields.

Running the tool

Downloading the files

Check out the [latest GATK from SVN to obtain a copy of the python/ directory. snpSelector.py lives there. It requires Python-2.4 (installed by default at the Broad).

Setting up the PYTHONPATH

You need to include the path to the GATK python directory in your PYTHONPATH variable. In TCSH, you do:

setenv PYTHONPATH ${PYTHONPATH}:<STING-TRUNK>/python

Where <STING-TRUNK> is the path to the local path where you checked out the Sting trunk. Equivalently, this is the parent directory where snpSelector.py lives.

snpSelector.py command line

Here's an example command-line that recalibrates the 1000 Genomes pilot 2 (deep coverage) from our newest Unified Genotyper using the SNP quality score field (QUAL), strand bias (SB), and depth of coverage (DP). It divides the data up into 20 bins and uses the relationship between the SNPs in these bins and their transition/transversion ratio to rescore the quality-score of the SNP itself. It assumes that whole-genome SNPs should have have a ti/tv of 2.1, which is a fairly robust assumption. It writes out the recalibrated SNP calls to recalCalls.vcf.

python python/snpSelector.py \
  -f QUAL,SB,DP \
  -o recalCalls.vcf \
   --titv 2.1 \ 
   -p 20 \
   NA12878.SLX.chr1.good.ug.vcf

The snpSelector.py script accepts a variety of standard command-line argument:

-f FIELDS, --f=FIELDS 
Comma-separated list of fields (either in the VCF columns of as INFO keys) to use during optimization [default: QUAL]
-p PARTITIONS, --partitions=PARTITIONS 
Number of partitions to use for each feature. Don't use so many that the number of variants per bin is very low. [default: 25]
--maxRecordsForCovariates=MAXRECORDSFORCOVARIATES 
Derive covariate information from up to this many VCF records. For files with more than this number of records, the system downsamples the reads [default: 200000]
-q MAXQSCORE, --qMax=MAXQSCORE  
The maximum Q score allowed for both a single covariate and the overall QUAL score [default: 30]
-o OUTPUTVCF, --outputVCF=OUTPUTVCF 
If provided, a VCF file will be written out to this file [default: none]
--titv=TITVTARGET  
If provided, we will optimize calls to the targeted : ti/tv rather than that calculated from known calls [default: none]
-b BOOTSTRAP, --bootstrap=BOOTSTRAP 
If provided, the % of the calls used to generate the recalibration tables. [default: none]

Example impact

This is from recalibrating the 1000 Genomes deep coverage CEU individual NA12878, using QUAL, SB, and DP from the GATK UnifiedGenotyper. The comparison is to trio-aware, SLX/454/SOLiD calls made by the 1000 Genomes project.

SNP error covariates quality score, strand bias, and depth of coverage vs. the transition/transversion ratio
Standard Specificity / Sensitivity plots for an experimental set of SNP calls from our Unified Genotyper before and after SNP quality recalibration.
TP, FP, and FN rates as a function of the recalibrated quality score
Personal tools