Creating Variant Validation Sets
Posted in Methods and Workflows | Last updated on 2012-09-28 17:49:21


Comments (0)

Contents

Introduction

ValidationSiteSelectorWalker is intended for use in experiments where we sample data randomly from a set of variants, for example in order to choose sites for a follow-up validation study. Sites are selected randomly but within certain restrictions. There are two main sources of restrictions: Sample restrictions and Frequency restrictions. Sample restrictions alter the polymorphic/monomorphic status of sites by restricting the sample set to a given number of samples. Frequency restrictions bias the site sampling method to sample either uniformly, or in accordance with the allele frequency spectrum of the input VCF.

GATK Documentation

For example command lines and a full list of arguments, please see the GATK documentation for this tool at Validation Site Selector.

Sample and Frequency Restrictions

-sampleMode

The -sampleMode argument controls the mode of sample-based site consideration. The options are:

  • None: All sites are included for consideration, including reference sites
  • Poly_based_on_gt: Site is included if it has a variant genotype in at least one of the selected samples
  • Poly_based_on_gl: Site is included if it is likely to be variant based on the genotype likelihoods of the selected samples

-samplePNonref

Note that Poly_based_on_gl uses the exact allele frequency calculation model to estimate P[site is nonref]. The site is considered for validation if P[site is nonref] > [this argument]. So if you want to validate sites that are >95% confidently nonref (based on the likelihoods), you would set -sampleMode POLY_BASED_ON_GL -samplePNonref 0.95

-frequencySelectionMode

The -frequencySelectionMode argument controls the mode of frequency matching for site selection. The options are:

  • Uniform: Choose variants uniformly, without regard to their allele frequency.
  • Keep AF Spectrum: Choose variants so that the resulting allele frequency matches as closely as possible to that of the input VCF.

Return to top Comment on this article in the forum