GenotypeAndValidate

Genotypes a dataset and validates the calls of another dataset using the Unified Genotyper.

Category Validation Utilities

Traversal LocusWalker

PartitionBy LOCUS


Overview

Genotype and Validate is a tool to evaluate the quality of a dataset for calling SNPs and Indels given a secondary (validation) data source. The data sources are BAM or VCF files. You can use them interchangeably (i.e. a BAM to validate calls in a VCF or a VCF to validate calls on a BAM).

The simplest scenario is when you have a VCF of hand annotated SNPs and Indels, and you want to know how well a particular technology performs calling these snps. With a dataset (BAM file) generated by the technology in test, and the hand annotated VCF, you can run GenotypeAndValidate to asses the accuracy of the calls with the new technology's dataset.

Another option is to validate the calls on a VCF file, using a deep coverage BAM file that you trust the calls on. The GenotypeAndValidate walker will make calls using the reads in the BAM file and take them as truth, then compare to the calls in the VCF file and produce a truth table.

Input

A BAM file to make calls on and a VCF file to use as truth validation dataset. You also have the option to invert the roles of the files using the command line options listed below.

Output

GenotypeAndValidate has two outputs. The truth table and the optional VCF file. The truth table is a 2x2 table correlating what was called in the dataset with the truth of the call (whether it's a true positive or a false positive). The table should look like this:

ALT REF Predictive Value
called alt True Positive (TP) False Positive (FP) Positive PV
called ref False Negative (FN) True Negative (TN) Negative PV

The positive predictive value (PPV) is the proportion of subjects with positive test results who are correctly diagnosed.

The negative predictive value (NPV) is the proportion of subjects with a negative test result who are correctly diagnosed.

The VCF file will contain only the variants that were called or not called, excluding the ones that were uncovered or didn't pass the filters. This file is useful if you are trying to compare the PPV and NPV of two different technologies on the exact same sites (so you can compare apples to apples).

Here is an example of an annotated VCF file (info field clipped for clarity)

 #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
 1   20568807    .   C   T   0    HapMapHet        AC=1;AF=0.50;AN=2;DP=0;GV=T  GT  0/1
 1   22359922    .   T   C   282  WG-CG-HiSeq      AC=2;AF=0.50;GV=T;AN=4;DP=42 GT:AD:DP:GL:GQ  1/0 ./. 0/1:20,22:39:-72.79,-11.75,-67.94:99    ./.
 13  102391461   .   G   A   341  Indel;SnpCluster AC=1;GV=F;AF=0.50;AN=2;DP=45 GT:AD:DP:GL:GQ  ./. ./. 0/1:32,13:45:-50.99,-13.56,-112.17:99   ./.
 1   175516757   .   C   G   655  SnpCluster,WG    AC=1;AF=0.50;AN=2;GV=F;DP=74 GT:AD:DP:GL:GQ  ./. ./. 0/1:52,22:67:-89.02,-20.20,-191.27:99   ./.
 

Additional Details

  • You should always use -L on your VCF track, so that the GATK only looks at the sites on the VCF file. This speeds up the process a lot.
  • The total number of visited bases may be greater than the number of variants in the original VCF file because of extended indels, as they trigger one call per new insertion or deletion. (i.e. ACTG/- will count as 4 genotyper calls, but it's only one line in the VCF).

Examples

  1. Genotypes BAM file from new technology using the VCF as a truth dataset:
  2.   java
          -jar /GenomeAnalysisTK.jar
          -T  GenotypeAndValidate
          -R human_g1k_v37.fasta
          -I myNewTechReads.bam
          -alleles handAnnotatedVCF.vcf
          -L handAnnotatedVCF.vcf
     
  3. Using a BAM file as the truth dataset:
  4.   java
          -jar /GenomeAnalysisTK.jar
          -T  GenotypeAndValidate
          -R human_g1k_v37.fasta
          -I myTruthDataset.bam
          -alleles callsToValidate.vcf
          -L callsToValidate.vcf
          -bt
          -o gav.vcf
     

    Additional Information

    Read filters

    These Read Filters are automatically applied to the data by the Engine before processing by GenotypeAndValidate.

    Parallelism options

    This tool can be run in multi-threaded mode using this option.

    Downsampling settings

    This tool applies the following downsampling settings by default.

    • Mode: BY_SAMPLE
    • To coverage: 1,000

    Window size

    This tool uses a sliding window on the reference.

    • Window start: -200 bp before the locus
    • Window stop: 200 bp after the locus

    Command-line Arguments

    Inherited arguments

    The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).

    GenotypeAndValidate specific arguments

    This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

    Argument name(s) Default value Summary
    Required Inputs
    --alleles
    NA The set of alleles at which to genotype
    Optional Outputs
    --out
     -o
    stdout Generate a VCF file with the variants considered by the walker, with a new annotation "callStatus" which will carry the value called in the validation VCF or BAM file
    Optional Parameters
    --condition_on_depth
     -depth
    -1 Condition validation on a minimum depth of coverage by the reads
    --maximum_deletion_fraction
     -deletions
    -1.0 Maximum deletion fraction for calling a genotype
    --minimum_base_quality_score
     -mbq
    -1 Minimum base quality score for calling a genotype
    --standard_min_confidence_threshold_for_calling
     -stand_call_conf
    -1.0 the minimum phred-scaled Qscore threshold to separate high confidence from low confidence calls
    --standard_min_confidence_threshold_for_emitting
     -stand_emit_conf
    -1.0 the minimum phred-scaled Qscore threshold to emit low confidence calls
    Optional Flags
    --set_bam_truth
     -bt
    false Use the calls on the reads (bam file) as the truth dataset and validate the calls on the VCF

    Argument details

    Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


    --alleles / -alleles

    The set of alleles at which to genotype
    The callset to be used as truth (default) or validated (if BAM file is set to truth).

    --alleles binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

    R RodBinding[VariantContext]


    --condition_on_depth / -depth

    Condition validation on a minimum depth of coverage by the reads
    Only validate sites that have at least a given depth

    int  -1  [ [ -?  ? ] ]


    --maximum_deletion_fraction / -deletions

    Maximum deletion fraction for calling a genotype
    The maximum deletion fraction allowed in a site for calling a genotype. This argument is passed to the Unified Genotyper.

    double  -1.0  [ [ -?  ? ] ]


    --minimum_base_quality_score / -mbq

    Minimum base quality score for calling a genotype
    The minimum base quality score necessary for a base to be considered when calling a genotype. This argument is passed to the Unified Genotyper.

    int  -1  [ [ -?  ? ] ]


    --out / -o

    Generate a VCF file with the variants considered by the walker, with a new annotation "callStatus" which will carry the value called in the validation VCF or BAM file
    The optional output file that will have all the variants used in the Genotype and Validation essay.

    VariantContextWriter  stdout


    --set_bam_truth / -bt

    Use the calls on the reads (bam file) as the truth dataset and validate the calls on the VCF
    Makes the Unified Genotyper calls to the BAM file the truth dataset and validates the alleles ROD binding callset.

    boolean  false


    --standard_min_confidence_threshold_for_calling / -stand_call_conf

    the minimum phred-scaled Qscore threshold to separate high confidence from low confidence calls
    the minimum phred-scaled Qscore threshold to separate high confidence from low confidence calls. This argument is passed to the Unified Genotyper.

    double  -1.0  [ [ -?  ? ] ]


    --standard_min_confidence_threshold_for_emitting / -stand_emit_conf

    the minimum phred-scaled Qscore threshold to emit low confidence calls
    the minimum phred-scaled Qscore threshold to emit low confidence calls. This argument is passed to the Unified Genotyper.

    double  -1.0  [ [ -?  ? ] ]


    See also Guide Index | Tool Documentation Index | Support Forum

    GATK version 3.2-2-gec30cee built at 2014/07/17 17:54:48. GTD: NA