ValidationAmplicons

Creates FASTA sequences for use in Seqenom or PCR utilities for site amplification and subsequent validation

Category Validation Utilities

Traversal LocusWalker

PartitionBy LOCUS


Overview

ValidationAmplicons consumes a VCF and an Interval list and produces FASTA sequences from which PCR primers or probe sequences can be designed. In addition, ValidationAmplicons uses BWA to check for specificity of tracts of bases within the output amplicon, lower-casing non-specific tracts, allows for users to provide sites to mask out, and specifies reasons why the site may fail validation (nearby variation, for example).

Input

Requires a VCF containing alleles to design amplicons towards, a VCF of variants to mask out of the amplicons, and an interval list defining the size of the amplicons around the sites to be validated

Output

Output is a FASTA-formatted file with some modifications at probe sites. For instance:

 >20:207414 INSERTION=1,VARIANT_TOO_NEAR_PROBE=1, 20_207414
 CCAACGTTAAGAAAGAGACATGCGACTGGGTgcggtggctcatgcctggaaccccagcactttgggaggccaaggtgggc[A/G*]gNNcacttgaggtcaggagtttgagaccagcctggccaacatggtgaaaccccgtctctactgaaaatacaaaagttagC
 >20:792122 Valid 20_792122
 TTTTTTTTTagatggagtctcgctcttatcgcccaggcNggagtgggtggtgtgatcttggctNactgcaacttctgcct[-/CCC*]cccaggttcaagtgattNtcctgcctcagccacctgagtagctgggattacaggcatccgccaccatgcctggctaatTT
 >20:994145 Valid 20_994145
 TCCATGGCCTCCCCCTGGCCCACGAAGTCCTCAGCCACCTCCTTCCTGGAGGGCTCAGCCAAAATCAGACTGAGGAAGAAG[AAG/-*]TGGTGGGCACCCACCTTCTGGCCTTCCTCAGCCCCTTATTCCTAGGACCAGTCCCCATCTAGGGGTCCTCACTGCCTCCC
 >20:1074230 SITE_IS_FILTERED=1, 20_1074230
 ACCTGATTACCATCAATCAGAACTCATTTCTGTTCCTATCTTCCACCCACAATTGTAATGCCTTTTCCATTTTAACCAAG[T/C*]ACTTATTATAtactatggccataacttttgcagtttgaggtatgacagcaaaaTTAGCATACATTTCATTTTCCTTCTTC
 >20:1084330 DELETION=1, 20_1084330
 CACGTTCGGcttgtgcagagcctcaaggtcatccagaggtgatAGTTTAGGGCCCTCTCAAGTCTTTCCNGTGCGCATGG[GT/AC*]CAGCCCTGGGCACCTGTNNNNNNNNNNNNNTGCTCATGGCCTTCTAGATTCCCAGGAAATGTCAGAGCTTTTCAAAGCCC
are amplicon sequences resulting from running the tool. The flags (preceding the sequence itself) can be:
 Valid                     // amplicon is valid
 SITE_IS_FILTERED=1        // validation site is not marked 'PASS' or '.' in its filter field ("you are trying to validate a filtered variant")
 VARIANT_TOO_NEAR_PROBE=1  // there is a variant too near to the variant to be validated, potentially shifting the mass-spec peak
 MULTIPLE_PROBES=1,        // multiple variants to be validated found inside the same amplicon
 DELETION=6,INSERTION=5,   // 6 deletions and 5 insertions found inside the amplicon region (from the "mask" VCF), will be potentially difficult to validate
 DELETION=1,               // deletion found inside the amplicon region, could shift mass-spec peak
 START_TOO_CLOSE,          // variant is too close to the start of the amplicon region to give sequenom a good chance to find a suitable primer
 END_TOO_CLOSE,            // variant is too close to the end of the amplicon region to give sequenom a good chance to find a suitable primer
 NO_VARIANTS_FOUND,        // no variants found within the amplicon region
 INDEL_OVERLAPS_VALIDATION_SITE, // an insertion or deletion interferes directly with the site to be validated (i.e. insertion directly preceding or postceding, or a deletion that spans the site itself)
 

Examples

    java
      -jar GenomeAnalysisTK.jar
      -T ValidationAmplicons
      -R /humgen/1kg/reference/human_g1k_v37.fasta
      -L:table interval_table.table
      -ProbeIntervals:table interval_table.table
      -ValidateAlleles:vcf sites_to_validate.vcf
      -MaskAlleles:vcf mask_sites.vcf
      --virtualPrimerSize 30
      -o probes.fasta
 

Additional Information

Read filters

These Read Filters are automatically applied to the data by the Engine before processing by ValidationAmplicons.

Downsampling settings

This tool applies the following downsampling settings by default.

  • Mode: BY_SAMPLE
  • To coverage: 1,000

Command-line Arguments

Inherited arguments

The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).

ValidationAmplicons specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Inputs
--MaskAlleles
NA A VCF containing the sites you want to MASK from the designed amplicon (e.g. by Ns or lower-cased bases)
--ProbeIntervals
NA A collection of intervals in table format with optional names that represent the intervals surrounding the probe sites amplicons should be designed for
--ValidateAlleles
NA A VCF containing the sites and alleles you want to validate. Restricted to *BI-Allelic* sites
Optional Outputs
--out
 -o
stdout An output file created by the walker. Will overwrite contents if file exists
Optional Parameters
--target_reference
 -target_ref
NA The reference to which reads in the source file should be aligned. Alongside this reference should sit index files generated by bwa index -d bwtsw. If unspecified, will default to the reference specified via the -R argument.
--virtualPrimerSize
20 Size of the virtual primer to use for lower-casing regions with low specificity
Optional Flags
--doNotUseBWA
false Do not use BWA, lower-case repeats only
--filterMonomorphic
false Monomorphic sites in the mask file will be treated as filtered
--ignoreComplexEvents
false Ignore complex genomic records.
--lowerCaseSNPs
false Lower case SNPs rather than replacing with 'N'
--onlyOutputValidAmplicons
false Only output valid sequences.

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--doNotUseBWA

Do not use BWA, lower-case repeats only

boolean  false


--filterMonomorphic

Monomorphic sites in the mask file will be treated as filtered

boolean  false


--ignoreComplexEvents

Ignore complex genomic records.
If ignoreComplexEvents is true, the output fasta file will contain only sequences coming from SNPs and Indels. Complex substitutions will be ignored.

boolean  false


--lowerCaseSNPs

Lower case SNPs rather than replacing with 'N'

boolean  false


--MaskAlleles

A VCF containing the sites you want to MASK from the designed amplicon (e.g. by Ns or lower-cased bases)
A VCF file containing variants to be masked. A mask variant overlapping a validation site will be ignored at the validation site.

--MaskAlleles binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

R RodBinding[VariantContext]


--onlyOutputValidAmplicons

Only output valid sequences.
If onlyOutputValidAmplicons is true, the output fasta file will contain only valid sequences. Useful for producing delivery-ready files.

boolean  false


--out / -o

An output file created by the walker. Will overwrite contents if file exists

PrintStream  stdout


--ProbeIntervals

A collection of intervals in table format with optional names that represent the intervals surrounding the probe sites amplicons should be designed for
A Table-formatted file listing amplicon contig, start, stop, and a name for the amplicon (or probe)

--ProbeIntervals binds reference ordered data. This argument supports ROD files of the following types: BEDTABLE, TABLE

R RodBinding[TableFeature]


--target_reference / -target_ref

The reference to which reads in the source file should be aligned. Alongside this reference should sit index files generated by bwa index -d bwtsw. If unspecified, will default to the reference specified via the -R argument.

File


--ValidateAlleles

A VCF containing the sites and alleles you want to validate. Restricted to *BI-Allelic* sites
A VCF file containing the bi-allelic sites for validation. Filtered records will prompt a warning, and will be flagged as filtered in the output fastq.

--ValidateAlleles binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

R RodBinding[VariantContext]


--virtualPrimerSize

Size of the virtual primer to use for lower-casing regions with low specificity
BWA single-end alignment is used as a primer specificity proxy. Low-complexity regions (that don't align back to themselves as a best hit) are lowercased. This changes the size of the k-mer used for alignment.

int  20  [ [ -?  ? ] ]


See also Guide Index | Tool Documentation Index | Support Forum

GATK version 3.1-1-g07a4bf8 built at 2014/03/18 07:00:36. GTD: NA