Creates FASTA sequences for use in Seqenom or PCR utilities for site amplification and subsequent validation
ValidationAmplicons consumes a VCF and an Interval list and produces FASTA sequences from which PCR primers or probe sequences can be designed. In addition, ValidationAmplicons uses BWA to check for specificity of tracts of bases within the output amplicon, lower-casing non-specific tracts, allows for users to provide sites to mask out, and specifies reasons why the site may fail validation (nearby variation, for example).
Requires a VCF containing alleles to design amplicons towards, a VCF of variants to mask out of the amplicons, and an interval list defining the size of the amplicons around the sites to be validated
Output is a FASTA-formatted file with some modifications at probe sites. For instance:
>20:207414 INSERTION=1,VARIANT_TOO_NEAR_PROBE=1, 20_207414 CCAACGTTAAGAAAGAGACATGCGACTGGGTgcggtggctcatgcctggaaccccagcactttgggaggccaaggtgggc[A/G*]gNNcacttgaggtcaggagtttgagaccagcctggccaacatggtgaaaccccgtctctactgaaaatacaaaagttagC >20:792122 Valid 20_792122 TTTTTTTTTagatggagtctcgctcttatcgcccaggcNggagtgggtggtgtgatcttggctNactgcaacttctgcct[-/CCC*]cccaggttcaagtgattNtcctgcctcagccacctgagtagctgggattacaggcatccgccaccatgcctggctaatTT >20:994145 Valid 20_994145 TCCATGGCCTCCCCCTGGCCCACGAAGTCCTCAGCCACCTCCTTCCTGGAGGGCTCAGCCAAAATCAGACTGAGGAAGAAG[AAG/-*]TGGTGGGCACCCACCTTCTGGCCTTCCTCAGCCCCTTATTCCTAGGACCAGTCCCCATCTAGGGGTCCTCACTGCCTCCC >20:1074230 SITE_IS_FILTERED=1, 20_1074230 ACCTGATTACCATCAATCAGAACTCATTTCTGTTCCTATCTTCCACCCACAATTGTAATGCCTTTTCCATTTTAACCAAG[T/C*]ACTTATTATAtactatggccataacttttgcagtttgaggtatgacagcaaaaTTAGCATACATTTCATTTTCCTTCTTC >20:1084330 DELETION=1, 20_1084330 CACGTTCGGcttgtgcagagcctcaaggtcatccagaggtgatAGTTTAGGGCCCTCTCAAGTCTTTCCNGTGCGCATGG[GT/AC*]CAGCCCTGGGCACCTGTNNNNNNNNNNNNNTGCTCATGGCCTTCTAGATTCCCAGGAAATGTCAGAGCTTTTCAAAGCCCare amplicon sequences resulting from running the tool. The flags (preceding the sequence itself) can be:
Valid // amplicon is valid SITE_IS_FILTERED=1 // validation site is not marked 'PASS' or '.' in its filter field ("you are trying to validate a filtered variant") VARIANT_TOO_NEAR_PROBE=1 // there is a variant too near to the variant to be validated, potentially shifting the mass-spec peak MULTIPLE_PROBES=1, // multiple variants to be validated found inside the same amplicon DELETION=6,INSERTION=5, // 6 deletions and 5 insertions found inside the amplicon region (from the "mask" VCF), will be potentially difficult to validate DELETION=1, // deletion found inside the amplicon region, could shift mass-spec peak START_TOO_CLOSE, // variant is too close to the start of the amplicon region to give sequenom a good chance to find a suitable primer END_TOO_CLOSE, // variant is too close to the end of the amplicon region to give sequenom a good chance to find a suitable primer NO_VARIANTS_FOUND, // no variants found within the amplicon region INDEL_OVERLAPS_VALIDATION_SITE, // an insertion or deletion interferes directly with the site to be validated (i.e. insertion directly preceding or postceding, or a deletion that spans the site itself)
java -jar GenomeAnalysisTK.jar -T ValidationAmplicons -R /humgen/1kg/reference/human_g1k_v37.fasta -L:table interval_table.table -ProbeIntervals:table interval_table.table -ValidateAlleles:vcf sites_to_validate.vcf -MaskAlleles:vcf mask_sites.vcf --virtualPrimerSize 30 -o probes.fasta
These Read Filters are automatically applied to the data by the Engine before processing by ValidationAmplicons.
This tool applies the following downsampling settings by default.
The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
|Argument name(s)||Default value||Summary|
||NA||A VCF containing the sites you want to MASK from the designed amplicon (e.g. by Ns or lower-cased bases)|
||NA||A collection of intervals in table format with optional names that represent the intervals surrounding the probe sites amplicons should be designed for|
||NA||A VCF containing the sites and alleles you want to validate. Restricted to *BI-Allelic* sites|
|stdout||An output file created by the walker. Will overwrite contents if file exists|
|NA||The reference to which reads in the source file should be aligned. Alongside this reference should sit index files generated by bwa index -d bwtsw. If unspecified, will default to the reference specified via the -R argument.|
||20||Size of the virtual primer to use for lower-casing regions with low specificity|
||false||Do not use BWA, lower-case repeats only|
||false||Monomorphic sites in the mask file will be treated as filtered|
||false||Ignore complex genomic records.|
||false||Lower case SNPs rather than replacing with 'N'|
||false||Only output valid sequences.|
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
Do not use BWA, lower-case repeats only
Monomorphic sites in the mask file will be treated as filtered
Ignore complex genomic records.
If ignoreComplexEvents is true, the output fasta file will contain only sequences coming from SNPs and Indels. Complex substitutions will be ignored.
Lower case SNPs rather than replacing with 'N'
A VCF containing the sites you want to MASK from the designed amplicon (e.g. by Ns or lower-cased bases)
A VCF file containing variants to be masked. A mask variant overlapping a validation site will be ignored at the validation site.
Only output valid sequences.
If onlyOutputValidAmplicons is true, the output fasta file will contain only valid sequences. Useful for producing delivery-ready files.
An output file created by the walker. Will overwrite contents if file exists
A collection of intervals in table format with optional names that represent the intervals surrounding the probe sites amplicons should be designed for
A Table-formatted file listing amplicon contig, start, stop, and a name for the amplicon (or probe)
The reference to which reads in the source file should be aligned. Alongside this reference should sit index files generated by bwa index -d bwtsw. If unspecified, will default to the reference specified via the -R argument.
A VCF containing the sites and alleles you want to validate. Restricted to *BI-Allelic* sites
A VCF file containing the bi-allelic sites for validation. Filtered records will prompt a warning, and will be flagged as filtered in the output fastq.
Size of the virtual primer to use for lower-casing regions with low specificity
BWA single-end alignment is used as a primer specificity proxy. Low-complexity regions (that don't align back to themselves as a best hit) are lowercased. This changes the size of the k-mer used for alignment.
int 20 [ [ -? ? ] ]
GATK version 3.0-0-g6bad1c6 built at 2014/03/06 06:38:04. GTD: NA