Creating Amplicon Sequences
Posted in Methods and Workflows | Last updated on 2012-09-28 16:56:14


Comments (0)

Creating Amplicon Sequences

Note that earlier versions of the GATK used a different tool.

For a complete, detailed argument reference, refer to the GATK document page here

Contents

Introduction

This tool generates amplicon sequences for use with the Sequenom primer design tool. The output of this tool is fasta-formatted, where the characters [A/B] specify the allele to be probed (see Validation Amplicons Output further below). It can mask nearby variation (either by 'N' or by lower-casing characters), and can try to restrict sequenom design to regions of the amplicon likely to generate a highly specific primer. This tool will also flag sites with properties that could shift the mass-spec peak from its expected value, such as indels in the amplicon sequence, SNPs within 4 bases of the variant attempting to be probed, or multiple variants selected for validation falling into the same amplicon.

Lowercase and Ns

Ns in the amplicon sequence instructs primer design software (such as Sequenom) not to use that base in the primer: any primer will fall entirely before, or entirely after, that base. Lower-case letters instruct the design software to try to avoid using the base (presumably by applying a penalty for doing so), but will not prevent it from doing so if a good primer (i.e. a primer with suitable melting temperature and low probability of hairpin formation) is found.

BWA Bindings

ValidationAmplicons relies on the GATK Sting BWA/C bindings to assess the specificity of potential primers. The wiki page for Sting BWA/C bindings contains required information about how to download the appropriate version of BWA, how to create a BWT reference, and how to set your classpath appropriately to run this tool. If you have not followed the directions to set up the BWA/C bindings, you will not be able to create validation amplicon sequences using the GATK. There is an argument (see below) to disable the use of BWA, and lower repeats within the amplicon only. Use of this argument is not recommended.

Running Validation Amplicons

Validation Amplicons requires three input files: a VCF of alleles you want to validate, a VCF of variants you want to mask, and a Table of intervals around the variants describing the size of the amplicons. For instance:

Alleles to Validate

##fileformat=VCFv4.0
#CHROM      POS     ID      REF     ALT     QUAL    FILTER  INFO
20          207414  .       G    A       85.09   PASS    .   // SNP to validate
20          792122  .    TCCC    T       22.24   PASS    .   // DEL to validate
20          994145  .       G    GAAG    48.21   PASS    .   // INS to validate
20          1074230 .       C    T       2.29     QD     .   // SNP to validate (but filtered)
20          1084330 .      AC   GT       42.21   PASS    .   // MNP to validate 

Interval Table

HEADERpos               name
20:207334-207494        20_207414
20:792042-792202        20_792122
20:994065-994225        20_994145
20:1074150-1074310      20_1074230
20:1084250-1084410      20_1084330

Alleles to Mask

##fileformat=VCFv4.1
#CHROM  POS      ID        REF         ALT     QUAL       FILTER         INFO
20      207414   .         G           A       77.12      PASS            .
20      207416   .         A           AGGC    49422.34   PASS            .
20      792076   .         A           G       2637.15    HaplotypeScore  .
20      792080   .         T           G       161.83     PASS            .
20      792087   .         CGGT        C       179.84     ReadPosRankSum  .
20      792106   .         C           G       32.59      PASS            .
20      792140   .         C           G       409.75     PASS            .
20      1084319  .         T          A,C      22.24      PASS            .
20      1084348  .  TACCACCCCACACA     T      482.84      PASS            .

Validation Amplicons Output

The output from Validation Amplicons is a fasta-formatted file, with a small adaptation to represent the site being probed. Using the test files above, the output of the command

java -jar $GATK/dist/GenomeAnalysisTK.jar \
-T ValidationAmplicons \
-R /humgen/1kg/reference/human_g1k_v37.fasta \
-BTI ProbeIntervals \
--ProbeIntervals:table interval_table.table \
--ValidateAlleles:vcf sites_to_validate.vcf \
--MaskAlleles:vcf mask_sites.vcf \
--virtualPrimerSize 30 \
-o probes.fasta \
-l WARN

is

>20:207414 INSERTION=1,VARIANT_TOO_NEAR_PROBE=1, 20_207414
CCAACGTTAAGAAAGAGACATGCGACTGGGTgcggtggctcatgcctggaaccccagcactttgggaggccaaggtgggc[A/G*]gNNcacttgaggtcaggagtttgagaccagcctggccaacatggtgaaaccccgtctctactgaaaatacaaaagttagC
>20:792122 Valid 20_792122
TTTTTTTTTagatggagtctcgctcttatcgcccaggcNggagtgggtggtgtgatcttggctNactgcaacttctgcct[-/CCC*]cccaggttcaagtgattNtcctgcctcagccacctgagtagctgggattacaggcatccgccaccatgcctggctaatTT
>20:994145 Valid 20_994145
TCCATGGCCTCCCCCTGGCCCACGAAGTCCTCAGCCACCTCCTTCCTGGAGGGCTCAGCCAAAATCAGACTGAGGAAGAAG[AAG/-*]TGGTGGGCACCCACCTTCTGGCCTTCCTCAGCCCCTTATTCCTAGGACCAGTCCCCATCTAGGGGTCCTCACTGCCTCCC
>20:1074230 SITE_IS_FILTERED=1, 20_1074230
ACCTGATTACCATCAATCAGAACTCATTTCTGTTCCTATCTTCCACCCACAATTGTAATGCCTTTTCCATTTTAACCAAG[T/C*]ACTTATTATAtactatggccataacttttgcagtttgaggtatgacagcaaaaTTAGCATACATTTCATTTTCCTTCTTC
>20:1084330 DELETION=1, 20_1084330
CACGTTCGGcttgtgcagagcctcaaggtcatccagaggtgatAGTTTAGGGCCCTCTCAAGTCTTTCCNGTGCGCATGG[GT/AC*]CAGCCCTGGGCACCTGTNNNNNNNNNNNNNTGCTCATGGCCTTCTAGATTCCCAGGAAATGTCAGAGCTTTTCAAAGCCC

Note that SNPs have been masked with 'N's, filtered 'mask' variants do not appear, the insertion has been flanked by Ns, the unfiltered deletion has been replaced by Ns, and the filtered site in the validation VCF is not marked as valid. In addition, bases that fall inside at least one non-unique 30-mer (meaning no multiple MQ0 alignments using BWA) are lower-cased. The identifier for each sequence is the position of the allele to be probed, a 'validation status' (defined below), and a string representing the amplicon. Validation status values are:

Valid                     // amplicon is valid
SITE_IS_FILTERED=1        // validation site is not marked 'PASS' or '.' in its filter field ("you are trying to validate a filtered variant")
VARIANT_TOO_NEAR_PROBE=1  // there is a variant too near to the variant to be validated, potentially shifting the mass-spec peak
MULTIPLE_PROBES=1,        // multiple variants to be validated found inside the same amplicon
DELETION=6,INSERTION=5,   // 6 deletions and 5 insertions found inside the amplicon region (from the "mask" VCF), will be potentially difficult to validate
DELETION=1,               // deletion found inside the amplicon region, could shift mass-spec peak
START_TOO_CLOSE,          // variant is too close to the start of the amplicon region to give sequenom a good chance to find a suitable primer
END_TOO_CLOSE,            // variant is too close to the end of the amplicon region to give sequenom a good chance to find a suitable primer
NO_VARIANTS_FOUND,        // no variants found within the amplicon region
INDEL_OVERLAPS_VALIDATION_SITE, // an insertion or deletion interferes directly with the site to be validated (i.e. insertion directly preceding or postceding, or a deletion that spans the site itself)

Warnings During Traversal

The files provided to Validation Amplicons should be such that all generated amplicons are valid. That means:

There are no variants within 4bp of the site to be validated
There are no indels in the amplicon region
Amplicon windows do not include other sites to be probed
Amplicon windows are not too short, and the variant therein is not within 50bp of either edge
All amplicon windows contain a variant to be validated
Variants to be validated are unfiltered or pass filters

The tool will warn you each time any of these conditions are not met.


Return to top Comment on this article in the forum