Creating Sequenom Probe Files
From GSA
Contents |
Introduction
This tool creates a fasta sequence file in Sequenom design format from a reference, intervals, and rod(s) of variants, and (optionally) a set of masks sites.
Note that if there are multiple variants at a site, it takes the first one seen.
Sequences for each variant/site will be output as a separate fasta sequence.
Example
java -jar path/to/GenomeAnalysisTK.jar \ -T PickSequenomProbes \ -R path/to/reference.fasta \ -o path/to/output.fasta \ -snp_mask path/to/snpMask.bed \ [-B:<unique name>,<rod_type> <path/to/rod>]
You can specify any number of rods. Positions from the snp mask will be turned into N's on the new reference. SNPs from rods will be converted onto the new reference and indels will be appropriately annotated.
Important note: it is absolutely crucial that one mask out known variant sites when designing sequenom probes (otherwise, they will cause probes to fail and in turn will lead to spurious results). For validation on human samples, for example, one should supply a dbSNP mask to PickSequenomProbes.
A small bit of optional pre-processing currently needs to get done to use this tool with a SNP mask. The following command will create such a mask given any number of input rods:
java -jar path/to/GenomeAnalysisTK.jar \ -T CreateSequenomMask \ -o path/to/snpMask.bed \ [-B:<unique name>,<rod_type> <path/to/rod>]
Near the event (SNP or Indel) you are designing a probe for, it is possible to un-mask variant sites from your mask track. This is particularly useful for tracks with known SNP-near-indel errors; to do this, add the --noMaskWindow # flag (-nmw #) to not mask bases within # of the beginning or end of your events.
Naming Convention
Naming Convention for SNPs
In the context of genotyping, sequenom data is a matrix where the columns pertain to genotypes across multiple samples. For use with our downstream conversion tool, when associating an identifier with each SNP, please use the following naming convention:
project_identifier_|c[chromosome]_p[position]_other_data
For example:
1KG_Pilot1_Validation|c1_p16787432_Q_22
indicates that the genotyped SNP comes from 1KG Pilot 1 validation, and is on chromosome 1, position 16,787,432.
Naming Convention for Indels
The indel naming convention requires an additional piece of information besides the chromosome and position: whether the call was an insertion or deletion. Therefore the indel naming convention is:
project_identifier|c[chromosome]_p[position]_g[I or D][length of indel]_other_data
For instance:
1KG_Pilot2_Validation|c1_p122878_gI4
indicates that the genotyped indel comes from 1KG Pilot 2 validation, is on chromosome 1, position 122,878, and is an insertion of length 4.
Conversion of Sequenom .ped files to VCF
See PlinkToVCF
