Creating Sequenom Probe Files

From GSA

Jump to: navigation, search

Contents

Introduction

This tool creates a fasta sequence file in Sequenom design format from a reference, intervals, and rod(s) of variants, and (optionally) a set of masks sites.
Note that if there are multiple variants at a site, it takes the first one seen.
Sequences for each variant/site will be output as a separate fasta sequence.

Example

 java -jar path/to/GenomeAnalysisTK.jar \
  -T PickSequenomProbes \
  -R path/to/reference.fasta \
  -o path/to/output.fasta \
  -snp_mask path/to/snpMask.bed \
  [-B:<unique name>,<rod_type> <path/to/rod>]

You can specify any number of rods. Positions from the snp mask will be turned into N's on the new reference. SNPs from rods will be converted onto the new reference and indels will be appropriately annotated.

Important note: it is absolutely crucial that one mask out known variant sites when designing sequenom probes (otherwise, they will cause probes to fail and in turn will lead to spurious results). For validation on human samples, for example, one should supply a dbSNP mask to PickSequenomProbes.

A small bit of optional pre-processing currently needs to get done to use this tool with a SNP mask. The following command will create such a mask given any number of input rods:

 java -jar path/to/GenomeAnalysisTK.jar \
  -T CreateSequenomMask \
  -o path/to/snpMask.bed \
  [-B:<unique name>,<rod_type> <path/to/rod>]

Near the event (SNP or Indel) you are designing a probe for, it is possible to un-mask variant sites from your mask track. This is particularly useful for tracks with known SNP-near-indel errors; to do this, add the --noMaskWindow # flag (-nmw #) to not mask bases within # of the beginning or end of your events.

Naming Convention

Naming Convention for SNPs

In the context of genotyping, sequenom data is a matrix where the columns pertain to genotypes across multiple samples. For use with our downstream conversion tool, when associating an identifier with each SNP, please use the following naming convention:

project_identifier_|c[chromosome]_p[position]_other_data

For example:

1KG_Pilot1_Validation|c1_p16787432_Q_22

indicates that the genotyped SNP comes from 1KG Pilot 1 validation, and is on chromosome 1, position 16,787,432.

Naming Convention for Indels

The indel naming convention requires an additional piece of information besides the chromosome and position: whether the call was an insertion or deletion. Therefore the indel naming convention is:

project_identifier|c[chromosome]_p[position]_g[I or D][length of indel]_other_data

For instance:

1KG_Pilot2_Validation|c1_p122878_gI4

indicates that the genotyped indel comes from 1KG Pilot 2 validation, is on chromosome 1, position 122,878, and is an insertion of length 4.

Conversion of Sequenom .ped files to VCF

See PlinkToVCF

Personal tools