How do I run HAPSEG?

 

To run HAPSEG, you'll need R version 2.12.0 or higher with the following packages installed:

  • Rcpp
  • numDeriv
  • geneplotter
  • foreach

These packages can be installed from the R command line with the commands:so

     source("http://www.bioconductor.org/biocLite.R")

     biocLite(c("Rcpp", "numDeriv", "foreach", "geneplotter"))

You will also need to install the HAPSEG package supplied in the release bundle. The subdirectory R_packages contains both source and Windows binary versions. Instructions on how to install these packages can be found at http://cran.r-project.org/doc/manuals/R-admin.html#Installing-packages.

Some portions of HAPSEG can be parallelized via the use of the foreach package. This can be done before running by registering a foreach parallel backend. Please see the foreach documentation on how to do this if you would like this functionality. 

The HAPSEG package also requires a Java 6 runtime environment, with the command java available in the user's PATH. 

You will also need the following files to process your sample:

  • A SNP intensity file containing this sample. This can either be per-sample (default) or multi-sample, see below.
  • A birdseed clusters files, either processed by SNPPipeline (default) or raw from Birdseed, see below.
  • A segmented copy number file (e.g. from GLAD, CBS, etc).
  • Phased BEAGLE array files for the build of the genome you are working with. Default array files are available in the release bundle in the subdirectory etc/phasedBGL. If you would like to use your own files, they must be named in the format chrX.phased.bgl  where X is the chromosome number.

To run HAPSEG with all optional arguments using defaults, the following call is made:

 

RunHapSeg(plate.name, array.name, seg.fn, snp.fn, genome.build, 
      results.dir, platform, use.pop, impute.gt, plot.segfit, 
      merge.small, merge.close, min.seg.size, normal, out.p, 
      seg.merge.thresh, use.normal, adj.atten, phased.bgl.dir)

 

An explanation of these arguments follows:

Argument

Description

plate.name

Name of the sample plate

array.name

Name of the chip that was run

seg.fn

Segmented copy numer data file for this sample. If NULL, HAPSEG will segment the data for you.

snp.fn

SNP intensity file for this sample

genome.build

Which build of the human genome to use, supported values are currently 'hg18' and 'hg19'

results.dir

Path to put output results. If this does not already exist it will be created

platform

The chip type used. Supported values are currently 'SNP_250K_STY' and 'SNP_6.0'

use.pop

Population to run with (currently supported values: CEPH, CH, JA, YOR)

impute.gt

If TRUE will impute genotypes via BEAGLE. We recommend this be TRUE unless there's a reason to do otherwise

plot.segfit

If TRUE will plot JPG images of the segmentations fits

merge.small

If TRUE will merge small segments

merge.close

if TRUE will merge close segments

min.seg.size

Minimum segment size

normal

If TRUE this sample will be treated as a normal

out.p

Outlier probability

seg.merge.thresh

Segmentation threshold

use.normal

If TRUE will use a matched normal sample if one is provided

adj.atten

This argument has been disabled and will be removed from the function signature in the next major release.

phased.bgl.dir

A directory containing phased imputation reference panels for BEAGLE.

 

There are also several optional arguments. To use this, specify them by name e.g. verbose=TRUE.

Argument

Description

drop.x

If TRUE will remove the X chromosome from the calculation

drop.y

If TRUE will remove the Y chromosome from the calculation

calls.fn

If using a matched normal sample, a Birdseed calls file needs to be supplied

mn.sample

If using a matched sample, the name of that matched normal sample

out.file

The name of the output file. By default it will be plate.name_array.name_segdat.RData

calibrate.data

If TRUE will perform a calibration on the input data. If left on the default value (NULL), the calibration status will be inferred from the snp.file.parser argument.

clusters.fn

If calibrate.data is TRUE the user must supply a Birdseed clusters file

prev.theta.fn

An optional file storing the previous theta values

snp.file.parser

A function used to parse the snp.file argument, see below

clusters.file.parser

A function used to parse the clusters.fn, see below

verbose

If you would like verbose output, supply this argument as being TRUE

 

Input File Options

As mentioned above, some input files take an optional argument to tell HAPSEG how to parse them. For the snp.fn argument the default situation assumes a tab delimited file with two columns named A and B with the rownames corresponding to the chip's probeset IDs. This file can either be a text file or a saved RData file containing that data as the object dat. The FullSnpFileParser parser can be used if the user is inputting a .snp file created by SNPFileCreator. Lastly the AptSnpFileParser parser is provided for those users using the Affymetrix Power Tools. This mechanism allows HAPSEG to support any input file that can be made to conform to the default structure, users can write & pass in a function to handle any arbitrary situation.

For the clusters.fn file the default parser also assumes a tab delimited file with rownames being the probeset IDs. In this case there are 6 columns: AA.a, AB.a, BB.a, BB.b, AB.b and AA.b. This file structure is the same as what the SNPPipeline outputs as birdclusters.RData. An alternate parser is supplied, BirdseedClustersFileParser which assumes that the input is a raw clusters file output by Birdseed. As with snp.file.parser, any arbitrary format can be supported via the use of user supplied functions.

Using Affy Power Tools

For users outside of the Broad, Birdseed is available via the Affymetrix power tools with documentation available here. To process APT data properly you will need HAPSEG version 1.1.

To get APT data working with HAPSEG, you'll need to run both apt-probeset-summarize and apt-probeset-genotype. As an example, assume that the APT binaries are installed in ./bin, the library files are in ./LibFiles and your CEL files are in ./CEL. First run the following (You might want to tweak the arguments to --analysis).

apt-probeset-summarize

 

./bin/apt-probeset-summarize --cdf-file LibFiles/GenomeWideSNP_6.cdf --analysis quant-norm.sketch=50000,pm-only,med-polish,expr.genotype=true --out-dir aptSnpOutput ./CEL/*.CEL 

The important output file for this command will be aptSnpOutput/quant-norm.pm-only.med-polish.expr.summary.txt (note that this filename will differ if you changed the --analysis argument). This file will be your snp.fn argument to HAPSEG.

Next you will need to run birdseed:

./bin/apt-probeset-genotype 
     -o birdseed 
     -c ./LibFiles/GenomeWideSNP_6.cdf 
     --set-gender-method cn-probe-chrXY-ratio 
     --chrX-probes ./LibFiles/GenomeWideSNP_6.chrXprobes 
     --chrY-probes ./LibFiles/GenomeWideSNP_6.chrYprobes 
     --special-snps ./LibFiles/GenomeWideSNP_6.specialSNPs 
     --read-models-birdseed ./LibFiles/GenomeWideSNP_6.birdseed.models 
     -a birdseed-v1 
     --write-models ./*.CEL

This will provide two output files of note, birdseed/birdseed-v1.calls.txt and birdseed/birdseed-v1.snp-models.txt which will be your calls.fn and clusters.fn arguments, respectively. 

Lastly you will need to specify snp.file.parser=AptSnpFileParser and cluster.file.parser=BirdseedClustersFileParser.

If you are using APT, it is likely that you will not have an external segmentation file for your samples. If this is the case make sure to use the value NULL for the seg.fn argument.

Example

Using the example data in the release bundle we can run the data to recreate the mixing results as follows. This can be run on a multicore system by uncommenting the registerDoMC call and specifying the number of cores that you wish to use. Besides ABSOLUTE this code also requires the use of foreach and optionally doMC. This code will create a directory output which will contain a per-sample subdirectory containing all output as well as a subdirectory hapseg_logs which will contain per-sample text files containing the standard out being emitted from R.

DoHapseg <- function(scan, sif) {
  registerDoSEQ()
  library(HAPSEG)
  plate.name <- "DRAWS"
  snp.file.base <- "./paper_example"
  snp.file.post <- "snp_byAllele.RData"
  seg.fn <- "./paper_example/mix250K_seg_out.txt"
  clusters.fn <- "./paper_example/birdclusters.RData"
  
  results.dir.base <- "./output"
  phased.bgl.dir <- "./paper_example/phasedBGL/hg18"
  genome <- "hg18"
  platform <- "SNP_250K_STY"
  
  pop <- "CEPH"
  impute.gt <- FALSE
  plot.segfit <- TRUE
  merge.small <- TRUE
  merge.close <- TRUE
  min.seg.size <- 5
  normal <- FALSE
  out.p <- 0.001
  seg.merge.thresh <- 1e-10
  use.normal <- TRUE
  adj.atten <- FALSE
  snp.fn <- file.path(snp.file.base,
                      paste(scan, snp.file.post, sep="."))
  sif.info <- as.character(sif[scan, ])
  results.dir <- file.path(results.dir.base, scan, "hapseg")
  if (!file.exists(results.dir)) {
    dir.create(results.dir)
  }
  print(paste("Starting scan", scan, "at", results.dir))
  log.dir <- file.path(results.dir.base, "hapseg_logs")
  if (!file.exists(log.dir)) {
    dir.create(log.dir)
  }
  sink(file=file.path(log.dir, paste(scan, ".hapseg.out.txt", sep="")))
  RunHapSeg(plate.name, scan, seg.fn, snp.fn, genome, results.dir, platform, 
            pop,impute.gt, plot.segfit, merge.small, merge.close, 
            min.seg.size, normal, out.p, seg.merge.thresh, use.normal, 
            adj.atten, phased.bgl.dir, calibrate.data=TRUE, 
            clusters.fn=clusters.fn, drop.x=TRUE, verbose=TRUE)
  sink()
}

arrays.txt <- "./paper_example/mix250K_arrays.txt"
sif.txt <- "./paper_example/mix_250K_SIF.txt"
## read in array names
scans <- readLines(arrays.txt)[-1]
sif <- read.delim(sif.txt, as.is=TRUE)
library(foreach)
##library(doMC)
##registerDoMC(20)
foreach (scan=scans, .combine=c) %dopar% {
  DoHapseg(scan, sif)
}