CNV Pipeline Overview

CNV Discovery Pipeline

The CNVDiscoveryPipeline Queue script implements a pipeline for discovering and copy number variants by seeding on read depth of coverage.

The CNV discovery pipline is a large and complex pipeline that recursively invokes the Queue workflow manager during many of the processing steps. The pipeline is divided into 12 stages and processing within many of the stages is further parallelized per chromosome. In many cases, stage N+1 on a given chromosome only depends on stage N for the same chromosome, so different stages of the pipeline may be running for different chromosomes simultaneously.

The pipeline state is maintained in a run directory. The run directory has a tree structure that matches the structure of the pipeline processing. There are subdirectories named cnv_stageN (where N is 1 to 12) and then subdirectories within these named seq_S (where S is the name of a chromosome or a contig name from the reference genome).

The 12 stage of the current CNV pipeline are:

  • Stage 1: Define initial seed windows
  • This stage creates VCF files with records that correspond to the initial seed windows.

    This stage is parallel per-chromosome.

  • Stage 2: Genotyping of initial seed windows
  • This stage runs initial genotyping on the seed windows from stage 1.

    This stage runs parallel per-chromosome and does parallel genotyping within each chromosome. (Parallel genotyping processes many VCF records in parallel based on the -genotypingParallelRecords parameter.)

  • Stage 3: Merging seed windows
  • This stage promotes seed windows with evidence of potential polymorphism. Adjacent or overlapping windows that are compatible (all input samples appear to have the same copy number) are merged to increase power.

    This stage runs parallel per-chromosome.

  • Stage 4: Genotyping the candidate windows
  • The seed windows and merged windows from stage 3 are genotyped. Sites that appear to be duplicates are removed and sites with evidence of polymorphism are promoted to stage 5.

    This stage runs parallel per-chromosome and does parallel genotyping within each chromosome.

  • Stage 5: Sample evaluation
  • This stage evaluates the number of variants called in each input sample (genome-wide) to detect samples with excessive call rates. Outlier samples with excessive "noise" in the read depth signal can create many false calls. Samples with too many calls are eliminated from the subsequent phases of the pipeline (but are re-introduced during final genotyping in stage 10). Samples are removed from discovery if the number of variants called per-sample is more then 3 MAD (median absolute deviations) above the median.

    The main output from stage5 is a list of samples to be used in subsequent stages (located in cnv_stage5/eval/SelectedSamples.list). This directory also contains a tab-delimited report on the number of variants called in each input sample and a plot showing the distribution.

    This stage is single-threaded. All jobs from stage 4 must complete before stage 5 will run.

  • Stage 6: Genotyping candidate windows in the filtered discovery samples
  • This stage re-genotypes the current set of candidate windows using the filtered list of discovery samples from stage 5. Only sites with evidence of polymorphism in the filtered discovery samples are promoted to stage 7.

    This stage runs parallel per-chromosome and does parallel genotyping within each chromosome.

  • Stage 7: Boundary refinement
  • This stage considers each candidate site independently and run a hill-climbing algorithm to attempt to determine the optimal boundaries for each CNV segment. This method is sometimes referred to as BRIG (boundary refinement by iterative genotyping).

    This stage runs parallel per-chromosome and within each chromosome boundary refinement is run in parallel on each site individually.

  • Stage 8: Genotyping refined sites
  • This stage re-genotypes the final site boundaries determined from stage 7 in the discovery samples. Any duplicate sites (after boundary refinement) are removed. Sites with evidence of polymorphism are promoted to stage 9.

    This stage runs parallel per-chromosome and does parallel genotyping within each chromosome.

  • Stage 9: Merging adjacent sites
  • This stage looks for adjacent or nearby sites that may represent redundant calls of the same CNV. All sites within 1mb are tested for compatibility and prospective merged sites are created (and then evaluated in stage 10).

    This stage runs parallel per-chromosome.

  • Stage 10: Genotyping the merged sites
  • This stage re-genotypes the candidate sites from stage 9 to pick up any merged sites. This stage uses all of the input samples (not just the discovery samples selected in stage 5). Any duplicate sites are removed and sites with evidence of polymorphism are promoted to stage 11.

    This stage runs parallel per-chromosome and does parallel genotyping within each chromosome.

  • Stage 11: Initial site filtering
  • This stage applies a default set of site filtering criteria. Sites passing these filters are promoted to stage 12.

    This stage runs parallel per-chromosome.

  • Stage 12: Final genotyping
  • The set of selected sites from stage 11 are merged into one genome-wide VCF and genotyped and default QC annotations are generated.

    This stage runs parallel genotyping on the merged VCF file.

Job dependencies and avoiding pipeline starvation

The job dependencies in the CNV pipeline are designed to allow a high degree of parallel processing. For most of the pipeline, within each chromosome each processing stage depends only on the previous stage for that chromosome (for example, stage 4 for chromosome 1 can start as soon as stage 3 for chromosome 1 completes, regardless of the state of the other chromosomes). The main exception to this rule is stage 5, which does genome-wide assessment of the call rate in each sample. Stage 5 will not start until all chromosomes have finished stage 4.

For scalability, the CNV pipeline recursively invokes Queue scripts as parallel jobs. Because of this, a certain number of job slots are occupied by Queue scripts that are dependent on other subsidiary jobs to finish. If the number of available job slots in your compute cluster is too small, it is possible for the pipeline to become starved (i.e. all available job slots are running Queue jobs, but none of the running jobs can make progress until a jobs slot becomes available). The minimum number of job slots to avoid starvation is the number of chromosomes being processed plus 2. One useful technique to reduce problems is to pass an -intervalList argument to limit processing to the "main" chromosomes (1-22 and possibly X and Y). This will limit the starvation scenarios to cases where less than 30 jobs slots are available.