How should I pre-process data from multiplexed sequencing and multi-library designs?
Posted in FAQs on 2013-08-02 20:23:38 | Last updated on 2015-08-13 23:14:55

Comments (2)

Our Best Practices Pre-processing documentation assumes a simple experimental design in which you have one set of input sequence files (forward/reverse or interleaved FASTQ, or unmapped uBAM) per sample, and you run each step of the pre-processing workflow separately for each sample, resulting in one BAM file per sample at the end of this phase.

However, if you are generating multiple libraries for each sample, and/or multiplexing samples within and/or across sequencing lanes, the data must be de-multiplexed before pre-processing, typically resulting in multiple sets of FASTQ files per sample all of which should have distinct read group IDs. At that point there are several different valid strategies for implementing the pre-processing workflow. At the Broad Institute, we run the entire pre-processing workflow separately on each individual read group, then we merge the data to produce a single BAM file for each sample, then we re-run two steps (Mark Duplicates and Indel Realignment) on the per-sample BAM files. See the worked-out example below and this presentation for more details.

Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.


Let's say we have this example data:

  • sample1_lane1.fq
  • sample1_lane2.fq
  • sample2_lane1.fq
  • sample2_lane2.fq

1. Run all core steps per-lane once

At the basic level, all pre-processing steps are meant to be performed per-lane. Assuming that you received one FASTQ file per lane of sequence data, just run each file through each pre-processing step individually: map & dedup -> realign -> recal.

The example data becomes:

  • sample1_lane1.dedup.realn.recal.bam
  • sample1_lane2.dedup.realn.recal.bam
  • sample2_lane1.dedup.realn.recal.bam
  • sample2_lane2.dedup.realn.recal.bam

2. Merge lanes per sample

Once you have pre-processed each lane individually, you merge lanes belonging to the same sample into a single BAM file.

The example data becomes:

  • sample1.merged.bam
  • sample2.merged.bam

3. Per-sample refinement

You can increase the quality of your results by performing an extra round of dedupping and realignment, this time at the sample level. It is not absolutely required and will increase your computational costs, so it's up to you to decide whether you want to do it on your data, but that's how we do it internally at Broad.

The example data becomes:

  • sample1.merged.dedup.realn.bam
  • sample2.merged.dedup.realn.bam

This gets you two big wins:

  • Dedupping per-sample eliminates PCR duplicates across all lanes in addition to optical duplicates (which are by definition only per-lane)
  • Realigning per-sample means that you will have consistent alignments across all lanes within a sample.

People often ask also if it's worth the trouble to try realigning across all samples in a cohort. The answer is almost always no, unless you have very shallow coverage. The problem is that while it would be lovely to ensure consistent alignments around indels across all samples, the computational cost gets too ridiculous too fast. That being said, for contrastive calling projects -- such as cancer tumor/normals -- we do recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.

Finally, why not do base recalibration across lanes or across samples? Well, by definition there is no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine. That said, don't worry if you find yourself needing to recalibrate a BAM file with the lanes already merged -- the GATK's BaseRecalibrator is read group-aware, which means that it will identify separate lanes as such even if they are in the same BAM file, and it will always process them separately.

Note that BaseRecalibrator distinguishes read groups by RGID, or RGPU if it is available (PU takes precedence over ID).

Return to top Comment on this article in the forum