How should I pre-process data from multiplexed sequencing and multi-library designs?
Posted in FAQs on 2013-08-02 20:23:38 | Last updated on 2015-12-22 13:49:47

Comments (13)

Our Best Practices Pre-processing documentation assumes a simple experimental design in which you have one set of input sequence files (forward/reverse or interleaved FASTQ, or unmapped uBAM) per sample, and you run each step of the pre-processing workflow separately for each sample, resulting in one BAM file per sample at the end of this phase.

However, if you are generating multiple libraries for each sample, and/or multiplexing samples within and/or across sequencing lanes, the data must be de-multiplexed before pre-processing, typically resulting in multiple sets of FASTQ files per sample all of which should have distinct read group IDs. At that point there are several different valid strategies for implementing the pre-processing workflow. At the Broad Institute, we run the initial steps of the pre-processing workflow (mapping and sorting) separately on each individual read group, then we merge the data to produce a single BAM file for each sample (aggregation). Then we re-run Mark Duplicates, run Indel Realignment and run Base Recalibration on the aggregated per-sample BAM files. See the worked-out example below and this presentation for more details.

Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.


Let's say we have this example data:

  • sample1_lane1.fq
  • sample1_lane2.fq
  • sample2_lane1.fq
  • sample2_lane2.fq

1. Run initial steps per-lane once

Assuming that you received one FASTQ file per lane of sequence data, run each file through mapping, sorting and marking duplicates (dedup).

During the mapping step you assign read group information, which will be very important in the next steps so be sure to do it correctly. See the GATK Dictionary on read groups got guidance.

The example data becomes:

  • sample1_lane1.dedup.bam
  • sample1_lane2.dedup.bam
  • sample2_lane1.dedup.bam
  • sample2_lane2.dedup.bam

Technically this first run of marking duplicates is not necessary because we will run it a second time per-sample, and that per-sample would be enough to achieve the desired result. However, running it once per-lane allows us to estimate library complexity as a quality control step.

2. Merge lanes per sample (aggregation)

Once you have pre-processed each lane individually, you merge lanes belonging to the same sample into a single BAM file.

The example data becomes:

  • sample1.merged.bam
  • sample2.merged.bam

3. Per-sample pre-processing

At this point you perform an extra round of marking duplicates. This eliminates PCR duplicates (arising from library preparation) across all lanes in addition to optical duplicates (which are by definition only per-lane).

The example data becomes:

  • sample1.merged.dedup.bam
  • sample2.merged.dedup.bam

Then you run indel realignment and base recalibration (BQSR).

Realigning per-sample means that you will have consistent alignments across all lanes within a sample.

Base recalibration will be applied per-lane (as it should be) if you assigned appropriate read group information in your data.

The example data becomes:

  • sample1.merged.dedup.realn.recal.bam
  • sample2.merged.dedup.realn.recal.bam

This works because the GATK's BaseRecalibrator is read group-aware, which means that it will identify separate lanes as such even if they are in the same BAM file, and it will always process them separately -- as long as the read groups are identified correctly of course. There would be no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine.

Note that BaseRecalibrator distinguishes read groups by RGID, or RGPU if it is available (PU takes precedence over ID).

Finally, people often ask also if it's worth the trouble to try realigning across all samples in a cohort. The answer is almost always no, unless you have very shallow coverage. The problem is that while it would be lovely to ensure consistent alignments around indels across all samples, the computational cost gets too ridiculous too fast. That being said, for contrastive calling projects -- such as cancer tumor/normals -- we do recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.

Return to top Comment on this article in the forum