Tagged with #mapping
1 documentation article | 0 announcements | 4 forum discussions

Created 2013-06-17 20:58:04 | Updated 2015-09-24 13:06:53 | Tags: official bwa picard mapping markduplicates
Comments (69)


Map the read data to the reference and mark duplicates.


  • This tutorial assumes adapter sequences have been removed.


  1. Identify read group information
  2. Generate a SAM file containing aligned reads
  3. Convert to BAM, sort and mark duplicates

1. Identify read group information

The read group information is key for downstream GATK functionality. The GATK will not work without a read group tag. Make sure to enter as much metadata as you know about your data in the read group fields provided. For more information about all the possible fields in the @RG tag, take a look at the SAM specification.


Compose the read group identifier in the following format:


where the \t stands for the tab character.

2. Generate a SAM file containing aligned reads


Run the following BWA command:

In this command, replace read group info by the read group identifier composed in the previous step.

bwa mem -M -R ’<read group info>’ -p reference.fa raw_reads.fq > aligned_reads.sam 

replacing the <read group info> bit with the read group identifier you composed at the previous step.

The -M flag causes BWA to mark shorter split hits as secondary (essential for Picard compatibility).

Expected Result

This creates a file called aligned_reads.sam containing the aligned reads from all input files, combined, annotated and aligned to the same reference.

Note that here we are using a command that is specific for pair ended data in an interleaved fastq file, which is what we are providing to you as a tutorial file. To map other types of datasets (e.g. single-ended or pair-ended in forward/reverse read files) you will need to adapt the command accordingly. Please see the BWA documentation for exact usage and more options for these commands.

3. Convert to BAM, sort and mark duplicates

These initial pre-processing operations format the data to suit the requirements of the GATK tools.


Run the following Picard command to sort the SAM file and convert it to BAM:

java -jar picard.jar SortSam \ 
    INPUT=aligned_reads.sam \ 
    OUTPUT=sorted_reads.bam \ 

Expected Results

This creates a file called sorted_reads.bam containing the aligned reads sorted by coordinate.


Run the following Picard command to mark duplicates:

java -jar picard.jar MarkDuplicates \ 
    INPUT=sorted_reads.bam \ 
    OUTPUT=dedup_reads.bam \

Expected Result

This creates a sorted BAM file called dedup_reads.bam with the same content as the input file, except that any duplicate reads are marked as such. It also produces a metrics file called metrics.txt containing (can you guess?) metrics.


Run the following Picard command to index the BAM file:

java -jar picard.jar BuildBamIndex \ 

Expected Result

This creates an index file for the BAM file called dedup_reads.bai.

No posts found with the requested search criteria.

Created 2015-06-01 10:54:07 | Updated | Tags: best-practices picard mapping
Comments (4)

I am trying to follow the best practices for mapping my (Paired-end Illumina HiSeq) reads to the reference, by following this presentation:

From what I understand, I should use MergeBamAlignment to clean up the output from bwa, and then use this cleaned up output for the rest of the analysis. However, when I run ValidateSamFile after running MergeBamAlignment I get a lot of errors, and running CleanSam on the file does not resolve any of them. What am I doing wrong? I've tried searching the web for more details about MergeBamAlignment but I haven't been able to find much. Please let me know if you require any additional information.

How I ran MergeBamAlignment picard-tools MergeBamAlignment \ UNMAPPED_BAM=unmapped_reads.sam \ ALIGNED_BAM=aligned_reads.sam \ OUTPUT=aligned_reads.merged.bam \ REFERENCE_SEQUENCE=/path/to/reference.fasta \ PAIRED_RUN=true # Why is this needed?

Error report from ValidateSamFile ## HISTOGRAM java.lang.String Error Type Count ERROR:INVALID_CIGAR 5261 ERROR:MATES_ARE_SAME_END 30 ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 30

Created 2014-12-01 18:47:12 | Updated | Tags: mapping
Comments (5)

Hello, when I use BWA to map reads generated from targeted sequencing data (Agilent SureSelect kit), how to prepare reference, better use whole genome or selected subset (targeted region) ?

Thanks !

Created 2014-01-20 19:28:37 | Updated | Tags: bwa mapping rna-seq
Comments (1)

Hi all,

My question is on bwa software when one want to map RNA-seq data on the entire human genome. What should be the specific settings to use to get maximum mapping? Should it be effective if no options are used in the command line?

Thank you for your time

Created 2012-11-29 08:16:23 | Updated | Tags: bam walker summary mapping reads
Comments (4)

Hi, Does GATK2 provide a walker/option to summarize the read alignment in a given BAM file? The summary including total reads, reads mapped/%, reads uniquely mapped/%, reads uniquely mapped with 0mm/%, reads mapped on-target/%, reads uniquely mapped on-target%, etc is of great use to assess the mapping quality for whole genome or targeted analysis. Please advice me on how I can obtain this using any of the walkers available. Thanks, Raj