## (howto) Fix a badly formatted BAMPosted in Tutorials on 2013-07-03 00:53:53 | Last updated on 2015-09-24 12:57:53

Comments (4)

#### Objective

Fix a BAM that is not indexed or not sorted, has not had duplicates marked, or is lacking read group information. These steps can be performed independently of each other but this order is recommended.

#### Prerequisites

• Installed Picard tools

#### Steps

1. Sort the aligned reads by coordinate order
2. Mark duplicates
3. Add read group information
4. Index the BAM file

#### Note

You may ask, is all of this really necessary? The GATK is notorious for imposing strict formatting guidelines and requiring the presence of information such as read groups that other software packages do not require. Although this represents a small additional processing burden upfront, the downstream benefits are numerous, including the ability to process library data individually, and significant gains in speed and parallelization options.

### 1. Sort the aligned reads by coordinate order

#### Action

Run the following Picard command:

java -jar picard.jar SortSam \
INPUT=unsorted_reads.bam \
OUTPUT=sorted_reads.bam \
SORT_ORDER=coordinate


#### Expected Results

This creates a file called sorted_reads.bam containing the aligned reads sorted by coordinate.

### 2. Mark duplicate reads

#### Action

Run the following Picard command:

java -jar picard.jar MarkDuplicates \
INPUT=sorted_reads.bam \
OUTPUT=dedup_reads.bam \
METRICS_FILE=metrics.txt


#### Expected Results

This creates a file called dedup_reads.bam with the same content as the input file, except that any duplicate reads are marked as such. It also creates a file called metrics.txt that contains metrics regarding duplication of the data.

#### More details

During the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process (sometimes called dedupping in bioinformatics slang) identifies these reads as such so that the GATK tools know to ignore them.

### 3. Add read group information

#### Action

Run the following Picard command:

java -jar picard.jar AddOrReplaceReadGroups \
INPUT=dedup_reads.bam \
OUTPUT=addrg_reads.bam \
RGID=group1 RGLB= lib1 RGPL=illumina RGPU=unit1 RGSM=sample1


#### Expected Results

This creates a file called addrg_reads.bam with the same content as the input file, except that the reads will now have read group information attached.

### 4. Index the BAM file

#### Action

Run the following Picard command:

java -jar picard.jar BuildBamIndex \
INPUT=addrg_reads.bam


#### Expected Results

This creates an index file called addrg_reads.bai, which is ready to be used in the Best Practices workflow.

Since Picard tools do not systematically create an index file when they output a new BAM file (unlike GATK tools, which will always output indexed files), it is best to keep the indexing step for last.

Return to top Comment on this article in the forum