## (howto) Fix a badly formatted BAMPosted in Tutorials on 2013-07-03 00:53:53 | Last updated on 2015-01-26 18:11:45

#### Objective

Fix a BAM that is not indexed or not sorted, has not had duplicates marked, or is lacking read group information. These steps can be performed independently of each other but this order is recommended.

#### Prerequisites

• Installed Picard tools

#### Steps

1. Sort the aligned reads by coordinate order
2. Mark duplicates
4. Index the BAM file

#### Note

You may ask, is all of this really necessary? The GATK is notorious for imposing strict formatting guidelines and requiring the presence of information such as read groups that other software packages do not require. Although this represents a small additional processing burden upfront, the downstream benefits are numerous, including the ability to process library data individually, and significant gains in speed and parallelization options.

### 1. Sort the aligned reads by coordinate order

#### Action

Run the following Picard command:

java -jar SortSam.jar \
SORT_ORDER=coordinate


#### Expected Results

This creates a file called sorted_reads.bam containing the aligned reads sorted by coordinate.

#### Action

Run the following Picard command:

java -jar MarkDuplicates.jar \
METRICS_FILE=metrics.txt


#### Expected Results

This creates a file called dedup_reads.bam with the same content as the input file, except that any duplicate reads are marked as such. It also creates a file called metrics.txt that contains metrics regarding duplication of the data.

#### More details

During the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process (sometimes called dedupping in bioinformatics slang) identifies these reads as such so that the GATK tools know to ignore them.

#### Action

Run the following Picard command:

java -jar AddOrReplaceReadGroups.jar  \
RGID=group1 RGLB= lib1 RGPL=illumina RGPU=unit1 RGSM=sample1


#### Expected Results

This creates a file called addrg_reads.bam with the same content as the input file, except that the reads will now have read group information attached.

### 4. Index the BAM file

#### Action

Run the following Picard command:

java -jar BuildBamIndex.jar \

This creates an index file called addrg_reads.bai, which is ready to be used in the Best Practices workflow.