# Tagged with #markduplicates 1 documentation article | 0 announcements | 10 forum discussions

#### Objective

Map the read data to the reference and mark duplicates.

#### Prerequisites

• This tutorial assumes adapter sequences have been removed.

#### Steps

2. Generate a SAM file containing aligned reads
3. Convert to BAM, sort and mark duplicates

### 1. Identify read group information

The read group information is key for downstream GATK functionality. The GATK will not work without a read group tag. Make sure to enter as much metadata as you know about your data in the read group fields provided. For more information about all the possible fields in the @RG tag, take a look at the SAM specification.

#### Action

Compose the read group identifier in the following format:

@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1


where the \t stands for the tab character.

### 2. Generate a SAM file containing aligned reads

#### Action

Run the following BWA command:

In this command, replace read group info by the read group identifier composed in the previous step.

bwa mem -M -R ’<read group info>’ -p reference.fa raw_reads.fq > aligned_reads.sam


replacing the <read group info> bit with the read group identifier you composed at the previous step.

The -M flag causes BWA to mark shorter split hits as secondary (essential for Picard compatibility).

#### Expected Result

This creates a file called aligned_reads.sam containing the aligned reads from all input files, combined, annotated and aligned to the same reference.

Note that here we are using a command that is specific for pair ended data in an interleaved fastq file, which is what we are providing to you as a tutorial file. To map other types of datasets (e.g. single-ended or pair-ended in forward/reverse read files) you will need to adapt the command accordingly. Please see the BWA documentation for exact usage and more options for these commands.

### 3. Convert to BAM, sort and mark duplicates

These initial pre-processing operations format the data to suit the requirements of the GATK tools.

#### Action

Run the following Picard command to sort the SAM file and convert it to BAM:

java -jar SortSam.jar \
SORT_ORDER=coordinate


#### Expected Results

This creates a file called sorted_reads.bam containing the aligned reads sorted by coordinate.

#### Action

Run the following Picard command to mark duplicates:

java -jar MarkDuplicates.jar \
METRICS_FILE=metrics.txt


#### Expected Result

This creates a sorted BAM file called dedup_reads.bam with the same content as the input file, except that any duplicate reads are marked as such. It also produces a metrics file called metrics.txt containing (can you guess?) metrics.

#### Action

Run the following Picard command to index the BAM file:

java -jar BuildBamIndex.jar \


#### Expected Result

This creates an index file for the BAM file called dedup_reads.bai.

No posts found with the requested search criteria.

Hi, When I follow the GATK protocol, ran below commands: bwa mem -R '@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:C2U2AACXX' ucsc.hg19.fasta ../Unaligned/Project_DefaultProject/Sample_1/1_R1.fastq ../Unaligned/Project_DefaultProject/Sample_1/1_R2.fastq > sample1.sam

java -jar /data/software/picard/MarkDuplicates.jar INPUT=sample1.s am OUTPUT=sample1_dedup.bam SO=coordinate

I got some error like "ERROR: Unrecognized option: SO"

Why? how to fix it?

Thanks, Min

Hi,

I'm having trouble removing duplicates using Picard tools on SOLiD data. I get a regex not matching error.

The reads have the following names:

22_758_632_F3

604_1497_576

124_1189_1519_F5

358_1875_702_F5-DNA

And I don't think Picard tools is able to pick these read names with its default regex.

I tried to change the default regex. This time it does not throw an error, but it takes too long and times out (out of memory). I suspect I'm not giving the right regex. Here is my command:

java -jar $PICARD_TOOLS_HOME/MarkDuplicates.jar I=$FILE O=$BAMs/MarkDuplicates/$SAMPLE.MD.bam M=$BAMs/MarkDuplicates/$SAMPLE.metrics READ_NAME_REGEX="([0-9]+)([0-9]+)([0-9]+).*"

Any help is appreciated. Thanks!

I was planning to call variants on my pooled bacteria sample using GATK UnifiedGenotyper and was wondering if removing duplicates should be in the best practices since with pooled samples, the reads might not be PCR duplicates but actually reads from different strains that I have in the pool. I would appreciate any input from your side. Thank you

Eva

Hello, I am a graduate student in lab that studies evolution, and I am relatively new to NGS. I have been given reads from pooled moth samples, and I am hoping to identify variants with the ultimate goal of quantifying the genetic differentiation between two strains of moths. I am wondering 1) if it is appropriate/recommended to remove duplicates with pooled data and 2) more broadly, are there particular situations in which removing duplicates is not suggested? For example, I have another data set in which the fragments were not generated by random shearing but rather by multiplex PCR of 17 particular amplicons for 42 different individual moths (not pooled). I'm guessing that removing duplicates doesn't make sense in this case because there will be lots of reads that start at the exact same position relative to the reference. Is this right?

Thanks a bunch!

Referring to broadinstitute.org/gatk/guide/article?id=3060, is removing duplicates necessary to be done twice, once per-lane and then per-sample?

Is it not enough to just mark the duplicates in the final BAM file with all the lanes merged, which should remove both optical and PCR duplicates (I am using Picard MarkDuplicates.jar)? So specifically, in the link above what is wrong with generating -

• sample1_lane1.realn.recal.bam
• sample1_lane2.realn.recal.bam
• sample2_lane1.realn.recal.bam
• sample2_lane2.realn.recal.bam

Then, merging them to get

• sample1.merged.bam
• sample2.merged.bam

and finally, include "de-dupping" only for the merged BAM file.

• sample1.merged.dedup.realn.bam
• sample2.merged.dedup.realn.bam

Hi,

So I've finally taken the plunge and migrated our analysis pipeline to Queue. With some great feedback from @johandahlberg, I have gotten to a state where most of the stuff is running smoothly on the cluster.

I'm trying to add Picard's CalculateHSMetrics to the pipeline, but am having some issues. This code:

case class hsmetrics(inBam: File, baitIntervals: File, targetIntervals: File, outMetrics: File) extends CalculateHsMetrics with ExternalCommonArgs with SingleCoreJob with OneDayJob {
@Input(doc="Input BAM file") val bam: File = inBam
@Output(doc="Metrics file") val metrics: File = outMetrics
this.input :+= bam
this.targets = targetIntervals
this.baits = baitIntervals
this.output = metrics
this.reference =  refGenome
this.isIntermediate = false
}


Gives the following error message:

ERROR 06:56:25,047 QGraph - Missing 2 values for function:  'java'  '-Xmx2048m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/Users/dankle/IdeaProjects/eclipse/AutoSeq/.queue/tmp' null 'INPUT=/Users/dankle/tmp/autoseqscala/exampleIND2/exampleIND2.panel.bam'  'TMP_DIR=/Users/dankle/IdeaProjects/eclipse/AutoSeq/.queue/tmp'  'VALIDATION_STRINGENCY=SILENT'  'OUTPUT=/Users/dankle/tmp/autoseqscala/exampleIND2/exampleIND2.panel.preMarkDupsHsMetrics.metrics'  'BAIT_INTERVALS=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/exampleINTERVAL.intervals'  'TARGET_INTERVALS=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/exampleINTERVAL.intervals'  'REFERENCE_SEQUENCE=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/bwaindex0.6/exampleFASTA.fasta'  'METRIC_ACCUMULATION_LEVEL=SAMPLE'
ERROR 06:56:25,048 QGraph -   @Argument: jarFile - jar
ERROR 06:56:25,049 QGraph -   @Argument: javaMainClass - Main class to run from javaClasspath


And yeah, is seems that the jar file is currently set to null in the command line. However, MarkDuplicates runs fine without setting the jar:

case class dedup(inBam: File, outBam: File, metricsFile: File) extends MarkDuplicates with ExternalCommonArgs with SingleCoreJob with OneDayJob {
@Input(doc = "Input bam file") var inbam = inBam
@Output(doc = "Output BAM file with dups removed") var outbam = outBam
this.REMOVE_DUPLICATES = true
this.input :+= inBam
this.output = outBam
this.metrics = metricsFile
this.memoryLimit = 3
this.isIntermediate = false
}


Why does CalculateHSMetrics need the jar, but not MarkDuplicates? Both are imported with import org.broadinstitute.sting.queue.extensions.picard._.

Hello,

I am having trouble calling variants using Haplotype Caller on simulated exome reads. I have been able to call reasonable-looking variants on the exome (simulated with dwgsim) with HaplotypeCaller before running it through the Best Practices Pre-Processing pipeline. The pre-processed data worked fine with UnifiedGenotyper but with HaplotypeCaller, though it runs without errors and seems to walk across the genome, only outputs a VCF header. I have tried calling variants with and without using -L to provide the exome regions (as recommended in this forum post: http://gatkforums.broadinstitute.org/discussion/1681/expected-file-size-haplotype-caller) but this hasn't made a difference - when we run the command with the pre-processed BAMs, we only get a VCF header. Everything has been tested with both 2.4-7 and 2.4-9.

Any help or guidance would be greatly appreciated!

Command Used for HaplotypeCaller:

java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -I exome.realigned.dedup.recal.bam -o exome.raw.vcf -D dbsnp_137.hg19.vcf -stand_emit_conf 10 -rf BadCigar -L Illumin_TruSeq.bed --logging_level DEBUG

Commands Used for pre-processing (run in sequence using a Perl script):

java -Xmx16g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -nt 8 -R ucsc.hg19.fasta -I exome.bam -o exome.intervals -known dbsnp_137.hg19.vcf

java -Xmx4g -jar GenomeAnalysisTK.jar -T IndelRealigner -R ucsc.hg19.fasta -I exome.bam -o exome.realigned.bam -targetIntervals intervals.bam -known dbsnp_137.hg19.vcf

java -Xmx16g -jar MarkDuplicates.jar I=exome.realigned.bam METRICS_FILE=exome.dups O=exome.realigned.dedup.bam

samtools index exome.realigned.dedup

java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -o exome.recal_data.grp -knownSites dbsnp_137.hg19.vcf -cov ReadGroupCovariate -cov ContextCovariate -cov CycleCovariate -cov QualityScoreCovariate

java -Xmx4g -jar GenomeAnalysisTK.jar -T PrintReads -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -BQSR exome.recal_data.grp -baq CALCULATE_AS_NECESSARY -o exome.realigned.dedup.recal.bam

I have been using GATK (v2.2) UnifiedGenotyper to generate VCFs. I did a multisample realignment around indels which generated a multisample BAM of size ~500Gb. After looking at some of the SNP calls I decided to try removing duplicates. I used MarkDuplicates with "REMOVE_DUPLICATES=true" and although only 10% of reads were duplicates, the BAM reduced to ~75Gb. This did not seem to affect the depth of reads at a site more than the expected ~10% but now the AD field in the genotype columns is missing. ie GT:AD:GQ 0/1:.:30 When I run UnifiedGenotyper with the old BAM prior to MarkDuplicates the AD field is present.

I am currently running the MarkDuplicates on each sample prior to realignment - because I think this makes the most sense, but isn't clear why this should matter,

Any ideas on what is happening here?

I was frustrated by the .metrics file from MarkDuplicates getting deleted as an intermediate file, so I set isIntermediate=false for that step in the DataProcessingPipeline. But now I'm getting tired of manually deleting the intermediate bams.

So my request is, could that field be changed from an @Output to an @Argument? This would be on line 50 of org.broadinstitute.sting.queue.extensions.picard.MarkDuplicates.scala. I also made that a required field in my local copy, since it is required to run the Picard tool.

A similar but opposite problem is that the bai file from the IndelRealigner step is not deleted - but that looks like it would require either special handling for that walker in Queue or for the index file to be an argument to the Java walker. Neither is a particularly appealing solution.

I've the following queries on running RealignerTargetCreator module in GATK1.4.

1) Is it recommended to provide the target capture BED file to RealignerTargetCreator in case of targeted/exome experiments? Without the bed file, the tool is taking long time (~6-7 hrs). What's the optimal way here?

2) Does running mark duplicates before or after 'RealignerTargetCreator' have any effect on the # of snps/indels? What is recommended?