# Tagged with #markduplicates 1 documentation article | 0 announcements | 19 forum discussions

Created 2016-01-08 21:06:41 | Updated 2016-02-08 17:17:59 | Tags: best-practices bam markduplicates duplicates

Here we discuss two tools, MarkDuplicates and MarkDuplicatesWithMateCigar, that flag duplicates. We provide example data and example commands for you to follow along the tutorial (section 1) and include tips in estimating library complexity for PCR-free samples and patterned flow cell technologies. In section 2, we point out special memory considerations for these tools. In section 3, we highlight the similarities and differences between the two tools. Finally, we get into some details that may be of interest to some that includes comments on the metrics file (section 4).

To mark duplicates in RNA-Seq data, use MarkDuplicates. Reasons are explained in section 2 and section 3. And if you are considering using MarkDuplicatesWithMateCigar for your DNA data, be sure insert lengths are short and you have a low percentage of split or multi-mapping records.

Obviously, expect more duplicates for samples prepared with PCR than for PCR-free preparations. Duplicates arise from various sources, including within the sequencing run. As such, even PCR-free data can give rise to duplicates, albeit at low rates, as illustrated here with our example data.

#### Prerequisites

• Installed Picard tools
• Coordinate-sorted and indexed BAM alignment data. Secondary/supplementary alignments are flagged appropriately (256 and 2048 flags) and additionally with the mate unmapped (8) flag. See the MergeBamAlignment section (3C) of Tutorial#6483 for a description of how MergeBamAlignment ensures such flagging.
• For MarkDuplicatesWithMateCigar, pre-computed Mate CIGAR (MC) tags. Data produced according to Tutorial#6483 will have the MC tags added by MergeBamAlignment. Alternatively, see tools RevertOriginalBaseQualitiesAndAddMateCigar and FixMateInformation.
• Appropriately assigned Read Group (RG) information. Read Group library (RGLB) information is factored for molecular duplicate detection. Optical duplicates are limited to those from the same RGID.

• Use the advanced tutorial bundle's human_g1k_v37_decoy.fasta as reference. This same reference is available to load in IGV.
• tutorial_6747.tar.gz data contain human paired 2x150 whole genome sequence reads originally aligning at ~30x depth of coverage. The sample is a PCR-free preparation of the NA12878 individual run on the HiSeq X platform. This machine type, along with HiSeq 4000, has the newer patterned flow cell that differs from the typical non-patterned flow cell. I took the reads aligning to a one Mbp genomic interval (10:96,000,000-97,000,000) and sanitized and realigned the reads (BWA-MEM -M) to the entire genome according to the workflow presented in Tutorial#6483 to produce snippet.bam. The data has (i) no supplementary records; (ii) secondary records flagged with the 256 flag and the mate-unmapped (8) flag; and (iii) unmapped records (4 flag) with mapped mates (mates have 8 flag), zero MAPQ (column 5) and asterisks for CIGAR (column 6). The notation allows read pairs where one mate maps and the other does not to sort and remain together when we apply genomic intervals such as in the generation of the snippet.

## 1. Commands for MarkDuplicates and MarkDuplicatesWithMateCigar

The following commands take a coordinate-sorted and indexed BAM and return (i) a BAM with the same records in coordinate order and with duplicates marked by the 1024 flag, (ii) a duplication metrics file, and (iii) an optional matching BAI index.

For a given file with all MC (mate CIGAR) tags accounted for:

• and where all mates are accounted for, each tool--MarkDuplicates and MarkDuplicatesWithMateCigar--examines the same duplicate sets but prioritize which inserts get marked duplicate differently. This situation is represented by our snippet example data.
• but containing missing mates records, MarkDuplicates ignores the records, while MarkDuplicatesWithMateCigar still considers them for duplicate marking using the MC tag for mate information. Again, the duplicate scoring methods differ for each tool.

Use the following commands to flag duplicates for 6747_snippet.bam. These commands produce qualitatively different data.

Score duplicate sets based on the sum of base qualities using MarkDuplicates:

java -Xmx32G -jar picard.jar MarkDuplicates \
INPUT=6747_snippet.bam \ #specify multiple times to merge
OUTPUT=6747_snippet_markduplicates.bam \
METRICS_FILE=6747_snippet_markduplicates_metrics.txt \
OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \ #changed from default of 100
CREATE_INDEX=true \ #optional
TMP_DIR=/tmp

Score duplicate sets based on total mapped reference length using MarkDuplicatesWithMateCigar:

java -Xmx32G -jar picard.jar MarkDuplicatesWithMateCigar \
INPUT=6747_snippet.bam \ #specify multiple times to merge
OUTPUT=6747_snippet_markduplicateswithmatecigar.bam \
METRICS_FILE=6747_snippet_markduplicateswithmatecigar_metrics.txt \
OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \ #changed from default of 100
CREATE_INDEX=true \ #optional
TMP_DIR=/tmp

• Each tool has a distinct default DUPLICATE_SCORING_STRATEGY. For MarkDuplicatesWithMateCigar it is TOTAL_MAPPED_REFERENCE_LENGTH and this is the only scoring strategy available. For MarkDuplicates you can switch the DUPLICATE_SCORING_STRATEGY between the default SUM_OF_BASE_QUALITIES and TOTAL_MAPPED_REFERENCE_LENGTH. Both scoring strategies use information pertaining to both mates in a pair, but in the case of MarkDuplicatesWithMateCigar the information for the mate comes from the read's MC tag and not from the actual mate.
• To merge multiple files into a single output, e.g. when aggregating a sample from across lanes, specify the INPUT parameter for each file. The tools merge the read records from the multiple files into the single output file. The tools marks duplicates for the entire library (RGLB) and accounts for optical duplicates per RGID. INPUT files must be coordinate sorted and indexed.
• The Broad's production workflow increases OPTICAL_DUPLICATE_PIXEL_DISTANCE to 2500, to better estimate library complexity. The default setting for this parameter is 100. Changing this parameter does not alter duplicate marking. It only changes the count for optical duplicates and the library complexity estimate in the metrics file in that whatever is counted as an optical duplicate does not factor towards library complexity. The increase has to do with the fact that our example data was sequenced in a patterned flow cell of a HiSeq X machine. Both HiSeq X and HiSeq 4000 technologies decrease pixel unit area by 10-fold and so the equivalent pixel distance in non-patterned flow cells is 250. You may ask why are we still counting optical duplicates for patterned flow cells that by design should have no optical duplicates. We are hijacking this feature of the tools to account for other types of duplicates arising from the sequencer. Sequencer duplicates are not limited to optical duplicates and should be differentiated from PCR duplicates for more accurate library complexity estimates.
• By default the tools flag duplicates and retain them in the output file. To remove the duplicate records from the resulting file, set the REMOVE_DUPLICATES parameter to true. However, given you can set GATK tools to include duplicates in analyses by adding -drf DuplicateRead to commands, a better option for value-added storage efficiency is to retain the resulting marked file over the input file.
• To optionally create a .bai index, add and set the CREATE_INDEX parameter to true.

For snippet, the duplication metrics are identical whether marked by MarkDuplicates or MarkDuplicatesWithMateCigar. We have 13.4008% duplication, with 255 unpaired read duplicates and 18,254 paired read duplicates. However, as the screenshot at the top of this page illustrates, and as section 4 explains, the data qualitatively differ.

## 2. Slow or out of memory error? Special memory considerations for duplicate marking tools

The seemingly simple task of marking duplicates is one of the most memory hungry processes, especially for paired end reads. Both tools are compute-intensive and require upping memory compared to other processes.

Because of the single-pass nature of MarkDuplicatesWithMateCigar, for a given file its memory requirements can be greater than for MarkDuplicates. What this means is that MarkDuplicatesWithMateCigar streams the duplicate marking routine in a manner that allows for piping. Due to these memory constraints for MarkDuplicatesWithMateCigar, we recommend MarkDuplicates for alignments that have large reference skips, e.g. spliced RNA alignments.

For large files, (1) use the Java -Xmx setting and (2) set the environmental variable TMP_DIR for a temporary directory. These options allow the tool to run without slowing down as well as run without causing an out of memory error. For the purposes of this tutorial, commands are given as if the example data is a large file, which we know it is not.

    java -Xmx32G -jar picard.jar MarkDuplicates \
... \
TMP_DIR=/tmp 

These options can be omitted for small files such as the example data and the equivalent command is as follows.

    java -jar picard.jar MarkDuplicates ...   

### Set the java maxheapsize, specified by the -Xmx#G option, to the maximum your system allows.

The high memory cost, especially for MarkDuplicatesWithMateCigar, is in part because the tool systematically traverses genomic coordinate intervals for inserts in question, and for every read it marks as a duplicate it must keep track of the mate, which may or may not map nearby, so that reads are marked as pairs with each record emitted in its coordinate turn. In the meanwhile, this information is held in memory, which is the first choice for faster processing, until the memory limit is reached, at which point memory spills to disk. We set this limit high to minimize instances of memory spilling to disk.

In the example command, the -Xmx32G Java option caps the maximum heap size, or memory usage, to 32 gigabytes, which is the limit on the server I use. This is in contrast to the 8G setting I use for other processes on the same sample data--a 75G BAM file. To find a system's default maximum heap size, type java -XX:+PrintFlagsFinal -version, and look for MaxHeapSize.

### Set an additional temporary directory with the TMP_DIR parameter for memory spillage.

When the tool hits the memory limit, memory spills to disk. This causes data to traverse in and out of the processor's I/O device, slowing the process down. Disk is a location you specify with the TMP_DIR parameter. If you work on a server separate from where you read and write files to, setting TMP_DIR to the server's local temporary directory (typically /tmp) can reduce processing time compared to setting it to the storage disk. This is because the tool then additionally avoids traversing the network file system when spilling memory. Be sure the TMP_DIR location you specify provides enough storage space. Use df -h to see how much is available.

## 3. Conceptual overview of duplicate flagging

The aim of duplicate marking is to flag all but one of a duplicate set as duplicates and to use duplicate metrics to estimate library complexity. Duplicates have a higher probability of being non-independent measurements from the exact same template DNA. Duplicate inserts are marked by the 0x400 bit (1024 flag) in the second column of a SAM record, for each mate of a pair. This allows downstream GATK tools to exclude duplicates from analyses (most do this by default). Certain duplicates, i.e. PCR and sequencer duplicates, violate assumptions of variant calling and also potentially amplify errors. Removing these, even at the cost of removing serendipitous biological duplicates, allows us to be conservative in calculating the confidence of variants.

GATK tools allow you to disable the duplicate read filter with -drf DuplicateRead so you can include duplicates in analyses.

For a whole genome DNA sample, duplicates arise from three sources: (i) in DNA shearing from distinct molecular templates identical in insert mapping, (ii) from PCR amplification of a template (PCR duplicates), and (iii) from sequencing, e.g. optical duplicates. The tools cannot distinguish between these types of duplicates with the exception of optical duplicates. In estimating library complexity, the latter two types of duplicates are undesirable and should each factor differently.

When should we not care about duplicates? Given duplication metrics, we can make some judgement calls on the quality of our sample preparation and sequencer run. Of course, we may not expect a complex library if our samples are targeted amplicons. Also, we may expect minimal duplicates if our samples are PCR-free. Or it may be that because of the variation inherent in expression level data, e.g. RNA-Seq, duplicate marking becomes ritualistic. Unless you are certain of your edge case (amplicon sequencing, RNA-Seq allele-specific expression analysis, etc.) where duplicate marking adds minimal value, you should go ahead and mark duplicates. You may find yourself staring at an IGV session trying to visually calculate the strength of the evidence for a variant. We can pat ourselves on the back for having the forethought to systematically mark duplicates and turn on the IGV duplicate filter.

The Broad's Genomics Platform uses MarkDuplicates twice for multiplexed samples. Duplicates are flagged first per sample per lane to estimate lane-level library complexity, and second to aggregate data per sample while marking all library duplicates. In the second pass, duplicate marking tools again assess all reads for duplicates and overwrite any prior flags.

Our two duplicate flagging tools share common features but differ at the core. As the name implies, MarkDuplicatesWithMateCigar uses the MC (mate CIGAR) tag for mate alignment information. Unlike MarkDuplicates, it is a single-pass tool that requires pre-computed MC tags.

• For RNA-Seq data mapped against the genome, use MarkDuplicates. Specifically, MarkDuplicatesWithMateCigar will refuse to process data with large reference skips frequent in spliced RNA transcripts where the gaps are denoted with an N in the CIGAR string.
• Both tools only consider primary mappings, even if mapped to different contigs, and ignore secondary/supplementary alignments (256 flag and 2048 flag) altogether. Because of this, before flagging duplicates, be sure to mark primary alignments according to a strategy most suited to your experimental aims. See MergeBamAlignment's PRIMARY_ALIGNMENT_STRATEGY parameter for strategies the tool considers for changing primary markings made by an aligner.
• Both tools identify duplicate sets identically with the exception that MarkDuplicatesWithMateCigar additionally considers reads with missing mates. Missing mates occur for example when aligned reads are filtered using an interval list of genomic regions. This creates divorced reads whose mates aligned outside the targeted intervals.
• Both tools identify duplicates as sets of read pairs that have the same unclipped alignment start and unclipped alignment end. The tools intelligently factor for discordant pair orientations given these start and end coordinates. Within a duplicate set, with the exception of optical duplicates, read pairs may have either pair orientation--F1R2 or F2R1. For optical duplicates, pairs in the set must have the same orientation. Why this is is explained in section 4.
• Both tools take into account clipped and gapped alignments and singly mapping reads (mate unmapped and not secondary/supplementary).
• Each tool flags duplicates according to different priorities. MarkDuplicatesWithMateCigar prioritizes which pair to leave as the representative nondup based on the total mapped length of a pair while MarkDuplicates can prioritize based on the sum of base qualities of a pair (default) or the total mapped length of a pair. Duplicate inserts are marked at both ends.

## 4. Details of interest to some

To reach a high target coverage depth, some fraction of sequenced reads will by stochastic means be duplicate reads.

Let us hope the truth of a variant never comes down to so few reads that duplicates should matter so. Keep in mind the better evidence for a variant is the presence of overlapping reads that contain the variant. Also, take estimated library complexity at face value--an estimate.

### Don't be duped by identical numbers. Data from the two tools qualitatively differ.

First, let me reiterate that secondary and supplementary alignment records are skipped and never flagged as duplicate.

Given a file with no missing mates, each tool identifies the same duplicate sets from primary alignments only and therefore the same number of duplicates. To reiterate, the number of identical loci or duplicate sets and the records within each set are the same for each tool. However, each tool differs in how it decides which insert(s) within a set get flagged and thus which insert remains the representative nondup. Also, if there are ties, the tools may break them differently in that tie-breaking can depend on the sort order of the records in memory.

• MarkDuplicates by default prioritizes the sum of base qualities for both mates of a pair. The pair with the highest sum of base qualities remains as the nondup.
• As a consequence of using the mate's CIGAR string (provided by the MC tag), MarkDuplicatesWithMateCigar can only prioritize the total mapped reference length, as provided by the CIGAR string, in scoring duplicates in a set. The pair with the longest mapping length remains as the nondup.
• If there are ties after applying each scoring strategy, both tools break the ties down a chain of deterministic factors starting with read name.

### Duplicate metrics in brief

We can break down the metrics file into two parts: (1) a table of metrics that counts various categories of duplicates and gives the library complexity estimate, and (2) histogram values in two columns.

See DuplicationMetrics for descriptions of each metric. For paired reads, duplicates are considered for the insert. For single end reads, duplicates are considered singly for the read, increasing the likelihood of being identified as a duplicate. Given the lack of insert-level information for these singly mapping reads, the insert metrics calculations exclude these.

The library complexity estimate only considers the duplicates that remain after subtracting out optical duplicates. For the math to derive estimated library size, see formula (1.2) in Mathematical Notes on SAMtools Algorithms.

The histogram values extrapolate the calculated library complexity to a saturation curve plotting the gains in complexity if you sequence additional aliquots of the same library. The first bin's value represents the current complexity.

### Pair orientation F1R2 is distinct from F2R1 for optical duplicates

Here we refer you to a five minute video illustrating what happens at the molecular level in a typical sequencing by synthesis run.

What I would like to highlight is that each strand of an insert has a chance to seed a different cluster. I will also point out, due to sequencing chemistry, F1 and R1 reads typically have better base qualities than F2 and R2 reads.

Optical duplicate designation requires the same pair orientation.

Let us work out the implications of this for a paired end, unstranded DNA library. During sequencing, within the flow cell, for a particular insert produced by sample preparation, the strands of the insert are separated and each strand has a chance to seed a different cluster. Let's say for InsertAB, ClusterA and ClusterB and for InsertCD, ClusterC and ClusterD. InsertAB and InsertCD are identical in sequence and length and map to the same loci. It is possible InsertAB and InsertCD are PCR duplicates and also possible they represent original inserts. Each strand is then sequenced in the forward and reverse to give four pieces of information in total for the given insert, e.g. ReadPairA and ReadPairB for InsertAB. The pair orientation of these two pairs are reversed--one cluster will give F1R2 and the other will give F2R1 pair orientation. Both read pairs map exactly to the same loci. Our duplicate marking tools consider ReadPairA and ReadPairB in the same duplicate set for regular duplicates but not for optical duplicates. Optical duplicates require identical pair orientation.

No posts found with the requested search criteria.

Created 2016-01-20 16:24:38 | Updated | Tags: commandlinegatk markduplicates

Dear GATK team,

Am I right that since MarkDuplicates considers only 5' coordinates of reads, it should work properly on reads (both paired-end and single-end) that have different lengths (due to quality trimming from 3')?

Created 2016-01-14 16:12:55 | Updated | Tags: commandlinegatk markduplicates rmdup

Dear GATK team,

I'm going to do variant calling for several tens of samples using hg38 reference with GATK. I have several questions about this process. They are partially covered on forums and in FAQs, but I'd like to clarify some points:

1) Am I right that MarkDuplicates can process a BAM file that contains both paired-end and single-end reads? (Picard FAQ hints it can, but just to be sure.)

2) Am I right that MarkDuplicates is significantly slower than samtools rmdup (because of its algorithm that marks not only dupes from the same chromosome, but also dupes from different chromosomes)?

3) Is there any evidence that use of MarkDuplicates is significantly better for the downstream analysis with GATK than use of samtools rmdup? (Of course, MarkDuplicates is used in the Best Practices, but Picard tools are used everywhere in that guide.)

Remarks:

1) I use bowtie2 --very-sensitive for read mapping.

2) I'd like to get a gVCF file for each sample.

Created 2015-10-23 18:15:22 | Updated | Tags: picard markduplicates

Hello, I've been using the Picardtools MarkDuplicates tool. I'd like to identify which reads are duplicates of each other (ie. if read.1234 is a duplicate of read.5678, I want to be able to retrieve this relationship). Does the MarkDuplicates output indicate this in any way? While I could group reads together if they share the same start coordinate listed in the BAM file, this gets a little tricky if the reads align to the minus strand, or if there are mismatches in the first couple of nucleotides in the read. I think the MarkDuplicates program must be collecting this information behind the scenes when it's finding duplicates. Thank you very much for your help.

I am using the best practices RNA-Seq pipeline for 6 libraries. Four have completed without any problem. Two (from the same project) have gotten snagged. The errors occur at "add or replace read groups" and at "mark duplicates." The errors:

Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Read name HWI-D00273:94:C6GFHANXX:8:1312:12804:32959, CIGAR M operator maps off end of reference

and

Exception in thread "main" net.sf.samtools.SAMFormatException: Did not inflate expected amount

I know picard tools is not part of GATK, but wondered if anyone has thoughts about what's going on. I have tried starting from scratch with trimmed reads, running cleansam, checking that all pairs are intact...nothing helps. I'm especially puzzled that the other libraries have no issues.

Created 2015-09-27 12:23:40 | Updated | Tags: picard markduplicates libraries

I was hoping this had been addressed already on the forum, but I've not seen a definitive answer although I have seen a similar question posed on this and other forums.

Our current mark duplicate procedure using Picard MarkDuplicates is to run merges across lane data generated from the same library. I believe this makes sense, and once duplicates are marked, then library level merges are combined to create a sample level, multi-library bam file. Any duplicates found across libraries would not be expected to be PCR duplicates but instead just identical fragments.

It's not clear though whether Picard MarkDuplicates is library aware....ie. when it does mark duplicates does it account for read pairs only from the same library, or if run against a bam merge generated from multiple libraries, will it mark any duplicates it finds.

I don't see this addressed in the documentation, so I assume that is not the case, but I have seen suggestions elsewhere that it might be so.

Created 2015-09-22 16:01:09 | Updated | Tags: bam gatk error markduplicates merged-bams

Hello,

I am relatively new to GATK and stuck on this problem.

After downloading the bam file that I wish to analyze, I am cutting portions of the file because I only wish to analyze variants at specific genes. Once I have the smaller files for each gene, I am using samtools merge to merge all of the smaller files back into one full sliced bam file. Next, I am realigning this file using bwa aln and sampe. After this step, I am attempting to use GATK best practices to mark the variants in these genes.

First, I am using GATK AddOrReplaceReadGroups to modify read group information as necessary. Then, I am using Picard's MarkDuplicates to mark duplicates in the re-aligned bam file. However, I get the following error.

Exception in thread "main" net.sf.picard.PicardException: Value was put into PairInfoMap more than once. 156: TLL:HWI-ST1222:5:2308:12532:76745#0 at net.sf.picard.sam.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:124) at net.sf.picard.sam.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:78) at net.sf.picard.sam.DiskReadEndsMap.remove(DiskReadEndsMap.java:61) at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:418) at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:161) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:145)

Looking at Picard's FAQ page (https://broadinstitute.github.io/picard/faq.html), it suggests that I try using Picard's MergeBamAlignment which also fails giving an error suggesting that the error is in the record ID of the bam file.

Exception in thread "main" net.sf.picard.PicardException: Program Record ID already in use in unmapped BAM file. at net.sf.picard.sam.SamAlignmentMerger.<init>(SamAlignmentMerger.java:131) at net.sf.picard.sam.MergeBamAlignment.doWork(MergeBamAlignment.java:226) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.sam.MergeBamAlignment.main(MergeBamAlignment.java:205)

Continuing on, trying to change the options on samtools merge to change the record ID's when merging together the single files into the full sliced file also failed.

Some more work into these issues found that using samtools fixmate and then Picard SortSam before calling MarkDuplicates resolves the issue, but the total amount of variants called by GATK is different when I run the sliced file through against a single gene file (for the same locations).

Are there any other options to explore to resolve this error?

Created 2015-09-03 07:52:38 | Updated | Tags: picard markduplicates

Hi there,

Hope someone can shed some light on this issue.

I have problem running picard-tools MarkDuplicates. I get an error "No space left on device". Having a bit of a search I found people mention that it might be an issue with the tmpdir folder specified. However the folder I'm using for tmpdir is massive (72GB). Looking a bit more at the error log, I found the retain data points before spilling to disk line.

It had a number that matched very closely to the number of records read before the error message. (28872640 vs 29,000,000)

INFO 2015-09-03 15:53:32 MarkDuplicates Will retain up to 28872640 data points before spilling to disk. ... INFO 2015-09-03 15:55:50 MarkDuplicates Read 29,000,000 records. Elapsed time: 00:02:18s. Time for last 1,000,000: 4s. Last read position: chr7:39,503,936 INFO 2015-09-03 15:55:50 MarkDuplicates Tracking 195949 as yet unmatched pairs. 13309 records in RAM. [Thu Sep 03 15:55:53 EST 2015] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 2.35 minutes. Runtime.totalMemory()=6107234304 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" htsjdk.samtools.util.RuntimeIOException: java.io.IOException: No space left on device at htsjdk.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:245) at htsjdk.samtools.util.SortingCollection.add(SortingCollection.java:165) at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:281) at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:114) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105) Caused by: java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:318) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) at org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:127) at org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:100) at org.xerial.snappy.SnappyOutputStream.close(SnappyOutputStream.java:137) at htsjdk.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:236) ... 6 more

I had a play around with the memory option of java (-Xmx??g) when I issue my MarkDuplicates call, and I see that increase in memory increase the number of data points before spilling to disk. This then increase the number of records read before my "No sapce left in device" error.

eg -Xmx16g gave me 59674689 data points before spilling to disk and I got up to 60,000,000 records read before "no space left on device" error.

I know I can increase my memory to allow for more records, but there is a limit to doing that if I have a huge bam.

What I would like to know is what does "Will retain up to 28872640 data points before spilling to disk." actually mean. I thought it was a safe guard for memory usage, where if the number of records/data point is excceeded then some will be written to file, thus allowing more records to be read. This mean you can still process a large bam with only a small amount of memory. But it does not seem to work that way from what I'm seeing.

# 2) I am running the MarkDuplicates script as follows:

java -Xmx2g -jar /home/.apps/picard-tools/1.119/MarkDuplicates.jar INPUT=inputfile.bam OUTPUT=outputfile.bam METRICS_FILE=stats/filename.sorted AS=TRUE REMOVE_DUPLICATES=false VALIDATION_STRINGENCY=SILENT TMP_DIR=temp/CREATE_INDEX=true

In the .out file I get the following error: OpenJDK 64-Bit Server VM warning: You have loaded library /home/.apps/picard-tools/1.119/libIntelDeflater.so which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.

And in the .err file I get: [Thu Aug 27 09:14:37 SAST 2015] picard.sam.MarkDuplicates INPUT=[home/anandie/BWA/7DLAetAln.bam] OUTPUT=7DLAetAln.sorted.markdup.bam METRICS_FILE=stats/7DLAetAln_Markdup.sorted REMOVE_DUPLICATES=false ASSUME_SORTED=true TMP_DIR=[temp/CREATE_INDEX=true] VALIDATION_STRINGENCY=SILENT PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Thu Aug 27 09:14:37 SAST 2015] Executing as anandie@n01.hpc on Linux 2.6.32-573.3.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_85-mockbuild_2015_07_15_13_21-b00; Picard version: 1.119(d44cdb51745f5e8075c826430a39d8a61f1dd832_1408991805) IntelDeflater [Thu Aug 27 09:14:37 SAST 2015] picard.sam.MarkDuplicates done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=2058354688 To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp Exception in thread "main" htsjdk.samtools.SAMException: Cannot read non-existent file: /home/anandie/BWA/home/anandie/BWA/7DLAetAln.bam at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:280) at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:171) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183) at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)

I have tried running ValidateSamFile in a similar manner and also get the same error.

Can someone please point me in the right direction! Is it java or picardtools giving the problem? Where can I start my search!

Created 2015-08-20 17:51:21 | Updated | Tags: picard markduplicates

Is it possible to retrieve Picard duplication metrics without running MarkDuplicates? I would like to get these metrics for a merged bam file where the original bams already have dups flagged, and did not go through the same PCR.

Created 2015-02-03 22:23:52 | Updated | Tags: markduplicates

Hi, When I follow the GATK protocol, ran below commands: bwa mem -R '@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:C2U2AACXX' ucsc.hg19.fasta ../Unaligned/Project_DefaultProject/Sample_1/1_R1.fastq ../Unaligned/Project_DefaultProject/Sample_1/1_R2.fastq > sample1.sam

java -jar /data/software/picard/MarkDuplicates.jar INPUT=sample1.s am OUTPUT=sample1_dedup.bam SO=coordinate

I got some error like "ERROR: Unrecognized option: SO"

Why? how to fix it?

Thanks, Min

Created 2014-12-05 15:12:16 | Updated 2014-12-05 15:12:48 | Tags: solid picard markduplicates

Hi,

I'm having trouble removing duplicates using Picard tools on SOLiD data. I get a regex not matching error.

The reads have the following names:

22_758_632_F3

604_1497_576

124_1189_1519_F5

358_1875_702_F5-DNA

And I don't think Picard tools is able to pick these read names with its default regex.

I tried to change the default regex. This time it does not throw an error, but it takes too long and times out (out of memory). I suspect I'm not giving the right regex. Here is my command:

java -jar $PICARD_TOOLS_HOME/MarkDuplicates.jar I=$FILE O=$BAMs/MarkDuplicates/$SAMPLE.MD.bam M=$BAMs/MarkDuplicates/$SAMPLE.metrics READ_NAMEREGEX="([0-9]+)([0-9]+)_([0-9]+).*"

Any help is appreciated. Thanks!

Created 2014-08-14 06:06:29 | Updated | Tags: markduplicates pooled-calls

I was planning to call variants on my pooled bacteria sample using GATK UnifiedGenotyper and was wondering if removing duplicates should be in the best practices since with pooled samples, the reads might not be PCR duplicates but actually reads from different strains that I have in the pool. I would appreciate any input from your side. Thank you

Eva

Created 2014-05-21 20:44:45 | Updated | Tags: markduplicates

Hello, I am a graduate student in lab that studies evolution, and I am relatively new to NGS. I have been given reads from pooled moth samples, and I am hoping to identify variants with the ultimate goal of quantifying the genetic differentiation between two strains of moths. I am wondering 1) if it is appropriate/recommended to remove duplicates with pooled data and 2) more broadly, are there particular situations in which removing duplicates is not suggested? For example, I have another data set in which the fragments were not generated by random shearing but rather by multiplex PCR of 17 particular amplicons for 42 different individual moths (not pooled). I'm guessing that removing duplicates doesn't make sense in this case because there will be lots of reads that start at the exact same position relative to the reference. Is this right?

Thanks a bunch!

Created 2014-03-18 14:12:40 | Updated 2014-03-18 14:13:47 | Tags: pipeline markduplicates lanes

Referring to broadinstitute.org/gatk/guide/article?id=3060, is removing duplicates necessary to be done twice, once per-lane and then per-sample?

Is it not enough to just mark the duplicates in the final BAM file with all the lanes merged, which should remove both optical and PCR duplicates (I am using Picard MarkDuplicates.jar)? So specifically, in the link above what is wrong with generating -

• sample1_lane1.realn.recal.bam
• sample1_lane2.realn.recal.bam
• sample2_lane1.realn.recal.bam
• sample2_lane2.realn.recal.bam

Then, merging them to get

• sample1.merged.bam
• sample2.merged.bam

and finally, include "de-dupping" only for the merged BAM file.

• sample1.merged.dedup.realn.bam
• sample2.merged.dedup.realn.bam

Created 2013-11-20 06:41:02 | Updated 2013-11-20 06:41:55 | Tags: queue qscript picard markduplicates

Hi,

So I've finally taken the plunge and migrated our analysis pipeline to Queue. With some great feedback from @johandahlberg, I have gotten to a state where most of the stuff is running smoothly on the cluster.

I'm trying to add Picard's CalculateHSMetrics to the pipeline, but am having some issues. This code:

case class hsmetrics(inBam: File, baitIntervals: File, targetIntervals: File, outMetrics: File) extends CalculateHsMetrics with ExternalCommonArgs with SingleCoreJob with OneDayJob {
@Input(doc="Input BAM file") val bam: File = inBam
@Output(doc="Metrics file") val metrics: File = outMetrics
this.input :+= bam
this.targets = targetIntervals
this.baits = baitIntervals
this.output = metrics
this.reference =  refGenome
this.isIntermediate = false
}

Gives the following error message:

ERROR 06:56:25,047 QGraph - Missing 2 values for function:  'java'  '-Xmx2048m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/Users/dankle/IdeaProjects/eclipse/AutoSeq/.queue/tmp' null 'INPUT=/Users/dankle/tmp/autoseqscala/exampleIND2/exampleIND2.panel.bam'  'TMP_DIR=/Users/dankle/IdeaProjects/eclipse/AutoSeq/.queue/tmp'  'VALIDATION_STRINGENCY=SILENT'  'OUTPUT=/Users/dankle/tmp/autoseqscala/exampleIND2/exampleIND2.panel.preMarkDupsHsMetrics.metrics'  'BAIT_INTERVALS=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/exampleINTERVAL.intervals'  'TARGET_INTERVALS=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/exampleINTERVAL.intervals'  'REFERENCE_SEQUENCE=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/bwaindex0.6/exampleFASTA.fasta'  'METRIC_ACCUMULATION_LEVEL=SAMPLE'
ERROR 06:56:25,048 QGraph -   @Argument: jarFile - jar
ERROR 06:56:25,049 QGraph -   @Argument: javaMainClass - Main class to run from javaClasspath 

And yeah, is seems that the jar file is currently set to null in the command line. However, MarkDuplicates runs fine without setting the jar:

case class dedup(inBam: File, outBam: File, metricsFile: File) extends MarkDuplicates with ExternalCommonArgs with SingleCoreJob with OneDayJob {
@Input(doc = "Input bam file") var inbam = inBam
@Output(doc = "Output BAM file with dups removed") var outbam = outBam
this.REMOVE_DUPLICATES = true
this.input :+= inBam
this.output = outBam
this.metrics = metricsFile
this.memoryLimit = 3
this.isIntermediate = false
}

Why does CalculateHSMetrics need the jar, but not MarkDuplicates? Both are imported with import org.broadinstitute.sting.queue.extensions.picard._.

Hello,

I am having trouble calling variants using Haplotype Caller on simulated exome reads. I have been able to call reasonable-looking variants on the exome (simulated with dwgsim) with HaplotypeCaller before running it through the Best Practices Pre-Processing pipeline. The pre-processed data worked fine with UnifiedGenotyper but with HaplotypeCaller, though it runs without errors and seems to walk across the genome, only outputs a VCF header. I have tried calling variants with and without using -L to provide the exome regions (as recommended in this forum post: http://gatkforums.broadinstitute.org/discussion/1681/expected-file-size-haplotype-caller) but this hasn't made a difference - when we run the command with the pre-processed BAMs, we only get a VCF header. Everything has been tested with both 2.4-7 and 2.4-9.

Any help or guidance would be greatly appreciated!

Command Used for HaplotypeCaller:

java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -I exome.realigned.dedup.recal.bam -o exome.raw.vcf -D dbsnp_137.hg19.vcf -stand_emit_conf 10 -rf BadCigar -L Illumin_TruSeq.bed --logging_level DEBUG

Commands Used for pre-processing (run in sequence using a Perl script):

java -Xmx16g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -nt 8 -R ucsc.hg19.fasta -I exome.bam -o exome.intervals -known dbsnp_137.hg19.vcf

java -Xmx4g -jar GenomeAnalysisTK.jar -T IndelRealigner -R ucsc.hg19.fasta -I exome.bam -o exome.realigned.bam -targetIntervals intervals.bam -known dbsnp_137.hg19.vcf

java -Xmx16g -jar MarkDuplicates.jar I=exome.realigned.bam METRICS_FILE=exome.dups O=exome.realigned.dedup.bam

samtools index exome.realigned.dedup

java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -o exome.recal_data.grp -knownSites dbsnp_137.hg19.vcf -cov ReadGroupCovariate -cov ContextCovariate -cov CycleCovariate -cov QualityScoreCovariate

java -Xmx4g -jar GenomeAnalysisTK.jar -T PrintReads -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -BQSR exome.recal_data.grp -baq CALCULATE_AS_NECESSARY -o exome.realigned.dedup.recal.bam

Created 2013-02-04 14:08:41 | Updated | Tags: unifiedgenotyper gatk2 markduplicates

I have been using GATK (v2.2) UnifiedGenotyper to generate VCFs. I did a multisample realignment around indels which generated a multisample BAM of size ~500Gb. After looking at some of the SNP calls I decided to try removing duplicates. I used MarkDuplicates with "REMOVE_DUPLICATES=true" and although only 10% of reads were duplicates, the BAM reduced to ~75Gb. This did not seem to affect the depth of reads at a site more than the expected ~10% but now the AD field in the genotype columns is missing. ie GT:AD:GQ 0/1:.:30 When I run UnifiedGenotyper with the old BAM prior to MarkDuplicates the AD field is present.

I am currently running the MarkDuplicates on each sample prior to realignment - because I think this makes the most sense, but isn't clear why this should matter,

Any ideas on what is happening here?

Created 2013-01-24 22:01:36 | Updated | Tags: queue markduplicates

I was frustrated by the .metrics file from MarkDuplicates getting deleted as an intermediate file, so I set isIntermediate=false for that step in the DataProcessingPipeline. But now I'm getting tired of manually deleting the intermediate bams.

So my request is, could that field be changed from an @Output to an @Argument? This would be on line 50 of org.broadinstitute.sting.queue.extensions.picard.MarkDuplicates.scala. I also made that a required field in my local copy, since it is required to run the Picard tool.

A similar but opposite problem is that the bai file from the IndelRealigner step is not deleted - but that looks like it would require either special handling for that walker in Queue or for the index file to be an argument to the Java walker. Neither is a particularly appealing solution.

Created 2012-12-06 10:59:01 | Updated 2012-12-06 15:45:06 | Tags: realignertargetcreator markduplicates

I've the following queries on running RealignerTargetCreator module in GATK1.4.

1) Is it recommended to provide the target capture BED file to RealignerTargetCreator in case of targeted/exome experiments? Without the bed file, the tool is taking long time (~6-7 hrs). What's the optimal way here?

2) Does running mark duplicates before or after 'RealignerTargetCreator' have any effect on the # of snps/indels? What is recommended?