Tagged with #markduplicates
0 documentation articles | 0 events or announcements | 4 forum discussions


Sorry, there are no publicly available documents of this type with the tag #markduplicates. Try one of the other types.
Sorry, there are no publicly available documents of this type with the tag #markduplicates. Try one of the other types.

I've the following queries on running RealignerTargetCreator module in GATK1.4.

1) Is it recommended to provide the target capture BED file to RealignerTargetCreator in case of targeted/exome experiments? Without the bed file, the tool is taking long time (~6-7 hrs). What's the optimal way here?

2) Does running mark duplicates before or after 'RealignerTargetCreator' have any effect on the # of snps/indels? What is recommended?

Look forward to your comments. Raj

I have been using GATK (v2.2) UnifiedGenotyper to generate VCFs. I did a multisample realignment around indels which generated a multisample BAM of size ~500Gb. After looking at some of the SNP calls I decided to try removing duplicates. I used MarkDuplicates with "REMOVE_DUPLICATES=true" and although only 10% of reads were duplicates, the BAM reduced to ~75Gb. This did not seem to affect the depth of reads at a site more than the expected ~10% but now the AD field in the genotype columns is missing. ie GT:AD:GQ 0/1:.:30 When I run UnifiedGenotyper with the old BAM prior to MarkDuplicates the AD field is present.

I am currently running the MarkDuplicates on each sample prior to realignment - because I think this makes the most sense, but isn't clear why this should matter,

Any ideas on what is happening here?

Hello,

I am having trouble calling variants using Haplotype Caller on simulated exome reads. I have been able to call reasonable-looking variants on the exome (simulated with dwgsim) with HaplotypeCaller before running it through the Best Practices Pre-Processing pipeline. The pre-processed data worked fine with UnifiedGenotyper but with HaplotypeCaller, though it runs without errors and seems to walk across the genome, only outputs a VCF header. I have tried calling variants with and without using -L to provide the exome regions (as recommended in this forum post: http://gatkforums.broadinstitute.org/discussion/1681/expected-file-size-haplotype-caller) but this hasn't made a difference - when we run the command with the pre-processed BAMs, we only get a VCF header. Everything has been tested with both 2.4-7 and 2.4-9.

Any help or guidance would be greatly appreciated!

Command Used for HaplotypeCaller:

java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -I exome.realigned.dedup.recal.bam -o exome.raw.vcf -D dbsnp_137.hg19.vcf -stand_emit_conf 10 -rf BadCigar -L Illumin_TruSeq.bed --logging_level DEBUG

Commands Used for pre-processing (run in sequence using a Perl script):

java -Xmx16g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -nt 8 -R ucsc.hg19.fasta -I exome.bam -o exome.intervals -known dbsnp_137.hg19.vcf

java -Xmx4g -jar GenomeAnalysisTK.jar -T IndelRealigner -R ucsc.hg19.fasta -I exome.bam -o exome.realigned.bam -targetIntervals intervals.bam -known dbsnp_137.hg19.vcf

java -Xmx16g -jar MarkDuplicates.jar I=exome.realigned.bam METRICS_FILE=exome.dups O=exome.realigned.dedup.bam

samtools index exome.realigned.dedup

java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -o exome.recal_data.grp -knownSites dbsnp_137.hg19.vcf -cov ReadGroupCovariate -cov ContextCovariate -cov CycleCovariate -cov QualityScoreCovariate

java -Xmx4g -jar GenomeAnalysisTK.jar -T PrintReads -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -BQSR exome.recal_data.grp -baq CALCULATE_AS_NECESSARY -o exome.realigned.dedup.recal.bam

I was frustrated by the .metrics file from MarkDuplicates getting deleted as an intermediate file, so I set isIntermediate=false for that step in the DataProcessingPipeline. But now I'm getting tired of manually deleting the intermediate bams.

So my request is, could that field be changed from an @Output to an @Argument? This would be on line 50 of org.broadinstitute.sting.queue.extensions.picard.MarkDuplicates.scala. I also made that a required field in my local copy, since it is required to run the Picard tool.

A similar but opposite problem is that the bai file from the IndelRealigner step is not deleted - but that looks like it would require either special handling for that walker in Queue or for the index file to be an argument to the Java walker. Neither is a particularly appealing solution.