Tagged with #gatk-best-practices
0 documentation articles | 0 announcements | 4 forum discussions

No posts found with the requested search criteria.
No posts found with the requested search criteria.
Comments (4)

I am pre-processing sequencing exome data according to GATK Best Practice workflows. I have a problem with the process of BQSC (base quality score calibration). According to https://broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php, a dbsnp file needs to be provided. But there are many types of dbsnp, like All.vcf, common_all.vcf, common_and_clinical.vcf. I spend some time understanding these files, but still I do not know which dbsnp file I need to choose to do BQSC. If possible, could you please give me some advice? Thanks very much for your time.

Created 2015-10-05 21:42:13 | Updated | Tags: gatk-best-practices addorreplacereadgroups picard-markduplicates
Comments (13)

I am using the best practices RNA-Seq pipeline for 6 libraries. Four have completed without any problem. Two (from the same project) have gotten snagged. The errors occur at "add or replace read groups" and at "mark duplicates." The errors:

Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Read name HWI-D00273:94:C6GFHANXX:8:1312:12804:32959, CIGAR M operator maps off end of reference


Exception in thread "main" net.sf.samtools.SAMFormatException: Did not inflate expected amount

I know picard tools is not part of GATK, but wondered if anyone has thoughts about what's going on. I have tried starting from scratch with trimmed reads, running cleansam, checking that all pairs are intact...nothing helps. I'm especially puzzled that the other libraries have no issues.

Created 2015-02-21 22:46:50 | Updated 2015-02-21 22:48:08 | Tags: combinevariants catvariants combinegvcfs gatk-best-practices
Comments (6)

Currently I am following GATK best practice for using HC 3.0+, however I'm splitting my calls to chromosomal regions (-L). Next are the following step I perform working up to GenotypeGVCF and my question.

1 - I use CatVariants (following HC) to merge all 25 chromosome gvcf files into a single gvcf file per individual.
2 - I use CombineGVCF to merge 2 .. n number of individuals together. This is done because some analysis have 300+ individuals. 3- I then use CombineGVCF again to merge all the file from step 2 into one large gvcf file for one large joint GenotypeGVCF step. 4 - GenotypeGVCF is done again based on chromosomal regions (-L), which is followed by a additional CatVariants before VQSR.

The question I have this this: Given the size of the analysis I have noticed that my CombineGVCF done in step 3 can take anywhere from 4-8 hours. I was wondering if I could change this step to use CombineVariants and have the result be the same (unlost data). The main reason for this would be because GATK currently allow CombineVariants to use the -nt option.

Thanks for you time and work.


Created 2014-03-18 14:12:40 | Updated 2014-03-18 14:13:47 | Tags: pipeline markduplicates lanes gatk-best-practices
Comments (2)

Referring to broadinstitute.org/gatk/guide/article?id=3060, is removing duplicates necessary to be done twice, once per-lane and then per-sample?

Is it not enough to just mark the duplicates in the final BAM file with all the lanes merged, which should remove both optical and PCR duplicates (I am using Picard MarkDuplicates.jar)? So specifically, in the link above what is wrong with generating -

  • sample1_lane1.realn.recal.bam
  • sample1_lane2.realn.recal.bam
  • sample2_lane1.realn.recal.bam
  • sample2_lane2.realn.recal.bam

Then, merging them to get

  • sample1.merged.bam
  • sample2.merged.bam

and finally, include "de-dupping" only for the merged BAM file.

  • sample1.merged.dedup.realn.bam
  • sample2.merged.dedup.realn.bam