Tagged with #pipeline
0 documentation articles | 1 announcement | 7 forum discussions


No posts found with the requested search criteria.
Comments (2)

Register now for a spot at the upcoming GATK workshop, which will be held in Cambridge, MA on October 21-22.

http://www.cvent.com/events/broade-workshop-gatk-best-practices-building-analysis-pipelines-with-queue/event-summary-de1eaa027413404ba6dc04c128d52c63.aspx

This workshop will cover the following topics:

  • GATK Best Practices for Variant Detection
  • Building Analysis Pipelines with Queue

The workshop is scheduled right before ASHG Boston, so if you're going to be in town for the conference, make sure you come a couple of days early and attend the GATK workshop!

Comments (2)

Referring to broadinstitute.org/gatk/guide/article?id=3060, is removing duplicates necessary to be done twice, once per-lane and then per-sample?

Is it not enough to just mark the duplicates in the final BAM file with all the lanes merged, which should remove both optical and PCR duplicates (I am using Picard MarkDuplicates.jar)? So specifically, in the link above what is wrong with generating -

  • sample1_lane1.realn.recal.bam
  • sample1_lane2.realn.recal.bam
  • sample2_lane1.realn.recal.bam
  • sample2_lane2.realn.recal.bam

Then, merging them to get

  • sample1.merged.bam
  • sample2.merged.bam

and finally, include "de-dupping" only for the merged BAM file.

  • sample1.merged.dedup.realn.bam
  • sample2.merged.dedup.realn.bam
Comments (1)

I'm wondering if there's any way to skip the GATKCommandLine line in the vcf-header (in a vcf file generated by UnifiedGenotyper). I thought that the --remove_program_records would do this but it doesn't seem to do the trick. I'm still seeing the header line.

##GATKCommandLine=<ID=UnifiedGenotyper,Version=2.7-2-g6bda569,Date="Fri Sep 27 15:17:59 CEST 2013", [...] >

The reason this is important to me is that I'm using the Pipeline test code provided in Queue and, as you know, this is based on md5 sums, and as the time when the tests was run is included, the md5 hash changes for each run. So, if there is no way to skip the header, is there any other, better way to do this.

Cheers, Johan

Comments (9)

Hello all,

We've just started using GATK in order to perform variant calling in a non-model teleost fish. The fish genome is highly repetitive (>65%), and also suffers from the whole genome duplication event common in teleosts (e.g. zebrafish). Additionally, the fish strain we use is highly inbred, which should result in a highly homogenous genome. We have generated a genome assembly and a de novo repeat library based on NGS data (manuscript submitted) before mapping the reads from four individuals (male and female) to the genome via bowtie2. Variants were called using UnifiedGenotyper.

We generally get a very good list of variants, but it seems that we're getting a number of false positives and negatives when calling variants. Some of these appear to be due to paralogues, but some seem to be errors in the actual genotype call. For example:

scaffold00001 1199020 . T G 44.35 . AC=1;AF=0.167;AN=6;BaseQRankSum=-7.420;DP=110;Dels=0.00;FS=152.859;HaplotypeScore=3.6965;MLEAC=1;MLEAF=0.167;MQ=42.00;MQ0=0;MQRankSum=-1.972;QD=1.53;ReadPosRankSum=-2.777;SB=-4.096e+00 GT:AD:DP:GQ:PL 0/1:20,9:29:79:79,0,588 0/0:16,7:23:12:0,12,447 0/0:39,18:57:65:0,65,1426 ./.

In this case, individual 3 has a homozygous reference genotype, despite having a 31% minor allele frequency. Individual 1 also has a 31% minor allele frequency, but is called heterozygous.Some of the bases used to call the G allele are of low quality (when looking more closely using IGV), but I would still expect the genotype to be heterozygous.

A reverse example:

scaffold00458 298207 . A G 64.81 . AC=2;AF=0.333;AN=6;BaseQRankSum=3.027;DP=64;Dels=0.00;FS=5.080;HaplotypeScore=0.0000;MLEAC=2;MLEAF=0.333;MQ=16.26;MQ0=0;MQRankSum=3.177;QD=1.16;ReadPosRankSum=-3.252;SB=0.439 GT:AD:DP:GQ:PL 0/0:8,0:8:21:0,21,207 0/1:20,1:21:13:13,0,152 0/1:31,4:35:90:90,0,102 ./.

Here, individual 2 is called heterozygous, but there is only a single read which supports the minor allele. Additionally, when looking at IGV, you can see that the read in question has a number of mismatches, suggesting it originates from another area of the genome.

I've also uploaded screenshots of IGV if that I hope will help clarify the problems we're having. We have used default parameters of GATK in almost all cases, and we did not used VQSR, as we did not have a list of high confidence SNPs at the time.

Comments (5)

Hello I'm a developer in Korea. Recently, I have been developed about Bioinformatics pipeline. I'm using BWA, Samtools, Picard, GATK. And then I wanna make this tool on hadoop. The reason is why Using MR is efficient to speed or memory something like that. So, I know GATK is made by MR. If so, did you test GATK on MR? In theory, that is more efficient than just GATK.

And, If GATK needs indexed and sorted SAM, with using hadoop-BAM library do I just make index and sort??

Because I am novice in Bioinformatics, this issue is too complicated to me.

Sincere.

e-mail : leoniz127@gmail.com phone : +821027266808

Comments (3)

Hi to all

I have just started using GATK and I have few question about some tools and about the general workflow.

I have 3 exome-seq data from a trio and I have to detect rare or private variants that segregate with the disease.

From the 3 aligned bam file I procedeed with the GATK pipeline (ADDgroupInfo, MarkDup, Realign, BQSR, Unified Genotyper and variant filtration) and I generated 3 VCF file.

As now I have to use the PhaseByTrasmission tool, should I merge the 3 VCF file?

Or it was better to merge the BAM file after adding the group info and proceed with the other analysis?

And should I create my .ped file,(I visited http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped, but I couln't understand how ped file is generated) based on the read group that I have assigned?

Thanks!!!

Comments (1)

I have a pipeline someone gave me; in it, it runs the following obsolete GATK command:

java -Xms5g -Xmx5g -jar src/GenomeAnalysisTK-1.0.5506/GenomeAnalysisTK.jar -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -R data/ucsc.hg19.fasta --DBSNP data/hg19/snp131.rod -I runs/nwp/run-000/chrUn_gl000228.realn.bam -recalFile runs/nwp/run-000/chrUn_gl000228.recal.csv

How does the following differ from above?

java -Xms5g -Xmx5g -jar src/GenomeAnalysisTK-2.1-8-g5efb575/GenomeAnalysisTK.jar -T BaseRecalibrator -I runs/nwp/run-000/chrUn_gl000228.realn.bam -R data/ucsc.hg19.fasta -knownSites data/dbsnp_135.hg19.vcf -o runs/nwp/run-000/chrUn_gl000228.recal.grp

Then there is another step at this stage of the pipeline:

java -Xms5g -Xmx5g -jar src/GenomeAnalysisTK-1.0.5506/GenomeAnalysisTK.jar -R data/ucsc.hg19.fasta -I runs/nwp/run-000/chrUn_gl000228.realn.bam -o runs/nwp/run-000/chrUn_gl000228.recal.bam -T TableRecalibration -baq RECALCULATE --doNotWriteOriginalQuals -recalFile runs/nwp/run-000/chrUn_gl000228.recal.csv

How does one run this last step map in 2.1-8 version of GATK?

Comments (13)

Hi there, I wanted to reproduce in my variant calling Queue script the same conditional you have in MethodsDevelopmenCallingPipeline, i.e. including InbreedingCoeff depending on the number of samples. However, in that script the number of samples is passed to the Target object as an integer, and I would like to count it from the bam file list passed as an input to the script.

Therefore I followed the method in DataProcessingPipeline, i.e.

import org.broadinstitute.sting.queue.util.QScriptUtils
[...]
@Input(doc="input BAM file - or list of BAM files", fullName="input", shortName="I", required=true)
var bamFile: File = _
[...]
val bamFilesList = QScriptUtils.createSeqFromFile(bamFile)
val sampleNo = bamFilesList.size

But unfortunately, despite DataProcessingPipeline works just fine, when I put these lines in my other script I get the following error:

INFO  12:48:08,616 HelpFormatter - Date/Time: 2012/11/08 12:48:08 
INFO  12:48:08,616 HelpFormatter - ---------------------------------------------------------------------- 
INFO  12:48:08,616 HelpFormatter - ---------------------------------------------------------------------- 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
org.broadinstitute.sting.utils.exceptions.DynamicClassResolutionException: Could not create module HaplotypeCallerStep because      Cannot instantiate class (Invocation failure) caused by exception null
at org.broadinstitute.sting.utils.classloader.PluginManager.createByType(PluginManager.java:306)
at org.broadinstitute.sting.utils.classloader.PluginManager.createAllTypes(PluginManager.java:317)
at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:126)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)
at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.2-2-gf44cc4e):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Could not create module HaplotypeCallerStep because Cannot instantiate class (Invocation failure) caused by exception null
##### ERROR ------------------------------------------------------------------------------------------

I tried several alternatives looking at the imports in DataProcessingPipeline but maybe I am missing something. Could you please advise?

thanks very much Francesco