# Tagged with #gatk 0 documentation articles | 1 announcement | 83 forum discussions

No posts found with the requested search criteria.

Heads up, cancer researchers! Appistry (our commercial licensing partner for GATK and now MuTect) is announcing an upcoming webinar on best practices for somatic mutation studies using GATK and MuTect. Registration for the webinar is open to all (not just Appistry customers) so be sure to sign up for this. See the announcement on Appistry's website for more detailed information.

Hi,

After using VQSR, I have a vcf output that contains sites labeled "." in the FILTER field. When I look at the vcf documentation (1000 genomes), it says that those are sites where filters have not been applied. Is this correct? I would like to know more about what these sites mean, exactly.

An example of such a site in my data is:

1 10439 . AC A 4816.02 . AC=10;AF=0.185;AN=54;BaseQRankSum=-4.200e-01;ClippingRankSum=-2.700e-01;DP=1690;FS=6.585;GQ_MEAN=111.04;GQ_STDDEV=147.63;InbreedingCoeff=-0.4596;MLEAC=17;MLEAF=0.315;MQ=36.85;MQ0=0;MQRankSum=-8.340e-01;NCC=0;QD=11.39;ReadPosRankSum=-8.690e-01;SOR=1.226 GT:AD:DP:GQ:PGT:PID:PL 0/1:22,14:36:99:0|1:10439_AC_A:200,0,825 0/0:49,0:49:0:.:.:0,0,533 0/0:92,0:92:0:.:.:0,0,2037 0/1:20,29:49:99:.:.:634,0,340 0/0:11,0:16:32:.:.:0,32,379 0/1:21,17:38:99:.:.:273,0,616 0/0:57,0:57:0:.:.:0,0,1028 0/0:58,0:58:0:.:.:0,0,1204 0/0:52,0:52:0:.:.:0,0,474 0/0:86,0:86:27:.:.:0,27,2537 0/1:13,24:37:99:.:.:596,0,220 0/1:14,34:48:99:.:.:814,0,263 0/0:86,0:86:0:.:.:0,0,865 0/0:61,0:61:0:.:.:0,0,973 0/0:50,0:50:0:.:.:0,0,648 0/0:40,0:40:0:.:.:0,0,666 0/0:79,0:79:0:.:.:0,0,935 0/0:84,0:84:0:.:.:0,0,1252 0/1:22,27:49:99:.:.:618,0,453 0/0:39,0:39:0:.:.:0,0,749 0/0:74,0:74:0:.:.:0,0,1312 0/1:13,18:31:99:.:.:402,0,281 0/0:41,0:44:99:.:.:0,115,1412 0/1:30,9:39:99:.:.:176,0,475 0/1:26,23:49:99:.:.:433,0,550 0/1:13,34:47:99:.:.:736,0,185 0/0:44,0:44:0:.:.:0,0,966

Thanks, Alva

Hello

I am using PhaseByTransmission to phase variants called from a trio (mother, father and son) contained in a VCF file. It's my first time using this tool. Do I need to follow any convention when naming my samples? Do the sample names in the VCF file need to match the Family ID, the individual ID, or a combination of the two in the PED file? Do they need to be in the same order?

Hi,

I have discovered some unusual calls in my VCF file after using HaplotypeCaller. I am using version 3.3 of GATK. I applied VQSR as well as the genotype refinement workflow (CalculateGenotypePosteriors, etc.) to refine the calls, but the unusual calls did not get removed. I also calculated the number of Mendelian error just in the biallelic SNPs in my final VCF file (using PLINK) and found unusually high percentage for each of the 3 families I am studying: 0.153%, 0.167%, and 0.25%. The percentage of triallelic SNPs is also very high: 0.111%. Why are the error rates so high?

I used the following commands to call the variants and generate the initial VCF file:

# CatVariants (generate 1 vcf file with all inds and all chrs)

java -Xmx1g -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants -R hs37d5.fa -V final.new.list -out final_allHutt.vcf -assumeSorted

# VQSR

java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input final_allHutt.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches

# Used excludeFiltered here

java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input final_allHutt.vcf -mode SNP --ts_filter_level 99.9 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches -o recalibrated_snps_raw_indels_allHutt_filteredout.vcf

java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches

# Used excludeFiltered here

java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -mode INDEL --ts_filter_level 99.0 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches -o recalibrated_variants_allHutt_filteredout.vcf

# Genotype Refinement Workflow

java -Xmx3g -jar GenomeAnalysisTK.jar -T CalculateGenotypePosteriors -R hs37d5.fa --supporting ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf -ped Hutt.ped -V recalibrated_variants_allHutt_filteredout.vcf -o recalibrated_variants_allHutt.postCGP.f.vcf

java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantFiltration -R hs37d5.fa -V recalibrated_variants_allHutt.postCGP.f.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibrated_variants_allHutt.postCGP.Gfiltered.f.vcf

Again, the first genotype in this example (indel) passed VariantFiltration even though its coverage was zero (2/2:0,0,0:0:PASS)

The entire example is below:

1 20199272 . T TCTTC,C 3520.08 PASS AC=8,22;AF=0.160,0.440;AN=50;BaseQRankSum=-1.356e+00;ClippingRankSum=-1.267e+00;DP=487;FS=4.843;GQ_MEAN=27.84;GQ_STDDEV=40.31;InbreedingCoeff=0.1002;MLEAC=1,12;MLEAF=0.020,0.240;MQ=51.74;MQ0=0;MQRankSum=0.421;NCC=2;PG=0,0,0,0,0,0;QD=32.53;ReadPosRankSum=1.27;SOR=0.699;VQSLOD=0.687;culprit=FS GT:AD:DP:FT:GQ:PGT:PID:PL:PP 2/2:0,0,0:0:PASS:22:.:.:410,207,355,32,22,0:410,207,355,32,22,0 2/2:0,0,1:1:lowGQ:5:.:.:240,51,36,18,5,0:240,51,36,18,5,0 0/2:4,0,4:8:PASS:90:.:.:140,153,256,0,103,90:140,153,256,0,103,90 0/0:22,0,0:22:lowGQ:0:.:.:0,0,390,0,390,390:0,0,390,0,390,390 0/0:2,0,0:2:lowGQ:3:.:.:0,3,45,3,45,45:0,3,45,3,45,45 2/2:0,0,3:3:lowGQ:11:.:.:287,135,124,21,11,0:287,135,124,21,11,0 ./.:7,0,0:7:PASS 2/2:0,0,3:4:lowGQ:11:.:.:282,126,115,22,11,0:282,126,115,22,11,0 0/2:10,0,0:10:lowGQ:5:.:.:27,5,494,0,411,405:27,5,494,0,411,405 0/2:7,0,0:7:lowGQ:13:.:.:13,15,502,0,303,288:13,15,502,0,303,288 0/1:8,6,0:14:PASS:99:.:.:194,0,255,218,273,491:194,0,255,218,273,491 0/0:18,0,0:18:PASS:52:.:.:0,52,755,52,755,755:0,52,755,52,755,755 2/2:0,0,0:0:PASS:23:.:.:305,168,416,23,30,0:305,168,416,23,30,0 0/2:0,0,4:4:lowGQ:14:.:.:40,14,634,0,185,699:40,14,634,0,185,699 0/0:19,0,0:19:PASS:58:.:.:0,58,824,58,824,824:0,58,824,58,824,824 0/0:1,0,0:1:lowGQ:6:0|1:20199257_CT_C:0,6,91,6,91,91:0,6,91,6,91,91 1/1:0,0,0:0:lowGQ:2:.:.:177,11,0,12,2,44:177,11,0,12,2,44 0/1:0,0,3:3:PASS:34:.:.:94,0,388,34,38,304:94,0,388,34,38,304 0/2:15,0,2:17:lowGQ:18:0|1:20199249_CT_C:18,64,695,0,632,624:18,64,695,0,632,624 1/1:0,0,0:0:lowGQ:8:.:.:133,8,0,101,17,265:133,8,0,101,17,265 0/2:3,0,0:3:PASS:25:.:.:129,25,484,0,121,94:129,25,484,0,121,94 0/2:2,0,0:2:PASS:38:.:.:185,38,644,0,88,42:185,38,644,0,88,42 0/2:2,0,0:2:lowGQ:14:.:.:256,14,293,0,57,41:256,14,293,0,57,41./.:11,0,0:11:PASS 1/2:0,0,1:1:lowGQ:14:.:.:115,24,14,36,0,359:115,24,14,36,0,359 1/2:0,0,1:1:PASS:28:.:.:188,39,28,35,0,206:188,39,28,35,0,206 2/2:0,0,3:3:lowGQ:8:1|1:20199249_CT_C:88,88,89,8,9,0:88,88,89,8,9,0

Why are some genotypes being passed when there is no support for their genotype? Why are the Mendelian error rates so high?

Thank you very much in advance, Alva

Hi,

If I get it right, anomalous read pairs refer to those where only one read is mapped. I wonder whether GATK would ever use these reads to do the calling.

Thank you!

Hi One minor change that would make GATK much easier to use with large data sets would be if GATK could provide slightly more detailed error messages, specifically for the cases where the error is related to a specific file.

For example the following Error message tells me one of my GVCF files has been damaged/corrupted at some point but doesn't tell me which one it is which is rather important information as as a result I now have to work my way through ~30files to work out which one is damaged and replace it with a good copy. While if GATK had provided the file name along with the error I would have been able to directly go to the file and replace it.

This has been a problem with the GVCF, VCF, & BAM files, and it would be greatly appreciated if it could be changed!

java.lang.RuntimeException: java.io.IOException: Unexpected compressed block length: 1 at htsjdk.tribble.readers.TabixIteratorLineReader.readLine(TabixIteratorLineReader.java:45) at htsjdk.tribble.TabixFeatureReader$FeatureIterator.readNextRecord(TabixFeatureReader.java:161) at htsjdk.tribble.TabixFeatureReader$FeatureIterator.(TabixFeatureReader.java:149) at htsjdk.tribble.TabixFeatureReader.query(TabixFeatureReader.java:124) at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrack.query(RMDTrack.java:119) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.createIteratorFromResource(ReferenceOrderedDataSource.java:241) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.createIteratorFromResource(ReferenceOrderedDataSource.java:185) at org.broadinstitute.gatk.engine.datasources.rmd.ResourcePool.iterator(ResourcePool.java:93) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedDataSource.seek(ReferenceOrderedDataSource.java:168) at org.broadinstitute.gatk.engine.datasources.providers.RodLocusView.(RodLocusView.java:83) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.getLocusView(TraverseLociNano.java:129) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:80) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.io.IOException: Unexpected compressed block length: 1 at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:377) at htsjdk.samtools.util.BlockCompressedInputStream.seek(BlockCompressedInputStream.java:292) at htsjdk.tribble.readers.TabixReader$IteratorImpl.next(TabixReader.java:382) at htsjdk.tribble.readers.TabixIteratorLineReader.readLine(TabixIteratorLineReader.java:43) ... 18 more ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: java.io.IOException: Unexpected compressed block length: 1 ##### ERROR ------------------------------------------------------------------------------------------ Hi, I used CalculateGenotypePosteriors with the supporting file called ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf, obtained from 1000 Genomes. It contains both indels and SNPs, and I was able to use the file for the first step of the Genotype Refinement Workflow. My question is: is it an issue that I didn't remove the indels from the supporting file? I would presume not since first of all, both the indels and the SNPs from Phase 3 of 1000G should be high confidence, and second, my recalibrated vcf file includes both indels and snps, so it should be in my interest to have as much information as possible, actually, so I should consider indels as well. Just want to check whether my reasoning is correct. Thanks, Alva I'm sorry I keep posting these questions, but this error has me baffled! ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Unable to retrieve result at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:190) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) Caused by: java.lang.IllegalArgumentException: No data found. at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:399) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:143) at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.notifyTraversalDone(HierarchicalMicroScheduler.java:226) at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:183) ... 5 more ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Unable to retrieve result ##### ERROR ------------------------------------------------------------------------------------------ Here is the input I use to get it: java -d64 -Xmx8g -jar$GATKDIR/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -input k_family_out.vcf \ -R $REF \ -resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0$HAPMAP \ -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 $OMNI \ -resource:dbsnp,VCF,known=true,training=false,truth=false,prior=8.0$DBSNP \ -resource:mills,VCF,known=true,training=true,truth=true,prior=12.0 $MILLS \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum \ -mode BOTH \ -U ALL \ -recalFile k_family.recal \ -tranchesFile k_family.tranches \ -rscriptFile k_family.R \ -nt 6 --TStranche 100.0 --TStranche 99.9 --TStranche 99.5 --TStranche 99.0 \ --TStranche 98.0 --TStranche 97.0 --TStranche 96.0 --TStranche 95.0 --TStranche 94.0 \ --TStranche 93.0 --TStranche 92.0 --TStranche 91.0 --TStranche 90.0 I thought this was similar to a multithreading error, so I reduced my pipeline to execute with only a single node, and the problem persists. Any help will be appreciated. Hi, with GATK 3.3-0 I am confronted with an error that was present in a much older version, but seemed resolved about a year ago: ERROR MESSAGE: There is no space left on the device, so writing failed There is 8TB left on the drive, no user limit. Sometimes re-running the exact same job works, sometimes not. Some jobs keep failing despite asking for an insane amount of memory on the cluster, given these are RNAseq bam files, the largest one being less than 7GB. For example: qsub -b y -cwd -N step3_145 -o step3_145.o -e step3_145.e -V -l h_vmem=40G /share/apps/java/oracle/1.8.0_11/bin/java -Xmx35G -jar /data/home/hhx037/GATK-3.3.0/GenomeAnalysisTK.jar -T SplitNCigarReads -R Homo_sapiens.GRCh37.75.dna.1-22XYMT.fa -I Analyses/file_dedup.bam -o Analyses/file_splittedcigar.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS Here is the log: INFO 10:50:51,568 HelpFormatter - Executing as hhx037@panos1 on Linux 2.6.32-431.1.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_11-b12. INFO 10:50:51,571 HelpFormatter - Date/Time: 2015/02/13 10:50:51 INFO 10:50:51,571 HelpFormatter - -------------------------------------------------------------------------------- INFO 10:50:51,576 HelpFormatter - -------------------------------------------------------------------------------- INFO 10:50:52,503 GenomeAnalysisEngine - Strictness is SILENT INFO 10:50:52,827 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 10:50:52,861 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 10:50:52,876 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 10:50:53,021 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 10:50:53,027 GenomeAnalysisEngine - Done preparing for traversal INFO 10:50:53,030 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 10:50:53,030 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 10:50:53,030 ProgressMeter - Location | reads | elapsed | reads | completed | runtime | runtime INFO 10:50:53,047 ReadShardBalancer$1 - Loading BAM index data INFO 10:50:53,050 ReadShardBalancer$1 - Done loading BAM index data INFO 10:51:23,404 ProgressMeter - 1:1477348 702953.0 30.0 s 43.0 s 0.0% 17.5 h 17.5 h INFO 10:52:32,660 ProgressMeter - 1:16909108 1202983.0 99.0 s 82.0 s 0.5% 5.0 h 5.0 h INFO 10:53:09,769 ProgressMeter - 1:21069702 1302985.0 2.3 m 104.0 s 0.7% 5.6 h 5.5 h INFO 10:53:49,083 ProgressMeter - 1:27951393 1803181.0 2.9 m 97.0 s 0.9% 5.4 h 5.4 h INFO 10:54:29,275 ProgressMeter - 1:32739969 2103299.0 3.6 m 102.0 s 1.1% 5.7 h 5.6 h INFO 10:55:09,177 ProgressMeter - 1:36643589 2203300.0 4.3 m 116.0 s 1.2% 6.0 h 5.9 h INFO 10:55:45,643 ProgressMeter - 1:39854010 2303302.0 4.9 m 2.1 m 1.3% 6.3 h 6.2 h INFO 10:56:25,147 ProgressMeter - 1:40542516 2403303.0 5.5 m 2.3 m 1.3% 7.0 h 6.9 h INFO 10:57:10,934 ProgressMeter - 1:40654849 2503322.0 6.3 m 2.5 m 1.3% 8.0 h 7.9 h INFO 10:57:54,084 ProgressMeter - 1:43162895 2503322.0 7.0 m 2.8 m 1.4% 8.4 h 8.3 h INFO 10:58:24,149 ProgressMeter - 1:45244391 2703426.0 7.5 m 2.8 m 1.5% 8.6 h 8.4 h INFO 10:58:56,749 ProgressMeter - 1:53716450 2803427.0 8.1 m 2.9 m 1.7% 7.7 h 7.6 h INFO 10:59:38,928 ProgressMeter - 1:86821106 3103432.0 8.8 m 2.8 m 2.8% 5.2 h 5.1 h INFO 11:00:11,337 ProgressMeter - 1:93301870 3303437.0 9.3 m 2.8 m 3.0% 5.1 h 5.0 h INFO 11:01:13,113 ProgressMeter - 1:115252321 3803590.0 10.3 m 2.7 m 3.7% 4.6 h 4.5 h INFO 11:02:02,172 ProgressMeter - 1:145441389 4303778.0 11.2 m 2.6 m 4.7% 4.0 h 3.8 h INFO 11:02:38,237 ProgressMeter - 1:150547232 4703871.0 11.8 m 2.5 m 4.9% 4.0 h 3.8 h INFO 11:03:09,693 ProgressMeter - 1:153362937 5003904.0 12.3 m 2.5 m 5.0% 4.1 h 3.9 h INFO 11:03:39,934 ProgressMeter - 1:155984762 5403968.0 12.8 m 2.4 m 5.0% 4.2 h 4.0 h INFO 11:04:05,477 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.3-0-g37228af): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: There is no space left on the device, so writing failed ##### ERROR ------------------------------------------------------------------------------------------ I understand temporary files may be large, but not that large. Are the temporary files written in the working directory (as I believe should be the case), or are they written in GATK installation directory? Also, note I never run into this problem with the previous version. Any idea? Cheers, Stephane Hi, I have generated vcf files using GenotypeGVCFs; each file contains variants corresponding to a different chromosome. I would like to use VQSR to perform the recalibration on all these data combined (for maximum power), but it seems that VQSR only takes a single vcf file, so I would have to combine my vcf files using CombineVariants. Looking at the documentation for CombineVariants, it seems that this tool always produces a union of vcfs. Since each vcf file is chromosome-specific, there are no identical sites across files. My questions are: Is CombineVariants indeed the appropriate tool for me to merge chromosome-specific vcf files, and is there any additional information that I should specify in the command-line when doing this? Do I need to run VariantAnnotator afterwards (I would assume not, since these vcfs were generated using GenotypeGVCFs and the best practices workflow more generally)? I just want to be completely sure that I am proceeding correctly. Thank you very much in advance, Alva Hi, I am combining gcvf files into single gvcf files by chromosome, using CombineGVCFs, in order to run GenotypeGVCFs. When I checked the first gvcf file generated by CombineGVCFs, I noticed that at each position, all the alleles were missing. For example, at position 16050036, this is what comes up in the final gvcf file: 22 16050036 . A C,<NON_REF> . . BaseQRankSum=-7.360e-01;ClippingRankSum=-7.360e-01;DP=4;MQ=27.00;MQ0=0;MQRankSum=-7.360e-01;ReadPosRankSum=0.736 GT:AD:DP:MIN_DP:PL:SB ./.:1,2,0:3:.:55,0,23,58,29,86:1,0,2,0 ./.:.:1:1:0,0,0,0,0,0 ./.:.:0:0:0,0,0,0,0,0 But if we just take one of the precursor gvcf files (one individual), we clearly see the genotype at that site: 22 16050036 . A C,<NON_REF> 26.80 . BaseQRankSum=-0.736;ClippingRankSum=-0.736;DP=3;MLEAC=1,0;MLEAF=0.500,0.00;MQ=27.00;MQ0=0;MQRankSum=-0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB 0/1:1,2,0:3:23:55,0,23,58,29,86:1,0,2,0 The command I'm using to generate these files is: java -Xmx1g -jar GenomeAnalysisTK.jar -T CombineGVCFs -R hs37d5.fa -V vcfs.chr${numchr}.new.list -o mergeGvcf_${numchr}.vcf -L${numchr} where numchr is a variable previously defined (indicating the chromosome number).

It seems that all the information is being taken into account except the actual genotypes. How do I solve this problem?

Thanks, Alva

Hi,

I ran VariantRecalibrator and ApplyRecalibration, and everything seems to have worked fine. I just have one question: if there are no reference alleles besides "N" in my recalibrate_SNP.recal and recalibrate_INDEL.recal files, and in the "alt" field simply displays , does that mean that none of my variants were recalibrated? Just wanted to be completely sure. My original file (after running GenotypeGVCFs) has the same number of variants as the recalibrated vcf's.

Thanks, Alva

Hi,

I have recal.bam files for all the individuals in my study (these constitute 4 families), and each bam file contains information for one chromosome for one individual. I was wondering if it is best for me to pass all the files for a single individual together when running HaplotypeCaller, if it will increase the accuracy of the calling, or if I can just run HaplotypeCaller on each individual bam file separately.

Also, I was wondering at which step I should be using CalculateGenotypePosteriors, and if it will clean up the calls substantially. VQSR already filters the calls, but I was reading that CalculateGenotypePosteriors actually takes pedigree files, which would be useful in my case. Should I try to use CalculateGenotypePosteriors after VQSR? Are there other relevant filtering or clean-up tools that I should be aware of?

Alva

I have a vcf that I want to annotate the info DP field matching on the rs IDs. According to the documentation this should be able to be done via the VariantAnnotator and using the -A for the field I want.

java -jar GenomeAnalysisTK.jar \
-T  VariantAnnotator \
-R ref.fasta \
--variant input.vcf \
-L input.vcf \
-o output.vcf \
--dbsnp dbSnp.vcf \
-U LENIENT_VCF_PROCESSING \
-A DepthOfCoverage

It is running and annotating the INFO field with the DB annotation with no value. Why is the DP annotation not being annotated? Thanks!

Hi, I didn't find the answer anywhere, so I'm asking here. I'm using GATK 2.6-4. I used GenotypeConcordance to compare two datasets, and it was great, exactly what I needed. But then, I was interesting in the discordant sites, and I saw that printInterestingSites should be a good solution. But I can't make it work. Just adding in my command line "-sites filename" doesn't work, and I've tried to be creative in changing this option. Then I thought, that maybe this option isn't working in my GATK version. Have I just done something wrong?

Hi, I'm currently trying to use GATK to call variants from Human RNA seq data

So far, I've managed to do variant calling in all my samples following the GATK best practice guidelines. (using HaplotypeCaller in DISCOVERY mode on each sample separately)

But I'd like to go further and try to get the genotype in every sample, of each variant found in at least one sample. This, to differentiate for each variant, samples where that variant is absent (homozygous for reference allele) from samples where it is not covered (and therefore note genotyped).

To do so, I've first used CombineVariants to merge variants from all my samples and to create the list of variants to be genotype ${ALLELES}.vcf I then try to regenotype my samples with HaplotypeCaller using the GENOTYPE_GIVEN_ALLELES mode and the same settings as before: my command is the following: ******java -jar${GATKPATH}/GenomeAnalysisTK.jar -T HaplotypeCaller -R ${GENOMEFILE}.fa -I${BAMFILE_CALIB}.bam --genotyping_mode GENOTYPE_GIVEN_ALLELES -alleles ${ALLELES}.vcf -out_mode EMIT_ALL_SITES -dontUseSoftClippedBases -stand_emit_conf 20 -stand_call_conf 20 -o${SAMPLE}_genotypes_all_variants.vcf -mbq 25 -L ${CDNA_BED}.bed --dbsnp${DBSNP}.vc**f

In doing so I invariably get the same error after calling 0.2% of the genome.

##### ERROR stack trace

java.lang.IndexOutOfBoundsException: Index: 3, Size: 3 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at htsjdk.variant.variantcontext.VariantContext.getAlternateAllele(VariantContext.java:845) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:248) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:1059) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:221) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

because the problem seemed to originate from getAlternateAllele, I tried to play with --max_alternate_alleles by setting it to 2 or 10, without success. I also checked my ${ALLELES}.vcf file to look for malformed Alternate alleles in the region where the GATK crashes (Chr 1, somewhere after 78Mb) , but I couldn't identify any... (I searched for Alternate alles that would not match the following extended regexpr '[ATGC,]+') I would be grateful for any help you can provide. Thanks. Hi, I noticed that the pipeline for BQSR is slightly different between the "Workshop walkthrough (Brussels 2014)" document and the "(howto) Recalibrate base quality scores = run BQSR" document. I used the former document as my guide since it is more recent, but I am curious as to why these two are different. The first step is the same for the 2 docs, but then for the second step, the output is a recal.bam file in the brussels document instead of a post_recal.table file in the (howto) document. Then, I am confused about the (howto) document because it seems that the 4th step is exactly the same as the second step, except we generate a recal.bam file. Is the brussels document presenting steps that are more efficient? (since there is only 1 use of -BQSR to generate the recalibrated bam file we need, as opposed to 2 uses of the option) Thanks, Alva Hi, I encounter the "RScript exited with 1. Run with -l DEBUG for more info" error when trying to make recalibration plots using Rscript. However, I do have all the required packages installed on my computer, including gsalib, which seems to be the problem. When I run with -l DEBUG, I get the following information: Error in library(gsalib) : there is no package called ‘gsalib’ The thing is that I am using my computer but working in a cluster, so not directly on my computer. Should I have the packages installed in the cluster or is it fine that I've installed the packages on my computer? I'm a bit new to this, so I apologize if this question is a bit basic. Thanks, Alva Hi, I used HaplotypeCaller in GVCF mode to generate a single sample GVCF, but when I checked my vcf file I see that the reference allele is not showing up: 22 1 . N <NON_REF> . . END=16050022 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 22 16050023 . C <NON_REF> . . END=16050023 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,37 22 16050024 . A <NON_REF> . . END=16050026 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,73 22 16050027 . A <NON_REF> . . END=16050035 GT:DP:GQ:MIN_DP:PL 0/0:3:9:3:0,9,110 22 16050036 . A C,<NON_REF> 26.80 . BaseQRankSum=-0.736;ClippingRankSum=-0.736;DP=3;MLEAC=1,0;MLEAF=0.500,0.00;MQ=27.00;MQ0=0;MQRankSum=-0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB 0/1:1,2,0:3:23:55,0,23,58,29,86:1,0,2,0 22 16050037 . G <NON_REF> . . END=16050037 GT:DP:GQ:MIN_DP:PL 0/0:3:9:3:0,9,109 22 16050038 . A <NON_REF> . . END=16050039 GT:DP:GQ:MIN_DP:PL 0/0:4:12:4:0,12,153 I am not sure where to start troubleshooting for this, since all the steps prior to using HaplotypeCaller did not generate any obvious errors. The basic command that I used was: java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hs37d5.fa -I recal_1.bam -o raw_1.vcf -L 22 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 Have you encountered this problem before? Where should I start troubleshooting? Thanks very much in advance, Alva Hi, I want to use HaplotypeCaller to call families together. I have bam files for each individual in the 4 families I am studying, as well as a ped file describing the pedigree information. The problem is that these families have complex pedigrees, with the parents (mother and father), the children, and then one grandchild for each child (do not have information about the other parent of the grandchild). I would like to call these families with their complex pedigrees together, and I would like to call all the 4 families together to maximize the power of the calling. However, I'm not sure how to do that with just the -ped option. -ped seems to be designed for only one family or cohort, and I'm not sure it would work for me to feed it all my bams as inputs. Are there any other tools for GATK that I could use to call complex pedigrees? The other possibility would be to call the 4 trios separately and each child-grandchild pair separately, but not sure how to do that either with just -ped. What would you recommend? And finally, I get an error message saying that --emitRefConfidence only works for single sample mode. It seems that I should omit this option when I run HaplotypeCaller on my families, but are there any other options that I should use for cohort calling besides the standard ones (which I know to be --variant_index_type and --variant_index_parameter)? Thanks, Alva Chaps, I am a newbie to GATK. I have started the best practices pipline but stuck in bwa. From where can I get the reference.fa file? The GATK bundle has .fasta .bam .fai but no .fa. Anyone please! I would like to obtain the coverage distribution on all exome intervals (human reference genome) using the BaseCoverageDistribution function in GATK. The problem I'm having is finding an appropriate reference file to associate with the -L argument. Specifically, the standard UCSC exome coordinate file doesn't have a format that GATK will accept, e.g. (header and first row given below, as opposed to chr1:start-stop) etc # name chrom strand cdsStart cdsEnd exonCount exonStarts exonEnds name2 exonFrames NM_017582 chr1 - 154522913 154531029 13 154521050,154523413,154523872,154524246,154524396,1545 24568,154524879,154525211,154525507,154527210,154527903,154528335,154530702, 154522945,1545234 80,154523968,154524295,154524455,154524659,154524940,154525296,154525648,154527261,154528008,1 54528440,154531120, UBE2Q1 1,0,0,2,0,2,1,0,0,0,0,0,0, Where can I get the fasta and ref files for experimenting on GATK? I am wondering if it would be okay to use the "SelectVariants" tool to select mutants specific to the tumor sample in a pair of tumor and normal samples. I tried the following command, and it seems working. But I am suspecting whether it differentiates the genotype differences (e.g. 0/1 in tumor sample and 0/0 in normal sample) or coverage differences (e.g. 0/1 in tumor sample and ./. in normal sample). Could someone help me understand "SelectVariants" for my question? Thanks a lot! java -Xmx10g -Djava.io.tmpdir=$tempSpillFolder -jar $CLASSPATH/GenomeAnalysisTK.jar \ -R$GenomeReference \
-T SelectVariants \
--variant $file \ -o$OutputFolder/filename.selected.vcf \ -select 'set=="tumor"' Hi! I am attempting to go through the process of applying a recalibration to an input vcf. I tried following the steps outlines in the tutorial Howto: Recalibrate Variant Quality Scores . Step 2 runs great, I used all the tranches and resources that the howto mentions. Unfortunately when I attempt to apply the recalibration (step 3) I receive the following error Encountered input variant which isn't found in the input recal file. Please make sure VariantRecalibrator and ApplyRecalibration were run on the same set of input variants. First seen at: [VC input @ 1:14907 Q523.77 of type=SNP alleles=[A*, G] attr={AC=1, AF=0.500, AN=2, BaseQRankSum=-1.863, ClippingRankSum=-0.420, DP=86, FS=0.984, MLEAC=1, MLEAF=0.500, MQ=36.93, MQ0=0, MQRankSum=0.338, QD=6.09, ReadPosRankSum=2.064} GT=GT:AD:DP:GQ:PL 0/1:57,29:86:99:552,0,1482 When I check the input file I notice that this is the first variant call in my VCF. I tried this process with other VCF files and each time my ApplyRecalibration gets stuck on the first variant call. Here is the command I used to generate the recalibration: java -Djava.io. -Xmx4g -jar GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \ -T VariantRecalibrator -R /human_g1k_v37.fasta \ -L file.intervals \ -input input_file.vcf \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 Mills_and_1000G_gold_standard.indels.b37.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf \ -an DP \ -an QD \ -an FS \ -an MQRankSum \ -an ReadPosRankSum \ -mode SNP \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ -recalFile recal_file.recal \ -tranchesFile tranches_file.tranches \ -rscriptFile rscipt_file.plots.R\ And here is my command for applying the recalibration: java -Xmx4g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R human_g1k_v37.fasta -input input_file.vcf -mode SNP --ts_filter_level 99.0 -recalFile recal_file.recal -tranchesFile tranches_files.tranches -o recalibrated_output.vcf Any suggestions to help me troubleshooot the issue is very welcome. I should also mention that I am using GATK 3.2.2 Thank you in advance! I used GATK 3.2-2 to do indel realignment and multisample calling with UnifiedGenotyper (I cannot use HaplotypeCaller because it is incompatible with the type of data I am analyzing). Among the ~96 samples sequenced, I have one sample with two nearby variant calls, an indel and a missense, that we checked by Sanger. The missense is real but we found no trace of the indel in the Sanger traces. When I look at indel call in the multisample vcf, it has good allele depth and GQ, but suspiciously has the same AD as the missense call. Additionally, when I look at the bam in IGV, I see no evidence for the indel whatsoever and the variant is not called in any other samples in this project. indel: chr13:101755523 CA>C GT:AD:DP:GQ:PL 0/1:55,56:111:99:1388,0,1538 missense: chr13:101755530 A>G GT:AD:DP:GQ:PL 0/1:55,56:111:99:2170,0,2125 I went back and recalled just this one sample (single sample calling)… which resulted in the correct variants, i.e. the indel was not called at all, but the SNP, which does validate, is called. I understand that this is not an easy region to call because of the 7xA repeat, but it’s not obvious to me why this happens only in multisample mode and I'd like to have a better understanding of what is going on. Hello, I'm currently working on creating a module file on my cluster for GATK. I set the prepend path to the directory in which I saved my .jar file, but when I attempt to use said file (command: java -jar GenomeAnalysisTK.jar -h), java cannot find it. It does, however, work when I am in the directory in which it is saved, but not when I add it as a module file. Any ideas? Thanks Hi there, I have been using GATK to identify variants recently. I saw that BQSR is highly recommended. But I don’t know whether it is still needed for de novo mutation calling. For example, I want to identify de novo mutations generated in the progenies by single seed descent METHODS in plants. For example, in the paper of “The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana”, these spontaneous arising mutations may not included in the known sites of variants. Based on documentation posted in GATK websites, they assume that all reference mismatches we see are errors and indicative of poor base quality. Under this assumption, these de novo mutations may be missed in the step of variant calling. So in this situation, what should I do? Or should I skip the BQSR step? Also what should I do when I reach to step- VQSR? Hope some GATK developers can help me on this. Thanks. Hi, I have performed BaseRecailbrator using GATK-3.2-2 . The size of the bam file is double the size of realigned_fixmate.bam. I read in one of the posts in this forum that it could be because of retaining the original and reclaibrated quality scores. And mentioned that the latest version will by default exclude original scores and retain reclaibrated scores. Im using the latest version and still gets double the size of the original bam. Could someone comment on this. Thanks Hi all, I tried to apply the following command to my raw vcf file to filter it using the filtering expression specified in the command: java -XX:ConcGCThreads=4 -XX:+UseConcMarkSweepGC -XX:ParallelGCThreads=4 -jar GenomeAnalysisTK.jar -T VariantFiltration -R human_g1k_v37.fa --variant human_g1k_v37.CHL124.vcf_snps.vcf -o CHL124.vcf_snps.vcf_filter_marked.vcf --filterExpression "QD < 2.0 && MQ < 40.0 && FS > 60.0 && HaplotypeScore > 13.0 && MQRankSum < -12.5 && ReadPosRankSum < -8.0" --filterName "very_small_SNPs_default_filter" After that, I check my result file which is CHL124.vcf_snps.vcf_filter_marked.vcf, I found that, all reads are marked as "PASS" whether its QD is > 2.0 or < 2.0. Obviously, the command doesn't work, but I cannot find why everything seems goes well, no error reported. bless~ XL Hi, I suppose I had a problem with my bam file, but I don't know what I should look for. Any help appreciated. ##### ERROR stack trace java.lang.IllegalArgumentException: -1 does not represent a valid base character at org.broadinstitute.gatk.utils.genotyper.DiploidGenotype.createDiploidGenotype(DiploidGenotype.java:115) at org.broadinstitute.gatk.tools.walkers.genotyper.SNPGenotypeLikelihoodsCalculationModel.getLikelihoods(SNPGenotypeLikelihoodsCalculationModel.java:177) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateLikelihoods(UnifiedGenotypingEngine.java:305) cf attached file for complete stack. Hi all, I have been successfully dealing with GATK tools for variant calling for exome sequences, but now I have to do it for a personal genome. Since the genome has been sequenced in two runs , using 7 lanes per each run , now I have 28 fastq files.( paired end reads (2)* 7 lanes * 2 runs). I haven't deal with such a large number of files at once before. My suggested approach is to, 1) Align, Dedup, Realign and Recalibrate per lane. (So I get 14 aligned,deduped,realigned and reacalibrated bam files) 2) Merge the bam files inorder to produce a single bam file 3) Call variants using the single bam file using Haplotype Caller. Do you think my approach is feasible? Or do you have any alternative approaches? Furthermore, what is the best tool to merge the bam files? Thanks in advance. Regards! Hi, I am trying to do a base quality score recalibration of my BAM files. I am working with individual based RADseq data where the reads were cleaned and aligned against a reference genome using bowtie2. The reference genome comes from a relatively closely related taxon (~25 Mio years of divergence), however the species in this taxonomic family underwent a recent genome duplication, hence I am running bowtie2 with the -k option (which retains multiple alignments rather than picking randomly one if multiple matches of equal quality occur) and only retain the uniquely aligned reads for the downstream analyses. The downside to this is, that the MAPQ values become artificial. After generating a recalibration data file, I run GATK PrintReads (version 3.1-1-g07a4bf8) as follow: java -Xmx16G -jar GenomeAnalysisTK.jar -T PrintReads -R Ref_genome.fasta -I Individual.bam -BQSR Library_recaldata.grp -o Individual.rc.bam Which yields the following error after running through 80% of the file: # ERROR MESSAGE: Exception when processing alignment for BAM index HWI-ST1206:32:C3VGTACXX:5:1104:14976:28487 84b aligned read. Here the read in question (the one in the middle): HWI-ST1206:32:C3VGTACXX:5:2313:18281:8513 16 chrUn 536921741 0 84M * 0 0 CCCAGAGTGCAAAGAACCTTGGCAGGACACTAGACAACACCCTGTTGTTTTCTGCAAACATCAAAGCAGTGACTCTGTCCTGCA DDDDEEEEEEFFFD@DHFGHHJHIIIIIGGJIJJIGGHHJJJJJJJJIJJJJJJIJJJJJIIJJJJJJJJJJJJJJJHHHHHFF AS:i:-29 XS:i:-35 XN:i:0 XM:i:5 XO:i:0 XG:i:0 NM:i:5 MD:Z:11T18G1A14G28C7 YT:Z:UU RG:Z:70130.satrFamilyX HWI-ST1206:32:C3VGTACXX:5:2314:6453:46729 16 chrUn 536921741 0 84M * 0 0 CCCAGAGTGCAAAGAACCTTGGCAGGACACTAGACAACACCCTGTTGTTTTCTGCAAACATCAAAGCAGTGACTCTGTCCTGCA DDDDEEEEEEFFFEDFHHHHHJJIJJIJIIIIIGJJIFFJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJIHJJJJJHHHHHFF AS:i:-29 XS:i:-35 XN:i:0 XM:i:5 XO:i:0 XG:i:0 NM:i:5 MD:Z:11T18G1A14G28C7 YT:Z:UU RG:Z:70130.satrFamilyX HWI-ST1206:32:C3VGTACXX:5:1104:14976:28487 16 chrUn 536961523 255 84M * 0 0 ATTGGATTGCAGTGCAGGTACACTTTGTATGGATCGTTGTGTGTGTTTACTGTGACTTCTCTTAAGAAGACGCGTGTCCCTGCA</del> DDDDDDDDDDDDDDDDDDDDDCEEEEEFFFFFHFFJJJJJJJJJIIHD8JJJJIHIJJJJJJIJJJJJJIJJIJJJJHHHHHFF AS:i:-9 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:20G27T35 YT:Z:UU RG:Z:70130.satrFamilyX** HWI-ST1206:32:C3VGTACXX:5:1104:10760:63491 16 chrUn 536961523 255 84M * 0 0 ATTGGATTGCAGTGCAGGTACACTTTGTATGGATCGTTGTGTGTGTTTTCTGTGACTTCTCTTAAGAAGACGCGTGTCCCTGCA DDDDCCCDDDDDDBCDCDCDC@DDEEECEEBFEFEHJJJJJJIHGJJIJJJJIHFJJJJIIJIJJIJJJIJJJJJJJHHHHHFF AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:20G63 YT:Z:UU RG:Z:70130.satrFamilyX HWI-ST1206:32:C3VGTACXX:5:1104:16496:63907 16 chrUn 536961523 255 84M * 0 0 ATTGGATTGCAGTGCAGGTACACTTTGTATGGATCGTTGTGTGTGTTTTCTGTGACTTCTCTTAAGAAGACGCGTGTCCCTGCA DDDDDDDDDDDDDDDDDDCCCCDDEEECEEFFFFFHJJJIJJJJJJJJJJJJJIHJJJJJJJJJJIJJJJJJJJJIJHHHHHFF AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:20G63 YT:Z:UU RG:Z:70130.satrFamilyX I subsequently verified my BAM files with Picard, which yields the following error # "bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned" for 84554 reads that are almost consecutive in the same genomic region (chrUn - which is composed of sequences for which no chromosome anchoring information exist). However, the reads flagged by Picard occur about 2 million base pairs before the region in question for GATK and PrintReads seemed to have no problem with them. For the sake of completeness, here the beginning and end of the section that was flagged by Picard. It starts with the second of the following reads: HWI-ST1206:32:C3VGTACXX:5:2315:1142:72568 256 chrUn 997157147 2 84M * 0 0 TGCAGGACCACAGTCTTTTCACCAGCTCCTCAAAATGGTTGAAGAAGTAGTATTTTGGGGGCTCGTATAAACTCGTATGCCGTC FFHHHHHJJIJJJDFHHGIHIGIJJIJJJIIJJJIIJGIJIGIFHDHCGHCGGHIJIIIIEFDDBDDCDEDAAC?ABCDCCBB@ AS:i:-50 XS:i:-50 XN:i:0 XM:i:10 XO:i:0 XG:i:0 NM:i:10 MD:Z:2T41T8C7A7G0G1A2G3A2G1 YT:Z:UU RG:Z:70130.satrFamilyX HWI-ST1206:32:C3VGTACXX:5:1102:4043:34769 16 chrUn 997184926 255 84M * 0 0 TCGCCTACAAACAAAACCCCACCCATGCACACCTGATTGGTGCACGTAAAACTGGGAAAGAAGTTACCTCCCTGGAAGCCTGCA DDDDDEDDDDDDDB@BDDEECHHIIIJHHIHJIHJIGJJJJJJIIJIJJJJIJJJIJJJJJJJIGCJJIJJJJJIJJHHHHHFF AS:i:-12 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:41T15G26 YT:Z:UU RG:Z:70130.satrFamilyX HWI-ST1206:32:C3VGTACXX:5:1105:2829:22242 16 chrUn 997184926 255 84M * 0 0 TCGCCTACAAACAAAACCCCACCCATGCACACCTGATTGGTGCACGTAAAACTGGGAAAGAAGTTACCTCCCTGGAAGCCTGCA DDDDDEDDDDDDB@?;DDEE?HHIIGIIHHGJIIJJJJJJJJIIJJJJJJJJJJJJJJJIJIIHE?JIGJJJJJJJJHHHHHFF AS:i:-12 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:41T15G26 YT:Z:UU RG:Z:70130.satrFamilyX And stops at the following read (note the second read here is ok): HWI-ST1206:32:C3VGTACXX:5:2314:10564:44125 16 chrUn 1110931683 255 84M * 0 0 GCTAGCTGACTACTGTAGCCCCAGCCCTTATTAGCCTGCACTGTTATAAGGTGTTATGATTCACACACATTGAAATGGCCTGCA DCDDDCECCCCCBDFFFHHHHGCIHHHFGF@FGHFIIGGGHJIJJJJJJJIJJJJJIJJJIHFIIJIJJJJJJJJJJHHHGHFF AS:i:-10 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:37A39A6 YT:Z:UU RG:Z:70130.satrFamilyX HWI-ST1206:32:C3VGTACXX:5:1101:4103:100272 256 chrUn 1110931763 0 84M * 0 0 TGCAGGACAATGCCAGTGAATGGAGCCTAATTACCCTCTGTAATGAAATGTCACATTTTTGTTTCTGGGTCCTGAATAGATGTG FFHHHHHJJJJJJJJJGIIJJJJIJJJJJJIJJJJJJJJJJJJJJJJJJIJJJJIIJJJJIJJJJJJIHHHFHFFFFFFEEEEE AS:i:-35 XS:i:-35 XN:i:0 XM:i:6 XO:i:0 XG:i:0 NM:i:6 MD:Z:13A34A5G4C1C11A10 YT:Z:UU RG:Z:70130.satrFamilyX I personally cannot find something specific that would explain why PrintReads does not work. The pipeline that I use works perfectly when I use an assembly based on my RADseq data, yet not with the reference genome. Thus, I would like to ask, if you could perhaps help me. Thanks! Uh..Does the GATK read the reference such as hg19 from the disk each time it runs some tools? As the GATK will run many tools such as RealignerTargetCreator, IndelRealigner, BaseRecalibrator in a variant detection pipeline, if it read hg19 from the disk every time, it will be time-consuming... Hi, I was wondering if there is a nice way to apply multiple processing steps to each variant (or a group of variants) as they are read so that the variant file is not read again and again. My understanding is that even if I use Queue, each script would read the vcf again. Is that correct? Hi, I have just used Ubuntu. I'm want install GATK for my work. But i can't search anything about how to install GATK by Ubuntu Software Center or Teminal. Please help me. Thanks. HI, I'd like to report a weird result from HaplotypeCaller. We have a patient sequenced by targeted sequencing on HNF1B. This patient has been confirmed to have a whole gene deletion of HNF1B so we used this patient as a positive control. We expected to see no heterozygous variants called in HNF1B. However, HaplotypeCaller called two heterozygous variants: one deletion (it didn't pass the FS strand bias filter and the ReadPosRankSumTest filter) and one substitution (this one passed all the quality filters). Both these two variants were not called by UnifiedGenotyper (and the variants called by UnifiedGenotyper in HNF1B region were all homozygous as what we expected) Please see the VCF table: There are three things I want to highlight: 1, The deletion is only 10 bases upstream of the substitution, but the FS score is enormous for the deletion whereas very low for the substitution. If there's a strand bias, it must affect both variants if they are so close to each other. 2, The score of ReadPosRankSumTest of the deletion didn't pass the filter because it's always called near the end of the reads. The ReadPosRankSumTest score for the substitution is missing. 3, The genotype was called 0/1 for the substitution, but if we look at the AD, there are 206 reads supporting the ref allele and 0 read supporting the alt allele. Going by the AD, it's clearly a homozygous ref/ref genotype. Then I looked into the bam files. It turns out the all the alternative alleles of the substitution were from the ends of bad reads, and there are not too many of them after all. So the reads in the bam file also support a homozygous ref/ref genotype. Therefore I'm really confused why the substitution has 0/1 genotype called by the HaplotypeCaller and why it passed the filter. Many Thanks Betty Hi I am hitting with the following error when I try to use UnifiedGenotyper. Wondering if its a bug or something wrong with my bam file? Thanks for your help in advance. ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.IllegalArgumentException: Illegal base [K] seen in the allele at org.broadinstitute.variant.variantcontext.Allele.create(Allele.java:205) at org.broadinstitute.variant.variantcontext.Allele.create(Allele.java:213) at org.broadinstitute.sting.gatk.walkers.annotator.RankSumTest.fillQualsFromPileup(RankSumTest.java:159) at org.broadinstitute.sting.gatk.walkers.annotator.RankSumTest.annotate(RankSumTest.java:113) at org.broadinstitute.sting.gatk.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:192) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateGenotypes(UnifiedGenotyperEngine.java:560) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:234) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNanoTraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNanoTraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.7-4-g6f46d11): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Illegal base [K] seen in the allele ##### ERROR ------------------------------------------------------------------------------------------ Cheers Scott.. Hi, How many samples and of what file/genome size can the Unified Genotyper process simultaneously ? I will have about 250 samples, genome size 180Mb, bam sizes 3-5GB, reduced bam sizes ~1GB. Also, how long would this take can it be multi-threaded ? Cheers, Blue Hi, we observe an unexpected deletion call (GATK UnifiedGenotyper 3.0 and 2.5) in a setup with three samples called together. In the file call.txt you find the call. From the pileup (pileup.txt, position 19:57325555) we would expect, that the indel would only be called for the first sample. Is there another source of information, that makes GATK believe, that the deletion also occurs in the second and third sample? Thanks in advance, Johannes In my PiCard/GATK pipeline, I already include the 1000G_gold_standard and dbsnp files in my VQSR step, I am wondering if I should further filter the final vcf files. The two files I use are Mills_and_1000G_gold_standard.indels.hg19.vcf and dbsnp_137.hg19.vcf, downloaded from the GATK resource bundle. I recently came across the NHLBI exome seq data http://evs.gs.washington.edu/EVS/#tabs-7, and the more complete 1000G variants ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ These made me wonder if I should use these available VCFs to further filter my VCF files to remove the common SNPs. If so, can I use the "--mask" parameter in VariantFiltration of GATK to do the filtration? Examples below copied from documentation page: java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T VariantFiltration \ -o output.vcf \ --variant input.vcf \ --filterExpression "AB < 0.2 || MQ0 > 50" \ --filterName "Nov09filters" \ --mask mask.vcf \ --maskName InDel In my PiCard/GATK pipeline, I already include the 1000G_gold_standard and dbsnp files in my VQSR step, I am wondering if I should further filter the final vcf files. The two files I use are Mills_and_1000G_gold_standard.indels.hg19.vcf and dbsnp_137.hg19.vcf, downloaded from the GATK resource bundle. I recently came across the NHLBI exome seq data http://evs.gs.washington.edu/EVS/#tabs-7, and the more complete 1000G variants ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ These made me wonder if I should use these available VCFs to further filter my VCF files to remove the common SNPs. If so, can I use the "--mask" parameter in VariantFiltration of GATK to do the filtration? Examples below copied from documentation page: java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T VariantFiltration \ -o output.vcf \ --variant input.vcf \ --filterExpression "AB < 0.2 || MQ0 > 50" \ --filterName "Nov09filters" \ --mask mask.vcf \ --maskName InDel Hello, I'm a little confused as to the format for multi sample snp calling suing Unified Genotyper. The example says: -I sample1.bam [-I sample2.bam ...] \ but the presence of the dots is confusing. Is the right format as in example a) or b) ? a) -I sample1.bam [-I sample2.bam] [-I sample3.bam] [-I sample4.bam] b) -I sample1.bam [-I sample2.bam sample3.bam sample4.bam] I tried both ways and got error both times actually - "invalid argument "[-I" Thanks I have in a database 11 vcf and bam files for individuals we've sequenced. I have been trying to merge the 11 individual vcf files into one combined vcf file using CombineVariants in GATK. While it does combine the vcf files, it does something odd that I'm sure has been solved by other users and I am looking for input on. A singleton SNP in individual 1 will be given "./." in all other 10 individuals instead of "0/0". Is there a way to fix this--the genotypes are not missing, they are reference.That said, some of them will be missing and are rightly called "./.", but I don't know how to incorporate this information into a merged VCF file. Your help is most appreciated and apologies if this has been asked before--I couldn't find this exact topic. One of my samples has this entry "1/1:0,1:1:3:36,3,0" in the information field, and from my understanding, the genotype is homo variant, because it has 1/1. However, I do not understand why. Since it has 0 REF reads and has only 1 ALT read, how does GATK tell this variant is a homo variant?? I am using this command to calculate the depth of coverage with three different formats for the EXOME_interval.list : java -Xmx2048m -jar GenomeAnalysisTK.jar -T DepthOfCoverage -I /home/vsinha/software/bam/group1.READS.bam.list -L /home/vsinha/software/EXOME.interval_list -R /home/vsinha/software/human_g1k_v37.fasta -dt BY_SAMPLE -dcov 5000 -l INFO --omitDepthOutputAtEachBase --omitLocusTable --minBaseQuality 0 --minMappingQuality 20 --start 1 --stop 5000 --nBins 200 --includeRefNSites --countType COUNT_FRAGMENTS -o /home/vsinha/software/group1.DATA 1st format : 1 69114 69332 1 69383 69577 1 69644 69940 1 69951 70028 1 566170 566275 1 801895 801983 1 802332 802416 1 861272 861420 Error : ERROR MESSAGE: Badly formed genome loc: Contig '1 69114 69332' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file? 2nd format: 1:69114-69332 1:69383-69577 1:69644-69940 1:69951-70028 1:566170-566275 1:801895-801983 1:802332-802416 1:861272-861420 Error: Badly formed genome loc: Contig '1' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file? 3rd Format: chr1:69114-69332 chr1:69383-69577 chr1:69644-69940 chr1:69951-70028 chr1:566170-566275 chr1:801895-801983 chr1:802332-802416 chr1:861272-861420 Error: Contig chr1 not present in sequence dictionary for merged BAM header: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1] Can you please let me know what is the problem in my file. Thank You in advance. Regards Vishal I have been asked to reach out to you to clarify the GATK license. We are hoping to package our CNV/SNP calling and annotation pipeline that includes GATK and make it available on an Amazon Machine Image (AMI) for use on the Amazon EC2 cloud for anyone who wants to replicate our protocol using their own data. Of all the tools that we use, only GATK contains license restrictions. It is our intent that our AMI be used by academic/non-commercial users, and it is not intended to be sold. We can list the AMI in the AWS Marketplace as a Bring Your Own License (BYOL) instance. However, there are no checks provided by Amazon that users do in fact have a license. Is this permissible? Hi, I would like to attend the workshops by Broad on GATK either online or some seminar in New York.Let me know if there is anything planned for the future. My exome-seq reads quality is pretty good (120M reads) and the alignment using bwa is more than 95%. I use GATK to do the variant calling, however, the vcf output from VQSR contains many './.' fields, which I think means not detected? I am wondering what's the possible cause for ./. ? my conditions are too stringent during gatk run? I was given reduced.bam files, and asked to rerun GATK on our targeted sequencing project (both for learning purposes, and to run some additional downstream analyses), and I'm having difficulty replicating what the previous person did exactly. I traced the root of the problem to the UnifiedGenotyper step --when comparing the vcf files output, I see some (not too many, but a few) discordances, namely, values for QUAL are different (although mostly correlated), and values for AD, DP and PL under INFO are different as well. Genotype calls and genotype qualities, however, are mostly concordant, except for a very few indel/MNP sites. The new run also lists more number of variants that passes the quality filtering. Comparing the old run (from log files) to my run, I can spot only a few differences, so I was wondering if you have any insights whether any of these, or which, could cause the observed discrepancies: First of all, the old analyses were done in December 2012, and I think GATK2.0 was used, whereas I use GATK2.7. Is it possible that new versions of GATK calculates the quality scores differently, particularly, for indels..? Second difference is, the person who ran it before had scattered the intervals into 40 counts (and gathered it after UnifiedGenotyper), whereas I ran it as one piece. Also he had -dcov 75 in his command line, whereas I omitted that. Any insight is much appreciated! Thanks a lot, Gulum I am reading through the most recent workshop slides on Queue, and trying to write a scala script to connect the GATK walkers. However, I'm confused how to use the output of last walker as input for the next walker, especially when you have multiple outputs from the last walker. For example, I wrote the following script to connect RealignerTargetCreator and IndelRealigner, and I have a list of bam files as input to RealignerTargetCreator. I don't know whether I should have multiple outputs from RealignerTargetCreator, and how to use the multiple output from RealignerTargetCreator as input for IndelRealigner. My confusion is highlighted as bold comment text below: def script() { val bams = QScriptUtils.createSeqFromFile(qscript.input) if (nContigs < 0) nContigs = QScriptUtils.getNumberOfContigs(bams(0)) val baseName = "" val outputDir = if (qscript.outputDir.isEmpty()) baseName else qscript.outputDir + "/" + baseName val realigner = new RealignerTargetCreator realigner.reference_sequence = qscript.referenceFile realigner.input_file = bams realigner.out = new File(outputDir + "/Realigner") // **should I have multiple outputs? how do you name individual output files?** val indelRealigner = new IndelRealigner indelRealigner.input_file :+= realigner.out // **do I need a for-loop to go through each input files from last walker?** indelRealigner.targetIntervals = swapExt(realigner.out, "bam", "intervals") indelRealigner.nWayOut = new File(".realigned.bam") add(realigner) add(indelRealigner) } I have the following entries in my vcf files output from VQSR. What does the "VQSRTrancheINDEL99.00to99.90" string mean? did they fail the recalibration? PASS VQSRTrancheINDEL99.00to99.90 VQSRTrancheINDEL99.00to99.90 VQSRTrancheINDEL99.00to99.90 PASS VQSRTrancheINDEL99.00to99.90 PASS PASS VQSRTrancheINDEL99.90to100.00 VQSRTrancheINDEL99.90to100.00 VQSRTrancheINDEL99.90to100.00 PASS VQSRTrancheINDEL99.00to99.90 VQSRTrancheINDEL99.00to99.90 Below is the command I used: java -Xmx6g -jarCLASSPATH/GenomeAnalysisTK.jar \
-T ApplyRecalibration \
-R GATK_ref/hg19.fasta \
-nt 5 \
--input ../GATK/VQSR/parallel_batch/combined_raw.snps_indels.vcf \
-mode INDEL \
--ts_filter_level 99.0 \
-recalFile ../GATK/VQSR/parallel_batch/Indels/exome.indels.vcf.recal \
-tranchesFile ../GATK/VQSR/parallel_batch/Indels/exome.indels.tranches \
-o ../GATK/VQSR/parallel_batch/Indels/exome.indels.filtered.vcf

I would like to know what's the common standards for filtering VCF files? do you filter by depth of coverage or anything else? for exome-sequencing

With the vcf output from GATK, I used SelectVariants to select variants with the following conditions:

java -Xmx8g -jar $CLASSPATH/GenomeAnalysisTK.jar \ -T SelectVariants \ -R GATK_ref/hg19.fasta \ -nt 5 \ -V ../GATK/VQSR/parallel_batch/Indels/exome.indels.filtered.vcf \ --excludeNonVariants \ -o ../GATK/VQSR/parallel_batch/Indels/exome.indels.filtered.selected.vcf \ -selectType INDEL \ -select "DP > 30.0" In the output file exome.indels.filtered.selected.vcf, however, I find some variants have DP < 30, for example: 1/1:0,3:3:12:113,12,0 The bold highlighted 3 is the DP, does this mean SelectVariants did not work on my vcf? I was expecting the "ApplyRecalibration' to reduce the VCF files output by Haplotypecaller. Below is my command line for VariantRecalibrator and ApplyRecalibration. I was wondering if I did anything wrong or the VCF file size does not always get smaller? or any suggestions to improve my commandlines? java -Xmx4g -Djava.io.tmpdir=/Volumes/tempdata1/tonywang/GATK_temp -jar$CLASSPATH/GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R GATK_ref/hg19.fasta \
--input ../GATK/raw_variants_snps_indels-3.vcf \
-nt 6 \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 GATK_ref/hapmap_3.3.hg19.vcf \
-resource:omni,known=false,training=true,truth=true,prior=12.0 GATK_ref/1000G_omni2.5.hg19.vcf \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 GATK_ref/1000G_phase1.snps.high_confidence.hg19.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 GATK_ref/dbsnp_137.hg19.vcf \
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an DP \
--maxGaussians 4 \
-mode SNP \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
-log ../GATK/VQSR/log/raw_variants_snps-3_snps_recal.log \
-recalFile ../GATK/VQSR/SNPs/snps-3_snp.recal.vcf \
-tranchesFile ../GATK/VQSR/SNPs/snps-3_snp.tranches \
-rscriptFile ../GATK/VQSR/SNPs/snps-3_snp_recal.plots.R

java -Xmx6g -Djava.awt.headless=true -jar $CLASSPATH/GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R GATK_ref/hg19.fasta \ -nt 5 \ --input ../GATK/raw_variants_snps_indels.vcf \ -mode SNP \ --ts_filter_level 99.0 \ -recalFile ../GATK/VQSR/SNPs/snps-3_snp.recal.vcf \ -tranchesFile ../GATK/VQSR/SNPs/snps-3_snp.tranches \ -log ../GATK/VQSR/SNPs/filtered/snps-3_snp.recal_filtered.log -o ../GATK/VQSR/SNPs/filtered/snps-3_snp.recal_filtered.vcf Hi, I have two draft genomes that I would like to compare and I was wondering whether gatk supported such studies? Thanks for your help! I'm using GATK 2.74 on my server with Java 1.7 on Mac. I'm running through many .bam files that was produced by the upstream PrintReads. I am using a for-loop in a shell script to loop all the .bam files through ReduceReads Some of these bam files were not compressed by GATK-ReduceReads, and it gives 0B .bam output files. The command line I use is following: java -Xmx10g -Djava.awt.headless=true -jar$CLASSPATH/GenomeAnalysisTK.jar \ -T ReduceReads \ -R ./GATK_ref/hg19.fasta \ -S LENIENT \ -log ../GATK/BQSR/log/file1.compress.log \ -I ../GATK/BQSR/file1.recal.bam \ -o ../GATK/BQSR/file1.compressed.bam

I'm copying the tail part of the log for one of the failed .bam files: from this log, I don't see any error. Maybe ReduceReads walker just terminated itself earlier???

INFO 10:28:13,654 ProgressMeter - chr4:48710213 2.47e+07 19.2 m 46.0 s 23.6% 81.4 m 62.2 m INFO 10:28:43,657 ProgressMeter - chr4:83322785 2.54e+07 19.7 m 46.0 s 24.7% 79.8 m 60.1 m INFO 10:29:13,659 ProgressMeter - chr4:112743605 2.60e+07 20.2 m 46.0 s 25.6% 78.8 m 58.6 m INFO 10:29:43,662 ProgressMeter - chr4:151294343 2.67e+07 20.7 m 46.0 s 26.8% 77.1 m 56.4 m INFO 10:30:13,664 ProgressMeter - chr4:184999795 2.72e+07 21.2 m 46.0 s 27.9% 75.9 m 54.7 m INFO 10:30:43,667 ProgressMeter - chr5:14341390 2.79e+07 21.7 m 46.0 s 28.6% 75.9 m 54.2 m INFO 10:31:13,796 ProgressMeter - chr5:37088471 2.85e+07 22.2 m 46.0 s 29.3% 75.7 m 53.6 m INFO 10:31:43,798 ProgressMeter - chr5:67907077 2.92e+07 22.7 m 46.0 s 30.3% 74.9 m 52.3 m INFO 10:32:13,914 ProgressMeter - chr5:90107126 2.97e+07 23.2 m 46.0 s 31.0% 74.8 m 51.7 m INFO 10:32:44,969 ProgressMeter - chr5:127450350 3.03e+07 23.7 m 46.0 s 32.2% 73.7 m 50.0 m

I recently upgraded from GATK 2.5 to the latest 2.74 stable release, but the Reduce Reads throws the following error when I try to run it with a Bam file that was produced by "PrintReads" (3 samples merged in one Bam file).

MESSAGE: Bad input: Reduce Reads is not meant to be run for more than 1 sample at a time except for the specific case of tumor/normal >pairs in cancer analysis

java -Xmx6g -Djava.awt.headless=true -jar $CLASSPATH/GenomeAnalysisTK.jar \ -T ReduceReads \ -R ../GATK_ref/hg19.fasta \ -I ../GATK/BQSR/all3Samples2.recal.bam \ -o ../GATK/BQSR/all3Samples.recal.compressed.bam It used to work with the old version of GATK, but it does not work now. Where could it be wrong? Hi there! I tried to analyse the depth of coverage of my exome data using GATK DepthOfCoverage java -Xmx${heap}m \ -Djava.io.tmpdir=$processed_bam/tmp_folder_dept \ -jar$gatk \ -T DepthOfCoverage \ -omitBaseOutput \ -omitLocusTable \ -R $GRCh37_ref_WholeGenome \ -I$file \ -o $coverage/${samplename}.coverage.dept I am not interested in particular genomic regions, so I haven't a target list file. How were the intervals determined in "_interval_statistics" output file? Where can I obtain an interval target list file including all exons in the human genome? Thank you very much!

Fulvio

Hello,

I'm currently using DepthOfCoverage with -L option and a BED intervals file. I would to get gene report also. I'm not familiar with RefSeq format for the gene list.

I work with HGNC and Ensembl genes. I could get a gene list file with the following format:

HGNC chr start end

But this doesn't seem to work. Any suggestions on how I could achieve that or how I could generate a valid gene list file with my current intervals?

Thank you.

root@GR0001:~# java -Xmx2g -Djava.io.tmpdir=/nG/Data/1265/vcf1/node1/9/AnalysisTemp/ -jar /nG/Process/Tools/GATK/GenomeAnalysisTK-2.7-1-g42d771f/GenomeAnalysisTK.jar -T RealignerTargetCreator -nt 3 -L 9 -R '/nG/Reference/CommonName/dog/FASTA/chrAll.fa' -I '/nG/Data/1265/vcf1/node1/9/Databind/9.bam' -o '/nG/Data/1265/vcf1/node1/9/Databind/9.bam.intervals' INFO 11:10:17,730 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:10:17,732 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-1-g42d771f, Compiled 2013/08/21 23:02:55 INFO 11:10:17,733 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 11:10:17,733 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 11:10:17,737 HelpFormatter - Program Args: -T RealignerTargetCreator -nt 3 -L 9 -R /nG/Reference/CommonName/dog/FASTA/chrAll.fa -I /nG/Data/1265/vcf1/node1/9/Databind/9.bam -o /nG/Data/1265/vcf1/node1/9/Databind/9.bam.intervals INFO 11:10:17,738 HelpFormatter - Date/Time: 2013/08/28 11:10:17 INFO 11:10:17,738 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:10:17,738 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:10:17,879 GenomeAnalysisEngine - Strictness is SILENT INFO 11:10:18,189 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 11:10:18,198 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 11:10:18,325 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.13 INFO 11:10:24,257 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR ------------------------------------------------------------------------------------------

I can't understand why call this error. please check it. Is it a bug?

update When I try to with chromosome 9 (numerical), I have the problem. There is not problem for other chromosome number (like 1,2,3,4,5,6,7,8,10,11...)

Dear GATK Team,

I am receiving the following error while running GATK 1.6. Unfortunately, for project consistency I cannot update to a more recent version of GATK and would at least wish to understand the source of the error. I ran ValidateSamFile on the input bam files and they appear to be OK.

Any insight or advice would be greatly appreciated:

##### ERROR ------------------------------------------------------------------------------------------

##### ERROR ------------------------------------------------------------------------------------------

Abbreviated commandline used:

GenomeAnalysisTK.jar -T UnifiedGenotyper -glm BOTH -et NO_ET \ -R "Saccharomyces_cerevisiae/UCSC/sacCer2/Sequence/WholeGenomeFasta/genome.fa" \ -dcov 5000 -I "someFile.bam" --output_mode EMIT_ALL_SITES -gvcf -l OFF \ -stand_call_conf 1 -L chrIV:1-1531919

For some reason I don't know, I keep getting "network error" when trying to download the latest GATK2.6-5 from the broadinstitute.org webpage. It hangs at 12.3MB out of the total size of 12.5MB. So I was wondering if I could download the latest GATK2.6-5 from the FTP server or Git? I need the pre-compiled version. thanks

I finally got the filtered VCF file from PWA + PiCard + GATK pipeline, and have 11 exome-seq data files which were processed as a list of input to GATK. In the process of getting VCF, I did not see an option of separating the 11 samples. Now, I've got two VCF files (one for SNPs and the other for indels) that each has 11 samples. My question is how to proceed from here?

Should I separate the 11 files before annotation? or annotation first then split them 11 samples to individual files? Big question here is how to split the samples from vcf files? thanks

I'm a bit confused about some of the output of GATK's DepthOfCoverage script. In the *.GATK.covdepth.sample_summary, what does the "total" column signify? Is it the total number of base pairs in the data? Because my raw read data has ~37.2 billion bp. However, GATK reports over 92 billion bp. Does the 'total' mean something else, or will I have to rerun the DepthofCoverage script?

I have 15 exome-seq samples, and have been using BWA-PiCard-GATK pipeline to do the variant calling. I did not realize GATK is so slow until I have to analyze this large number of samples. In this HaplotypeCaller step, each sample seems to take at least 2 days (48+ hours). Is this normal is there's something I did wrong? Below is my command, is there anything wrong or GATK-HaplotypeCaller is known this slooooow?

java -Xmx10g -Djava.awt.headless=true -jar /Library/Java/Extensions/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -minPruning 3 \ -dcov 10 \ -R ./GATK_ref/hg19.fasta \ -I ./GATK/BQSR/sample1_realign.recal.compressed.bam \ -o ./GATK/VQSR/sample1_realign.raw.snps_indels.vcf

Is there anything I did wrong with this command? or anything I can change to make it run faster? I am using GATK 2.5 by the way. thanks!

Hi all,

I am comparing the variant calls from samtools and GATK. For samtools, I have been using quality score cut-offs 100 for SNP and 1000 for INDEL (quite stringent) and as a result, many variants are excluded after filtering. In case of GATK, I have been using our default setting (99% sensitivity for SNPs and 95% sensitivity for INDEL) and included only the variants with FILTER field "PASS". I was wondering if there is any more stringent filters that I can apply and that could be equivalent to samtools QS thresholds since it does not look like this is a fair comparison. Any of your suggestions will be appreciated.

Hi, I ran the exact same command line twice with the exact same files and parameters and the output file is different: gatk -T RealignerTargetCreator -nt 8 -R /home/ngs/data/tools/gatk/hg/broad_bundle_hg19_v2.2/ucsc.hg19.fasta -I $HOME/jobout/nogroupid_$JOB_ID//JFP0435_02_R2.JFP.lane5.120817FCA_sorted_remdup.bam --known /home/ngs/data/tools/gatk/hg/broad_bundle_hg19_v2.2//1000G_phase1.indels.hg19.vcf --known /home/ngs/data/tools/gatk/hg/broad_bundle_hg19_v2.2//Mills_and_1000G_gold_standard.indels.hg19.vcf --filter_mismatching_base_and_quals -o $HOME/jobout/nogroupid_$JOB_ID//JFP0435_02_R2.JFP.lane5.120817FCA_sorted_remdup.intervals

rem :Gatk vers 2.3.

Size of the intervals files are different and when I run a 'diff' it show some differences, not huge but I wonder if it's due to the algorithm:

here is the diff result:

158138c158138 < chr2:905714-905956 --- > chr2:905685-905956 452144c452144 < chr3:195511953-195512064 --- > chr3:195511916-195512064 461418c461418 < chr4:9241955-9242059 --- > chr4:9241966-9242059 605566,605567c605566,605569 < chr5:21481723-21482150 < chr5:21482294-21482726 --- > chr5:21481723-21481740 > chr5:21481909-21482150 > chr5:21482294-21482563 > chr5:21482690-21482726 605569c605571,605573 < chr5:21484233-21484258 --- > chr5:21483481-21483649 > chr5:21483821-21484114 > chr5:21484233-21484265 615246c615250 < chr5:34189680-34190048 --- > chr5:34189714-34190048 615248a615253 > chr5:34191846-34192088 909440,909441c909445 < chr7:100643452-100643460 < chr7:100643595-100643794 --- > chr7:100643452-100643794 1008760c1008764 < chr8:86572070-86572117 --- > chr8:86572070-86572085 1008763c1008767 < chr8:86573453-86573764 --- > chr8:86573453-86573707 1478683c1478687 < chr14:19553519-19553562 --- > chr14:19553519-19553559 1478828c1478832 < chr14:20019712-20019990 --- > chr14:20019712-20019951 1994633c1994637 < chrUn_gl000212:6736-6842 --- > chrUn_gl000212:6736-6843

Kind regards

Didier

Hi,

I have two SNP calls as shown below:

The first SNP is categorized as 1/1 and the second SNP as 0/1. For both the SNP's the variant allele ratio are 30/32=0.9375 and 30/33=0.909 which are approximately equal and above 0.9. On what criteria one SNP is determined as 0/1 and the other as 1/1?

As per my knowledge both the SNPs should be 1/1. Could anyone comment the reason for this discrepancy?

In working with a mapping that has a very large coverage in certain places, I have noticed that the Unified Genotyper often calls SNPs even if the alternate allele is only present in a minuscule <1% fraction of the reads at a position. It is difficult to filter these SNPs out because QUAL is high for these SNPs the and QD, which is low in these situations, is also low for many other (good) SNPs.

There already is a fractional threshold option for indel calling, -minIndelFrac. Is there any similar option for SNPs – and if not, what is your reasoning for omitting this option and what steps do you recommend for filtering out these SNPs?

Greetings GATK team!

I hope I'm not making a duplicate question here, but I couldn't find anything regarding this in the forum.

Basically, what I want to do is to use SelectVariants to filter against another call set, but I do not want to be as strict as using -discordance (i.e. 100% discordance rate between the two call sets). I want to say for example: "filter call set A against variants that occur in >90% of call set B".

Is there a way to do this with JEXL expressions maybe?

Kind regards

Hello,

i have spend many hours trying to run GATK depthofcoverage but it never works. last try: -T DepthOfCoverage -R /home/remi/Analyse/CNV/ERR125905/bam/Chloroplastgenomebarley.fa -I /home/remi/Analyse/CNV/ERR125905/bam/ERfiltre.bam -L /home/remi/Analyse/CNV/ERR125905/bam/ERfiltre.bed -o /home/remi/Analyse/CNV/FishingCNV_1.5.3/out/ERfiltre.bam.coverage --minMappingQuality 15 --minBaseQuality 10 --omitDepthOutputAtEachBase --logging_level INFO --summaryCoverageThreshold 5 --summaryCoverageThreshold 7 --summaryCoverageThreshold 10 --summaryCoverageThreshold 15 --summaryCoverageThreshold 20 --summaryCoverageThreshold 30 --summaryCoverageThreshold 50

My BAM header seem to be malformed.

ERROR MESSAGE: SAM/BAM file /home/remi/Analyse/CNV/ERR125905/bam/ERfiltre.bam is malformed: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups

here is the 1rst line of the header:

@SQ SN:Chloroplastgenomebarley LN:136462 @PG ID:bwa PN:bwa VN:0.5.9-r16 ERR125905.35 99 Chloroplastgenomebarley 69543 29 101M = 69854 412 TTTGATCCCTCTGATCCTGTTCTGGATCCAATGTGGAGACAAGGTATGTTCGTAATTCCCTTCATGACTCGTTTAGGAATAACGGATCCTTGGGGTGGTTG D-:D?BDDDDCC-?ADCBBBDDDDD:BDD= :6 C-4<9@62@@<:?=B??B=DC28=B&?:AA:4 ERR125905.35 147 Chloroplastgenomebarley 69854 29 101M = 69543 -412 GGCTTTCTGTCGCTTGTGGGCTTTTCCTATAACGGCTTTTTATGTTCCTGGGATATGGGTATCCGATCCTTATGGACTAACTGGAAAAGTACAAGCTGTAA #################################################A-B49= @@2>+:CCC:@@ 66DD@-@DDD?B::@-CA:5?:ADD?ADBB??

I Have search in the forum and doc about it. I have try to reorder my header with picard:

@HD VN:1.4 SO:unsorted @SQ SN:Chloroplastgenomebarley LN:136462 UR:file:/home/remi/Analyse/REFGEN/Chloroplastgenomebarley.fa M5:7a7b36ef01cc1a2af1c8451ca3800f93 @PG ID:bwa PN:bwa VN:0.5.9-r16 ERR125905.35 99 Chloroplastgenomebarley 69543 29 101M = 69854 412 TTTGATCCCTCTGATCCTGTTCTGGATCCAATGTGGAGACAAGGTATGTTCGTAATTCCCTTCATGACTCGTTTAGGAATAACGGATCCTTGGGGTGGTTG D-:D?BDDDDCC-?ADCBBBDDDDD:BDD= :6 C-4<9@62@@<:?=B??B=DC28=B&?:AA:4 ERR125905.35 147 Chloroplastgenomebarley 69854 29 101M = 69543 -412 GGCTTTCTGTCGCTTGTGGGCTTTTCCTATAACGGCTTTTTATGTTCCTGGGATATGGGTATCCGATCCTTATGGACTAACTGGAAAAGTACAAGCTGTAA #################################################A-B49= @@2>+:CCC:@@ 66DD@-@DDD?B::@-CA:5?:ADD?ADBB?? but no more change.

Someone can help me please ?

Regards, Remi

`

Hi,

I am using GATK v2.4.9 and i am trying to liftover from one build to another. Below are the arguments for LiftoverVariants:

Arguments for LiftoverVariants: -V,--variant Input VCF file -o,--out File to which variants should be written -chain,--chain Chain file -dict,--newSequenceDictionary Sequence .dict file for the new build -recordOriginalLocation,--recordOriginalLocation Should we record what the original location was in the INFO field?

Here is the command i have used, java -jar GenomeAnalysisTK.jar -T LiftoverVariants -chain canFam2ToCanFam3.over.chain -dict canFam3_dog_genome.dict -V canFam2_snp131.vcf -o canFam3_snp131.vcf

It showed an error message: ERROR MESSAGE: Walker requires a reference but none was provided.

The argument reference was not present in the documentation and the tool needs a reference argument. If it is required, what should be the reference file, new build or old build?

Thanks.

Hi,

I am using unified genotyper utility of GATK. I don't exactly know how I should make the dbsnp file necessary. I thank you very much for any help.

Regards, Homa

Hi,

I am trying to convert a dbSNP text file into a vcf file with the following command: java -jar GenomeAnalysisTK.jar -R canFam2_dog_genome.fa -T VariantsToVCF --variant:OLDDBSNP snp131.txt -o canFam2.vcf

Below is the message:

INFO 13:27:28,085 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:27:28,088 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-9-g532efad, Compiled 2013/03/19 07:35:36 INFO 13:27:28,088 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 13:27:28,088 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 13:27:28,095 HelpFormatter - Program Args: -R canFam2_dog_genome.fa -T VariantsToVCF --variant:OLDDBSNP snp131.txt -o canFam2.vcf INFO 13:27:28,095 HelpFormatter - Date/Time: 2013/04/25 13:27:28 INFO 13:27:28,095 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:27:28,095 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:27:29,123 GenomeAnalysisEngine - Strictness is SILENT INFO 13:27:29,292 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 13:27:29,315 RMDTrackBuilder - Creating Tribble index in memory for file snp131.txt INFO 13:28:05,113 IndexDictionaryUtils - Track /csc/lohi/dog_tools/GenomeAnalysisTK-2.4-9-g532efad/../../canFam3_ref_dogData/snp131.txt doesn't have a sequence dictionary built in, skipping dictionary validation INFO 13:28:05,125 RMDTrackBuilder - Writing Tribble index to disk for file GenomeAnalysisTK-2.4-9-g532efadsnp131.txt.idx INFO 13:28:05,348 GenomeAnalysisEngine - Creating shard strategy for 0 BAM files INFO 13:28:05,365 GenomeAnalysisEngine - Done creating shard strategy INFO 13:28:05,365 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 13:28:05,367 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 13:28:05,875 ProgressMeter - done 0.00e+00 0.0 s 5.9 d 100.0% 0.0 s 0.0 s INFO 13:28:05,876 ProgressMeter - Total runtime 0.51 secs, 0.01 min, 0.00 hours INFO 13:28:07,459 GATKRunReport - Uploaded run statistics report to AWS S3

Everything looks normal and no error message shown. Could anyone help to fix this?

Thanks

Hello,

I am using IndelRealigner. I get this error: ERROR MESSAGE: Badly formed genome loc: Contig ENSGALG00000024490|ENSGALT00000041403|3|651|852 given as location, but this contig isn't present in the Fasta sequence dictionary

I checked in the .dict and .bam files, indeed, the last line of .bam file which contains the above contig is missing from .dict file:

.bam file: @SQ SN:ENSGALG00000024480|ENSGALT00000041393|22|309|366 LN:2220 @SQ SN:ENSGALG00000024481|ENSGALT00000041394|1|231|231 LN:1946 @SQ SN:ENSGALG00000024482|ENSGALT00000041395|25|342|342 LN:3194 @SQ SN:ENSGALG00000024483|ENSGALT00000041396|33|191|213 LN:1252 @SQ SN:ENSGALG00000024484|ENSGALT00000041397|139|203|288 LN:1146 @SQ SN:ENSGALG00000024487|ENSGALT00000041400|1|245|258 LN:294 @SQ SN:ENSGALG00000024488|ENSGALT00000041401|336|531|531 LN:2755 @SQ SN:ENSGALG00000024490|ENSGALT00000041403|3|651|852 LN:1693 @RG ID:1 PL:Illumina PU:3 LB:SX165_1.1 SM:1 @PG ID:bwa PN:bwa VN:0.6.2-r126

and

.dict file: @SQ SN:ENSGALG00000024480|ENSGALT00000041393|22|309|366 LN:2220 UR:file:/proj/b2011033/private/assembly/GG_gap/scamelus_oases31_newbler_gg_gap.fa M5:a1d6d352f572b6c869d5d4196ba818b1 @SQ SN:ENSGALG00000024481|ENSGALT00000041394|1|231|231 LN:1946 UR:file:/proj/b2011033/private/assembly/GG_gap/scamelus_oases31_newbler_gg_gap.fa M5:9a90cffed9374c371cad87b962e2d9e0 @SQ SN:ENSGALG00000024482|ENSGALT00000041395|25|342|342 LN:3194 UR:file:/proj/b2011033/private/assembly/GG_gap/scamelus_oases31_newbler_gg_gap.fa M5:997affbb2c6e0e90de28446927c5b1cf @SQ SN:ENSGALG00000024483|ENSGALT00000041396|33|191|213 LN:1252 UR:file:/proj/b2011033/private/assembly/GG_gap/scamelus_oases31_newbler_gg_gap.fa M5:d03d3553e20bb0fe30b87ac359a681f6 @SQ SN:ENSGALG00000024484|ENSGALT00000041397|139|203|288 LN:1146 UR:file:/proj/b2011033/private/assembly/GG_gap/scamelus_oases31_newbler_gg_gap.fa M5:1e9db17274efd2c147be4a339cba7ffd @SQ SN:ENSGALG00000024487|ENSGALT00000041400|1|245|258 LN:294 UR:file:/proj/b2011033/private/assembly/GG_gap/scamelus_oases31_newbler_gg_gap.fa M5:8d2f3543543c8b3dfb62df3e89fa9791 @SQ SN:ENSGALG00000024488|ENSGALT00000041401|336|531|531 LN:2755 UR:file:/proj/b2011033/private/assembly/GG_gap/scamelus_oases31_newbler_gg_gap.fa M5:fe62d9e99cc89b45fe43d8522ab589cc

My question is that how is it possible that my .bam file contains a contig that does not exist in the .dict file? And how I can solve this problem. I used the picard utility, CreatSequenceDictionary, to recreat the .dict file but the problem is still there.

Thank you very much in advance for any help.

Homa- Uppsala University

Hi,

I was using GATK UnifiedGenotyper for SNPs/indels. My command is as follows:

java -Xmx16g -jar $gatkPath/GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R$humanRefSequence \ -I "$sampleID".recal.bam \ --dbsnp$humanDbsnpFile \ -stand_call_conf 30.0 \ -stand_emit_conf 10.0 \ -dcov 250 \ -l INFO -log "$sampleID".GATKvariants.log \ -o "$sampleID".GATKvariants.raw.vcf \ --output_mode EMIT_ALL_SITES

When I am selecting Indels from the vcf file, it's producing an empty vcf file (which is working fine with SNPs). My command for selecting Indels are as follows:

java -Xmx10g -jar $gatkPath/GenomeAnalysisTK.jar \ -T SelectVariants \ -R$humanRefSequence \ --variant "$sampleID".GATKvariants.raw.vcf \ -o "$sampleID".GATKindels.raw.vcf \ -selectType INDEL \ -log "sampleID".SelectIndels.log Do I need to use "-glm BOTH" in UnifiedGenotyper to output both SNPs and Indels? As from the documentation, I understand that the default value of -glm is set to SNP. Many thanks in advance. Hi all, I'm trying to perform BQRS on a bam file I have. Unfortunately I get this error: ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace org.broadinstitute.sting.utils.exceptions.ReviewedStingException: START (0) > (-1) STOP -- this should never happen, please check read: HWI-ST1296:110:C1P16ACXX:8:2116:15747:68300 2/2 101b aligned read. (CIGAR: 5D74M27S) The culprit appears to be this read pair: HWI-ST1296:110:C1P16ACXX:8:2116:15747:68300 147 chr2 230419662 24 5D74M27S = 230419667 -74 GAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGTGAAAGGAAGGGAAAAGAAAAAGGAAAGGAAGGCAATCCCTGCCCAGGTTCTTAATTTTC #####A>:3CC>;=>DC@;>66;.3@DAA>CA@77.)((./))@FBB.<<??/9*0*******00*?1***1*1)1)<)B3++2+2+22++2222+4++B= PG:Z:MarkDuplicates RG:Z:GB_L008.1 NM:i:9 AS:i:43 XS:i:45 HWI-ST1296:110:C1P16ACXX:8:2116:15747:68300 99 chr2 230419667 53 83M18S = 230419662 74 GAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAGAAAGGAAGGAAAAAGAGAAAGGGAAGGAAGGAAATTCATGCTCAGTATCTAATTTTTA ??;ADDDDHBF3DA=GBGB@D;EFC<3CDHBHICDEGEGIE=@@A@D@;C;;2?@7;=7>>;5>;;@B1<@CB<?>>::4++>@@>CC44>>@DC(:CA## PG:Z:MarkDuplicates RG:Z:GB_L008.1 NM:i:0 AS:i:83 XS:i:50 Program arguments were: -R /lustre1/genomes/hg19/fa/hg19.fa -knownSites /lustre1/genomes/hg19/annotation/dbSNP-137.chr.vcf -I GB_dedup.realign.bam -T BaseRecalibrator --covariate QualityScoreCovariate --covariate CycleCovariate --covariate ContextCovariate --covariate ReadGroupCovariate --unsafe ALLOW_SEQ_DICT_INCOMPATIBILITY -nct 64 -o jobdir/GB.grp I'm running the latest nightbuild of GATK. Any hint is very much appreciated d Hi guys, I have Googled my problem, with no luck, so I am asking you directly. I am currently testing an established pipeline on BAMs from a new source, so I advance step by step, and the last step, the calling with UG, seems to have trouble. My BAM endured, in order, Picard AddOrReplaceReadGroup, Picard MarkDup, GATK RealignerTargetCreator, GATK IndelRealigner, Picard FixMateInformation, GATK BaseRecalibrator, GATK PrintReads. Arrived at UG, this is my (stuck) output (I removed file names because of privacy): INFO 14:54:29,692 ArgumentTypeDescriptor - Dynamically determined type of /scratch/appli57_local_duplicates/reference/exome_target_intervals.bed to be BED INFO 14:54:29,748 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:54:29,748 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.1-11-g13c0244, Compiled 2012/09/29 06:03:05 INFO 14:54:29,749 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 14:54:29,749 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 14:54:29,750 HelpFormatter - Program Args: -T UnifiedGenotyper -nt 6 -R /scratch/appli57_local_duplicates/reference/Homo_sapiens_assembly19.fasta -I /scratch/user/FILE.marked.realigned.fixed.recal.bam --dbsnp /scratch/appli57_local_duplicates/dbsnp/dbsnp_132.b37.vcf -L /scratch/appli57_local_duplicates/reference/exome_target_intervals.bed --metrics_file /scratch/user/FILE.snps.metrics -o /scratch/user/FILE.vcf INFO 14:54:29,750 HelpFormatter - Date/Time: 2013/03/20 14:54:29 INFO 14:54:29,750 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:54:29,751 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:54:29,783 ArgumentTypeDescriptor - Dynamically determined type of /scratch/appli57_local_duplicates/dbsnp/dbsnp_132.b37.vcf to be VCF INFO 14:54:29,799 GenomeAnalysisEngine - Strictness is SILENT INFO 14:54:29,906 SAMDataSourceSAMReaders - Initializing SAMRecords in serial INFO 14:54:29,943 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04 INFO 14:54:29,959 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/appli57_local_duplicates/dbsnp/dbsnp_132.b37.vcf WARN 14:54:30,190 VCFStandardHeaderLines$Standards - Repairing standard header line for field AF because -- count types disagree; header has UNBOUNDED but standard is A -- descriptions disagree; header has 'Allele Frequency' but standard is 'Allele Frequency, for each ALT allele, in the same order as listed' INFO 14:54:32,484 MicroScheduler - Running the GATK in parallel mode with 6 concurrent threads

And it does not move from there. In my destination folder, a bamschedule.*.tmp file appears every 5 minutes or so, and in top, the program seems to be running:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29468 valleem 19 0 20.6g 3.9g 10m S 111.5 4.2 28:45.46 java

Can you help me?

Hi all, I would like to know which best practices are available for BQRS only and, before that, I need some detail on how BQRS works. In particular: suppose I have exome data for multiple samples and a set of intervals that covers all exome baits. One way to perform BQRS is to run the BaseRecalibrator walker on the whole file, using all BAM files available and have a single covariate table; I run PrintReads on each BAM file using the covariate table. Another way, is to run in the very same way feeding with the file containing intervals. This speeds things up because the walker doesn't have to check the whole genome space. Another way, faster, is to run a single BaseRecalibrator process for each interval. This results in a number of covariate tables equal to the number of intervals. I then run the same number of PrintReads and merge the results. If the latter would be enough, I know I can run really fast, but I'm afraid I may get some biased covariate table. I may apply the same procedure on whole genome, choosing the proper set of intervals

Hi,

I have used GATK for multi sample SNP calling, the total depth DP at the variant site does not seem to be the same when the individual depths are summed up for few of the variants. For example, for the below shown variant site, DP=24 in the INFO column and from the FORMAT column depth= 3+3+8+2=16. chr1 3015369 . C A 39.22 AC=4;AF=1.00;AN=4;DP=24;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=4;MLEAF=1.00;MQ=6.96;MQ0=21;QD=2.45;SB=-2.321e+01 GT:AD:DP:GQ:PL ./. 1/1:3,3:6:6:48,6,0 1/1:8,2:10:3:25,3,0

Could someone give an explanation for differences in depths?

I have read this on GATK's documents:

Human sequence

If you are using human data, your reads must be aligned to one of the official b3x (e.g. b36, b37) or hg1x (e.g. hg18, hg19) references. The contig ordering in the reference you used must exactly match that of one of the official references canonical orderings. These are defined by historical karotyping of largest to smallest chromosomes, followed by the X, Y, and MT for the b3x references; the order is thus 1, 2, 3, ..., 10, 11, 12, ... 20, 21, 22, X, Y, MT. The hg1x references differ in that the chromosome names are prefixed with "chr" and chrM appears first instead of last. The GATK will detect misordered contigs (for example, lexicographically sorted) and throw an error.

that said the reference order must be: chr1, chr2, chr3, ... chr22, chrX, chrY, chrM. but after I download all the bundle in GATK's ftp, I check's reference, it's with a order of :

>chrM
>chr1
>chr2
>chr3
>chr4
>chr5
>chr6
>chr7
>chr8
>chr9
>chr10
>chr11
>chr12
>chr13
>chr14
>chr15
>chr16
>chr17
>chr18
>chr19
>chr20
>chr21
>chr22
>chrX
>chrY
...

Hi all, I'm running VariantRecalibrator on a SNP set (47 exomes) and I get this error:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.2-3-gde33222):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR
##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)
##### ERROR ------------------------------------------------------------------------------------------

this is the command line:

java -Djava.io.tmpdir=/lustre2/scratch/  -Xmx32g -jar /lustre1/tools/bin/GenomeAnalysisTK-2.2-3.jar \
-T VariantRecalibrator \
-R /lustre1/genomes/hg19/fa/hg19.fa \
-input /lustre1/workspace/Ferrari/Carrera/Analysis/UG/bpd_ug.SNP.vcf \
-resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 /lustre1/genomes/hg19/annotation/hapmap_3.3.hg19.sites.vcf.gz \
-resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 /lustre1/genomes/hg19/annotation/1000G_omni2.5.hg19.sites.vcf.gz \
-resource:dbsnp,VCF,known=true,training=false,truth=false,prior=6.0 /lustre1/genomes/hg19/annotation/dbSNP-137.chr.vcf -an QD \
-an HaplotypeScore \
-an MQRankSum \
-an FS \
-an MQ \
-an DP \
-an QD \
-an InbreedingCoeff \
-mode SNP \
-recalFile /lustre2/scratch/Carrera/Analysis2/snp.ug.recal.csv \
-tranchesFile /lustre2/scratch/Carrera/Analysis2/snp.ug.tranches \
-rscriptFile /lustre2/scratch/Carrera/Analysis2/snp.ug.plot.R \
-U ALLOW_SEQ_DICT_INCOMPATIBILITY \
--maxGaussians 6

I've already tried to decrease the --maxGaussians option to 4, I've also added --percentBad option (setting it up to 0.12, as for INDEL) but I still get the error. I've added the option -debug to see what's happening, but apparently this has been removed from GATK-2.2. Any help is appreciated... thanks

Hello, From a talk on MuTect from a while ago, I found a reference to filtering out "noisy" reads. This seems like a great idea, and I would like to do it for other standard GATK based variant calling as well. How exactly do you do this kind of "noisy read" filtering. From the graphic in the talk, it looks like reads with a lot of mismatches relative to the others in a region, after indel realignment are candidates for being these kind of noisy reads. How is this defined exactly though, and how could I do something similar with GATK? Thanks, John