# Tagged with #gatk 1 documentation article | 0 announcements | 85 forum discussions

Created 2015-07-06 18:30:35 | Updated 2015-07-07 15:32:56 | Tags: queue gatk build

TL;DR: mvn -Ddisable.shadepackage verify

### Background

In addition to Queue's GATK-wrapper codegen, relatively slow scala compilation, etc. there's still a lot of legacy compatibility from our ant days in the Maven scripts. Our mvn verify behaves more like when one runs ant, and builds everything needed to bundle the GATK.

As of GATK 3.4, by default the build for the "protected" code generates jar files that contains every class needed for running, one for the GATK and one for Queue. This is done by the Maven shade plugin, and are each called the "package jar". But, there's a way to generate a jar file that only contains META-INF/MANIFEST.MF pointers to the dependency jar files, instead of zipping/shading them up. These are each the "executable jar", and FYI are always generated as it takes seconds, not minutes.

### Instructions for fast compilation

While developing and recompiling Queue, disable the shaded jar with -Ddisable.shadepackage. Then run java -jar target/executable/Queue.jar ... If you need to transfer this jar to another machine / directory, you can't copy (or rsync) just the jar, you'll need the entire executable directory.

# Total expected time, on a local disk, with Queue:
#   ~5.0 min from clean
#   ~1.5 min per recompile

# always available
java -jar target/executable/Queue.jar --help

java -jar target/package/Queue.jar --help

If one is only developing for the GATK, skip Queue by adding -P\!queue also.

mvn -Ddisable.shadepackage -P\!queue verify

# always available
java -jar target/executable/GenomeAnalysisTK.jar --help

java -jar target/executable/Queue.jar --help
No posts found with the requested search criteria.

Created 2016-02-02 19:26:18 | Updated | Tags: gatk gvcf merging

Hi Team,

I need to know something and hopefully is simple to implement. I have to run GATK HaplotypeCaller on a large BAM file, thus I have to run this in batches of 4 hours top. I've done some test to determine the appropriate size for the subsampled BAM. So let's say I need to run 10 jobs. Each one of those will output a gVCG file, which all belong to the same individual in this experiment. I have read that there are several methods to parse and merge VCF files into a single one like CatVariants, CombineGVCFs and CombineVariants. The question is that I'm inclined to use CombineGVCFs since is the output I have, but I also have the understanding that this is for merging different individuals which is not what I have to do. So which approach should I use? Thanks for the inputs!!! Alejandro

Created 2016-01-29 21:19:34 | Updated | Tags: depthofcoverage gatk

Hi,

I was comparing the output from GATK DepthOfCoverage and samtools mpileup and noticed that for chromosome 1, for instance, the depth values differ per individual:

GATK DepthOfCoverage sample (4th column is for individual of interest):

1:10001 1056 39.11 43 1:10002 1973 73.07 91 1:10003 2728 101.04 120

samtools mpileup (the last field is for the individual of interest):

1 10001 . T 0 . DP=298;I16=246,10,0,0,8031,252873,0,0,266,4956,0,0,216,3484,0,0;QS=8,0;MQSB=0.882047;MQ0F=0.855705 PL:DP:DV:SP:DPR 0,120,16:40:0:0:40,0 1 10002 . A 0 . DP=605;I16=492,13,0,0,15628,487360,0,0,583,11687,0,0,480,4056,0,0;QS=8,0;MQSB=0.82856;MQ0F=0.86281 PL:DP:DV:SP:DPR 0,244,26:81:0:0:81,0 1 10003 . A C, 0 . DP=862;I16=727,19,1,1,22640,694394,40,808,1252,27124,0,0,1102,6936,4,8;QS=7.98193,0.018075,0;VDB=0.02;SGB=-2.77779;RPB=0.485255;MQB=0.873995;MQSB=0.836835;BQB=0.0589812;MQ0F=0.845708 PL:DP:DV:SP:DPR 0,255,48,255,51,48:112:1:0:111,1,0

Any thoughts as to what could be happening?

Thanks, Alva

Created 2016-01-29 02:39:02 | Updated | Tags: gatk mutect2 gatk3-5

Using the standard mutect2 commands from the docs I get the error seen below. My reference file dictionaries were all made automatically from mutect version 1.4. Is there a workaround for this md5 problem. I tried an awk command to take out the m5 column from the fast.dict file, but that didnt work. Is there a flag that I'm not seeing to avoid this MD5 column check because for my purposes if the length is the same its likely that they are the same contig? Any help would be greatly appreciated.

##### ERROR contig reference is named 3 with length 198022430 and MD5 641e4338fa8d52a5b781bd2a2c08d3c3.

Created 2016-01-28 08:48:49 | Updated 2016-01-28 09:12:56 | Tags: fastareference dbsnp gatk mutect2

I have a cancer rna-seq bam file that I would like to use to call variants using MuTect2, but I ran into a couple of errors that I would appreciate some assistance with. Thank you.

1. The chr notations in my bam file is absent in the reference fasta (Homo_sapiens_assembly19.fasta). I'd rather not change my bam file header using sed and samtools reheader and rather append chr notation into the reference file. Will this mess up the corresponding snp (00-All.vcf)?

2. Is there a karytypically sorted Ensembl reference (and its corresponding dbSNP vcf) that I can use for GATK workflow?

3. The biggest issue is "Error: Discordant contig lengths: read MT LN=16571, ref MT LN=16569" when I use picard's reordersam function. This is because the read was aligned against Ensembl reference. Is there a quick fix to resolve this?

EDIT: I should mention that realigning my seq against GATK's reference is not an option because I don't have the original fastq files available.

Created 2016-01-23 05:08:59 | Updated | Tags: gatkdocs gatk error

Hi,

I tried to look up the documentation for certain tools, such as CalculateGenotypePosteriors, but when I try to access it, I just get a blank page. This is the link I am trying to access: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_CalculateGenotypePosteriors.php If this is not the correct link, please direct me to the right one. I noticed this was also happening to some other tools.

Thanks, Alva

Created 2016-01-23 05:03:33 | Updated | Tags: variantfiltration snp gatk

Hi,

I am doing an analysis on chimp variants, and do not have enough variants to train the model in order to perform VQSR. I am using hard filters at this point, as per the recommendation. I keep reading that the recommended cutoff values are only recommendations and that I should expect to tweak the parameters further. However, how do I do this in an effective manner? I'm just not sure how to approach exploring the parameter space.

Should I try out different values around the recommended cutoffs and, down the line, analyze how many mutations I get/calculate certain values for quality control and check that they agree with what is expected (CpG transition rate, and so on)? Is there a way to inspect the data to see whether it is behaving well, or do I have to wait until down the line to see how things are going? Any insight into this would help tremendously.

Thanks! Alva

Created 2016-01-14 19:02:12 | Updated | Tags: commandlinegatk haplotypecaller multi-sample gatk error

Hi,

I would like to force call a list of variants across my cohort using HaplotypeCaller to get more accurate QC metrics for each variant. I am using the following command:

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -et NO_ET -K my.key -I my.cohort.list --alleles my.vcf -L my.vcf -out_mode EMIT_ALL_SITES -gt_mode GENOTYPE_GIVEN_ALLELES -stand_call_conf 30.0 -stand_emit_conf 0.0 -dt NONE -o final_my.vcf

Here is a link to the input VCF file: VCF File

Unfortunately, I keep running into the following error (I've tried GATK ver3.3 and ver3.5):

INFO 18:49:21,288 ProgressMeter - chr1:11177077 21138.0 49.5 m 39.0 h 69.4% 71.3 m 21.8 m ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.IndexOutOfBoundsException: Index: 3, Size: 3 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at htsjdk.variant.variantcontext.VariantContext.getAlternateAllele(VariantContext.java:845) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:248) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:1059) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:221) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3.0-mssm-0-gaa95802): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Index: 3, Size: 3 ##### ERROR ------------------------------------------------------------------------------------------

Would appreciate your help in solving this issue.

Created 2016-01-07 06:21:51 | Updated | Tags: unifiedgenotyper java gatk

Hi ,

I am using the UnifiedGenotyper command but got the following error. Can some one put a light what it could mean!!

INFO 17:17:23,650 HelpFormatter - ----------------------------------------------------------------------------------- INFO 17:17:23,653 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.0-6121-g8a78414, Compiled 2011/07/11 16:05:10 INFO 17:17:23,654 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:17:23,654 HelpFormatter - Please view our documentation at http://www.broadinstitute.org/gsa/wiki INFO 17:17:23,655 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa INFO 17:17:23,655 HelpFormatter - Program Args: -T UnifiedGenotyper -R /projects/u71/nandan/Sven_Uthicke_populationGenomics/analysis/SNP_SFG/seaurchin_topHit_Ids.fasta -I merged_realigned_woUnsafe_1.bam -stand_call_conf 20.0 -stand_emit_conf 20.0 -o raw_snps_indels_Q20_all.vcf -nt 12
INFO 17:17:23,655 HelpFormatter - Date/Time: 2016/01/07 17:17:23 INFO 17:17:23,656 HelpFormatter - ----------------------------------------------------------------------------------- INFO 17:17:23,656 HelpFormatter - ----------------------------------------------------------------------------------- INFO 17:17:23,699 GenomeAnalysisEngine - Strictness is SILENT INFO 17:17:26,575 MicroScheduler - Running the GATK in parallel mode with 12 concurrent threads WARN 17:17:29,044 RestStorageService - Error Response: PUT '/GATK_Run_Reports/Ar4GP2YBVmBmWH6kz3hZjIAP6ikSNlHP.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 1993, Content-MD5: ykLRiVk7ELWpIyvrZz9dyg==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: ca42d189593b10b5a9232beb673f5dca, Date: Thu, 07 Jan 2016 06:17:27 GMT, Authorization: AWS AKIAJXU7VIHBPDW4TDSQ:Ul22+ARecieW4VAUerP/AzNX3qQ=, User-Agent: JetS3t/0.8.0 (Linux/3.0.76-0.11-default; amd64; en; JVM 1.6.0_30), Host: s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: 6FAAE2E0B5A437DF, x-amz-id-2: oJ9rYHwvwv7LZ2xpsni+V3YQF+6emAiSko8wGj5d09v7CX9nG7E+Duz0KEXFKj2u7jJNNFPOJ4Y=, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Thu, 07 Jan 2016 06:17:28 GMT, Connection: close, Server: AmazonS3]

##### ERROR ------------------------------------------------------------------------------------------

Regards,

Nandan

Created 2016-01-06 13:01:18 | Updated | Tags: combinevariants haplotypecaller best-practices dbsnp gatk combinegvcfs gvcf

Hi guys, I have recently jointly called 27 full genome data using GenotypeGVCFs approach. While i was trying to extract some chromosomes from the final file, i got this error The provided VCF file is malformed at approximately line number 16076: Unparsable vcf record with allele *.

I look into the file and I found some of the multi-allellic sites having * as seen in the attached picture.

I feel the problem could be that the program realised that more than one allele is present at that position but could not ascertain which allele. I may be wrong but please what do you think I can do to solve this problem. LAWAL

Created 2015-12-21 15:10:30 | Updated | Tags: realignertargetcreator dbsnp realignment gatk random-chromosomes

I'm re-aligning some bam files using RealignerTargetCreator. I downloaded the hg38 dbSNP file and hg38 reference genome. however, the dbSNP file contains the canonical chromosomes [1,2...X,Y,MT] while the reference (and my bam file) contain also the random chromosomes.

here's where I get the error: MESSAGE: Input files and reference have incompatible contigs: No overlapping contigs found. 142_hg38.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT] reference contigs = [chr1, chr1_GL383518v1_alt, chr1_GL383519v1_alt, chr1_GL383520v2_alt ... [etc etc etc.....] ... chrX_KI270881v1_alt, chrX_KI270913v1_alt, chrY, chrY_KI270740v1_random, chrM]

so...I would like to keep the random chromosomes in the reference/bam file, so how can I have this run without tossing an error? or is this not possible and I need to keep ONLY the chrs in the dbSNP file?

thanks Seb

Created 2015-12-17 19:38:46 | Updated | Tags: indelrealigner unifiedgenotyper haplotypecaller gatk

Does anyone know of an effective way to determine haplotypes or phasing data for SNPs and STRs? I understand that STRs are inherently difficult for aligners; however, I'm trying to determine haplotypes for a large number of STRs (including the flanking region information...SNPs) on a large number of samples. So, manual verification is not really an option. We've developed an in-house perl script that calls STRs accurately; however, it currently does not include flanking region information.

Any help is greatly appreciated.

Created 2015-12-04 06:45:03 | Updated | Tags: vqsr gatk wgs

Hi. I ran GATK(version GATK3.4-46) using whole genome sequencing data of 60 samples. Using bwa, I aligned my WGS data to human reference genome (GRCh 37) including autosome, X, Y, MT, and GL**. I ran HaplotypeCaller, and then using GenotypeGVCFs, VCF was extracted for chr1 - chr22, chrX, chrY. I wonder what do mean WGS that GATK' Bestpractices recommend. when GATK (VQSR, HC et al) was used, to get the best result**, should WGS be consisted of autosome, X, Y, MT and GL*****, ? Or could it contain only the autosome excluding X ,Y?

Created 2015-12-02 23:31:39 | Updated | Tags: unifiedgenotyper ploidy gatk variant-calling

I am using the Gatk v3.1 UnifiedGenotyper for variant calling of some GBS type data in the plant genus Boechera. I have diploids as well as triploids, where I have the ploidy from a prior microsat run. The diploids work out well (I get some 150k variable loci), but I can't seem to get variants for the triploids.

Here's my initial call for the triploids:

Program Args: -T UnifiedGenotyper -R /home/A01768064/assembly/Bstricta_278_v1.fa -I mytripbam.list -o boechera_triploids.vcf -nt 22 -glm SNP -hets 0.001 -mbq 20 -ploidy 3 -pnrm EXACT_GENERAL_PLOIDY -stand_call_conf 50 -maxAltAlleles 2

I was able to call the triploids with the diploid option, but - obviously - this is not what I want. I additionally ran the following relaxed stringency options.

hets 0.01 mbq 10 hets 0.1 mbq 10

None of them return any variants.

The read group for one example file:

@RG ID:CR1308 LB:CR1308 SM:CR1308 PL:ILLUMINA (with unique values for each sample; note that in the output vcf, the samples are appearing in the header line)

Additionally, I attached the slurm log file for the last mentioned run.

I am at a complete loss, what am I missing here?

Created 2015-11-25 21:38:23 | Updated 2015-11-25 21:40:23 | Tags: selectvariants vcf gatk

Hi,

I have multi-sample vcf file and an example variant is shown below: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 03-071 04-051 04-071 06-044 07-085 10-009

chr1 6526093 . T C 197.77 . AC1=1;AC=1;AF1=0.5 GT:GQ:DP:PL:AD 0/1 1/1 0/0 1/1 1/1 0/1

For each variant i would need to retrieve the sample names based on genotype. If the genotype is "0/1" could it be possible to select only the samples for which the genotype is "0/1" and similarly for the genotype "1/1"?

SelectVariants filters variants based on the different quality values provided. Here i would need to subset the vcf by sample for each variant based on the genotype. Are there any tools which can do this?

Created 2015-10-21 10:08:51 | Updated | Tags: haplotypecaller gatk

I am running HaplotypeCaller (GATK 3.2.2) and gets the error message below for some of the samples. I have seen this error posted before at the Forum and the recommendation has often been to change to a newer version of GATK. The problem is we have been using the version 3.2.2 for over 2000 samples in the same project and those will be analysed together so I am afraid to switch version for just a few samples.

I would greatly appreciate any help!

Best, Lina

##### ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: 125 at org.broadinstitute.gatk.utils.sam.AlignmentUtils.calcNumHighQualitySoftClips(AlignmentUtils.java:437) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.ReferenceConfidenceModel.calcGenotypeLikelihoodsOfRefVsAny(ReferenceConfidenceModel.java:291) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.isActive(HaplotypeCaller.java:839) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.addIsActiveResult(TraverseActiveRegions.java:618) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.access$800(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$ActiveRegionIterator.hasNext(TraverseActiveRegions.java:378) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:268) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:273) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

##### ERROR ------------------------------------------------------------------------------------------

Created 2015-10-19 13:43:07 | Updated 2015-10-19 13:44:06 | Tags: gatk broadies

Hi, I'm just about to try to get started using GATK for the first time, and I haven't been able to find a guide or any information on running GATK on the Broad server. I'm assuming there's a general use GATK version (akin to /seq/picard) and copies of gold standard variant files etc. already on the Broad, and I'd rather use these than re-invent the wheel. Have I simply missed a post? Are my assumptions incorrect (and everyone separately installs GATK and downloads gold standard vcfs)? Any leads would be very much appreciated. Avery p.s. I tried searching the forum, guide, and BITS wiki - what am I missing?

Created 2015-10-09 20:43:20 | Updated | Tags: haplotypecaller bug gatk variant-calling

This SNP has 30 reads supporting it with most of them being mapq60 and good fred scores.

using version 3.4-46-gbc02625 and my command was java -jar ~/GenomeAnalysisTK.jar -R /mnt/opt/refdata/fasta/hg19/hg19.fa -I /mnt/park/gemcode/haynes/rfa/RFA_phaser5/_SNPINDEL_PHASER_BAM/ATTACH_SERAFIM/fork0/files/output.bam -T HaplotypeCaller --genotyping_mode DISCOVERY -L chr3:33100000-33300000 -stand_emit_conf 10 -stand_call_conf 30

Let me know if you want more info or want me to submit a detailed bug report with relevant files.

Thanks, Haynes

Created 2015-10-06 20:13:14 | Updated | Tags: vcf indels gatk

Have a long standing stable pipeline, which I need to tweak as can be explained and justified. What is the difference between the old file of known indels (from Mills paper(s)) - Homo_sapiens_assembly19.kown_indels.vcf And the current version of known indels with 1000G added - Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz

I know the newer reference is smaller - just sites. It also has significantly fewer records/line counts. What else was removed or added from the old version until now?

Created 2015-09-25 23:52:49 | Updated | Tags: best-practices bam gatk

Hi,

So far, I have only tried running Best Practices on bam files where each bam file corresponds to a certain chromosome. How do I go about partitioning the bam files further and running Best Practices on parts of bam files? I want to make my pipeline even more efficient.

Created 2015-09-22 18:10:21 | Updated | Tags: combinevariants gatk error

Hi there, I ran the following command java -jar $gatk_jar -T CombineVariants -R${hs37d5} \ --variant ${name1} --variant${name2} \ -o combined.vcf \ -genotypeMergeOptions UNIQUIFY I followed the best practices and added 30 bam samples from the 1000g to my 4 samples. I tried also the b37 fasta but the following error persist. I've seen other treads with a similar error but not for CombineVariants. Could it be a resilient bug?

Thanks

Horacio

##### ERROR stack trace

java.lang.IllegalArgumentException: Unexpected base in allele bases '*T' at htsjdk.variant.variantcontext.Allele.(Allele.java:162) at htsjdk.variant.variantcontext.Allele.create(Allele.java:234) at htsjdk.variant.variantcontext.Allele.extend(Allele.java:253) at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.createAlleleMapping(GATKVariantContextUtils.java:1303) at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.resolveIncompatibleAlleles(GATKVariantContextUtils.java:1268) at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.simpleMerge(GATKVariantContextUtils.java:1008) at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:339) at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:136) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

##### ERROR ------------------------------------------------------------------------------------------

Created 2015-09-22 16:01:09 | Updated | Tags: bam gatk error markduplicates merged-bams

Hello,

I am relatively new to GATK and stuck on this problem.

After downloading the bam file that I wish to analyze, I am cutting portions of the file because I only wish to analyze variants at specific genes. Once I have the smaller files for each gene, I am using samtools merge to merge all of the smaller files back into one full sliced bam file. Next, I am realigning this file using bwa aln and sampe. After this step, I am attempting to use GATK best practices to mark the variants in these genes.

First, I am using GATK AddOrReplaceReadGroups to modify read group information as necessary. Then, I am using Picard's MarkDuplicates to mark duplicates in the re-aligned bam file. However, I get the following error.

Exception in thread "main" net.sf.picard.PicardException: Value was put into PairInfoMap more than once. 156: TLL:HWI-ST1222:5:2308:12532:76745#0 at net.sf.picard.sam.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:124) at net.sf.picard.sam.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:78) at net.sf.picard.sam.DiskReadEndsMap.remove(DiskReadEndsMap.java:61) at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:418) at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:161) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:145)

Looking at Picard's FAQ page (https://broadinstitute.github.io/picard/faq.html), it suggests that I try using Picard's MergeBamAlignment which also fails giving an error suggesting that the error is in the record ID of the bam file.

Exception in thread "main" net.sf.picard.PicardException: Program Record ID already in use in unmapped BAM file. at net.sf.picard.sam.SamAlignmentMerger.<init>(SamAlignmentMerger.java:131) at net.sf.picard.sam.MergeBamAlignment.doWork(MergeBamAlignment.java:226) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.sam.MergeBamAlignment.main(MergeBamAlignment.java:205)

Continuing on, trying to change the options on samtools merge to change the record ID's when merging together the single files into the full sliced file also failed.

Some more work into these issues found that using samtools fixmate and then Picard SortSam before calling MarkDuplicates resolves the issue, but the total amount of variants called by GATK is different when I run the sliced file through against a single gene file (for the same locations).

Are there any other options to explore to resolve this error?

Created 2015-09-17 21:56:03 | Updated | Tags: bqsr best-practices bam gatk

Hi,

I recently re-generated my realigned bam files (realigned.bam) from the GATK Best Practices, but I encountered an error when I validated the bam files using the validation tool (ValidateSamFile.jar) in Picard:

ERROR: Record 188823, Read name HWI-ST1329:286:C2DT8ACXX:2:1303:11866:8561, Mate alignment does not match alignment start of mate ERROR: Record 188823, Read name HWI-ST1329:286:C2DT8ACXX:2:1303:11866:8561, Mate negative strand flag does not match read negative strand flag of mate

etc.

What exactly is going on? Where in the pipeline could this have occurred? I do not have the intermediary files so I cannot diagnose this accurately. Before I started the GATK Best Practices, my bam files had no problems; when I validated them I did not get any error messages. Does this mean that I have to regenerate the files again and start from the very beginning?

I tried to use the Picard tool FixMateInformation.jar, but then I got an error message regarding the index files. I tried deleting the bam index for one bam file and then recreating it to see if there was a problem with the indexes, but doing so did not resolve the issue. So it seems that something went wrong with the bam files at some earlier step.

Has this error been encountered before? Not sure how to proceed next, except restart everything.

Best, Alva

Created 2015-09-07 12:14:38 | Updated | Tags: developer gatk

Hi,

I followed the instructions from your guide on how to make GATK from source with IntelliJ IDEA. My only problem is that GenomeAnalysisTK.jar is not generated, and I don't know which class should I give to the debugger as the main class?

Created 2015-07-31 22:11:45 | Updated | Tags: commandlinegatk haplotypecaller gatk error

Hello,

I am receiving the following error. I am working with SAM files that were exported from CLC, then edited with Picard-tools to addReadGroups. I am not sure if I need to add an additional step to solve this problem, I cannot find any documentation regarding this error.

Please let me know what I need to do to correct this issue.

Thank you!

# GenotypeGVCFs (generate vcf files for each chr)

##### ERROR ------------------------------------------------------------------------------------------

Created 2015-02-20 21:43:46 | Updated | Tags: vcf gatk calculategenotypeposteriors

Hi,

I used CalculateGenotypePosteriors with the supporting file called ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf, obtained from 1000 Genomes. It contains both indels and SNPs, and I was able to use the file for the first step of the Genotype Refinement Workflow. My question is: is it an issue that I didn't remove the indels from the supporting file? I would presume not since first of all, both the indels and the SNPs from Phase 3 of 1000G should be high confidence, and second, my recalibrated vcf file includes both indels and snps, so it should be in my interest to have as much information as possible, actually, so I should consider indels as well.

Just want to check whether my reasoning is correct.

Thanks, Alva

Created 2015-02-17 15:58:44 | Updated | Tags: variantrecalibrator gatk

I'm sorry I keep posting these questions, but this error has me baffled!

##### ERROR ------------------------------------------------------------------------------------------

Here is the input I use to get it:

java -d64 -Xmx8g -jar $GATKDIR/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -input k_family_out.vcf \ -R$REF \ -resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 $HAPMAP \ -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0$OMNI \ -resource:dbsnp,VCF,known=true,training=false,truth=false,prior=8.0 $DBSNP \ -resource:mills,VCF,known=true,training=true,truth=true,prior=12.0$MILLS \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum \ -mode BOTH \ -U ALL \ -recalFile k_family.recal \ -tranchesFile k_family.tranches \ -rscriptFile k_family.R \ -nt 6 --TStranche 100.0 --TStranche 99.9 --TStranche 99.5 --TStranche 99.0 \ --TStranche 98.0 --TStranche 97.0 --TStranche 96.0 --TStranche 95.0 --TStranche 94.0 \ --TStranche 93.0 --TStranche 92.0 --TStranche 91.0 --TStranche 90.0

I thought this was similar to a multithreading error, so I reduced my pipeline to execute with only a single node, and the problem persists.

Any help will be appreciated.

Created 2015-02-13 11:14:24 | Updated | Tags: bam gatk error

Hi,

with GATK 3.3-0 I am confronted with an error that was present in a much older version, but seemed resolved about a year ago:

ERROR MESSAGE: There is no space left on the device, so writing failed

There is 8TB left on the drive, no user limit. Sometimes re-running the exact same job works, sometimes not. Some jobs keep failing despite asking for an insane amount of memory on the cluster, given these are RNAseq bam files, the largest one being less than 7GB.

For example:

qsub -b y -cwd -N step3_145 -o step3_145.o -e step3_145.e -V -l h_vmem=40G /share/apps/java/oracle/1.8.0_11/bin/java -Xmx35G -jar /data/home/hhx037/GATK-3.3.0/GenomeAnalysisTK.jar -T SplitNCigarReads -R Homo_sapiens.GRCh37.75.dna.1-22XYMT.fa -I Analyses/file_dedup.bam -o Analyses/file_splittedcigar.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

Here is the log:

INFO 10:50:51,568 HelpFormatter - Executing as hhx037@panos1 on Linux 2.6.32-431.1.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_11-b12. INFO 10:50:51,571 HelpFormatter - Date/Time: 2015/02/13 10:50:51 INFO 10:50:51,571 HelpFormatter - -------------------------------------------------------------------------------- INFO 10:50:51,576 HelpFormatter - -------------------------------------------------------------------------------- INFO 10:50:52,503 GenomeAnalysisEngine - Strictness is SILENT INFO 10:50:52,827 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 10:50:52,861 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 10:50:52,876 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 10:50:53,021 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 10:50:53,027 GenomeAnalysisEngine - Done preparing for traversal INFO 10:50:53,030 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 10:50:53,030 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 10:50:53,030 ProgressMeter - Location | reads | elapsed | reads | completed | runtime | runtime INFO 10:50:53,047 ReadShardBalancer$1 - Loading BAM index data INFO 10:50:53,050 ReadShardBalancer$1 - Done loading BAM index data INFO 10:51:23,404 ProgressMeter - 1:1477348 702953.0 30.0 s 43.0 s 0.0% 17.5 h 17.5 h INFO 10:52:32,660 ProgressMeter - 1:16909108 1202983.0 99.0 s 82.0 s 0.5% 5.0 h 5.0 h INFO 10:53:09,769 ProgressMeter - 1:21069702 1302985.0 2.3 m 104.0 s 0.7% 5.6 h 5.5 h INFO 10:53:49,083 ProgressMeter - 1:27951393 1803181.0 2.9 m 97.0 s 0.9% 5.4 h 5.4 h INFO 10:54:29,275 ProgressMeter - 1:32739969 2103299.0 3.6 m 102.0 s 1.1% 5.7 h 5.6 h INFO 10:55:09,177 ProgressMeter - 1:36643589 2203300.0 4.3 m 116.0 s 1.2% 6.0 h 5.9 h INFO 10:55:45,643 ProgressMeter - 1:39854010 2303302.0 4.9 m 2.1 m 1.3% 6.3 h 6.2 h INFO 10:56:25,147 ProgressMeter - 1:40542516 2403303.0 5.5 m 2.3 m 1.3% 7.0 h 6.9 h INFO 10:57:10,934 ProgressMeter - 1:40654849 2503322.0 6.3 m 2.5 m 1.3% 8.0 h 7.9 h INFO 10:57:54,084 ProgressMeter - 1:43162895 2503322.0 7.0 m 2.8 m 1.4% 8.4 h 8.3 h INFO 10:58:24,149 ProgressMeter - 1:45244391 2703426.0 7.5 m 2.8 m 1.5% 8.6 h 8.4 h INFO 10:58:56,749 ProgressMeter - 1:53716450 2803427.0 8.1 m 2.9 m 1.7% 7.7 h 7.6 h INFO 10:59:38,928 ProgressMeter - 1:86821106 3103432.0 8.8 m 2.8 m 2.8% 5.2 h 5.1 h INFO 11:00:11,337 ProgressMeter - 1:93301870 3303437.0 9.3 m 2.8 m 3.0% 5.1 h 5.0 h INFO 11:01:13,113 ProgressMeter - 1:115252321 3803590.0 10.3 m 2.7 m 3.7% 4.6 h 4.5 h INFO 11:02:02,172 ProgressMeter - 1:145441389 4303778.0 11.2 m 2.6 m 4.7% 4.0 h 3.8 h INFO 11:02:38,237 ProgressMeter - 1:150547232 4703871.0 11.8 m 2.5 m 4.9% 4.0 h 3.8 h INFO 11:03:09,693 ProgressMeter - 1:153362937 5003904.0 12.3 m 2.5 m 5.0% 4.1 h 3.9 h INFO 11:03:39,934 ProgressMeter - 1:155984762 5403968.0 12.8 m 2.4 m 5.0% 4.2 h 4.0 h INFO 11:04:05,477 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR ------------------------------------------------------------------------------------------

I understand temporary files may be large, but not that large. Are the temporary files written in the working directory (as I believe should be the case), or are they written in GATK installation directory?

Also, note I never run into this problem with the previous version.

Any idea?

Cheers,

Stephane

Created 2015-02-04 05:14:44 | Updated | Tags: variantrecalibrator vqsr vcf gatk

Hi,

I have generated vcf files using GenotypeGVCFs; each file contains variants corresponding to a different chromosome. I would like to use VQSR to perform the recalibration on all these data combined (for maximum power), but it seems that VQSR only takes a single vcf file, so I would have to combine my vcf files using CombineVariants. Looking at the documentation for CombineVariants, it seems that this tool always produces a union of vcfs. Since each vcf file is chromosome-specific, there are no identical sites across files. My questions are: Is CombineVariants indeed the appropriate tool for me to merge chromosome-specific vcf files, and is there any additional information that I should specify in the command-line when doing this? Do I need to run VariantAnnotator afterwards (I would assume not, since these vcfs were generated using GenotypeGVCFs and the best practices workflow more generally)? I just want to be completely sure that I am proceeding correctly.

Thank you very much in advance, Alva

Created 2015-01-27 21:59:14 | Updated 2015-01-27 21:59:46 | Tags: best-practices snp gatk combinegvcfs gvcf

Hi,

I am combining gcvf files into single gvcf files by chromosome, using CombineGVCFs, in order to run GenotypeGVCFs. When I checked the first gvcf file generated by CombineGVCFs, I noticed that at each position, all the alleles were missing.

For example, at position 16050036, this is what comes up in the final gvcf file: 22 16050036 . A C,<NON_REF> . . BaseQRankSum=-7.360e-01;ClippingRankSum=-7.360e-01;DP=4;MQ=27.00;MQ0=0;MQRankSum=-7.360e-01;ReadPosRankSum=0.736 GT:AD:DP:MIN_DP:PL:SB ./.:1,2,0:3:.:55,0,23,58,29,86:1,0,2,0 ./.:.:1:1:0,0,0,0,0,0 ./.:.:0:0:0,0,0,0,0,0

But if we just take one of the precursor gvcf files (one individual), we clearly see the genotype at that site: 22 16050036 . A C,<NON_REF> 26.80 . BaseQRankSum=-0.736;ClippingRankSum=-0.736;DP=3;MLEAC=1,0;MLEAF=0.500,0.00;MQ=27.00;MQ0=0;MQRankSum=-0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB 0/1:1,2,0:3:23:55,0,23,58,29,86:1,0,2,0

The command I'm using to generate these files is:

java -Xmx1g -jar GenomeAnalysisTK.jar -T CombineGVCFs -R hs37d5.fa -V vcfs.chr${numchr}.new.list -o mergeGvcf_${numchr}.vcf -L ${numchr} where numchr is a variable previously defined (indicating the chromosome number). It seems that all the information is being taken into account except the actual genotypes. How do I solve this problem? Thanks, Alva Created 2015-01-27 20:25:42 | Updated | Tags: variantrecalibrator vqsr vcf gatk Hi, I ran VariantRecalibrator and ApplyRecalibration, and everything seems to have worked fine. I just have one question: if there are no reference alleles besides "N" in my recalibrate_SNP.recal and recalibrate_INDEL.recal files, and in the "alt" field simply displays , does that mean that none of my variants were recalibrated? Just wanted to be completely sure. My original file (after running GenotypeGVCFs) has the same number of variants as the recalibrated vcf's. Thanks, Alva Created 2015-01-23 16:55:57 | Updated | Tags: vqsr haplotypecaller bam gatk genotypegvcfs variant-calling Hi, I have recal.bam files for all the individuals in my study (these constitute 4 families), and each bam file contains information for one chromosome for one individual. I was wondering if it is best for me to pass all the files for a single individual together when running HaplotypeCaller, if it will increase the accuracy of the calling, or if I can just run HaplotypeCaller on each individual bam file separately. Also, I was wondering at which step I should be using CalculateGenotypePosteriors, and if it will clean up the calls substantially. VQSR already filters the calls, but I was reading that CalculateGenotypePosteriors actually takes pedigree files, which would be useful in my case. Should I try to use CalculateGenotypePosteriors after VQSR? Are there other relevant filtering or clean-up tools that I should be aware of? Thanks very much in advance, Alva Created 2014-12-18 08:52:46 | Updated | Tags: genotypeconcordance gatk Hi, I didn't find the answer anywhere, so I'm asking here. I'm using GATK 2.6-4. I used GenotypeConcordance to compare two datasets, and it was great, exactly what I needed. But then, I was interesting in the discordant sites, and I saw that printInterestingSites should be a good solution. But I can't make it work. Just adding in my command line "-sites filename" doesn't work, and I've tried to be creative in changing this option. Then I thought, that maybe this option isn't working in my GATK version. Have I just done something wrong? Created 2014-12-17 18:04:27 | Updated | Tags: haplotypecaller gatk error rnaseq genotyping genotyping-mode Hi, I'm currently trying to use GATK to call variants from Human RNA seq data So far, I've managed to do variant calling in all my samples following the GATK best practice guidelines. (using HaplotypeCaller in DISCOVERY mode on each sample separately) But I'd like to go further and try to get the genotype in every sample, of each variant found in at least one sample. This, to differentiate for each variant, samples where that variant is absent (homozygous for reference allele) from samples where it is not covered (and therefore note genotyped). To do so, I've first used CombineVariants to merge variants from all my samples and to create the list of variants to be genotype${ALLELES}.vcf

I then try to regenotype my samples with HaplotypeCaller using the GENOTYPE_GIVEN_ALLELES mode and the same settings as before: my command is the following:

**java -jar ${GATKPATH}/GenomeAnalysisTK.jar -T HaplotypeCaller -R${GENOMEFILE}.fa -I ${BAMFILE_CALIB}.bam --genotyping_mode GENOTYPE_GIVEN_ALLELES -alleles${ALLELES}.vcf -out_mode EMIT_ALL_SITES -dontUseSoftClippedBases -stand_emit_conf 20 -stand_call_conf 20
-o ${SAMPLE}_genotypes_all_variants.vcf -mbq 25 -L${CDNA_BED}.bed --dbsnp ${DBSNP}.vc**f In doing so I invariably get the same error after calling 0.2% of the genome. ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.IndexOutOfBoundsException: Index: 3, Size: 3 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at htsjdk.variant.variantcontext.VariantContext.getAlternateAllele(VariantContext.java:845) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:248) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:1059) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:221) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Index: 3, Size: 3 ##### ERROR ------------------------------------------------------------------------------------------ because the problem seemed to originate from getAlternateAllele, I tried to play with --max_alternate_alleles by setting it to 2 or 10, without success. I also checked my${ALLELES}.vcf file to look for malformed Alternate alleles in the region where the GATK crashes (Chr 1, somewhere after 78Mb) , but I couldn't identify any... (I searched for Alternate alles that would not match the following extended regexpr '[ATGC,]+')

Created 2014-12-05 21:10:20 | Updated | Tags: bqsr best-practices gatk

Hi,

I noticed that the pipeline for BQSR is slightly different between the "Workshop walkthrough (Brussels 2014)" document and the "(howto) Recalibrate base quality scores = run BQSR" document. I used the former document as my guide since it is more recent, but I am curious as to why these two are different. The first step is the same for the 2 docs, but then for the second step, the output is a recal.bam file in the brussels document instead of a post_recal.table file in the (howto) document. Then, I am confused about the (howto) document because it seems that the 4th step is exactly the same as the second step, except we generate a recal.bam file. Is the brussels document presenting steps that are more efficient? (since there is only 1 use of -BQSR to generate the recalibrated bam file we need, as opposed to 2 uses of the option)

Thanks, Alva

Created 2014-12-05 20:16:37 | Updated | Tags: bqsr gatk

Hi,

I encounter the "RScript exited with 1. Run with -l DEBUG for more info" error when trying to make recalibration plots using Rscript. However, I do have all the required packages installed on my computer, including gsalib, which seems to be the problem. When I run with -l DEBUG, I get the following information:

Error in library(gsalib) : there is no package called ‘gsalib’

The thing is that I am using my computer but working in a cluster, so not directly on my computer. Should I have the packages installed in the cluster or is it fine that I've installed the packages on my computer? I'm a bit new to this, so I apologize if this question is a bit basic.

Thanks, Alva

Created 2014-12-05 15:27:21 | Updated 2014-12-05 18:08:01 | Tags: haplotypecaller best-practices vcf gatk gvcf

Hi,

I used HaplotypeCaller in GVCF mode to generate a single sample GVCF, but when I checked my vcf file I see that the reference allele is not showing up:

22  1   .   N   <NON_REF>   .   .   END=16050022    GT:DP:GQ:MIN_DP:PL  0/0:0:0:0:0,0,0
22 16050023    .   C   <NON_REF>   .   .   END=16050023    GT:DP:GQ:MIN_DP:PL  0/0:1:3:1:0,3,37
22  16050024    .   A   <NON_REF>   .   .   END=16050026    GT:DP:GQ:MIN_DP:PL  0/0:2:6:2:0,6,73
22  16050027    .   A   <NON_REF>   .   .   END=16050035    GT:DP:GQ:MIN_DP:PL  0/0:3:9:3:0,9,110
22  16050037    .   G   <NON_REF>   .   .   END=16050037    GT:DP:GQ:MIN_DP:PL  0/0:3:9:3:0,9,109
22  16050038    .   A   <NON_REF>   .   .   END=16050039    GT:DP:GQ:MIN_DP:PL  0/0:4:12:4:0,12,153

I am not sure where to start troubleshooting for this, since all the steps prior to using HaplotypeCaller did not generate any obvious errors.

The basic command that I used was: java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hs37d5.fa -I recal_1.bam -o raw_1.vcf -L 22 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

Have you encountered this problem before? Where should I start troubleshooting?

Thanks very much in advance, Alva

Created 2014-12-02 16:32:32 | Updated | Tags: haplotypecaller bam gatk

Hi,

I want to use HaplotypeCaller to call families together. I have bam files for each individual in the 4 families I am studying, as well as a ped file describing the pedigree information. The problem is that these families have complex pedigrees, with the parents (mother and father), the children, and then one grandchild for each child (do not have information about the other parent of the grandchild). I would like to call these families with their complex pedigrees together, and I would like to call all the 4 families together to maximize the power of the calling. However, I'm not sure how to do that with just the -ped option. -ped seems to be designed for only one family or cohort, and I'm not sure it would work for me to feed it all my bams as inputs. Are there any other tools for GATK that I could use to call complex pedigrees?

The other possibility would be to call the 4 trios separately and each child-grandchild pair separately, but not sure how to do that either with just -ped. What would you recommend?

And finally, I get an error message saying that --emitRefConfidence only works for single sample mode. It seems that I should omit this option when I run HaplotypeCaller on my families, but are there any other options that I should use for cohort calling besides the standard ones (which I know to be --variant_index_type and --variant_index_parameter)?

Thanks, Alva

Created 2014-11-25 14:27:16 | Updated | Tags: best-practices gatk

Chaps,

I am a newbie to GATK. I have started the best practices pipline but stuck in bwa. From where can I get the reference.fa file? The GATK bundle has .fasta .bam .fai but no .fa. Anyone please!

Created 2014-11-20 23:51:40 | Updated | Tags: gatk

I would like to obtain the coverage distribution on all exome intervals (human reference genome) using the BaseCoverageDistribution function in GATK. The problem I'm having is finding an appropriate reference file to associate with the -L argument. Specifically, the standard UCSC exome coordinate file doesn't have a format that GATK will accept, e.g. (header and first row given below, as opposed to chr1:start-stop) etc

# name chrom strand cdsStart cdsEnd exonCount exonStarts exonEnds name2 exonFrames

NM_017582 chr1 - 154522913 154531029 13 154521050,154523413,154523872,154524246,154524396,1545 24568,154524879,154525211,154525507,154527210,154527903,154528335,154530702, 154522945,1545234 80,154523968,154524295,154524455,154524659,154524940,154525296,154525648,154527261,154528008,1 54528440,154531120, UBE2Q1 1,0,0,2,0,2,1,0,0,0,0,0,0,

Created 2014-11-20 11:41:42 | Updated | Tags: best-practices gatk

Where can I get the fasta and ref files for experimenting on GATK?

Created 2014-09-23 15:08:19 | Updated | Tags: selectvariants gatk tumor normal

I am wondering if it would be okay to use the "SelectVariants" tool to select mutants specific to the tumor sample in a pair of tumor and normal samples. I tried the following command, and it seems working. But I am suspecting whether it differentiates the genotype differences (e.g. 0/1 in tumor sample and 0/0 in normal sample) or coverage differences (e.g. 0/1 in tumor sample and ./. in normal sample).

Could someone help me understand "SelectVariants" for my question? Thanks a lot!

java -Xmx10g -Djava.io.tmpdir=$tempSpillFolder -jar$CLASSPATH/GenomeAnalysisTK.jar \
-R $GenomeReference \ -T SelectVariants \ --variant$file \
-o $OutputFolder/$filename.selected.vcf \
-select 'set=="tumor"'

Created 2014-09-19 23:06:10 | Updated | Tags: variantrecalibrator applyrecalibration gatk

Hi! I am attempting to go through the process of applying a recalibration to an input vcf. I tried following the steps outlines in the tutorial Howto: Recalibrate Variant Quality Scores .

Step 2 runs great, I used all the tranches and resources that the howto mentions. Unfortunately when I attempt to apply the recalibration (step 3) I receive the following error

Encountered input variant which isn't found in the input recal file. Please make sure VariantRecalibrator and ApplyRecalibration were run on the same set of input variants. First seen at: [VC input @ 1:14907 Q523.77 of type=SNP alleles=[A*, G] attr={AC=1, AF=0.500, AN=2, BaseQRankSum=-1.863, ClippingRankSum=-0.420, DP=86, FS=0.984, MLEAC=1, MLEAF=0.500, MQ=36.93, MQ0=0, MQRankSum=0.338, QD=6.09, ReadPosRankSum=2.064} GT=GT:AD:DP:GQ:PL 0/1:57,29:86:99:552,0,1482

When I check the input file I notice that this is the first variant call in my VCF. I tried this process with other VCF files and each time my ApplyRecalibration gets stuck on the first variant call.

Here is the command I used to generate the recalibration:

 java -Djava.io. -Xmx4g -jar GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \
-T VariantRecalibrator -R /human_g1k_v37.fasta \
-L file.intervals \
-input input_file.vcf \

-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf \
-resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 Mills_and_1000G_gold_standard.indels.b37.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf \

-an DP \
-an QD \
-an FS \
-an MQRankSum \

-mode SNP \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
-recalFile recal_file.recal \
-tranchesFile tranches_file.tranches \
-rscriptFile rscipt_file.plots.R\

And here is my command for applying the recalibration:

java -Xmx4g -jar GenomeAnalysisTK.jar
-T ApplyRecalibration
-R human_g1k_v37.fasta
-input input_file.vcf
-mode SNP
--ts_filter_level 99.0
-recalFile  recal_file.recal
-tranchesFile tranches_files.tranches
-o recalibrated_output.vcf 

Any suggestions to help me troubleshooot the issue is very welcome. I should also mention that I am using GATK 3.2.2

Created 2014-09-15 21:28:59 | Updated | Tags: indelrealigner unifiedgenotyper indels gatk variant-calling

I used GATK 3.2-2 to do indel realignment and multisample calling with UnifiedGenotyper (I cannot use HaplotypeCaller because it is incompatible with the type of data I am analyzing).

Among the ~96 samples sequenced, I have one sample with two nearby variant calls, an indel and a missense, that we checked by Sanger. The missense is real but we found no trace of the indel in the Sanger traces. When I look at indel call in the multisample vcf, it has good allele depth and GQ, but suspiciously has the same AD as the missense call. Additionally, when I look at the bam in IGV, I see no evidence for the indel whatsoever and the variant is not called in any other samples in this project.

indel: chr13:101755523 CA>C GT:AD:DP:GQ:PL 0/1:55,56:111:99:1388,0,1538

missense: chr13:101755530 A>G GT:AD:DP:GQ:PL 0/1:55,56:111:99:2170,0,2125

I went back and recalled just this one sample (single sample calling)… which resulted in the correct variants, i.e. the indel was not called at all, but the SNP, which does validate, is called.

I understand that this is not an easy region to call because of the 7xA repeat, but it’s not obvious to me why this happens only in multisample mode and I'd like to have a better understanding of what is going on.

Created 2014-09-09 20:23:51 | Updated | Tags: gatk module

Hello,

I'm currently working on creating a module file on my cluster for GATK. I set the prepend path to the directory in which I saved my .jar file, but when I attempt to use said file (command: java -jar GenomeAnalysisTK.jar -h), java cannot find it. It does, however, work when I am in the directory in which it is saved, but not when I add it as a module file. Any ideas?

Thanks

Created 2014-09-02 12:38:46 | Updated 2014-09-02 12:48:19 | Tags: vqsr bqsr gatk

Hi there, I have been using GATK to identify variants recently. I saw that BQSR is highly recommended. But I don’t know whether it is still needed for de novo mutation calling. For example, I want to identify de novo mutations generated in the progenies by single seed descent METHODS in plants. For example, in the paper of “The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana”, these spontaneous arising mutations may not included in the known sites of variants. Based on documentation posted in GATK websites, they assume that all reference mismatches we see are errors and indicative of poor base quality. Under this assumption, these de novo mutations may be missed in the step of variant calling. So in this situation, what should I do? Or should I skip the BQSR step? Also what should I do when I reach to step- VQSR? Hope some GATK developers can help me on this. Thanks.

Created 2014-08-26 19:14:10 | Updated | Tags: baserecalibrator gatk

Hi,

I have performed BaseRecailbrator using GATK-3.2-2 . The size of the bam file is double the size of realigned_fixmate.bam. I read in one of the posts in this forum that it could be because of retaining the original and reclaibrated quality scores. And mentioned that the latest version will by default exclude original scores and retain reclaibrated scores.

Im using the latest version and still gets double the size of the original bam. Could someone comment on this. Thanks

Created 2014-08-22 10:43:41 | Updated | Tags: variantfiltration gatk filterexpression

Hi all,

I tried to apply the following command to my raw vcf file to filter it using the filtering expression specified in the command:

java -XX:ConcGCThreads=4 -XX:+UseConcMarkSweepGC -XX:ParallelGCThreads=4 -jar GenomeAnalysisTK.jar -T VariantFiltration -R human_g1k_v37.fa --variant human_g1k_v37.CHL124.vcf_snps.vcf -o CHL124.vcf_snps.vcf_filter_marked.vcf --filterExpression "QD < 2.0 && MQ < 40.0 && FS > 60.0 && HaplotypeScore > 13.0 && MQRankSum < -12.5 && ReadPosRankSum < -8.0" --filterName "very_small_SNPs_default_filter"

After that, I check my result file which is CHL124.vcf_snps.vcf_filter_marked.vcf, I found that, all reads are marked as "PASS" whether its QD is > 2.0 or < 2.0. Obviously, the command doesn't work, but I cannot find why everything seems goes well, no error reported.

bless~ XL

Created 2014-08-19 14:40:22 | Updated | Tags: gatk error illegalargumentexception

Hi, I suppose I had a problem with my bam file, but I don't know what I should look for.

Any help appreciated.

##### ERROR stack trace

cf attached file for complete stack.

Created 2014-07-23 03:56:29 | Updated | Tags: gatk personal-genome merge-bam

Hi all, I have been successfully dealing with GATK tools for variant calling for exome sequences, but now I have to do it for a personal genome. Since the genome has been sequenced in two runs , using 7 lanes per each run , now I have 28 fastq files.( paired end reads (2) 7 lanes 2 runs). I haven't deal with such a large number of files at once before. My suggested approach is to,

1) Align, Dedup, Realign and Recalibrate per lane. (So I get 14 aligned,deduped,realigned and reacalibrated bam files) 2) Merge the bam files inorder to produce a single bam file 3) Call variants using the single bam file using Haplotype Caller.

Do you think my approach is feasible? Or do you have any alternative approaches? Furthermore, what is the best tool to merge the bam files?

Created 2014-07-21 10:04:39 | Updated | Tags: printreads gatk

Hi, I am trying to do a base quality score recalibration of my BAM files. I am working with individual based RADseq data where the reads were cleaned and aligned against a reference genome using bowtie2. The reference genome comes from a relatively closely related taxon (~25 Mio years of divergence), however the species in this taxonomic family underwent a recent genome duplication, hence I am running bowtie2 with the -k option (which retains multiple alignments rather than picking randomly one if multiple matches of equal quality occur) and only retain the uniquely aligned reads for the downstream analyses. The downside to this is, that the MAPQ values become artificial.

After generating a recalibration data file, I run GATK PrintReads (version 3.1-1-g07a4bf8) as follow:

java -Xmx16G -jar GenomeAnalysisTK.jar -T PrintReads -R Ref_genome.fasta -I Individual.bam -BQSR Library_recaldata.grp -o Individual.rc.bam

Which yields the following error after running through 80% of the file: # ERROR MESSAGE: Exception when processing alignment for BAM index HWI-ST1206:32:C3VGTACXX:5:1104:14976:28487 84b aligned read.

Here the read in question (the one in the middle):

HWI-ST1206:32:C3VGTACXX:5:2313:18281:8513   16  chrUn   536921741   0   84M *   0   0   CCCAGAGTGCAAAGAACCTTGGCAGGACACTAGACAACACCCTGTTGTTTTCTGCAAACATCAAAGCAGTGACTCTGTCCTGCA    DDDDEEEEEEFFFD@DHFGHHJHIIIIIGGJIJJIGGHHJJJJJJJJIJJJJJJIJJJJJIIJJJJJJJJJJJJJJJHHHHHFF    AS:i:-29    XS:i:-35    XN:i:0  XM:i:5  XO:i:0  XG:i:0  NM:i:5  MD:Z:11T18G1A14G28C7    YT:Z:UU RG:Z:70130.satrFamilyX
HWI-ST1206:32:C3VGTACXX:5:2314:6453:46729   16  chrUn   536921741   0   84M *   0   0   CCCAGAGTGCAAAGAACCTTGGCAGGACACTAGACAACACCCTGTTGTTTTCTGCAAACATCAAAGCAGTGACTCTGTCCTGCA DDDDEEEEEEFFFEDFHHHHHJJIJJIJIIIIIGJJIFFJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJIHJJJJJHHHHHFF   AS:i:-29    XS:i:-35    XN:i:0  XM:i:5  XO:i:0  XG:i:0  NM:i:5  MD:Z:11T18G1A14G28C7    YT:Z:UU RG:Z:70130.satrFamilyX
HWI-ST1206:32:C3VGTACXX:5:1104:14976:28487  16  chrUn   536961523   255 84M *   0   0   ATTGGATTGCAGTGCAGGTACACTTTGTATGGATCGTTGTGTGTGTTTACTGTGACTTCTCTTAAGAAGACGCGTGTCCCTGCA</del>  DDDDDDDDDDDDDDDDDDDDDCEEEEEFFFFFHFFJJJJJJJJJIIHD8JJJJIHIJJJJJJIJJJJJJIJJIJJJJHHHHHFF    AS:i:-9 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:20G27T35   YT:Z:UU RG:Z:70130.satrFamilyX**
HWI-ST1206:32:C3VGTACXX:5:1104:10760:63491  16  chrUn   536961523   255 84M *   0   0   ATTGGATTGCAGTGCAGGTACACTTTGTATGGATCGTTGTGTGTGTTTTCTGTGACTTCTCTTAAGAAGACGCGTGTCCCTGCA    DDDDCCCDDDDDDBCDCDCDC@DDEEECEEBFEFEHJJJJJJIHGJJIJJJJIHFJJJJIIJIJJIJJJIJJJJJJJHHHHHFF    AS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:20G63  YT:Z:UU RG:Z:70130.satrFamilyX
HWI-ST1206:32:C3VGTACXX:5:1104:16496:63907  16  chrUn   536961523   255 84M *   0   0   ATTGGATTGCAGTGCAGGTACACTTTGTATGGATCGTTGTGTGTGTTTTCTGTGACTTCTCTTAAGAAGACGCGTGTCCCTGCA    DDDDDDDDDDDDDDDDDDCCCCDDEEECEEFFFFFHJJJIJJJJJJJJJJJJJIHJJJJJJJJJJIJJJJJJJJJIJHHHHHFF    AS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:20G63  YT:Z:UU RG:Z:70130.satrFamilyX

I subsequently verified my BAM files with Picard, which yields the following error # "bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned" for 84554 reads that are almost consecutive in the same genomic region (chrUn - which is composed of sequences for which no chromosome anchoring information exist). However, the reads flagged by Picard occur about 2 million base pairs before the region in question for GATK and PrintReads seemed to have no problem with them.

For the sake of completeness, here the beginning and end of the section that was flagged by Picard. It starts with the second of the following reads:

HWI-ST1206:32:C3VGTACXX:5:2315:1142:72568   256 chrUn   997157147   2   84M *   0   0   TGCAGGACCACAGTCTTTTCACCAGCTCCTCAAAATGGTTGAAGAAGTAGTATTTTGGGGGCTCGTATAAACTCGTATGCCGTC    FFHHHHHJJIJJJDFHHGIHIGIJJIJJJIIJJJIIJGIJIGIFHDHCGHCGGHIJIIIIEFDDBDDCDEDAAC?ABCDCCBB@    AS:i:-50    XS:i:-50    XN:i:0  XM:i:10 XO:i:0  XG:i:0  NM:i:10 MD:Z:2T41T8C7A7G0G1A2G3A2G1 YT:Z:UU RG:Z:70130.satrFamilyX
HWI-ST1206:32:C3VGTACXX:5:1102:4043:34769   16  chrUn   997184926   255 84M *   0   0   TCGCCTACAAACAAAACCCCACCCATGCACACCTGATTGGTGCACGTAAAACTGGGAAAGAAGTTACCTCCCTGGAAGCCTGCA    DDDDDEDDDDDDDB@BDDEECHHIIIJHHIHJIHJIGJJJJJJIIJIJJJJIJJJIJJJJJJJIGCJJIJJJJJIJJHHHHHFF    AS:i:-12    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:41T15G26   YT:Z:UU RG:Z:70130.satrFamilyX
HWI-ST1206:32:C3VGTACXX:5:1105:2829:22242   16  chrUn   997184926   255 84M *   0   0   TCGCCTACAAACAAAACCCCACCCATGCACACCTGATTGGTGCACGTAAAACTGGGAAAGAAGTTACCTCCCTGGAAGCCTGCA    DDDDDEDDDDDDB@?;DDEE?HHIIGIIHHGJIIJJJJJJJJIIJJJJJJJJJJJJJJJIJIIHE?JIGJJJJJJJJHHHHHFF    AS:i:-12    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:41T15G26   YT:Z:UU RG:Z:70130.satrFamilyX

And stops at the following read (note the second read here is ok):

HWI-ST1206:32:C3VGTACXX:5:2314:10564:44125  16  chrUn   1110931683  255 84M *   0   0   GCTAGCTGACTACTGTAGCCCCAGCCCTTATTAGCCTGCACTGTTATAAGGTGTTATGATTCACACACATTGAAATGGCCTGCA    DCDDDCECCCCCBDFFFHHHHGCIHHHFGF@FGHFIIGGGHJIJJJJJJJIJJJJJIJJJIHFIIJIJJJJJJJJJJHHHGHFF    AS:i:-10    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:37A39A6    YT:Z:UU RG:Z:70130.satrFamilyX
HWI-ST1206:32:C3VGTACXX:5:1101:4103:100272  256 chrUn   1110931763  0   84M *   0   0   TGCAGGACAATGCCAGTGAATGGAGCCTAATTACCCTCTGTAATGAAATGTCACATTTTTGTTTCTGGGTCCTGAATAGATGTG    FFHHHHHJJJJJJJJJGIIJJJJIJJJJJJIJJJJJJJJJJJJJJJJJJIJJJJIIJJJJIJJJJJJIHHHFHFFFFFFEEEEE    AS:i:-35    XS:i:-35    XN:i:0  XM:i:6  XO:i:0  XG:i:0  NM:i:6  MD:Z:13A34A5G4C1C11A10  YT:Z:UU RG:Z:70130.satrFamilyX

I personally cannot find something specific that would explain why PrintReads does not work. The pipeline that I use works perfectly when I use an assembly based on my RADseq data, yet not with the reference genome. Thus, I would like to ask, if you could perhaps help me. Thanks!

Created 2014-06-14 02:55:54 | Updated | Tags: java gatk

Uh..Does the GATK read the reference such as hg19 from the disk each time it runs some tools? As the GATK will run many tools such as RealignerTargetCreator, IndelRealigner, BaseRecalibrator in a variant detection pipeline, if it read hg19 from the disk every time, it will be time-consuming...

Created 2014-06-10 20:57:36 | Updated | Tags: variantannotator best-practices vcf developer performance walkers java gatk gatk-walkers

Hi,

I was wondering if there is a nice way to apply multiple processing steps to each variant (or a group of variants) as they are read so that the variant file is not read again and again. My understanding is that even if I use Queue, each script would read the vcf again. Is that correct?

Created 2014-04-25 06:48:55 | Updated | Tags: install gatk newbie

Hi, I have just used Ubuntu. I'm want install GATK for my work. But i can't search anything about how to install GATK by Ubuntu Software Center or Teminal. Please help me. Thanks.

Created 2014-04-16 14:11:37 | Updated | Tags: unifiedgenotyper haplotypecaller bug gatk variant-calling

HI, I'd like to report a weird result from HaplotypeCaller.

We have a patient sequenced by targeted sequencing on HNF1B. This patient has been confirmed to have a whole gene deletion of HNF1B so we used this patient as a positive control. We expected to see no heterozygous variants called in HNF1B.

However, HaplotypeCaller called two heterozygous variants: one deletion (it didn't pass the FS strand bias filter and the ReadPosRankSumTest filter) and one substitution (this one passed all the quality filters). Both these two variants were not called by UnifiedGenotyper (and the variants called by UnifiedGenotyper in HNF1B region were all homozygous as what we expected)

There are three things I want to highlight: 1, The deletion is only 10 bases upstream of the substitution, but the FS score is enormous for the deletion whereas very low for the substitution. If there's a strand bias, it must affect both variants if they are so close to each other. 2, The score of ReadPosRankSumTest of the deletion didn't pass the filter because it's always called near the end of the reads. The ReadPosRankSumTest score for the substitution is missing. 3, The genotype was called 0/1 for the substitution, but if we look at the AD, there are 206 reads supporting the ref allele and 0 read supporting the alt allele. Going by the AD, it's clearly a homozygous ref/ref genotype.

Then I looked into the bam files. It turns out the all the alternative alleles of the substitution were from the ends of bad reads, and there are not too many of them after all. So the reads in the bam file also support a homozygous ref/ref genotype.

Therefore I'm really confused why the substitution has 0/1 genotype called by the HaplotypeCaller and why it passed the filter.

Many Thanks

Betty

Created 2014-03-31 10:41:27 | Updated | Tags: unifiedgenotyper gatk

Hi

I am hitting with the following error when I try to use UnifiedGenotyper. Wondering if its a bug or something wrong with my bam file? Thanks for your help in advance.

##### ERROR stack trace

java.lang.IllegalArgumentException: Illegal base [K] seen in the allele at org.broadinstitute.variant.variantcontext.Allele.create(Allele.java:205) at org.broadinstitute.variant.variantcontext.Allele.create(Allele.java:213) at org.broadinstitute.sting.gatk.walkers.annotator.RankSumTest.fillQualsFromPileup(RankSumTest.java:159) at org.broadinstitute.sting.gatk.walkers.annotator.RankSumTest.annotate(RankSumTest.java:113) at org.broadinstitute.sting.gatk.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:192) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateGenotypes(UnifiedGenotyperEngine.java:560) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:234) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

##### ERROR ------------------------------------------------------------------------------------------

Cheers Scott..

Created 2014-03-28 13:58:31 | Updated | Tags: unifiedgenotyper multithreading multi-sample gatk capacity

Hi,

How many samples and of what file/genome size can the Unified Genotyper process simultaneously ? I will have about 250 samples, genome size 180Mb, bam sizes 3-5GB, reduced bam sizes ~1GB.

Also, how long would this take can it be multi-threaded ? Cheers,

Blue

Created 2014-03-12 08:28:47 | Updated | Tags: unifiedgenotyper gatk variant-calling

Hi, we observe an unexpected deletion call (GATK UnifiedGenotyper 3.0 and 2.5) in a setup with three samples called together. In the file call.txt you find the call. From the pileup (pileup.txt, position 19:57325555) we would expect, that the indel would only be called for the first sample. Is there another source of information, that makes GATK believe, that the deletion also occurs in the second and third sample?

Created 2014-02-24 22:22:31 | Updated | Tags: vqsr filter gatk

In my PiCard/GATK pipeline, I already include the 1000G_gold_standard and dbsnp files in my VQSR step, I am wondering if I should further filter the final vcf files. The two files I use are Mills_and_1000G_gold_standard.indels.hg19.vcf and dbsnp_137.hg19.vcf, downloaded from the GATK resource bundle.

I recently came across the NHLBI exome seq data http://evs.gs.washington.edu/EVS/#tabs-7, and the more complete 1000G variants ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/

These made me wonder if I should use these available VCFs to further filter my VCF files to remove the common SNPs. If so, can I use the "--mask" parameter in VariantFiltration of GATK to do the filtration? Examples below copied from documentation page:

    java -Xmx2g -jar GenomeAnalysisTK.jar \
-R ref.fasta \
-T VariantFiltration \
-o output.vcf \
--variant input.vcf \
--filterExpression "AB < 0.2 || MQ0 > 50" \
--filterName "Nov09filters" \
--maskName InDel

Created 2014-02-24 22:21:58 | Updated | Tags: vqsr filter gatk

In my PiCard/GATK pipeline, I already include the 1000G_gold_standard and dbsnp files in my VQSR step, I am wondering if I should further filter the final vcf files. The two files I use are Mills_and_1000G_gold_standard.indels.hg19.vcf and dbsnp_137.hg19.vcf, downloaded from the GATK resource bundle.

I recently came across the NHLBI exome seq data http://evs.gs.washington.edu/EVS/#tabs-7, and the more complete 1000G variants ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/

These made me wonder if I should use these available VCFs to further filter my VCF files to remove the common SNPs. If so, can I use the "--mask" parameter in VariantFiltration of GATK to do the filtration? Examples below copied from documentation page:

    java -Xmx2g -jar GenomeAnalysisTK.jar \
-R ref.fasta \
-T VariantFiltration \
-o output.vcf \
--variant input.vcf \
--filterExpression "AB < 0.2 || MQ0 > 50" \
--filterName "Nov09filters" \
--maskName InDel

Created 2014-02-22 20:15:08 | Updated | Tags: gatk

Hello, I'm a little confused as to the format for multi sample snp calling suing Unified Genotyper. The example says: -I sample1.bam [-I sample2.bam ...] \ but the presence of the dots is confusing.

Is the right format as in example a) or b) ? a) -I sample1.bam [-I sample2.bam] [-I sample3.bam] [-I sample4.bam] b) -I sample1.bam [-I sample2.bam sample3.bam sample4.bam] I tried both ways and got error both times actually - "invalid argument "[-I"

Thanks

Created 2014-01-23 16:32:26 | Updated | Tags: merge gatk

I have in a database 11 vcf and bam files for individuals we've sequenced. I have been trying to merge the 11 individual vcf files into one combined vcf file using CombineVariants in GATK. While it does combine the vcf files, it does something odd that I'm sure has been solved by other users and I am looking for input on.

A singleton SNP in individual 1 will be given "./." in all other 10 individuals instead of "0/0". Is there a way to fix this--the genotypes are not missing, they are reference.That said, some of them will be missing and are rightly called "./.", but I don't know how to incorporate this information into a merged VCF file.

Your help is most appreciated and apologies if this has been asked before--I couldn't find this exact topic.

Created 2014-01-10 22:01:49 | Updated | Tags: unifiedgenotyper vcf gatk

One of my samples has this entry "1/1:0,1:1:3:36,3,0" in the information field, and from my understanding, the genotype is homo variant, because it has 1/1. However, I do not understand why. Since it has 0 REF reads and has only 1 ALT read, how does GATK tell this variant is a homo variant??

Created 2013-11-22 09:36:51 | Updated | Tags: depthofcoverage gatk

I am using this command to calculate the depth of coverage with three different formats for the EXOME_interval.list : java -Xmx2048m -jar GenomeAnalysisTK.jar -T DepthOfCoverage -I /home/vsinha/software/bam/group1.READS.bam.list -L /home/vsinha/software/EXOME.interval_list -R /home/vsinha/software/human_g1k_v37.fasta -dt BY_SAMPLE -dcov 5000 -l INFO --omitDepthOutputAtEachBase --omitLocusTable --minBaseQuality 0 --minMappingQuality 20 --start 1 --stop 5000 --nBins 200 --includeRefNSites --countType COUNT_FRAGMENTS -o /home/vsinha/software/group1.DATA

1st format : 1 69114 69332 1 69383 69577 1 69644 69940 1 69951 70028 1 566170 566275 1 801895 801983 1 802332 802416 1 861272 861420

Error : ERROR MESSAGE: Badly formed genome loc: Contig '1 69114 69332' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?

2nd format: 1:69114-69332 1:69383-69577 1:69644-69940 1:69951-70028 1:566170-566275 1:801895-801983 1:802332-802416 1:861272-861420

Error: Badly formed genome loc: Contig '1' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?

3rd Format: chr1:69114-69332 chr1:69383-69577 chr1:69644-69940 chr1:69951-70028 chr1:566170-566275 chr1:801895-801983 chr1:802332-802416 chr1:861272-861420

Error: Contig chr1 not present in sequence dictionary for merged BAM header: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1]

Can you please let me know what is the problem in my file.

Regards Vishal

Created 2013-11-14 19:55:16 | Updated | Tags: license gatk

I have been asked to reach out to you to clarify the GATK license. We are hoping to package our CNV/SNP calling and annotation pipeline that includes GATK and make it available on an Amazon Machine Image (AMI) for use on the Amazon EC2 cloud for anyone who wants to replicate our protocol using their own data. Of all the tools that we use, only GATK contains license restrictions. It is our intent that our AMI be used by academic/non-commercial users, and it is not intended to be sold. We can list the AMI in the AWS Marketplace as a Bring Your Own License (BYOL) instance. However, there are no checks provided by Amazon that users do in fact have a license. Is this permissible?

Created 2013-11-12 19:05:01 | Updated | Tags: gatk

Hi,

I would like to attend the workshops by Broad on GATK either online or some seminar in New York.Let me know if there is anything planned for the future.

Created 2013-10-13 06:06:13 | Updated | Tags: reducereads gatk

I recently upgraded from GATK 2.5 to the latest 2.74 stable release, but the Reduce Reads throws the following error when I try to run it with a Bam file that was produced by "PrintReads" (3 samples merged in one Bam file).

MESSAGE: Bad input: Reduce Reads is not meant to be run for more than 1 sample at a time except for the specific case of tumor/normal >pairs in cancer analysis

java -Xmx6g -Djava.awt.headless=true -jar \$CLASSPATH/GenomeAnalysisTK.jar \ -T ReduceReads \ -R ../GATK_ref/hg19.fasta \ -I ../GATK/BQSR/all3Samples2.recal.bam \ -o ../GATK/BQSR/all3Samples.recal.compressed.bam

It used to work with the old version of GATK, but it does not work now. Where could it be wrong?

Created 2013-08-22 18:55:31 | Updated | Tags: unifiedgenotyper gatk

Dear GATK Team,

I am receiving the following error while running GATK 1.6. Unfortunately, for project consistency I cannot update to a more recent version of GATK and would at least wish to understand the source of the error. I ran ValidateSamFile on the input bam files and they appear to be OK.

Any insight or advice would be greatly appreciated:

##### ERROR ------------------------------------------------------------------------------------------

##### ERROR ------------------------------------------------------------------------------------------

Abbreviated commandline used:

GenomeAnalysisTK.jar -T UnifiedGenotyper -glm BOTH -et NO_ET \ -R "Saccharomyces_cerevisiae/UCSC/sacCer2/Sequence/WholeGenomeFasta/genome.fa" \ -dcov 5000 -I "someFile.bam" --output_mode EMIT_ALL_SITES -gvcf -l OFF \ -stand_call_conf 1 -L chrIV:1-1531919

Created 2013-08-13 21:38:39 | Updated | Tags: vcf gatk picard

I finally got the filtered VCF file from PWA + PiCard + GATK pipeline, and have 11 exome-seq data files which were processed as a list of input to GATK. In the process of getting VCF, I did not see an option of separating the 11 samples. Now, I've got two VCF files (one for SNPs and the other for indels) that each has 11 samples. My question is how to proceed from here?

Should I separate the 11 files before annotation? or annotation first then split them 11 samples to individual files? Big question here is how to split the samples from vcf files? thanks

Created 2013-08-11 15:39:42 | Updated | Tags: depthofcoverage gatk

I'm a bit confused about some of the output of GATK's DepthOfCoverage script. In the *.GATK.covdepth.sample_summary, what does the "total" column signify? Is it the total number of base pairs in the data? Because my raw read data has ~37.2 billion bp. However, GATK reports over 92 billion bp. Does the 'total' mean something else, or will I have to rerun the DepthofCoverage script?

Created 2013-08-08 16:17:33 | Updated | Tags: gatk slow

I have 15 exome-seq samples, and have been using BWA-PiCard-GATK pipeline to do the variant calling. I did not realize GATK is so slow until I have to analyze this large number of samples. In this HaplotypeCaller step, each sample seems to take at least 2 days (48+ hours). Is this normal is there's something I did wrong? Below is my command, is there anything wrong or GATK-HaplotypeCaller is known this slooooow?

java -Xmx10g -Djava.awt.headless=true -jar /Library/Java/Extensions/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -minPruning 3 \ -dcov 10 \ -R ./GATK_ref/hg19.fasta \ -I ./GATK/BQSR/sample1_realign.recal.compressed.bam \ -o ./GATK/VQSR/sample1_realign.raw.snps_indels.vcf

Is there anything I did wrong with this command? or anything I can change to make it run faster? I am using GATK 2.5 by the way. thanks!

Created 2013-04-25 11:21:53 | Updated | Tags: gatk

Hi,

I am using GATK v2.4.9 and i am trying to liftover from one build to another. Below are the arguments for LiftoverVariants:

Arguments for LiftoverVariants: -V,--variant Input VCF file -o,--out File to which variants should be written -chain,--chain Chain file -dict,--newSequenceDictionary Sequence .dict file for the new build -recordOriginalLocation,--recordOriginalLocation Should we record what the original location was in the INFO field?

Here is the command i have used, java -jar GenomeAnalysisTK.jar -T LiftoverVariants -chain canFam2ToCanFam3.over.chain -dict canFam3_dog_genome.dict -V canFam2_snp131.vcf -o canFam3_snp131.vcf

It showed an error message: ERROR MESSAGE: Walker requires a reference but none was provided.

The argument reference was not present in the documentation and the tool needs a reference argument. If it is required, what should be the reference file, new build or old build?

Thanks.

Created 2013-04-25 11:11:55 | Updated | Tags: gatk

Hi,

I am using unified genotyper utility of GATK. I don't exactly know how I should make the dbsnp file necessary. I thank you very much for any help.

Regards, Homa

Created 2012-12-09 02:19:39 | Updated | Tags: gatk

I have read this on GATK's documents:

Human sequence

If you are using human data, your reads must be aligned to one of the official b3x (e.g. b36, b37) or hg1x (e.g. hg18, hg19) references. The contig ordering in the reference you used must exactly match that of one of the official references canonical orderings. These are defined by historical karotyping of largest to smallest chromosomes, followed by the X, Y, and MT for the b3x references; the order is thus 1, 2, 3, ..., 10, 11, 12, ... 20, 21, 22, X, Y, MT. The hg1x references differ in that the chromosome names are prefixed with "chr" and chrM appears first instead of last. The GATK will detect misordered contigs (for example, lexicographically sorted) and throw an error.

that said the reference order must be: chr1, chr2, chr3, ... chr22, chrX, chrY, chrM. but after I download all the bundle in GATK's ftp, I check's reference, it's with a order of :

>chrM
>chr1
>chr2
>chr3
>chr4
>chr5
>chr6
>chr7
>chr8
>chr9
>chr10
>chr11
>chr12
>chr13
>chr14
>chr15
>chr16
>chr17
>chr18
>chr19
>chr20
>chr21
>chr22
>chrX
>chrY
...