# Tagged with #gatk 1 documentation article | 0 announcements | 87 forum discussions

Created 2015-07-06 18:30:35 | Updated 2015-07-07 15:32:56 | Tags: queue gatk build

TL;DR: mvn -Ddisable.shadepackage verify

### Background

In addition to Queue's GATK-wrapper codegen, relatively slow scala compilation, etc. there's still a lot of legacy compatibility from our ant days in the Maven scripts. Our mvn verify behaves more like when one runs ant, and builds everything needed to bundle the GATK.

As of GATK 3.4, by default the build for the "protected" code generates jar files that contains every class needed for running, one for the GATK and one for Queue. This is done by the Maven shade plugin, and are each called the "package jar". But, there's a way to generate a jar file that only contains META-INF/MANIFEST.MF pointers to the dependency jar files, instead of zipping/shading them up. These are each the "executable jar", and FYI are always generated as it takes seconds, not minutes.

### Instructions for fast compilation

While developing and recompiling Queue, disable the shaded jar with -Ddisable.shadepackage. Then run java -jar target/executable/Queue.jar ... If you need to transfer this jar to another machine / directory, you can't copy (or rsync) just the jar, you'll need the entire executable directory.

# Total expected time, on a local disk, with Queue:
#   ~5.0 min from clean
#   ~1.5 min per recompile

# always available
java -jar target/executable/Queue.jar --help

java -jar target/package/Queue.jar --help

If one is only developing for the GATK, skip Queue by adding -P\!queue also.

mvn -Ddisable.shadepackage -P\!queue verify

# always available
java -jar target/executable/GenomeAnalysisTK.jar --help

java -jar target/executable/Queue.jar --help
No articles to display.

Created 2016-05-25 22:44:45 | Updated 2016-05-25 22:46:24 | Tags: gatk error splitncigarreads

I know this question has been asked before. I have read the answers of previous post and did fix by the instructions, but this problem keeping occur. I have checked my storage space, there is enough space indeed. Here is my script:

java -d64 -Xmx32g -Djava.io.tmpdif=/home/RNAseq_SCZ/tmp -jar GATK -T SplitNCigarReads -R /home/hg19seq/hg19.fa -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIG AR_READS The output of error message: ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.5-0-g36282e4): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: Writing failed because there is no space left on the disk or hard drive. Please make some space or specify a different location for writing output files. ##### ERROR ------------------------------------------------------------------------------------------ Created 2016-05-24 21:40:42 | Updated | Tags: unifiedgenotyper haplotypecaller gatk snps Hi, I am currently working on two different projects and interested in finding common and rare variants. Project1 - Organism : Influenza virus Number of samples : 18 Sequencing type: Exome sequencing Alignment tool : BWA Analysis ready BAM files : • BAM files are generated using BWA, • then sorted the bam • deduplicated bam file using picard (markduplicates) GATK variant caller : Unifiedgenotyper or HaplotypeCaller (Which one to be used?) Project2 : Organism : Human Number of samples : 24 Sequencing type: DNAseq (paired-end) Analysis ready BAM files : • BAM files are generated using BWA, • then sorted the bam • and finally deduplicated and recalibrated_reads.bam GATK variant caller : Unifiedgenotyper or HaplotypeCaller (Which one to be used?) What is the criteria to select the variant caller? Created 2016-05-20 22:32:20 | Updated | Tags: indelrealigner realignertargetcreator queue performance gatk lsf memory Hi, I'm using a QScript to run the GATK best practices on LSF. I'm trying to run the data processing steps now. RealignerTargetCreator and IndelRealigner are using so much memory that my jobs are being killed by LSF. Usually they don't even make it through RealignerTargetCreator, but the ones that do make it have a memory explosion in IndelRealigner and get killed at that step. I'm testing the pipeline with a very small dataset: 3 bam files of about 10 million reads each. Even when I downsample them to 1 million reads each, the memory still explodes. I've tried requesting up to 32g of memory. Higher memory requests allow the jobs to run longer, but they eventually outgrow even 32g. A few notes: • I'm using GATK/Queue 3.5-0. • The "-memLimit" option is working correctly so that my jobs are submitted with the correct memory request. LSF is reporting that the actual memory usage is indeed extremely high. • I've validated my bam files with the Picard validation tool. • I'm using 1000G_phase1.indels.hg19.vcf and dbsnp_137.hg19.vcf from the GATK resource bundle downloaded in early 2013. Do you have any idea what could be happening? Here is an example of the output from LSF: Sender: LSF System <hpcadmin@c27> Subject: Job 57251: <RealignerTargetCreator: B11.sorted.rg.downsample.interval_list> Exited Job <RealignerTargetCreator: B11.sorted.rg.downsample.interval_list> was submitted from host <c15> by user <russellp> in cluster <compbio_cluster1>. Job was executed on host(s) <c27>, in queue <shared>, as user <russellp> in cluster <compbio_cluster1>. </home/russellp> was used as the home directory. </home/russellp/Projects/IPF_resequencing/GATK_testing> was used as the working directory. Started at Fri May 20 16:10:00 2016 Results reported at Fri May 20 16:11:03 2016 Your job looked like: ------------------------------------------------------------ # LSBATCH: User input sh /home/russellp/Projects/IPF_resequencing/GATK_testing/.queue/tmp/.exec5219091543152698331 ------------------------------------------------------------ TERM_MEMLIMIT: job killed after reaching LSF memory usage limit. Exited with exit code 130. Resource usage summary: CPU time : 79.19 sec. Max Memory : 7951 MB Max Swap : 34733 MB Max Processes : 4 Max Threads : 29 Here is my QScript: package qscripts import org.broadinstitute.gatk.queue.QScript //import org.broadinstitute.gatk.queue.extensions.picard.MarkDuplicates import org.broadinstitute.gatk.queue.extensions.gatk._ /** * Created by prussell on 3/3/16. * Implements GATK best practices */ class GatkBestPractices extends QScript { /** * ********************************************************************* * INPUTS * ********************************************************************* */ /** * Reference genome fasta */ @Input(doc="Reference genome for the bam files", shortName="R", fullName="REF_FASTA", required=true) var referenceFile: File = null /** * Bam files */ @Input(doc="One or more bam files", shortName="I", fullName="INPUT_BAM", required=true) var bamFiles: List[File] = Nil /** * VCF files */ @Input(doc="VCF file(s) with known indels", fullName="KNOWN_INDELS", required=true) var knownIndels: List[File] = Nil @Input(doc="Database of known variants e.g. dbSNP", fullName="KNOWN_VARIANTS", required=true) var knownPolymorphicSites: List[File] = Nil /** * Output directory */ @Input(doc="Output directory", shortName="od", fullName="OUT_DIR", required=true) var outputDirectory : File = null /** * Output file prefix not including directory */ @Input(doc="Output prefix not including directory", shortName="op", fullName="OUT_PREFIX", required=true) var outputPrefix : File = null /** * Common arguments */ trait CommonArguments extends CommandLineGATK { this.reference_sequence = referenceFile // TODO other common arguments? } def script() = { /** * Container for the processed bam files */ var processedFiles = Seq.empty[File] /** * Data processing */ for(bam <- bamFiles) { /** * ********************************************************************* * LOCAL REALIGNMENT AROUND INDELS * https://www.broadinstitute.org/gatk/guide/article?id=38 * https://www.broadinstitute.org/gatk/guide/article?id=2800 * ********************************************************************* */ /** * Local realignment around indels * Step 1 of 2: RealignerTargetCreator: Define intervals to target for local realignment * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php */ val realignerTargetCreator = new RealignerTargetCreator with CommonArguments realignerTargetCreator.input_file +:= bam realignerTargetCreator.known = knownIndels realignerTargetCreator.out = swapExt(bam, "bam", "interval_list") realignerTargetCreator.maxIntervalSize = int2intOption(500) // Default 500 realignerTargetCreator.minReadsAtLocus = int2intOption(4) // Default 4 realignerTargetCreator.mismatchFraction = double2doubleOption(0.0) // Default 0.0 realignerTargetCreator.windowSize = int2intOption(10) // Default 10 add(realignerTargetCreator) /** * Local realignment around indels * Step 2 of 2: IndelRealigner: Perform local realignment of reads around indels * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php */ val indelRealigner = new IndelRealigner with CommonArguments indelRealigner.targetIntervals = realignerTargetCreator.out indelRealigner.input_file +:= bam indelRealigner.knownAlleles = knownIndels indelRealigner.out = swapExt(bam, "bam", "realign.bam") indelRealigner.consensusDeterminationModel = null indelRealigner.LODThresholdForCleaning = double2doubleOption(5.0) // Default 5.0 indelRealigner.nWayOut = null indelRealigner.entropyThreshold = double2doubleOption(0.15) // Default 0.15 indelRealigner.maxConsensuses = int2intOption(30) // Default 30 indelRealigner.maxIsizeForMovement = int2intOption(3000) // Default 3000 indelRealigner.maxPositionalMoveAllowed = int2intOption(200) // Default 200 indelRealigner.maxReadsForConsensuses = int2intOption(120) // Default 120 indelRealigner.maxReadsForRealignment = int2intOption(20000) // Default 20000 indelRealigner.maxReadsInMemory = int2intOption(150000) // Default 150000 indelRealigner.noOriginalAlignmentTags = false add(indelRealigner) /** * ********************************************************************* * BASE QUALITY SCORE RECALIBRATION * https://www.broadinstitute.org/gatk/guide/article?id=44 * https://www.broadinstitute.org/gatk/guide/article?id=2801 * ********************************************************************* */ /** * Base quality score recalibration * Step 1 of 3: BaseRecalibrator: Generate base recalibration table to compensate for systematic errors in basecalling confidences * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php */ // Generate the first pass recalibration table file val baseRecalibratorBefore = new BaseRecalibrator with CommonArguments baseRecalibratorBefore.input_file +:= indelRealigner.out baseRecalibratorBefore.out = swapExt(indelRealigner.out, "realign.bam", "base_recalibrator_first_pass.out") baseRecalibratorBefore.knownSites = knownPolymorphicSites baseRecalibratorBefore.indels_context_size = int2intOption(3) // Default 3 baseRecalibratorBefore.maximum_cycle_value = int2intOption(500) // Default 500 baseRecalibratorBefore.mismatches_context_size = int2intOption(2) // Default 2 baseRecalibratorBefore.solid_nocall_strategy = null baseRecalibratorBefore.solid_recal_mode = null baseRecalibratorBefore.list = false baseRecalibratorBefore.lowMemoryMode = false baseRecalibratorBefore.no_standard_covs = false baseRecalibratorBefore.sort_by_all_columns = false baseRecalibratorBefore.binary_tag_name = null baseRecalibratorBefore.bqsrBAQGapOpenPenalty = double2doubleOption(40.0) // Default 40.0 baseRecalibratorBefore.deletions_default_quality = int2byteOption(45) // Default 45 baseRecalibratorBefore.insertions_default_quality = int2byteOption(45) // Default 45 baseRecalibratorBefore.low_quality_tail = int2byteOption(2) // Default 2 baseRecalibratorBefore.mismatches_default_quality = int2byteOption(-1) // Default -1 baseRecalibratorBefore.quantizing_levels = int2intOption(16) // Default 16 baseRecalibratorBefore.run_without_dbsnp_potentially_ruining_quality = false add(baseRecalibratorBefore) // Generate the second pass recalibration table file val baseRecalibratorAfter = new BaseRecalibrator with CommonArguments baseRecalibratorAfter.BQSR = baseRecalibratorBefore.out baseRecalibratorAfter.input_file +:= indelRealigner.out baseRecalibratorAfter.out = swapExt(indelRealigner.out, "realign.bam", "base_recalibrator_second_pass.out") baseRecalibratorAfter.knownSites = knownPolymorphicSites baseRecalibratorAfter.indels_context_size = int2intOption(3) // Default 3 baseRecalibratorAfter.maximum_cycle_value = int2intOption(500) // Default 500 baseRecalibratorAfter.mismatches_context_size = int2intOption(2) // Default 2 baseRecalibratorAfter.solid_nocall_strategy = null baseRecalibratorAfter.solid_recal_mode = null baseRecalibratorAfter.list = false baseRecalibratorAfter.lowMemoryMode = false baseRecalibratorAfter.no_standard_covs = false baseRecalibratorAfter.sort_by_all_columns = false baseRecalibratorAfter.binary_tag_name = null baseRecalibratorAfter.bqsrBAQGapOpenPenalty = double2doubleOption(40.0) // Default 40.0 baseRecalibratorAfter.deletions_default_quality = int2byteOption(45) // Default 45 baseRecalibratorAfter.insertions_default_quality = int2byteOption(45) // Default 45 baseRecalibratorAfter.low_quality_tail = int2byteOption(2) // Default 2 baseRecalibratorAfter.mismatches_default_quality = int2byteOption(-1) // Default -1 baseRecalibratorAfter.quantizing_levels = int2intOption(16) // Default 16 baseRecalibratorAfter.run_without_dbsnp_potentially_ruining_quality = false add(baseRecalibratorAfter) /** * Base quality score recalibration * Step 2 of 3: AnalyzeCovariates: Create plots to visualize base recalibration results * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_bqsr_AnalyzeCovariates.php */ val analyzeCovariates = new AnalyzeCovariates with CommonArguments analyzeCovariates.beforeReportFile = baseRecalibratorBefore.out analyzeCovariates.afterReportFile = baseRecalibratorAfter.out analyzeCovariates.plotsReportFile = new File(outputDirectory.getAbsolutePath + "/" + outputPrefix + "_BQSR.pdf") analyzeCovariates.intermediateCsvFile = new File(outputDirectory.getAbsolutePath + "/" + outputPrefix + "_BQSR.csv") analyzeCovariates.ignoreLMT = false add(analyzeCovariates) /** * Base quality score recalibration * Step 3 of 3: PrintReads: Write out sequence read data (for filtering, merging, subsetting etc) * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_readutils_PrintReads.php */ val printReads = new PrintReads with CommonArguments printReads.input_file +:= bam printReads.BQSR = baseRecalibratorAfter.out printReads.out = swapExt(bam, "bam", "recalibrated.bam") processedFiles +:= printReads.out } /** * ********************************************************************* * VARIANT DISCOVERY * https://www.broadinstitute.org/gatk/guide/bp_step.php?p=2 * ********************************************************************* */ /** * Variant discovery * Step 1 of 6: HaplotypeCaller: Call germline SNPs and indels via local re-assembly of haplotypes * https://www.broadinstitute.org/gatk/guide/article?id=2803 * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php * https://www.broadinstitute.org/gatk/guide/article?id=3893 */ // TODO /** * Variant discovery * Step 2 of 6: CombineGVCFs: Combine per-sample gVCF files produced by HaplotypeCaller into a multi-sample gVCF file * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_CombineGVCFs.php */ // TODO /** * Variant discovery * Step 3 of 6: GenotypeGVCFs: Perform joint genotyping on gVCF files produced by HaplotypeCaller * https://www.broadinstitute.org/gatk/guide/article?id=3893 * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.php */ // TODO /** * Variant discovery * Step 4 of 6: VariantFiltration: Filter variant calls based on INFO and FORMAT annotations * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_filters_VariantFiltration.php */ // TODO /** * Variant discovery * Step 5 of 6: VariantRecalibrator: Build a recalibration model to score variant quality for filtering purposes * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantrecalibration_VariantRecalibrator.php * https://www.broadinstitute.org/gatk/guide/article?id=39 * https://www.broadinstitute.org/gatk/guide/article?id=2805 * */ // TODO /** * Variant discovery * Step 6 of 6: ApplyRecalibration: Apply a score cutoff to filter variants based on a recalibration table * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantrecalibration_ApplyRecalibration.php * https://www.broadinstitute.org/gatk/guide/article?id=2806 */ // TODO /** * ********************************************************************* * CALLSET REFINEMENT * https://www.broadinstitute.org/gatk/guide/bp_step.php?p=3 * ********************************************************************* */ /** * Callset refinement * Step 1 of 8: CalculateGenotypePosteriors: Calculate genotype posterior likelihoods given panel data * https://www.broadinstitute.org/gatk/guide/article?id=4727 * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_CalculateGenotypePosteriors.php */ // TODO /** * Callset refinement * Step 2 of 8: VariantFiltration: Filter variant calls based on INFO and FORMAT annotations * https://www.broadinstitute.org/gatk/guide/article?id=4727 * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_filters_VariantFiltration.php */ // TODO /** * Callset refinement * Step 3 of 8: VariantAnnotator: Annotate variant calls with context information * https://www.broadinstitute.org/gatk/guide/article?id=4727 * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_VariantAnnotator.php */ // TODO /** * Callset refinement * Step 4 of 8: SelectVariants: Select a subset of variants from a larger callset * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php */ // TODO /** * Callset refinement * Step 5 of 8: CombineVariants: Combine variant records from different sources * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_CombineVariants.php */ // TODO /** * Callset refinement * Step 6 of 8: VariantEval: General-purpose tool for variant evaluation (% in dbSNP, genotype concordance, Ti/Tv ratios, and a lot more) * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_varianteval_VariantEval.php */ // TODO /** * Callset refinement * Step 7 of 8: VariantsToTable: Extract specific fields from a VCF file to a tab-delimited table * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_VariantsToTable.php */ // TODO /** * Callset refinement * Step 8 of 8: GenotypeConcordance (Picard version): Genotype concordance * https://broadinstitute.github.io/picard/command-line-overview.html#GenotypeConcordance */ // TODO } } Created 2016-05-13 20:39:58 | Updated | Tags: realignertargetcreator bug gatk error runtime-error index-file I have been trying to use multiple GATK tools after processing my sample bam files with multiple Picard tools (AddOrReplaceReadGroups, SortSam, MarkDuplicates, and ReorderSam in that order) and creating a new index file with each step. However, I keep getting the same error: My input: java -jarGATK -T RealignerTargetCreator -R human_wgfa -I 000ordered.bam -o 000realignertargetcreator.intervals Output: INFO 15:22:51,165 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:22:51,166 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56 INFO 15:22:51,166 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 15:22:51,166 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 15:22:51,169 HelpFormatter - Program Args: -T RealignerTargetCreator -R /Users/sybouts/Genomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa -I 000ordered.bam -o 000realignertargetcreator.intervals INFO 15:22:51,173 HelpFormatter - Executing as sybouts@m2600145308.v41.uthscsa.edu on Mac OS X 10.11.4 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14. INFO 15:22:51,173 HelpFormatter - Date/Time: 2016/05/13 15:22:51 INFO 15:22:51,174 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:22:51,174 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:22:51,435 GenomeAnalysisEngine - Strictness is SILENT INFO 15:22:51,507 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 15:22:51,511 SAMDataSourceSAMReaders - Initializing SAMRecords in serial WARNING: BAM index file /Users/sybouts/SA_BLM_family/000ordered.bai is older than BAM /Users/sybouts/SA_BLM_family/000ordered.bam INFO 15:22:51,529 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02 INFO 15:22:52,424 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Unable to open index file (mark/reset not supported)000ordered.bai at org.broadinstitute.gatk.engine.datasources.reads.GATKBAMIndex.openIndexFile(GATKBAMIndex.java:319) at org.broadinstitute.gatk.engine.datasources.reads.GATKBAMIndex.(GATKBAMIndex.java:97) at org.broadinstitute.gatk.engine.datasources.reads.SAMDataSource.(SAMDataSource.java:388) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.createReadsDataSource(GenomeAnalysisEngine.java:931) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:822) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:286) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106) Caused by: java.io.IOException: mark/reset not supported at java.io.InputStream.reset(InputStream.java:348) at htsjdk.samtools.SamIndexes.asBaiSeekableStreamOrNull(SamIndexes.java:80) at org.broadinstitute.gatk.engine.datasources.reads.GATKBAMIndex.openIndexFile(GATKBAMIndex.java:314) ... 9 more ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Unable to open index file (mark/reset not supported)000ordered.bai ##### ERROR ------------------------------------------------------------------------------------------ I am really frustrated and done everything I can to figure out what is going on, but the same thing keeps happening. Please let me know if there is any other information that may be helpful! Thanks! Created 2016-05-13 15:31:53 | Updated | Tags: gatk small-rna-seq Hi, I would like to know if it is possible to use GATK on small RNA seq using the RNA seq processing. Thank you. Riccardo Created 2016-04-27 12:04:17 | Updated | Tags: gatk genotypegvcfs combinegvcfs concatvariants Hi there, I'm attempting to use GenotypeGVCFs with 223 gzipped gvcf files containing WGS data. The files were gzipped by GATK programs (I don't have the space to deal with uncompressed files). As has been discussed in other threads, GenotypeGVCFs crashes with the gzipped files. I believe this is because HaplotypeCaller doesn't index the files, and GATK programs automatically use tabix to index any unindexed gzipped files they receive, yes? As such, I'm looking for the best workaround. So far, I have used HaplotypeCaller to call sites on intervals in each genome, then used ConcatVariants to concatenate each set of intervals. I am now attempting to run CombineGVCFs on the concatenated GVCF files and split the files into 10 groups to run on GenotypeGVCFs. This will take approx 48 hours with no guarantee that GenotypeGVCFs will like the output. Will it make any difference to runtime or output (and, indeed, does it make sense) to run CombineGVCFs on the groups of intervals before giving them to ConcatVariants then GenotypeGVCFs? Apologies if this is a daft question but I've already lost a fair amount of time on this so I just want to make sure what I'm doing makes sense :) Thanks, Ian Created 2016-02-27 23:55:23 | Updated | Tags: haplotypecaller gatkdocs variantfiltration vcf developer gatk variant-calling In a discussion about using ERC, you provide some example VCF output lines like: 20 10000442 . T <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:56,0:56:56:99:0,168,2095 20 10000443 . A <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:56,0:56:56:99:0,169,2089 20 10000444 . A <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:56,0:56:56:99:0,168,2093  and say: 1. "For each reference base not part of a variant call we emit a VCF record with the special symbolic allele <NON_REF> indicating the call is between the reference base and any possible non-reference allele that might be segregating at this site," 2. and 3. "Note that there's no site-level QUAL field value. We discussed this internally and since the QUAL is the probability that the site is polymorphic, all of the QUAL field values should be 0 here, so we decided to drop it." While the formal Variant Call Format 4.2 Specification says: 1. "ALT - alternate base(s): Comma separated list of alternate non-reference alleles called on at least one of the samples. Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive) or an angle-bracketed ID String (“<ID>”) or a breakend replacement string as described in the section on breakends," 2. and 3. "QUAL - quality: Phred-scaled quality score for the assertion made in ALT. i.e. −10log10 prob(call in ALT is wrong). If ALT is ‘.’ (no variant) then this is −10log10 prob(variant), and if ALT is not ‘.’ this is −10log10 prob(no variant). If unknown, the missing value should be speciﬁed." The issue is subtle, but introduces problems with downstream processing of HaplotypeCaller generated VCF files containing reference calls. The use of the symbol "<NON_REF>" instead of "." for reference calls is a little confusing, but I also see the logic of that. More seriously: According to the VCFv4.2 specs, QUAL is NOT always a measure of "the probability that the site is polymorphic". Perhaps when a variant is called, but not when a site is called as non-variant. All those QUAL field values should NOT be 0 there. It is about the quality and correctness of the call itself. It is defined as the PHRED score of the probability that the assertion made in ALT is wrong. So if ALT asserts "<NON_REF>" or "." CONFIDENTLY (meaning a low probability that the assertion is wrong), then the QUAL PHRED score should be HIGH, not ZERO. A confident call should have a high QUAL score, whether variant or monomorphic. That is the intention of the specification. If otherwise: filtering out low quality records from a non-compliant VCF output file by removing those with low QUAL scores, for instance, would also filter out all the high quality reference calls. I have previously been in touch with the writers/keepers of the VCF Specs (now over at samtools) and they confirm this interpretation of QUAL scores as the correct one for VCFv4.2 (and other versions) compliant output. It's unfortunately couched in mathematical double negatives, but look at it carefully and you will see the correctness of this reading. As one who uses tools from many sources, I hope you will address this discrepancy. It complicates interoperability. -- Fred P. Created 2016-02-26 05:18:49 | Updated | Tags: gatk resource-bundle I have been experimenting with GATK for SNP calling/filtering against the human genome using the hg36 bundle. However, a few weeks back, I believe GATK changed the default bundle to hg38. Now, the sequence of computational steps I have programmed fails because of differences in the naming convention within the hg38 bundle. It looks like the problem involves dpsnp_144.vcf, which has the chromosome locations named "1" or "X", whereas all the other files and reference included names the chromosomes "chr1" and "chrX". Can anyone else please confirm if this is personal user error or a genuine error in bundle naming convention? Thanks! Created 2016-02-26 01:07:05 | Updated | Tags: depthofcoverage bam gatk Very new to bioinformatics tools here. I'm running DepthOfCoverage with a list of 30 bam files 1 to 2 GB each on a (rat) reference genome ~ 1.8 GB. My terminal window is taunting me with : INFO 16:29:24,523 ProgressMeter - 1:74214637 7.4203136E7 2.9 h 2.3 m 2.6% 4.6 d 4.5 d INFO 16:30:24,525 ProgressMeter - 1:74636521 7.462912E7 2.9 h 2.3 m 2.6% 4.6 d 4.5 d INFO 16:31:24,526 ProgressMeter - 1:75031737 7.5022336E7 2.9 h 2.3 m 2.6% 4.6 d 4.5 d INFO 16:32:24,528 ProgressMeter - 1:75443037 7.5431936E7 2.9 h 2.3 m 2.6% 4.6 d 4.5 d INFO 16:33:24,530 ProgressMeter - 1:75854037 7.5841536E7 2.9 h 2.3 m 2.6% 4.6 d 4.5 d INFO 16:34:24,532 ProgressMeter - 1:76272821 7.626752E7 3.0 h 2.3 m 2.7% 4.6 d 4.5 d INFO 16:35:24,533 ProgressMeter - 1:76708805 7.6693504E7 3.0 h 2.3 m 2.7% 4.6 d 4.5 d This is how I ran it: java -jar$GATK \
  -T DepthOfCoverage \
  -R REF_GENOME \  -o depth.txt \  -I input.list \  --outputFormat csv Is that just how long it should take? Created 2016-02-18 19:19:47 | Updated | Tags: gatk Hello GATK team! I'm new to the field of bioinformatics, and trying to get all the software required to match a pipeline that was made almost half a decade ago. The only one I'm missing is GATK 1.0.5083 and can't seem to find the source files (to build it myself) or the jar file available anywhere. I was wondering if there was any way for me to get access to the needed files for me to get a working copy of the GATK 1.0.5083. I understand that it's not recommended or supported, but I need it to duplicate some data. Thanks, Cavin Created 2016-02-12 20:57:50 | Updated | Tags: java gatk maven Hi, I am using the class GATKVariantContextUtils in some of my project and I want to add the non-fat jar for GATK for this as a maven dependency. What is the best way to do this? Thanks! Created 2016-02-11 22:33:35 | Updated | Tags: haplotypecaller bug gatk heterozygous Hello! I'm pretty new to this so pardon me if this is formatted incorrectly or I'm missing something obvious. It looks like I have the same problem as: http://gatkforums.broadinstitute.org/gatk/discussion/2319/haplotypecaller-incorrectly-making-heterozygous-calls-again but their problems seemed to be fixed simply by updating, and I'm using the current version of GATK (3.5.0). Any idea what I can do to fix this? When processing this, I mostly used the GATK Best Practices. However, I did use joint calling instead of the mixed method, since I figured with only 93 samples, for a handful of genes, it wouldn't benefit too heavily from more efficient computation, and plugging the whole BAM in at once seemed easier. Here's a shot from IGV of a portion of one of my samples. The circled SNPs are the ones in question. For all 3, all of the reads in the BAM are for the alternate, but the VCF produced by HaplotypeCaller has them as heterozygous (and passing filters). It seems to be pretty consistent in which SNPs it messes up. On the other hand, some samples are fine, and some are not, but I don't see a pattern as to why some work and some don't. I'd be happy to provide any other information that would help resolve this. Thank you for any help you can provide, Jessica Patnode Created 2016-02-02 19:26:18 | Updated | Tags: gatk gvcf merging Hi Team, I need to know something and hopefully is simple to implement. I have to run GATK HaplotypeCaller on a large BAM file, thus I have to run this in batches of 4 hours top. I've done some test to determine the appropriate size for the subsampled BAM. So let's say I need to run 10 jobs. Each one of those will output a gVCG file, which all belong to the same individual in this experiment. I have read that there are several methods to parse and merge VCF files into a single one like CatVariants, CombineGVCFs and CombineVariants. The question is that I'm inclined to use CombineGVCFs since is the output I have, but I also have the understanding that this is for merging different individuals which is not what I have to do. So which approach should I use? Thanks for the inputs!!! Alejandro Created 2016-01-29 21:19:34 | Updated | Tags: depthofcoverage gatk Hi, I was comparing the output from GATK DepthOfCoverage and samtools mpileup and noticed that for chromosome 1, for instance, the depth values differ per individual: GATK DepthOfCoverage sample (4th column is for individual of interest): 1:10001 1056 39.11 43 1:10002 1973 73.07 91 1:10003 2728 101.04 120 samtools mpileup (the last field is for the individual of interest): 1 10001 . T 0 . DP=298;I16=246,10,0,0,8031,252873,0,0,266,4956,0,0,216,3484,0,0;QS=8,0;MQSB=0.882047;MQ0F=0.855705 PL:DP:DV:SP:DPR 0,120,16:40:0:0:40,0 1 10002 . A 0 . DP=605;I16=492,13,0,0,15628,487360,0,0,583,11687,0,0,480,4056,0,0;QS=8,0;MQSB=0.82856;MQ0F=0.86281 PL:DP:DV:SP:DPR 0,244,26:81:0:0:81,0 1 10003 . A C, 0 . DP=862;I16=727,19,1,1,22640,694394,40,808,1252,27124,0,0,1102,6936,4,8;QS=7.98193,0.018075,0;VDB=0.02;SGB=-2.77779;RPB=0.485255;MQB=0.873995;MQSB=0.836835;BQB=0.0589812;MQ0F=0.845708 PL:DP:DV:SP:DPR 0,255,48,255,51,48:112:1:0:111,1,0 Any thoughts as to what could be happening? Thanks, Alva Created 2016-01-29 02:39:02 | Updated | Tags: gatk mutect2 gatk3-5 Using the standard mutect2 commands from the docs I get the error seen below. My reference file dictionaries were all made automatically from mutect version 1.4. Is there a workaround for this md5 problem. I tried an awk command to take out the m5 column from the fast.dict file, but that didnt work. Is there a flag that I'm not seeing to avoid this MD5 column check because for my purposes if the length is the same its likely that they are the same contig? Any help would be greatly appreciated. ##### ERROR MESSAGE: Input files reads and reference have incompatible contigs. Please see http://gatkforums.broadinstitute.org/discussion/63/input-files-have-incompatible-contigsfor more information. Error details: Found contigs with the same name but different lengths or MD5s: ##### ERROR contig reads is named 3 with length 198022430 and MD5 fdfd811849cc2fadebc929bb925902e5 ##### ERROR contig reference is named 3 with length 198022430 and MD5 641e4338fa8d52a5b781bd2a2c08d3c3. Created 2016-01-28 08:48:49 | Updated 2016-01-28 09:12:56 | Tags: fastareference dbsnp gatk mutect2 I have a cancer rna-seq bam file that I would like to use to call variants using MuTect2, but I ran into a couple of errors that I would appreciate some assistance with. Thank you. 1. The chr notations in my bam file is absent in the reference fasta (Homo_sapiens_assembly19.fasta). I'd rather not change my bam file header using sed and samtools reheader and rather append chr notation into the reference file. Will this mess up the corresponding snp (00-All.vcf)? 2. Is there a karytypically sorted Ensembl reference (and its corresponding dbSNP vcf) that I can use for GATK workflow? 3. The biggest issue is "Error: Discordant contig lengths: read MT LN=16571, ref MT LN=16569" when I use picard's reordersam function. This is because the read was aligned against Ensembl reference. Is there a quick fix to resolve this? EDIT: I should mention that realigning my seq against GATK's reference is not an option because I don't have the original fastq files available. Created 2016-01-23 05:08:59 | Updated | Tags: gatkdocs gatk error Hi, I tried to look up the documentation for certain tools, such as CalculateGenotypePosteriors, but when I try to access it, I just get a blank page. This is the link I am trying to access: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_CalculateGenotypePosteriors.php If this is not the correct link, please direct me to the right one. I noticed this was also happening to some other tools. Thanks, Alva Created 2016-01-23 05:03:33 | Updated | Tags: variantfiltration snp gatk Hi, I am doing an analysis on chimp variants, and do not have enough variants to train the model in order to perform VQSR. I am using hard filters at this point, as per the recommendation. I keep reading that the recommended cutoff values are only recommendations and that I should expect to tweak the parameters further. However, how do I do this in an effective manner? I'm just not sure how to approach exploring the parameter space. Should I try out different values around the recommended cutoffs and, down the line, analyze how many mutations I get/calculate certain values for quality control and check that they agree with what is expected (CpG transition rate, and so on)? Is there a way to inspect the data to see whether it is behaving well, or do I have to wait until down the line to see how things are going? Any insight into this would help tremendously. Thanks! Alva Created 2016-01-14 19:02:12 | Updated | Tags: commandlinegatk haplotypecaller multi-sample gatk error Hi, I would like to force call a list of variants across my cohort using HaplotypeCaller to get more accurate QC metrics for each variant. I am using the following command: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -et NO_ET -K my.key -I my.cohort.list --alleles my.vcf -L my.vcf -out_mode EMIT_ALL_SITES -gt_mode GENOTYPE_GIVEN_ALLELES -stand_call_conf 30.0 -stand_emit_conf 0.0 -dt NONE -o final_my.vcf Here is a link to the input VCF file: VCF File Unfortunately, I keep running into the following error (I've tried GATK ver3.3 and ver3.5): INFO 18:49:21,288 ProgressMeter - chr1:11177077 21138.0 49.5 m 39.0 h 69.4% 71.3 m 21.8 m ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.IndexOutOfBoundsException: Index: 3, Size: 3 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at htsjdk.variant.variantcontext.VariantContext.getAlternateAllele(VariantContext.java:845) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:248) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:1059) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:221) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegionsTraverseActiveRegionMap.apply(TraverseActiveRegions.java:709) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegionsTraverseActiveRegionMap.apply(TraverseActiveRegions.java:705) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3.0-mssm-0-gaa95802): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Index: 3, Size: 3 ##### ERROR ------------------------------------------------------------------------------------------ Would appreciate your help in solving this issue. Created 2016-01-07 06:21:51 | Updated | Tags: unifiedgenotyper java gatk Hi , I am using the UnifiedGenotyper command but got the following error. Can some one put a light what it could mean!! INFO 17:17:23,650 HelpFormatter - ----------------------------------------------------------------------------------- INFO 17:17:23,653 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.0-6121-g8a78414, Compiled 2011/07/11 16:05:10 INFO 17:17:23,654 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:17:23,654 HelpFormatter - Please view our documentation at http://www.broadinstitute.org/gsa/wiki INFO 17:17:23,655 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa INFO 17:17:23,655 HelpFormatter - Program Args: -T UnifiedGenotyper -R /projects/u71/nandan/Sven_Uthicke_populationGenomics/analysis/SNP_SFG/seaurchin_topHit_Ids.fasta -I merged_realigned_woUnsafe_1.bam -stand_call_conf 20.0 -stand_emit_conf 20.0 -o raw_snps_indels_Q20_all.vcf -nt 12 INFO 17:17:23,655 HelpFormatter - Date/Time: 2016/01/07 17:17:23 INFO 17:17:23,656 HelpFormatter - ----------------------------------------------------------------------------------- INFO 17:17:23,656 HelpFormatter - ----------------------------------------------------------------------------------- INFO 17:17:23,699 GenomeAnalysisEngine - Strictness is SILENT INFO 17:17:26,575 MicroScheduler - Running the GATK in parallel mode with 12 concurrent threads WARN 17:17:29,044 RestStorageService - Error Response: PUT '/GATK_Run_Reports/Ar4GP2YBVmBmWH6kz3hZjIAP6ikSNlHP.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 1993, Content-MD5: ykLRiVk7ELWpIyvrZz9dyg==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: ca42d189593b10b5a9232beb673f5dca, Date: Thu, 07 Jan 2016 06:17:27 GMT, Authorization: AWS AKIAJXU7VIHBPDW4TDSQ:Ul22+ARecieW4VAUerP/AzNX3qQ=, User-Agent: JetS3t/0.8.0 (Linux/3.0.76-0.11-default; amd64; en; JVM 1.6.0_30), Host: s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: 6FAAE2E0B5A437DF, x-amz-id-2: oJ9rYHwvwv7LZ2xpsni+V3YQF+6emAiSko8wGj5d09v7CX9nG7E+Duz0KEXFKj2u7jJNNFPOJ4Y=, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Thu, 07 Jan 2016 06:17:28 GMT, Connection: close, Server: AmazonS3] ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.nio.BufferUnderflowException at java.nio.Buffer.nextGetIndex(Buffer.java:478) at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:336) at org.broadinstitute.sting.gatk.datasources.reads.GATKBAMIndex.readInteger(GATKBAMIndex.java:299) at org.broadinstitute.sting.gatk.datasources.reads.GATKBAMIndex.readReferenceSequence(GATKBAMIndex.java:125) at org.broadinstitute.sting.gatk.datasources.reads.BAMSchedule.(BAMSchedule.java:107) at org.broadinstitute.sting.gatk.datasources.reads.BAMScheduler.getNextOverlappingBAMScheduleEntry(BAMScheduler.java:205) at org.broadinstitute.sting.gatk.datasources.reads.BAMScheduler.advance(BAMScheduler.java:108) at org.broadinstitute.sting.gatk.datasources.reads.BAMScheduler.next(BAMScheduler.java:79) at org.broadinstitute.sting.gatk.datasources.reads.BAMScheduler.next(BAMScheduler.java:47) at net.sf.picard.util.PeekableIterator.advance(PeekableIterator.java:71) at net.sf.picard.util.PeekableIterator.next(PeekableIterator.java:57) at org.broadinstitute.sting.gatk.datasources.reads.LowMemoryIntervalSharder.next(LowMemoryIntervalSharder.java:61) at org.broadinstitute.sting.gatk.datasources.reads.LowMemoryIntervalSharder.next(LowMemoryIntervalSharder.java:36) at org.broadinstitute.sting.gatk.datasources.reads.LocusShardStrategy.next(LocusShardStrategy.java:142) at org.broadinstitute.sting.gatk.datasources.reads.LocusShardStrategy.next(LocusShardStrategy.java:42) at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:104) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:234) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:221) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:87) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 1.0-6121-g8a78414): ##### ERROR ##### ERROR Please visit the wiki to see if this is a known problem ##### ERROR If not, please post the error, with stack trace, to the GATK forum ##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki ##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa ##### ERROR ##### ERROR MESSAGE: Code exception (see stack trace for error itself) ##### ERROR ------------------------------------------------------------------------------------------ Regards, Nandan Created 2016-01-06 13:01:18 | Updated | Tags: combinevariants haplotypecaller best-practices dbsnp gatk combinegvcfs gvcf Hi guys, I have recently jointly called 27 full genome data using GenotypeGVCFs approach. While i was trying to extract some chromosomes from the final file, i got this error The provided VCF file is malformed at approximately line number 16076: Unparsable vcf record with allele *. I look into the file and I found some of the multi-allellic sites having * as seen in the attached picture. I feel the problem could be that the program realised that more than one allele is present at that position but could not ascertain which allele. I may be wrong but please what do you think I can do to solve this problem. LAWAL Created 2015-12-21 15:10:30 | Updated | Tags: realignertargetcreator dbsnp realignment gatk random-chromosomes I'm re-aligning some bam files using RealignerTargetCreator. I downloaded the hg38 dbSNP file and hg38 reference genome. however, the dbSNP file contains the canonical chromosomes [1,2...X,Y,MT] while the reference (and my bam file) contain also the random chromosomes. here's where I get the error: MESSAGE: Input files and reference have incompatible contigs: No overlapping contigs found. 142_hg38.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT] reference contigs = [chr1, chr1_GL383518v1_alt, chr1_GL383519v1_alt, chr1_GL383520v2_alt ... [etc etc etc.....] ... chrX_KI270881v1_alt, chrX_KI270913v1_alt, chrY, chrY_KI270740v1_random, chrM] so...I would like to keep the random chromosomes in the reference/bam file, so how can I have this run without tossing an error? or is this not possible and I need to keep ONLY the chrs in the dbSNP file? thanks Seb Created 2015-12-17 19:38:46 | Updated | Tags: indelrealigner unifiedgenotyper haplotypecaller gatk Does anyone know of an effective way to determine haplotypes or phasing data for SNPs and STRs? I understand that STRs are inherently difficult for aligners; however, I'm trying to determine haplotypes for a large number of STRs (including the flanking region information...SNPs) on a large number of samples. So, manual verification is not really an option. We've developed an in-house perl script that calls STRs accurately; however, it currently does not include flanking region information. Any help is greatly appreciated. Created 2015-12-04 06:45:03 | Updated | Tags: vqsr gatk wgs Hi. I ran GATK(version GATK3.4-46) using whole genome sequencing data of 60 samples. Using bwa, I aligned my WGS data to human reference genome (GRCh 37) including autosome, X, Y, MT, and GL**. I ran HaplotypeCaller, and then using GenotypeGVCFs, VCF was extracted for chr1 - chr22, chrX, chrY. I wonder what do mean WGS that GATK' Bestpractices recommend. when GATK (VQSR, HC et al) was used, to get the best result**, should WGS be consisted of autosome, X, Y, MT and GL*****, ? Or could it contain only the autosome excluding X ,Y? Created 2015-12-02 23:31:39 | Updated | Tags: unifiedgenotyper ploidy gatk variant-calling I am using the Gatk v3.1 UnifiedGenotyper for variant calling of some GBS type data in the plant genus Boechera. I have diploids as well as triploids, where I have the ploidy from a prior microsat run. The diploids work out well (I get some 150k variable loci), but I can't seem to get variants for the triploids. Here's my initial call for the triploids: Program Args: -T UnifiedGenotyper -R /home/A01768064/assembly/Bstricta_278_v1.fa -I mytripbam.list -o boechera_triploids.vcf -nt 22 -glm SNP -hets 0.001 -mbq 20 -ploidy 3 -pnrm EXACT_GENERAL_PLOIDY -stand_call_conf 50 -maxAltAlleles 2 I was able to call the triploids with the diploid option, but - obviously - this is not what I want. I additionally ran the following relaxed stringency options. hets 0.01 mbq 10 hets 0.1 mbq 10 None of them return any variants. The read group for one example file: @RG ID:CR1308 LB:CR1308 SM:CR1308 PL:ILLUMINA (with unique values for each sample; note that in the output vcf, the samples are appearing in the header line) Additionally, I attached the slurm log file for the last mentioned run. I am at a complete loss, what am I missing here? Created 2015-11-25 21:38:23 | Updated 2015-11-25 21:40:23 | Tags: selectvariants vcf gatk Hi, I have multi-sample vcf file and an example variant is shown below: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 03-071 04-051 04-071 06-044 07-085 10-009 chr1 6526093 . T C 197.77 . AC1=1;AC=1;AF1=0.5 GT:GQ:DP:PL:AD 0/1 1/1 0/0 1/1 1/1 0/1 For each variant i would need to retrieve the sample names based on genotype. If the genotype is "0/1" could it be possible to select only the samples for which the genotype is "0/1" and similarly for the genotype "1/1"? SelectVariants filters variants based on the different quality values provided. Here i would need to subset the vcf by sample for each variant based on the genotype. Are there any tools which can do this? Created 2015-10-21 10:08:51 | Updated | Tags: haplotypecaller gatk Dear Sir/Madam, I am running HaplotypeCaller (GATK 3.2.2) and gets the error message below for some of the samples. I have seen this error posted before at the Forum and the recommendation has often been to change to a newer version of GATK. The problem is we have been using the version 3.2.2 for over 2000 samples in the same project and those will be analysed together so I am afraid to switch version for just a few samples. I would greatly appreciate any help! Best, Lina ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.ArrayIndexOutOfBoundsException: 125 at org.broadinstitute.gatk.utils.sam.AlignmentUtils.calcNumHighQualitySoftClips(AlignmentUtils.java:437) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.ReferenceConfidenceModel.calcGenotypeLikelihoodsOfRefVsAny(ReferenceConfidenceModel.java:291) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.isActive(HaplotypeCaller.java:839) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.addIsActiveResult(TraverseActiveRegions.java:618) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.access800(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$ActiveRegionIterator.hasNext(TraverseActiveRegions.java:378) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:268) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:273) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: 125 ##### ERROR ------------------------------------------------------------------------------------------ Created 2015-10-19 13:43:07 | Updated 2015-10-19 13:44:06 | Tags: gatk broadies Hi, I'm just about to try to get started using GATK for the first time, and I haven't been able to find a guide or any information on running GATK on the Broad server. I'm assuming there's a general use GATK version (akin to /seq/picard) and copies of gold standard variant files etc. already on the Broad, and I'd rather use these than re-invent the wheel. Have I simply missed a post? Are my assumptions incorrect (and everyone separately installs GATK and downloads gold standard vcfs)? Any leads would be very much appreciated. Avery p.s. I tried searching the forum, guide, and BITS wiki - what am I missing? Created 2015-10-09 20:43:20 | Updated | Tags: haplotypecaller bug gatk variant-calling This SNP has 30 reads supporting it with most of them being mapq60 and good fred scores. using version 3.4-46-gbc02625 and my command was java -jar ~/GenomeAnalysisTK.jar -R /mnt/opt/refdata/fasta/hg19/hg19.fa -I /mnt/park/gemcode/haynes/rfa/RFA_phaser5/_SNPINDEL_PHASER_BAM/ATTACH_SERAFIM/fork0/files/output.bam -T HaplotypeCaller --genotyping_mode DISCOVERY -L chr3:33100000-33300000 -stand_emit_conf 10 -stand_call_conf 30 Let me know if you want more info or want me to submit a detailed bug report with relevant files. Thanks, Haynes Created 2015-10-06 20:13:14 | Updated | Tags: vcf indels gatk Have a long standing stable pipeline, which I need to tweak as can be explained and justified. What is the difference between the old file of known indels (from Mills paper(s)) - Homo_sapiens_assembly19.kown_indels.vcf And the current version of known indels with 1000G added - Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz I know the newer reference is smaller - just sites. It also has significantly fewer records/line counts. What else was removed or added from the old version until now? Thanks in advance. Created 2015-09-25 23:52:49 | Updated | Tags: best-practices bam gatk Hi, So far, I have only tried running Best Practices on bam files where each bam file corresponds to a certain chromosome. How do I go about partitioning the bam files further and running Best Practices on parts of bam files? I want to make my pipeline even more efficient. Thanks for your help, Alva Created 2015-09-22 18:10:21 | Updated | Tags: combinevariants gatk error Hi there, I ran the following command java -jar$gatk_jar -T CombineVariants -R ${hs37d5} \ --variant${name1} --variant ${name2} \ -o combined.vcf \ -genotypeMergeOptions UNIQUIFY I followed the best practices and added 30 bam samples from the 1000g to my 4 samples. I tried also the b37 fasta but the following error persist. I've seen other treads with a similar error but not for CombineVariants. Could it be a resilient bug? Thanks Horacio ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.IllegalArgumentException: Unexpected base in allele bases '*T' at htsjdk.variant.variantcontext.Allele.(Allele.java:162) at htsjdk.variant.variantcontext.Allele.create(Allele.java:234) at htsjdk.variant.variantcontext.Allele.extend(Allele.java:253) at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.createAlleleMapping(GATKVariantContextUtils.java:1303) at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.resolveIncompatibleAlleles(GATKVariantContextUtils.java:1268) at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.simpleMerge(GATKVariantContextUtils.java:1008) at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:339) at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:136) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNanoTraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.4-46-gbc02625): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Unexpected base in allele bases '*T' ##### ERROR ------------------------------------------------------------------------------------------ Created 2015-09-22 16:01:09 | Updated | Tags: bam gatk error markduplicates merged-bams Hello, I am relatively new to GATK and stuck on this problem. After downloading the bam file that I wish to analyze, I am cutting portions of the file because I only wish to analyze variants at specific genes. Once I have the smaller files for each gene, I am using samtools merge to merge all of the smaller files back into one full sliced bam file. Next, I am realigning this file using bwa aln and sampe. After this step, I am attempting to use GATK best practices to mark the variants in these genes. First, I am using GATK AddOrReplaceReadGroups to modify read group information as necessary. Then, I am using Picard's MarkDuplicates to mark duplicates in the re-aligned bam file. However, I get the following error. Exception in thread "main" net.sf.picard.PicardException: Value was put into PairInfoMap more than once. 156: TLL:HWI-ST1222:5:2308:12532:76745#0 at net.sf.picard.sam.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:124) at net.sf.picard.sam.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:78) at net.sf.picard.sam.DiskReadEndsMap.remove(DiskReadEndsMap.java:61) at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:418) at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:161) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:145) Looking at Picard's FAQ page (https://broadinstitute.github.io/picard/faq.html), it suggests that I try using Picard's MergeBamAlignment which also fails giving an error suggesting that the error is in the record ID of the bam file. Exception in thread "main" net.sf.picard.PicardException: Program Record ID already in use in unmapped BAM file. at net.sf.picard.sam.SamAlignmentMerger.<init>(SamAlignmentMerger.java:131) at net.sf.picard.sam.MergeBamAlignment.doWork(MergeBamAlignment.java:226) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.sam.MergeBamAlignment.main(MergeBamAlignment.java:205) Continuing on, trying to change the options on samtools merge to change the record ID's when merging together the single files into the full sliced file also failed. Some more work into these issues found that using samtools fixmate and then Picard SortSam before calling MarkDuplicates resolves the issue, but the total amount of variants called by GATK is different when I run the sliced file through against a single gene file (for the same locations). Are there any other options to explore to resolve this error? Created 2015-09-17 21:56:03 | Updated | Tags: bqsr best-practices bam gatk Hi, I recently re-generated my realigned bam files (realigned.bam) from the GATK Best Practices, but I encountered an error when I validated the bam files using the validation tool (ValidateSamFile.jar) in Picard: ERROR: Record 188823, Read name HWI-ST1329:286:C2DT8ACXX:2:1303:11866:8561, Mate alignment does not match alignment start of mate ERROR: Record 188823, Read name HWI-ST1329:286:C2DT8ACXX:2:1303:11866:8561, Mate negative strand flag does not match read negative strand flag of mate etc. What exactly is going on? Where in the pipeline could this have occurred? I do not have the intermediary files so I cannot diagnose this accurately. Before I started the GATK Best Practices, my bam files had no problems; when I validated them I did not get any error messages. Does this mean that I have to regenerate the files again and start from the very beginning? I tried to use the Picard tool FixMateInformation.jar, but then I got an error message regarding the index files. I tried deleting the bam index for one bam file and then recreating it to see if there was a problem with the indexes, but doing so did not resolve the issue. So it seems that something went wrong with the bam files at some earlier step. Has this error been encountered before? Not sure how to proceed next, except restart everything. Best, Alva Created 2015-09-07 12:14:38 | Updated | Tags: developer gatk Hi, I followed the instructions from your guide on how to make GATK from source with IntelliJ IDEA. My only problem is that GenomeAnalysisTK.jar is not generated, and I don't know which class should I give to the debugger as the main class? Created 2015-07-31 22:11:45 | Updated | Tags: commandlinegatk haplotypecaller gatk error Hello, I am receiving the following error. I am working with SAM files that were exported from CLC, then edited with Picard-tools to addReadGroups. I am not sure if I need to add an additional step to solve this problem, I cannot find any documentation regarding this error. Please let me know what I need to do to correct this issue. Thank you! gatk -T HaplotypeCaller -R spinach_assembly-repeatdetect_PACBIO_V1.3_formated_60.fa -I .sam.list -drf DuplicateRead --alleles Unfiltered_Spinach_PacBio_Reseq_12_Geno_Assay_SNP.fixed.noblanks.vcf --genotyping_mode GENOTYPE_GIVEN_ALLELES --output_mode EMIT_ALL_SITES -o output_raw_unfiltered_spinach_snps_gbs.vcf INFO 14:48:44,450 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:48:44,453 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12 INFO 14:48:44,454 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 14:48:44,454 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 14:48:44,458 HelpFormatter - Program Args: -T HaplotypeCaller -R spinach_assembly-repeatdetect_PACBIO_V1.3_formated_60.fa -I .sam.list -drf DuplicateRead --alleles Unfiltered_Spinach_PacBio_Reseq_12_Geno_Assay_SNP.fixed.noblanks.vcf --genotyping_mode GENOTYPE_GIVEN_ALLELES --output_mode EMIT_ALL_SITES -o output_raw_unfiltered_spinach_snps_gbs.vcf INFO 14:48:44,468 HelpFormatter - Executing as ahulse@jalapeno.genomecenter.ucdavis.edu on Linux 2.6.18-348.12.1.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_05-b13. INFO 14:48:44,469 HelpFormatter - Date/Time: 2015/07/31 14:48:44 INFO 14:48:44,469 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:48:44,470 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:48:45,102 GenomeAnalysisEngine - Strictness is SILENT INFO 14:48:45,385 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500 INFO 14:48:45,394 SAMDataSourceSAMReaders - Initializing SAMRecords in serial INFO 14:48:48,432 SAMDataSource$SAMReaders - Init 50 BAMs in last 3.04 s, 50 of 80 in 3.04 s / 0.05 m (16.46 tasks/s). 30 remaining with est. completion in 1.82 s / 0.03 m INFO 14:48:50,052 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 4.66 INFO 14:48:50,164 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 14:48:54,742 RMDTrackBuilder - Writing Tribble index to disk for file /local/scratch/scratch/Amanda/Spinach_GBS/Unfiltered_Spinach_PacBio_Reseq_12_Geno_Assay_SNP.fixed.noblanks.vcf.idx INFO 14:48:58,784 GenomeAnalysisEngine - Preparing for traversal over 80 BAM files INFO 14:49:00,054 GATKRunReport - Uploaded run statistics report to AWS S3

 ERROR ------------------------------------------------------------------------------------------ ERROR A BAM ERROR has occurred (version 3.4-46-gbc02625): ERROR ERROR This means that there is something wrong with the BAM file(s) you provided. ERROR The error message below tells you what is the problem. ERROR ERROR Visit our website and forum for extensive documentation and answers to ERROR commonly asked questions http://www.broadinstitute.org/gatk ERROR ERROR Please do NOT post this error to the GATK forum until you have followed these instructions: ERROR - Make sure that your BAM file is well-formed by running Picard's validator on it ERROR (see http://picard.sourceforge.net/command-line-overview.shtml#ValidateSamFile for details) ERROR - Ensure that your BAM index is not corrupted: delete the current one and regenerate it with 'samtools index' ERROR ERROR MESSAGE: Cannot retrieve file pointers within SAM text files. 
##### ERROR ------------------------------------------------------------------------------------------

Created 2015-07-24 06:38:11 | Updated 2015-07-24 07:03:39 | Tags: haplotypecaller gatk rice

I am working on Rice to call variants using GATK3.3. I have used BWA (bwa mem -M options) to map reads to reference genome. Followed by variant calling (HaplotypeCaller) by following Best Practices of GATK3.3. I have compared the results of 20 genotypes and in all genotypes INDELs are more (2x) as compared to SNPs. Please suggest me on this

Thanks Mahesh HB

Created 2015-07-22 11:46:00 | Updated | Tags: combinevariants haplotypecaller best-practices vcf gatk genotypegvcfs combinegvcfs gvcf

I was trying to do combine sets of vcf files for all my samples so that I have one single vcf output using this command option below java -d64 -Xmx48g -jar ${GATK}/GenomeAnalysisTK.jar \ -R${REF} \ -T GenotypeGVCFs \ --variant A.g.vcf \ --variant B.g.vcf \ --variant C.g.vcf \ -stand_emit_conf 30 \ -stand_call_conf 30 \ -o genotype.vcf

but I got this error message “The following invalid GT allele index was encountered in the file: END=21994810”. I have tried to locate where the problem could be coming from but I do not understand this. Could you please advise me.

Created 2015-07-22 05:07:12 | Updated | Tags: bqsr baserecalibrator best-practices gatk

Hi,

I am running GATK Best Practices again and just wanted to assure myself that I am proceeding correctly. I remember reading somewhere on here that BQSR performs best when given the whole genome to work with (similar to VQSR in that respect). Better to just run using the whole genome rather than giving it single chromosomes. I couldn't really find this comment/recommendation when I tried to look for it again, so wanted to see what the final word was on this. How should one proceed?

Thanks, Alva

Created 2015-06-27 16:56:26 | Updated | Tags: bam gatk

Meaning does it matter if my intervals file goes chr1, chr10, instead of chr1, chr2 ? Additionally, will a command line error occur if bam files are not sorted in coordinate order with respect to the reference? Thank you -Best Steve

Created 2015-05-23 05:08:22 | Updated 2015-05-23 05:12:25 | Tags: haplotypecaller bug gatk

Hi All,

I cannot perform joint genotype calling using gVCFs using the GenotypeGVCF command on my dataset containing 352 mtDNA samples.

Does any one has run into this problem before, or has suggestions on how to get this working.

I have attached the standard output for a hanging run.

I can generate VCF files individually no problems.

INFO  17:06:45,732 GenomeAnalysisEngine - Strictness is SILENT
INFO  17:06:45,867 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  17:06:47,721 MicroScheduler - Running the GATK in parallel mode with 4 total threads, 1 CPU thread(s) for each of 4 data thread(s), of 20 processors available on this machine
INFO  17:06:47,791 GenomeAnalysisEngine - Preparing for traversal
INFO  17:06:47,793 GenomeAnalysisEngine - Done preparing for traversal
INFO  17:06:47,794 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  17:06:47,794 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  17:06:47,795 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
INFO  17:06:48,436 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files
INFO  17:07:17,800 ProgressMeter - gi|251831106|ref|NC_012920.1|:1         0.0    30.0 s      49.6 w        0.0%   15250.3 w   15250.3 w
Waiting for data... (interrupt to abort)

GATK VERSION: The Genome Analysis Toolkit (GATK) v3.4-0-g7e26428, Compiled 2015/05/15 03:25:41 java version "1.7.0_51"

Thanks for your time. James Boocock

Created 2015-05-05 00:40:58 | Updated | Tags: haplotypecaller queue gatk

Question

Hey folks,

We've run in to a situation where we have hundreds of queue managed GATK HaplotypeCaller.scala jobs which we'd like to run. This has led to hundreds of instances of Queue.jar in their post scatter watching state holding on to cores in our cluster. I'm curious about plans for Queue and whether or not a job dependency tree model has been considered for future versions, or whether or not this is already available.

By that, I mean something along the lines of the scatter kicks off the child jobs and a check/resubmit_faileds/gather job is queued on the cluster with a qsub -W depend=after:child1jobid[:child2jobid: ...:childNjobid]. The initial scatter job would end, freeing up a core, and resources would be available to the sub jobs. Then, when the child jobs finish up, the dependencies would be met and the next job would execute (when resources are available) and pick up the successful, resubmit the failed, lather, rinse , repeat.

Thanks in advance for your patience with me in the phrasing of my question, as I am approaching this as a cluster admin, not as the developer who has implemented the queue.

Cheers, Scott

Created 2015-05-01 05:07:41 | Updated | Tags: selectvariants snp gatk fastaalternatereferencemaker

I have some VCF files, each of which I have merged to contain >300 genotypes. Furthermore, to make them more manageable I have subsetted them to just contain the chromosome regions I am interested in.

Now I wish to generate some genotype specific FASTA sequences using these files and a reference sequence; i.e. a sequence for each genotype which is the same as the reference sequence but with the SNPs specific to each genotype in place of their counterparts in the reference sequence.

Now I know that there is variation in the genotypes. Here is a picture visualizing three exemplar genotypes that I generated by loading the VCF file into Geneious.

http://imgur.com/LRZHIcw

I then try to create individual VCF files for each genotype using this:

java -jar GenomeAnalysisTK.jar -R ~/Path/to/reference\sequence/ref.fasta -T SelectVariants --variant ~/Path/to/complete\vcf/example.vcf -o ~/Path/to/individual/genotype.vcf -sn genotype

While I can't be sure this had the desired effect as it is difficult to assess a whole VCF file I can say that the header now only contains the relevant genotype so I assume this is the case.

I then try and use this individual VCF file for each genotype like this:

java -jar GenomeAnalysisTK.jar -R ~/Path/to/reference\sequence/ref.fasta -T FastaAlternateReferenceMaker --variant~/Path/to/individual/genotype.vcf -L chrX:XX,XXX,XXX-XX,XXX,XXX -o ~/Path/to/individual/genotype.fasta

Here the Xs represent the location on the reference sequence of the regions of interest.

I did this in a loop and got identical sequences for every genotypes. I then implemented it individually for the 3 exemplar genotypes in the picture above and in both cases I get identical sequences for every genotype. Interestingly they are not the reference sequence.

What am I doing wrong?

I have also posted this on the Biostars forum..

Created 2015-04-20 23:41:13 | Updated | Tags: vqsr vcf gatk

Hi,

After using VQSR, I have a vcf output that contains sites labeled "." in the FILTER field. When I look at the vcf documentation (1000 genomes), it says that those are sites where filters have not been applied. Is this correct? I would like to know more about what these sites mean, exactly.

An example of such a site in my data is:

1 10439 . AC A 4816.02 . AC=10;AF=0.185;AN=54;BaseQRankSum=-4.200e-01;ClippingRankSum=-2.700e-01;DP=1690;FS=6.585;GQ_MEAN=111.04;GQ_STDDEV=147.63;InbreedingCoeff=-0.4596;MLEAC=17;MLEAF=0.315;MQ=36.85;MQ0=0;MQRankSum=-8.340e-01;NCC=0;QD=11.39;ReadPosRankSum=-8.690e-01;SOR=1.226 GT:AD:DP:GQ:PGT:PID:PL 0/1:22,14:36:99:0|1:10439_AC_A:200,0,825 0/0:49,0:49:0:.:.:0,0,533 0/0:92,0:92:0:.:.:0,0,2037 0/1:20,29:49:99:.:.:634,0,340 0/0:11,0:16:32:.:.:0,32,379 0/1:21,17:38:99:.:.:273,0,616 0/0:57,0:57:0:.:.:0,0,1028 0/0:58,0:58:0:.:.:0,0,1204 0/0:52,0:52:0:.:.:0,0,474 0/0:86,0:86:27:.:.:0,27,2537 0/1:13,24:37:99:.:.:596,0,220 0/1:14,34:48:99:.:.:814,0,263 0/0:86,0:86:0:.:.:0,0,865 0/0:61,0:61:0:.:.:0,0,973 0/0:50,0:50:0:.:.:0,0,648 0/0:40,0:40:0:.:.:0,0,666 0/0:79,0:79:0:.:.:0,0,935 0/0:84,0:84:0:.:.:0,0,1252 0/1:22,27:49:99:.:.:618,0,453 0/0:39,0:39:0:.:.:0,0,749 0/0:74,0:74:0:.:.:0,0,1312 0/1:13,18:31:99:.:.:402,0,281 0/0:41,0:44:99:.:.:0,115,1412 0/1:30,9:39:99:.:.:176,0,475 0/1:26,23:49:99:.:.:433,0,550 0/1:13,34:47:99:.:.:736,0,185 0/0:44,0:44:0:.:.:0,0,966

Thanks, Alva

Created 2015-04-02 13:05:07 | Updated | Tags: phasebytransmission pedigree gatk

Hello

I am using PhaseByTransmission to phase variants called from a trio (mother, father and son) contained in a VCF file. It's my first time using this tool. Do I need to follow any convention when naming my samples? Do the sample names in the VCF file need to match the Family ID, the individual ID, or a combination of the two in the PED file? Do they need to be in the same order?

Created 2015-03-29 18:00:14 | Updated | Tags: best-practices snp vcf gatk error variant-calling

Hi,

I have discovered some unusual calls in my VCF file after using HaplotypeCaller. I am using version 3.3 of GATK. I applied VQSR as well as the genotype refinement workflow (CalculateGenotypePosteriors, etc.) to refine the calls, but the unusual calls did not get removed. I also calculated the number of Mendelian error just in the biallelic SNPs in my final VCF file (using PLINK) and found unusually high percentage for each of the 3 families I am studying: 0.153%, 0.167%, and 0.25%. The percentage of triallelic SNPs is also very high: 0.111%. Why are the error rates so high?

I used the following commands to call the variants and generate the initial VCF file:

# CatVariants (generate 1 vcf file with all inds and all chrs)

java -Xmx1g -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants -R hs37d5.fa -V final.new.list -out final_allHutt.vcf -assumeSorted

# VQSR

java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input final_allHutt.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches

# Used excludeFiltered here

java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input final_allHutt.vcf -mode SNP --ts_filter_level 99.9 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches -o recalibrated_snps_raw_indels_allHutt_filteredout.vcf

java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches

# Used excludeFiltered here

java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -mode INDEL --ts_filter_level 99.0 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches -o recalibrated_variants_allHutt_filteredout.vcf

# Genotype Refinement Workflow

java -Xmx3g -jar GenomeAnalysisTK.jar -T CalculateGenotypePosteriors -R hs37d5.fa --supporting ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf -ped Hutt.ped -V recalibrated_variants_allHutt_filteredout.vcf -o recalibrated_variants_allHutt.postCGP.f.vcf

java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantFiltration -R hs37d5.fa -V recalibrated_variants_allHutt.postCGP.f.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibrated_variants_allHutt.postCGP.Gfiltered.f.vcf

Again, the first genotype in this example (indel) passed VariantFiltration even though its coverage was zero (2/2:0,0,0:0:PASS)

The entire example is below:

1 20199272 . T TCTTC,C 3520.08 PASS AC=8,22;AF=0.160,0.440;AN=50;BaseQRankSum=-1.356e+00;ClippingRankSum=-1.267e+00;DP=487;FS=4.843;GQ_MEAN=27.84;GQ_STDDEV=40.31;InbreedingCoeff=0.1002;MLEAC=1,12;MLEAF=0.020,0.240;MQ=51.74;MQ0=0;MQRankSum=0.421;NCC=2;PG=0,0,0,0,0,0;QD=32.53;ReadPosRankSum=1.27;SOR=0.699;VQSLOD=0.687;culprit=FS GT:AD:DP:FT:GQ:PGT:PID:PL:PP 2/2:0,0,0:0:PASS:22:.:.:410,207,355,32,22,0:410,207,355,32,22,0 2/2:0,0,1:1:lowGQ:5:.:.:240,51,36,18,5,0:240,51,36,18,5,0 0/2:4,0,4:8:PASS:90:.:.:140,153,256,0,103,90:140,153,256,0,103,90 0/0:22,0,0:22:lowGQ:0:.:.:0,0,390,0,390,390:0,0,390,0,390,390 0/0:2,0,0:2:lowGQ:3:.:.:0,3,45,3,45,45:0,3,45,3,45,45 2/2:0,0,3:3:lowGQ:11:.:.:287,135,124,21,11,0:287,135,124,21,11,0 ./.:7,0,0:7:PASS 2/2:0,0,3:4:lowGQ:11:.:.:282,126,115,22,11,0:282,126,115,22,11,0 0/2:10,0,0:10:lowGQ:5:.:.:27,5,494,0,411,405:27,5,494,0,411,405 0/2:7,0,0:7:lowGQ:13:.:.:13,15,502,0,303,288:13,15,502,0,303,288 0/1:8,6,0:14:PASS:99:.:.:194,0,255,218,273,491:194,0,255,218,273,491 0/0:18,0,0:18:PASS:52:.:.:0,52,755,52,755,755:0,52,755,52,755,755 2/2:0,0,0:0:PASS:23:.:.:305,168,416,23,30,0:305,168,416,23,30,0 0/2:0,0,4:4:lowGQ:14:.:.:40,14,634,0,185,699:40,14,634,0,185,699 0/0:19,0,0:19:PASS:58:.:.:0,58,824,58,824,824:0,58,824,58,824,824 0/0:1,0,0:1:lowGQ:6:0|1:20199257_CT_C:0,6,91,6,91,91:0,6,91,6,91,91 1/1:0,0,0:0:lowGQ:2:.:.:177,11,0,12,2,44:177,11,0,12,2,44 0/1:0,0,3:3:PASS:34:.:.:94,0,388,34,38,304:94,0,388,34,38,304 0/2:15,0,2:17:lowGQ:18:0|1:20199249_CT_C:18,64,695,0,632,624:18,64,695,0,632,624 1/1:0,0,0:0:lowGQ:8:.:.:133,8,0,101,17,265:133,8,0,101,17,265 0/2:3,0,0:3:PASS:25:.:.:129,25,484,0,121,94:129,25,484,0,121,94 0/2:2,0,0:2:PASS:38:.:.:185,38,644,0,88,42:185,38,644,0,88,42 0/2:2,0,0:2:lowGQ:14:.:.:256,14,293,0,57,41:256,14,293,0,57,41./.:11,0,0:11:PASS 1/2:0,0,1:1:lowGQ:14:.:.:115,24,14,36,0,359:115,24,14,36,0,359 1/2:0,0,1:1:PASS:28:.:.:188,39,28,35,0,206:188,39,28,35,0,206 2/2:0,0,3:3:lowGQ:8:1|1:20199249_CT_C:88,88,89,8,9,0:88,88,89,8,9,0

Why are some genotypes being passed when there is no support for their genotype? Why are the Mendelian error rates so high?

Thank you very much in advance, Alva

Created 2015-03-20 05:19:22 | Updated | Tags: gatk variant-calling

Hi,

If I get it right, anomalous read pairs refer to those where only one read is mapped. I wonder whether GATK would ever use these reads to do the calling.

Thank you!

Created 2015-02-24 10:54:14 | Updated | Tags: gatk

Hi One minor change that would make GATK much easier to use with large data sets would be if GATK could provide slightly more detailed error messages, specifically for the cases where the error is related to a specific file.

For example the following Error message tells me one of my GVCF files has been damaged/corrupted at some point but doesn't tell me which one it is which is rather important information as as a result I now have to work my way through ~30files to work out which one is damaged and replace it with a good copy. While if GATK had provided the file name along with the error I would have been able to directly go to the file and replace it.

This has been a problem with the GVCF, VCF, & BAM files, and it would be greatly appreciated if it could be changed!

java.lang.RuntimeException: java.io.IOException: Unexpected compressed block length: 1 at htsjdk.tribble.readers.TabixIteratorLineReader.readLine(TabixIteratorLineReader.java:45) at htsjdk.tribble.TabixFeatureReader$FeatureIterator.readNextRecord(TabixFeatureReader.java:161) at htsjdk.tribble.TabixFeatureReader$FeatureIterator.(TabixFeatureReader.java:149) at htsjdk.tribble.TabixFeatureReader.query(TabixFeatureReader.java:124) at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrack.query(RMDTrack.java:119) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.createIteratorFromResource(ReferenceOrderedDataSource.java:241) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.createIteratorFromResource(ReferenceOrderedDataSource.java:185) at org.broadinstitute.gatk.engine.datasources.rmd.ResourcePool.iterator(ResourcePool.java:93) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedDataSource.seek(ReferenceOrderedDataSource.java:168) at org.broadinstitute.gatk.engine.datasources.providers.RodLocusView.(RodLocusView.java:83) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.getLocusView(TraverseLociNano.java:129) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:80) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.io.IOException: Unexpected compressed block length: 1 at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:377) at htsjdk.samtools.util.BlockCompressedInputStream.seek(BlockCompressedInputStream.java:292) at htsjdk.tribble.readers.TabixReader$IteratorImpl.next(TabixReader.java:382) at htsjdk.tribble.readers.TabixIteratorLineReader.readLine(TabixIteratorLineReader.java:43) ... 18 more ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: java.io.IOException: Unexpected compressed block length: 1 ##### ERROR ------------------------------------------------------------------------------------------ Created 2015-02-20 21:43:46 | Updated | Tags: vcf gatk calculategenotypeposteriors Hi, I used CalculateGenotypePosteriors with the supporting file called ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf, obtained from 1000 Genomes. It contains both indels and SNPs, and I was able to use the file for the first step of the Genotype Refinement Workflow. My question is: is it an issue that I didn't remove the indels from the supporting file? I would presume not since first of all, both the indels and the SNPs from Phase 3 of 1000G should be high confidence, and second, my recalibrated vcf file includes both indels and snps, so it should be in my interest to have as much information as possible, actually, so I should consider indels as well. Just want to check whether my reasoning is correct. Thanks, Alva Created 2015-02-17 15:58:44 | Updated | Tags: variantrecalibrator gatk I'm sorry I keep posting these questions, but this error has me baffled! ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Unable to retrieve result at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:190) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) Caused by: java.lang.IllegalArgumentException: No data found. at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:399) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:143) at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.notifyTraversalDone(HierarchicalMicroScheduler.java:226) at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:183) ... 5 more ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Unable to retrieve result ##### ERROR ------------------------------------------------------------------------------------------ Here is the input I use to get it: java -d64 -Xmx8g -jar$GATKDIR/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -input k_family_out.vcf \ -R $REF \ -resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0$HAPMAP \ -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 $OMNI \ -resource:dbsnp,VCF,known=true,training=false,truth=false,prior=8.0$DBSNP \ -resource:mills,VCF,known=true,training=true,truth=true,prior=12.0 $MILLS \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum \ -mode BOTH \ -U ALL \ -recalFile k_family.recal \ -tranchesFile k_family.tranches \ -rscriptFile k_family.R \ -nt 6 --TStranche 100.0 --TStranche 99.9 --TStranche 99.5 --TStranche 99.0 \ --TStranche 98.0 --TStranche 97.0 --TStranche 96.0 --TStranche 95.0 --TStranche 94.0 \ --TStranche 93.0 --TStranche 92.0 --TStranche 91.0 --TStranche 90.0 I thought this was similar to a multithreading error, so I reduced my pipeline to execute with only a single node, and the problem persists. Any help will be appreciated. Created 2015-02-13 11:14:24 | Updated | Tags: bam gatk error Hi, with GATK 3.3-0 I am confronted with an error that was present in a much older version, but seemed resolved about a year ago: ERROR MESSAGE: There is no space left on the device, so writing failed There is 8TB left on the drive, no user limit. Sometimes re-running the exact same job works, sometimes not. Some jobs keep failing despite asking for an insane amount of memory on the cluster, given these are RNAseq bam files, the largest one being less than 7GB. For example: qsub -b y -cwd -N step3_145 -o step3_145.o -e step3_145.e -V -l h_vmem=40G /share/apps/java/oracle/1.8.0_11/bin/java -Xmx35G -jar /data/home/hhx037/GATK-3.3.0/GenomeAnalysisTK.jar -T SplitNCigarReads -R Homo_sapiens.GRCh37.75.dna.1-22XYMT.fa -I Analyses/file_dedup.bam -o Analyses/file_splittedcigar.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS Here is the log: INFO 10:50:51,568 HelpFormatter - Executing as hhx037@panos1 on Linux 2.6.32-431.1.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_11-b12. INFO 10:50:51,571 HelpFormatter - Date/Time: 2015/02/13 10:50:51 INFO 10:50:51,571 HelpFormatter - -------------------------------------------------------------------------------- INFO 10:50:51,576 HelpFormatter - -------------------------------------------------------------------------------- INFO 10:50:52,503 GenomeAnalysisEngine - Strictness is SILENT INFO 10:50:52,827 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 10:50:52,861 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 10:50:52,876 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 10:50:53,021 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 10:50:53,027 GenomeAnalysisEngine - Done preparing for traversal INFO 10:50:53,030 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 10:50:53,030 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 10:50:53,030 ProgressMeter - Location | reads | elapsed | reads | completed | runtime | runtime INFO 10:50:53,047 ReadShardBalancer$1 - Loading BAM index data INFO 10:50:53,050 ReadShardBalancer$1 - Done loading BAM index data INFO 10:51:23,404 ProgressMeter - 1:1477348 702953.0 30.0 s 43.0 s 0.0% 17.5 h 17.5 h INFO 10:52:32,660 ProgressMeter - 1:16909108 1202983.0 99.0 s 82.0 s 0.5% 5.0 h 5.0 h INFO 10:53:09,769 ProgressMeter - 1:21069702 1302985.0 2.3 m 104.0 s 0.7% 5.6 h 5.5 h INFO 10:53:49,083 ProgressMeter - 1:27951393 1803181.0 2.9 m 97.0 s 0.9% 5.4 h 5.4 h INFO 10:54:29,275 ProgressMeter - 1:32739969 2103299.0 3.6 m 102.0 s 1.1% 5.7 h 5.6 h INFO 10:55:09,177 ProgressMeter - 1:36643589 2203300.0 4.3 m 116.0 s 1.2% 6.0 h 5.9 h INFO 10:55:45,643 ProgressMeter - 1:39854010 2303302.0 4.9 m 2.1 m 1.3% 6.3 h 6.2 h INFO 10:56:25,147 ProgressMeter - 1:40542516 2403303.0 5.5 m 2.3 m 1.3% 7.0 h 6.9 h INFO 10:57:10,934 ProgressMeter - 1:40654849 2503322.0 6.3 m 2.5 m 1.3% 8.0 h 7.9 h INFO 10:57:54,084 ProgressMeter - 1:43162895 2503322.0 7.0 m 2.8 m 1.4% 8.4 h 8.3 h INFO 10:58:24,149 ProgressMeter - 1:45244391 2703426.0 7.5 m 2.8 m 1.5% 8.6 h 8.4 h INFO 10:58:56,749 ProgressMeter - 1:53716450 2803427.0 8.1 m 2.9 m 1.7% 7.7 h 7.6 h INFO 10:59:38,928 ProgressMeter - 1:86821106 3103432.0 8.8 m 2.8 m 2.8% 5.2 h 5.1 h INFO 11:00:11,337 ProgressMeter - 1:93301870 3303437.0 9.3 m 2.8 m 3.0% 5.1 h 5.0 h INFO 11:01:13,113 ProgressMeter - 1:115252321 3803590.0 10.3 m 2.7 m 3.7% 4.6 h 4.5 h INFO 11:02:02,172 ProgressMeter - 1:145441389 4303778.0 11.2 m 2.6 m 4.7% 4.0 h 3.8 h INFO 11:02:38,237 ProgressMeter - 1:150547232 4703871.0 11.8 m 2.5 m 4.9% 4.0 h 3.8 h INFO 11:03:09,693 ProgressMeter - 1:153362937 5003904.0 12.3 m 2.5 m 5.0% 4.1 h 3.9 h INFO 11:03:39,934 ProgressMeter - 1:155984762 5403968.0 12.8 m 2.4 m 5.0% 4.2 h 4.0 h INFO 11:04:05,477 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.3-0-g37228af): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: There is no space left on the device, so writing failed ##### ERROR ------------------------------------------------------------------------------------------ I understand temporary files may be large, but not that large. Are the temporary files written in the working directory (as I believe should be the case), or are they written in GATK installation directory? Also, note I never run into this problem with the previous version. Any idea? Cheers, Stephane Created 2015-02-04 05:14:44 | Updated | Tags: variantrecalibrator vqsr vcf gatk Hi, I have generated vcf files using GenotypeGVCFs; each file contains variants corresponding to a different chromosome. I would like to use VQSR to perform the recalibration on all these data combined (for maximum power), but it seems that VQSR only takes a single vcf file, so I would have to combine my vcf files using CombineVariants. Looking at the documentation for CombineVariants, it seems that this tool always produces a union of vcfs. Since each vcf file is chromosome-specific, there are no identical sites across files. My questions are: Is CombineVariants indeed the appropriate tool for me to merge chromosome-specific vcf files, and is there any additional information that I should specify in the command-line when doing this? Do I need to run VariantAnnotator afterwards (I would assume not, since these vcfs were generated using GenotypeGVCFs and the best practices workflow more generally)? I just want to be completely sure that I am proceeding correctly. Thank you very much in advance, Alva Created 2015-01-27 21:59:14 | Updated 2015-01-27 21:59:46 | Tags: best-practices snp gatk combinegvcfs gvcf Hi, I am combining gcvf files into single gvcf files by chromosome, using CombineGVCFs, in order to run GenotypeGVCFs. When I checked the first gvcf file generated by CombineGVCFs, I noticed that at each position, all the alleles were missing. For example, at position 16050036, this is what comes up in the final gvcf file: 22 16050036 . A C,<NON_REF> . . BaseQRankSum=-7.360e-01;ClippingRankSum=-7.360e-01;DP=4;MQ=27.00;MQ0=0;MQRankSum=-7.360e-01;ReadPosRankSum=0.736 GT:AD:DP:MIN_DP:PL:SB ./.:1,2,0:3:.:55,0,23,58,29,86:1,0,2,0 ./.:.:1:1:0,0,0,0,0,0 ./.:.:0:0:0,0,0,0,0,0 But if we just take one of the precursor gvcf files (one individual), we clearly see the genotype at that site: 22 16050036 . A C,<NON_REF> 26.80 . BaseQRankSum=-0.736;ClippingRankSum=-0.736;DP=3;MLEAC=1,0;MLEAF=0.500,0.00;MQ=27.00;MQ0=0;MQRankSum=-0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB 0/1:1,2,0:3:23:55,0,23,58,29,86:1,0,2,0 The command I'm using to generate these files is: java -Xmx1g -jar GenomeAnalysisTK.jar -T CombineGVCFs -R hs37d5.fa -V vcfs.chr${numchr}.new.list -o mergeGvcf_${numchr}.vcf -L${numchr} where numchr is a variable previously defined (indicating the chromosome number).

It seems that all the information is being taken into account except the actual genotypes. How do I solve this problem?

Thanks, Alva

Created 2015-01-27 20:25:42 | Updated | Tags: variantrecalibrator vqsr vcf gatk

Hi,

I ran VariantRecalibrator and ApplyRecalibration, and everything seems to have worked fine. I just have one question: if there are no reference alleles besides "N" in my recalibrate_SNP.recal and recalibrate_INDEL.recal files, and in the "alt" field simply displays , does that mean that none of my variants were recalibrated? Just wanted to be completely sure. My original file (after running GenotypeGVCFs) has the same number of variants as the recalibrated vcf's.

Thanks, Alva

Created 2015-01-23 16:55:57 | Updated | Tags: vqsr haplotypecaller bam gatk genotypegvcfs variant-calling

Hi,

I have recal.bam files for all the individuals in my study (these constitute 4 families), and each bam file contains information for one chromosome for one individual. I was wondering if it is best for me to pass all the files for a single individual together when running HaplotypeCaller, if it will increase the accuracy of the calling, or if I can just run HaplotypeCaller on each individual bam file separately.

Also, I was wondering at which step I should be using CalculateGenotypePosteriors, and if it will clean up the calls substantially. VQSR already filters the calls, but I was reading that CalculateGenotypePosteriors actually takes pedigree files, which would be useful in my case. Should I try to use CalculateGenotypePosteriors after VQSR? Are there other relevant filtering or clean-up tools that I should be aware of?

Alva

Created 2014-12-18 08:52:46 | Updated | Tags: genotypeconcordance gatk

Hi, I didn't find the answer anywhere, so I'm asking here. I'm using GATK 2.6-4. I used GenotypeConcordance to compare two datasets, and it was great, exactly what I needed. But then, I was interesting in the discordant sites, and I saw that printInterestingSites should be a good solution. But I can't make it work. Just adding in my command line "-sites filename" doesn't work, and I've tried to be creative in changing this option. Then I thought, that maybe this option isn't working in my GATK version. Have I just done something wrong?

Created 2014-12-17 18:04:27 | Updated | Tags: haplotypecaller gatk error rnaseq genotyping genotyping-mode

Hi, I'm currently trying to use GATK to call variants from Human RNA seq data

So far, I've managed to do variant calling in all my samples following the GATK best practice guidelines. (using HaplotypeCaller in DISCOVERY mode on each sample separately)

But I'd like to go further and try to get the genotype in every sample, of each variant found in at least one sample. This, to differentiate for each variant, samples where that variant is absent (homozygous for reference allele) from samples where it is not covered (and therefore note genotyped).

To do so, I've first used CombineVariants to merge variants from all my samples and to create the list of variants to be genotype ${ALLELES}.vcf I then try to regenotype my samples with HaplotypeCaller using the GENOTYPE_GIVEN_ALLELES mode and the same settings as before: my command is the following: **java -jar${GATKPATH}/GenomeAnalysisTK.jar -T HaplotypeCaller -R ${GENOMEFILE}.fa -I${BAMFILE_CALIB}.bam --genotyping_mode GENOTYPE_GIVEN_ALLELES -alleles ${ALLELES}.vcf -out_mode EMIT_ALL_SITES -dontUseSoftClippedBases -stand_emit_conf 20 -stand_call_conf 20 -o${SAMPLE}_genotypes_all_variants.vcf -mbq 25 -L ${CDNA_BED}.bed --dbsnp${DBSNP}.vc**f

In doing so I invariably get the same error after calling 0.2% of the genome.

##### ERROR stack trace

java.lang.IndexOutOfBoundsException: Index: 3, Size: 3 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at htsjdk.variant.variantcontext.VariantContext.getAlternateAllele(VariantContext.java:845) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:248) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:1059) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:221) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

because the problem seemed to originate from getAlternateAllele, I tried to play with --max_alternate_alleles by setting it to 2 or 10, without success. I also checked my ${ALLELES}.vcf file to look for malformed Alternate alleles in the region where the GATK crashes (Chr 1, somewhere after 78Mb) , but I couldn't identify any... (I searched for Alternate alles that would not match the following extended regexpr '[ATGC,]+') I would be grateful for any help you can provide. Thanks. Created 2014-12-05 21:10:20 | Updated | Tags: bqsr best-practices gatk Hi, I noticed that the pipeline for BQSR is slightly different between the "Workshop walkthrough (Brussels 2014)" document and the "(howto) Recalibrate base quality scores = run BQSR" document. I used the former document as my guide since it is more recent, but I am curious as to why these two are different. The first step is the same for the 2 docs, but then for the second step, the output is a recal.bam file in the brussels document instead of a post_recal.table file in the (howto) document. Then, I am confused about the (howto) document because it seems that the 4th step is exactly the same as the second step, except we generate a recal.bam file. Is the brussels document presenting steps that are more efficient? (since there is only 1 use of -BQSR to generate the recalibrated bam file we need, as opposed to 2 uses of the option) Thanks, Alva Created 2014-12-05 20:16:37 | Updated | Tags: bqsr gatk Hi, I encounter the "RScript exited with 1. Run with -l DEBUG for more info" error when trying to make recalibration plots using Rscript. However, I do have all the required packages installed on my computer, including gsalib, which seems to be the problem. When I run with -l DEBUG, I get the following information: Error in library(gsalib) : there is no package called ‘gsalib’ The thing is that I am using my computer but working in a cluster, so not directly on my computer. Should I have the packages installed in the cluster or is it fine that I've installed the packages on my computer? I'm a bit new to this, so I apologize if this question is a bit basic. Thanks, Alva Created 2014-12-05 15:27:21 | Updated 2014-12-05 18:08:01 | Tags: haplotypecaller best-practices vcf gatk gvcf Hi, I used HaplotypeCaller in GVCF mode to generate a single sample GVCF, but when I checked my vcf file I see that the reference allele is not showing up: 22 1 . N <NON_REF> . . END=16050022 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0 22 16050023 . C <NON_REF> . . END=16050023 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,37 22 16050024 . A <NON_REF> . . END=16050026 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,73 22 16050027 . A <NON_REF> . . END=16050035 GT:DP:GQ:MIN_DP:PL 0/0:3:9:3:0,9,110 22 16050036 . A C,<NON_REF> 26.80 . BaseQRankSum=-0.736;ClippingRankSum=-0.736;DP=3;MLEAC=1,0;MLEAF=0.500,0.00;MQ=27.00;MQ0=0;MQRankSum=-0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB 0/1:1,2,0:3:23:55,0,23,58,29,86:1,0,2,0 22 16050037 . G <NON_REF> . . END=16050037 GT:DP:GQ:MIN_DP:PL 0/0:3:9:3:0,9,109 22 16050038 . A <NON_REF> . . END=16050039 GT:DP:GQ:MIN_DP:PL 0/0:4:12:4:0,12,153 I am not sure where to start troubleshooting for this, since all the steps prior to using HaplotypeCaller did not generate any obvious errors. The basic command that I used was: java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hs37d5.fa -I recal_1.bam -o raw_1.vcf -L 22 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 Have you encountered this problem before? Where should I start troubleshooting? Thanks very much in advance, Alva Created 2014-12-02 16:32:32 | Updated | Tags: haplotypecaller bam gatk Hi, I want to use HaplotypeCaller to call families together. I have bam files for each individual in the 4 families I am studying, as well as a ped file describing the pedigree information. The problem is that these families have complex pedigrees, with the parents (mother and father), the children, and then one grandchild for each child (do not have information about the other parent of the grandchild). I would like to call these families with their complex pedigrees together, and I would like to call all the 4 families together to maximize the power of the calling. However, I'm not sure how to do that with just the -ped option. -ped seems to be designed for only one family or cohort, and I'm not sure it would work for me to feed it all my bams as inputs. Are there any other tools for GATK that I could use to call complex pedigrees? The other possibility would be to call the 4 trios separately and each child-grandchild pair separately, but not sure how to do that either with just -ped. What would you recommend? And finally, I get an error message saying that --emitRefConfidence only works for single sample mode. It seems that I should omit this option when I run HaplotypeCaller on my families, but are there any other options that I should use for cohort calling besides the standard ones (which I know to be --variant_index_type and --variant_index_parameter)? Thanks, Alva Created 2014-11-25 14:27:16 | Updated | Tags: best-practices gatk Chaps, I am a newbie to GATK. I have started the best practices pipline but stuck in bwa. From where can I get the reference.fa file? The GATK bundle has .fasta .bam .fai but no .fa. Anyone please! Created 2014-11-20 23:51:40 | Updated | Tags: gatk I would like to obtain the coverage distribution on all exome intervals (human reference genome) using the BaseCoverageDistribution function in GATK. The problem I'm having is finding an appropriate reference file to associate with the -L argument. Specifically, the standard UCSC exome coordinate file doesn't have a format that GATK will accept, e.g. (header and first row given below, as opposed to chr1:start-stop) etc # name chrom strand cdsStart cdsEnd exonCount exonStarts exonEnds name2 exonFrames NM_017582 chr1 - 154522913 154531029 13 154521050,154523413,154523872,154524246,154524396,1545 24568,154524879,154525211,154525507,154527210,154527903,154528335,154530702, 154522945,1545234 80,154523968,154524295,154524455,154524659,154524940,154525296,154525648,154527261,154528008,1 54528440,154531120, UBE2Q1 1,0,0,2,0,2,1,0,0,0,0,0,0, Created 2014-11-20 11:41:42 | Updated | Tags: best-practices gatk Where can I get the fasta and ref files for experimenting on GATK? Created 2014-09-23 15:08:19 | Updated | Tags: selectvariants gatk tumor normal I am wondering if it would be okay to use the "SelectVariants" tool to select mutants specific to the tumor sample in a pair of tumor and normal samples. I tried the following command, and it seems working. But I am suspecting whether it differentiates the genotype differences (e.g. 0/1 in tumor sample and 0/0 in normal sample) or coverage differences (e.g. 0/1 in tumor sample and ./. in normal sample). Could someone help me understand "SelectVariants" for my question? Thanks a lot! java -Xmx10g -Djava.io.tmpdir=$tempSpillFolder -jar $CLASSPATH/GenomeAnalysisTK.jar \ -R$GenomeReference \
-T SelectVariants \
--variant $file \ -o$OutputFolder/\$filename.selected.vcf \
-select 'set=="tumor"'

Created 2014-09-19 23:06:10 | Updated | Tags: variantrecalibrator applyrecalibration gatk

Hi! I am attempting to go through the process of applying a recalibration to an input vcf. I tried following the steps outlines in the tutorial Howto: Recalibrate Variant Quality Scores .

Step 2 runs great, I used all the tranches and resources that the howto mentions. Unfortunately when I attempt to apply the recalibration (step 3) I receive the following error

Encountered input variant which isn't found in the input recal file. Please make sure VariantRecalibrator and ApplyRecalibration were run on the same set of input variants. First seen at: [VC input @ 1:14907 Q523.77 of type=SNP alleles=[A*, G] attr={AC=1, AF=0.500, AN=2, BaseQRankSum=-1.863, ClippingRankSum=-0.420, DP=86, FS=0.984, MLEAC=1, MLEAF=0.500, MQ=36.93, MQ0=0, MQRankSum=0.338, QD=6.09, ReadPosRankSum=2.064} GT=GT:AD:DP:GQ:PL 0/1:57,29:86:99:552,0,1482

When I check the input file I notice that this is the first variant call in my VCF. I tried this process with other VCF files and each time my ApplyRecalibration gets stuck on the first variant call.

Here is the command I used to generate the recalibration:

 java -Djava.io. -Xmx4g -jar GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \
-T VariantRecalibrator -R /human_g1k_v37.fasta \
-L file.intervals \
-input input_file.vcf \

-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf \
-resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 Mills_and_1000G_gold_standard.indels.b37.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf \

-an DP \
-an QD \
-an FS \
-an MQRankSum \

-mode SNP \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
-recalFile recal_file.recal \
-tranchesFile tranches_file.tranches \
-rscriptFile rscipt_file.plots.R\

And here is my command for applying the recalibration:

java -Xmx4g -jar GenomeAnalysisTK.jar
-T ApplyRecalibration
-R human_g1k_v37.fasta
-input input_file.vcf
-mode SNP
--ts_filter_level 99.0
-recalFile  recal_file.recal
-tranchesFile tranches_files.tranches
-o recalibrated_output.vcf 

Any suggestions to help me troubleshooot the issue is very welcome. I should also mention that I am using GATK 3.2.2

Created 2014-09-15 21:28:59 | Updated | Tags: indelrealigner unifiedgenotyper indels gatk variant-calling

I used GATK 3.2-2 to do indel realignment and multisample calling with UnifiedGenotyper (I cannot use HaplotypeCaller because it is incompatible with the type of data I am analyzing).

Among the ~96 samples sequenced, I have one sample with two nearby variant calls, an indel and a missense, that we checked by Sanger. The missense is real but we found no trace of the indel in the Sanger traces. When I look at indel call in the multisample vcf, it has good allele depth and GQ, but suspiciously has the same AD as the missense call. Additionally, when I look at the bam in IGV, I see no evidence for the indel whatsoever and the variant is not called in any other samples in this project.

indel: chr13:101755523 CA>C GT:AD:DP:GQ:PL 0/1:55,56:111:99:1388,0,1538

missense: chr13:101755530 A>G GT:AD:DP:GQ:PL 0/1:55,56:111:99:2170,0,2125

I went back and recalled just this one sample (single sample calling)… which resulted in the correct variants, i.e. the indel was not called at all, but the SNP, which does validate, is called.

I understand that this is not an easy region to call because of the 7xA repeat, but it’s not obvious to me why this happens only in multisample mode and I'd like to have a better understanding of what is going on.

Created 2014-09-09 20:23:51 | Updated | Tags: gatk module

Hello,

I'm currently working on creating a module file on my cluster for GATK. I set the prepend path to the directory in which I saved my .jar file, but when I attempt to use said file (command: java -jar GenomeAnalysisTK.jar -h), java cannot find it. It does, however, work when I am in the directory in which it is saved, but not when I add it as a module file. Any ideas?

Thanks

Created 2014-09-02 12:38:46 | Updated 2014-09-02 12:48:19 | Tags: vqsr bqsr gatk

Hi there, I have been using GATK to identify variants recently. I saw that BQSR is highly recommended. But I don’t know whether it is still needed for de novo mutation calling. For example, I want to identify de novo mutations generated in the progenies by single seed descent METHODS in plants. For example, in the paper of “The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana”, these spontaneous arising mutations may not included in the known sites of variants. Based on documentation posted in GATK websites, they assume that all reference mismatches we see are errors and indicative of poor base quality. Under this assumption, these de novo mutations may be missed in the step of variant calling. So in this situation, what should I do? Or should I skip the BQSR step? Also what should I do when I reach to step- VQSR? Hope some GATK developers can help me on this. Thanks.

Created 2014-08-26 19:14:10 | Updated | Tags: baserecalibrator gatk

Hi,

I have performed BaseRecailbrator using GATK-3.2-2 . The size of the bam file is double the size of realigned_fixmate.bam. I read in one of the posts in this forum that it could be because of retaining the original and reclaibrated quality scores. And mentioned that the latest version will by default exclude original scores and retain reclaibrated scores.

Im using the latest version and still gets double the size of the original bam. Could someone comment on this. Thanks

Created 2014-08-22 10:43:41 | Updated | Tags: variantfiltration gatk filterexpression

Hi all,

I tried to apply the following command to my raw vcf file to filter it using the filtering expression specified in the command:

java -XX:ConcGCThreads=4 -XX:+UseConcMarkSweepGC -XX:ParallelGCThreads=4 -jar GenomeAnalysisTK.jar -T VariantFiltration -R human_g1k_v37.fa --variant human_g1k_v37.CHL124.vcf_snps.vcf -o CHL124.vcf_snps.vcf_filter_marked.vcf --filterExpression "QD < 2.0 && MQ < 40.0 && FS > 60.0 && HaplotypeScore > 13.0 && MQRankSum < -12.5 && ReadPosRankSum < -8.0" --filterName "very_small_SNPs_default_filter"

After that, I check my result file which is CHL124.vcf_snps.vcf_filter_marked.vcf, I found that, all reads are marked as "PASS" whether its QD is > 2.0 or < 2.0. Obviously, the command doesn't work, but I cannot find why everything seems goes well, no error reported.

bless~ XL

Created 2014-08-19 14:40:22 | Updated | Tags: gatk error illegalargumentexception

Hi, I suppose I had a problem with my bam file, but I don't know what I should look for.

Any help appreciated.

##### ERROR stack trace

cf attached file for complete stack.

Created 2014-07-23 03:56:29 | Updated | Tags: gatk personal-genome merge-bam

Hi all, I have been successfully dealing with GATK tools for variant calling for exome sequences, but now I have to do it for a personal genome. Since the genome has been sequenced in two runs , using 7 lanes per each run , now I have 28 fastq files.( paired end reads (2) 7 lanes 2 runs). I haven't deal with such a large number of files at once before. My suggested approach is to,

1) Align, Dedup, Realign and Recalibrate per lane. (So I get 14 aligned,deduped,realigned and reacalibrated bam files) 2) Merge the bam files inorder to produce a single bam file 3) Call variants using the single bam file using Haplotype Caller.

Do you think my approach is feasible? Or do you have any alternative approaches? Furthermore, what is the best tool to merge the bam files?

Created 2014-07-21 10:04:39 | Updated | Tags: printreads gatk

Hi, I am trying to do a base quality score recalibration of my BAM files. I am working with individual based RADseq data where the reads were cleaned and aligned against a reference genome using bowtie2. The reference genome comes from a relatively closely related taxon (~25 Mio years of divergence), however the species in this taxonomic family underwent a recent genome duplication, hence I am running bowtie2 with the -k option (which retains multiple alignments rather than picking randomly one if multiple matches of equal quality occur) and only retain the uniquely aligned reads for the downstream analyses. The downside to this is, that the MAPQ values become artificial.

After generating a recalibration data file, I run GATK PrintReads (version 3.1-1-g07a4bf8) as follow:

java -Xmx16G -jar GenomeAnalysisTK.jar -T PrintReads -R Ref_genome.fasta -I Individual.bam -BQSR Library_recaldata.grp -o Individual.rc.bam

Which yields the following error after running through 80% of the file: # ERROR MESSAGE: Exception when processing alignment for BAM index HWI-ST1206:32:C3VGTACXX:5:1104:14976:28487 84b aligned read.

Here the read in question (the one in the middle):

HWI-ST1206:32:C3VGTACXX:5:2313:18281:8513   16  chrUn   536921741   0   84M *   0   0   CCCAGAGTGCAAAGAACCTTGGCAGGACACTAGACAACACCCTGTTGTTTTCTGCAAACATCAAAGCAGTGACTCTGTCCTGCA    DDDDEEEEEEFFFD@DHFGHHJHIIIIIGGJIJJIGGHHJJJJJJJJIJJJJJJIJJJJJIIJJJJJJJJJJJJJJJHHHHHFF    AS:i:-29    XS:i:-35    XN:i:0  XM:i:5  XO:i:0  XG:i:0  NM:i:5  MD:Z:11T18G1A14G28C7    YT:Z:UU RG:Z:70130.satrFamilyX
HWI-ST1206:32:C3VGTACXX:5:2314:6453:46729   16  chrUn   536921741   0   84M *   0   0   CCCAGAGTGCAAAGAACCTTGGCAGGACACTAGACAACACCCTGTTGTTTTCTGCAAACATCAAAGCAGTGACTCTGTCCTGCA DDDDEEEEEEFFFEDFHHHHHJJIJJIJIIIIIGJJIFFJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJIHJJJJJHHHHHFF   AS:i:-29    XS:i:-35    XN:i:0  XM:i:5  XO:i:0  XG:i:0  NM:i:5  MD:Z:11T18G1A14G28C7    YT:Z:UU RG:Z:70130.satrFamilyX
HWI-ST1206:32:C3VGTACXX:5:1104:14976:28487  16  chrUn   536961523   255 84M *   0   0   ATTGGATTGCAGTGCAGGTACACTTTGTATGGATCGTTGTGTGTGTTTACTGTGACTTCTCTTAAGAAGACGCGTGTCCCTGCA</del>  DDDDDDDDDDDDDDDDDDDDDCEEEEEFFFFFHFFJJJJJJJJJIIHD8JJJJIHIJJJJJJIJJJJJJIJJIJJJJHHHHHFF    AS:i:-9 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:20G27T35   YT:Z:UU RG:Z:70130.satrFamilyX**
HWI-ST1206:32:C3VGTACXX:5:1104:10760:63491  16  chrUn   536961523   255 84M *   0   0   ATTGGATTGCAGTGCAGGTACACTTTGTATGGATCGTTGTGTGTGTTTTCTGTGACTTCTCTTAAGAAGACGCGTGTCCCTGCA    DDDDCCCDDDDDDBCDCDCDC@DDEEECEEBFEFEHJJJJJJIHGJJIJJJJIHFJJJJIIJIJJIJJJIJJJJJJJHHHHHFF    AS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:20G63  YT:Z:UU RG:Z:70130.satrFamilyX
HWI-ST1206:32:C3VGTACXX:5:1104:16496:63907  16  chrUn   536961523   255 84M *   0   0   ATTGGATTGCAGTGCAGGTACACTTTGTATGGATCGTTGTGTGTGTTTTCTGTGACTTCTCTTAAGAAGACGCGTGTCCCTGCA    DDDDDDDDDDDDDDDDDDCCCCDDEEECEEFFFFFHJJJIJJJJJJJJJJJJJIHJJJJJJJJJJIJJJJJJJJJIJHHHHHFF    AS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:20G63  YT:Z:UU RG:Z:70130.satrFamilyX

I subsequently verified my BAM files with Picard, which yields the following error # "bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned" for 84554 reads that are almost consecutive in the same genomic region (chrUn - which is composed of sequences for which no chromosome anchoring information exist). However, the reads flagged by Picard occur about 2 million base pairs before the region in question for GATK and PrintReads seemed to have no problem with them.

For the sake of completeness, here the beginning and end of the section that was flagged by Picard. It starts with the second of the following reads:

HWI-ST1206:32:C3VGTACXX:5:2315:1142:72568   256 chrUn   997157147   2   84M *   0   0   TGCAGGACCACAGTCTTTTCACCAGCTCCTCAAAATGGTTGAAGAAGTAGTATTTTGGGGGCTCGTATAAACTCGTATGCCGTC    FFHHHHHJJIJJJDFHHGIHIGIJJIJJJIIJJJIIJGIJIGIFHDHCGHCGGHIJIIIIEFDDBDDCDEDAAC?ABCDCCBB@    AS:i:-50    XS:i:-50    XN:i:0  XM:i:10 XO:i:0  XG:i:0  NM:i:10 MD:Z:2T41T8C7A7G0G1A2G3A2G1 YT:Z:UU RG:Z:70130.satrFamilyX
HWI-ST1206:32:C3VGTACXX:5:1102:4043:34769   16  chrUn   997184926   255 84M *   0   0   TCGCCTACAAACAAAACCCCACCCATGCACACCTGATTGGTGCACGTAAAACTGGGAAAGAAGTTACCTCCCTGGAAGCCTGCA    DDDDDEDDDDDDDB@BDDEECHHIIIJHHIHJIHJIGJJJJJJIIJIJJJJIJJJIJJJJJJJIGCJJIJJJJJIJJHHHHHFF    AS:i:-12    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:41T15G26   YT:Z:UU RG:Z:70130.satrFamilyX
HWI-ST1206:32:C3VGTACXX:5:1105:2829:22242   16  chrUn   997184926   255 84M *   0   0   TCGCCTACAAACAAAACCCCACCCATGCACACCTGATTGGTGCACGTAAAACTGGGAAAGAAGTTACCTCCCTGGAAGCCTGCA    DDDDDEDDDDDDB@?;DDEE?HHIIGIIHHGJIIJJJJJJJJIIJJJJJJJJJJJJJJJIJIIHE?JIGJJJJJJJJHHHHHFF    AS:i:-12    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:41T15G26   YT:Z:UU RG:Z:70130.satrFamilyX

And stops at the following read (note the second read here is ok):

HWI-ST1206:32:C3VGTACXX:5:2314:10564:44125  16  chrUn   1110931683  255 84M *   0   0   GCTAGCTGACTACTGTAGCCCCAGCCCTTATTAGCCTGCACTGTTATAAGGTGTTATGATTCACACACATTGAAATGGCCTGCA    DCDDDCECCCCCBDFFFHHHHGCIHHHFGF@FGHFIIGGGHJIJJJJJJJIJJJJJIJJJIHFIIJIJJJJJJJJJJHHHGHFF    AS:i:-10    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:37A39A6    YT:Z:UU RG:Z:70130.satrFamilyX
HWI-ST1206:32:C3VGTACXX:5:1101:4103:100272  256 chrUn   1110931763  0   84M *   0   0   TGCAGGACAATGCCAGTGAATGGAGCCTAATTACCCTCTGTAATGAAATGTCACATTTTTGTTTCTGGGTCCTGAATAGATGTG    FFHHHHHJJJJJJJJJGIIJJJJIJJJJJJIJJJJJJJJJJJJJJJJJJIJJJJIIJJJJIJJJJJJIHHHFHFFFFFFEEEEE    AS:i:-35    XS:i:-35    XN:i:0  XM:i:6  XO:i:0  XG:i:0  NM:i:6  MD:Z:13A34A5G4C1C11A10  YT:Z:UU RG:Z:70130.satrFamilyX

I personally cannot find something specific that would explain why PrintReads does not work. The pipeline that I use works perfectly when I use an assembly based on my RADseq data, yet not with the reference genome. Thus, I would like to ask, if you could perhaps help me. Thanks!

Created 2014-06-14 02:55:54 | Updated | Tags: java gatk

Uh..Does the GATK read the reference such as hg19 from the disk each time it runs some tools? As the GATK will run many tools such as RealignerTargetCreator, IndelRealigner, BaseRecalibrator in a variant detection pipeline, if it read hg19 from the disk every time, it will be time-consuming...

Created 2014-06-10 20:57:36 | Updated | Tags: variantannotator best-practices vcf developer performance walkers java gatk gatk-walkers

Hi,

I was wondering if there is a nice way to apply multiple processing steps to each variant (or a group of variants) as they are read so that the variant file is not read again and again. My understanding is that even if I use Queue, each script would read the vcf again. Is that correct?

Created 2014-04-25 06:48:55 | Updated | Tags: install gatk newbie

Hi, I have just used Ubuntu. I'm want install GATK for my work. But i can't search anything about how to install GATK by Ubuntu Software Center or Teminal. Please help me. Thanks.

Created 2014-04-16 14:11:37 | Updated | Tags: unifiedgenotyper haplotypecaller bug gatk variant-calling

HI, I'd like to report a weird result from HaplotypeCaller.

We have a patient sequenced by targeted sequencing on HNF1B. This patient has been confirmed to have a whole gene deletion of HNF1B so we used this patient as a positive control. We expected to see no heterozygous variants called in HNF1B.

However, HaplotypeCaller called two heterozygous variants: one deletion (it didn't pass the FS strand bias filter and the ReadPosRankSumTest filter) and one substitution (this one passed all the quality filters). Both these two variants were not called by UnifiedGenotyper (and the variants called by UnifiedGenotyper in HNF1B region were all homozygous as what we expected)

There are three things I want to highlight: 1, The deletion is only 10 bases upstream of the substitution, but the FS score is enormous for the deletion whereas very low for the substitution. If there's a strand bias, it must affect both variants if they are so close to each other. 2, The score of ReadPosRankSumTest of the deletion didn't pass the filter because it's always called near the end of the reads. The ReadPosRankSumTest score for the substitution is missing. 3, The genotype was called 0/1 for the substitution, but if we look at the AD, there are 206 reads supporting the ref allele and 0 read supporting the alt allele. Going by the AD, it's clearly a homozygous ref/ref genotype.

Then I looked into the bam files. It turns out the all the alternative alleles of the substitution were from the ends of bad reads, and there are not too many of them after all. So the reads in the bam file also support a homozygous ref/ref genotype.

Therefore I'm really confused why the substitution has 0/1 genotype called by the HaplotypeCaller and why it passed the filter.

Many Thanks

Betty

Created 2013-11-22 09:36:51 | Updated | Tags: depthofcoverage gatk

I am using this command to calculate the depth of coverage with three different formats for the EXOME_interval.list : java -Xmx2048m -jar GenomeAnalysisTK.jar -T DepthOfCoverage -I /home/vsinha/software/bam/group1.READS.bam.list -L /home/vsinha/software/EXOME.interval_list -R /home/vsinha/software/human_g1k_v37.fasta -dt BY_SAMPLE -dcov 5000 -l INFO --omitDepthOutputAtEachBase --omitLocusTable --minBaseQuality 0 --minMappingQuality 20 --start 1 --stop 5000 --nBins 200 --includeRefNSites --countType COUNT_FRAGMENTS -o /home/vsinha/software/group1.DATA

1st format : 1 69114 69332 1 69383 69577 1 69644 69940 1 69951 70028 1 566170 566275 1 801895 801983 1 802332 802416 1 861272 861420

Error : ERROR MESSAGE: Badly formed genome loc: Contig '1 69114 69332' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?

2nd format: 1:69114-69332 1:69383-69577 1:69644-69940 1:69951-70028 1:566170-566275 1:801895-801983 1:802332-802416 1:861272-861420

Error: Badly formed genome loc: Contig '1' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?

3rd Format: chr1:69114-69332 chr1:69383-69577 chr1:69644-69940 chr1:69951-70028 chr1:566170-566275 chr1:801895-801983 chr1:802332-802416 chr1:861272-861420

Error: Contig chr1 not present in sequence dictionary for merged BAM header: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1]

Can you please let me know what is the problem in my file.

Regards Vishal

Created 2013-11-14 19:55:16 | Updated | Tags: license gatk

I have been asked to reach out to you to clarify the GATK license. We are hoping to package our CNV/SNP calling and annotation pipeline that includes GATK and make it available on an Amazon Machine Image (AMI) for use on the Amazon EC2 cloud for anyone who wants to replicate our protocol using their own data. Of all the tools that we use, only GATK contains license restrictions. It is our intent that our AMI be used by academic/non-commercial users, and it is not intended to be sold. We can list the AMI in the AWS Marketplace as a Bring Your Own License (BYOL) instance. However, there are no checks provided by Amazon that users do in fact have a license. Is this permissible?

Created 2013-08-22 18:55:31 | Updated | Tags: unifiedgenotyper gatk

Dear GATK Team,

I am receiving the following error while running GATK 1.6. Unfortunately, for project consistency I cannot update to a more recent version of GATK and would at least wish to understand the source of the error. I ran ValidateSamFile on the input bam files and they appear to be OK.

Any insight or advice would be greatly appreciated:

##### ERROR ------------------------------------------------------------------------------------------

##### ERROR ------------------------------------------------------------------------------------------

Abbreviated commandline used:

GenomeAnalysisTK.jar -T UnifiedGenotyper -glm BOTH -et NO_ET \ -R "Saccharomyces_cerevisiae/UCSC/sacCer2/Sequence/WholeGenomeFasta/genome.fa" \ -dcov 5000 -I "someFile.bam" --output_mode EMIT_ALL_SITES -gvcf -l OFF \ -stand_call_conf 1 -L chrIV:1-1531919

Created 2013-08-13 21:38:39 | Updated | Tags: vcf gatk picard

I finally got the filtered VCF file from PWA + PiCard + GATK pipeline, and have 11 exome-seq data files which were processed as a list of input to GATK. In the process of getting VCF, I did not see an option of separating the 11 samples. Now, I've got two VCF files (one for SNPs and the other for indels) that each has 11 samples. My question is how to proceed from here?

Should I separate the 11 files before annotation? or annotation first then split them 11 samples to individual files? Big question here is how to split the samples from vcf files? thanks

Created 2013-08-11 15:39:42 | Updated | Tags: depthofcoverage gatk

I'm a bit confused about some of the output of GATK's DepthOfCoverage script. In the *.GATK.covdepth.sample_summary, what does the "total" column signify? Is it the total number of base pairs in the data? Because my raw read data has ~37.2 billion bp. However, GATK reports over 92 billion bp. Does the 'total' mean something else, or will I have to rerun the DepthofCoverage script?

Created 2013-08-08 16:17:33 | Updated | Tags: gatk slow

I have 15 exome-seq samples, and have been using BWA-PiCard-GATK pipeline to do the variant calling. I did not realize GATK is so slow until I have to analyze this large number of samples. In this HaplotypeCaller step, each sample seems to take at least 2 days (48+ hours). Is this normal is there's something I did wrong? Below is my command, is there anything wrong or GATK-HaplotypeCaller is known this slooooow?

java -Xmx10g -Djava.awt.headless=true -jar /Library/Java/Extensions/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -minPruning 3 \ -dcov 10 \ -R ./GATK_ref/hg19.fasta \ -I ./GATK/BQSR/sample1_realign.recal.compressed.bam \ -o ./GATK/VQSR/sample1_realign.raw.snps_indels.vcf

Is there anything I did wrong with this command? or anything I can change to make it run faster? I am using GATK 2.5 by the way. thanks!

Created 2013-04-25 11:21:53 | Updated | Tags: gatk

Hi,

I am using GATK v2.4.9 and i am trying to liftover from one build to another. Below are the arguments for LiftoverVariants:

Arguments for LiftoverVariants: -V,--variant Input VCF file -o,--out File to which variants should be written -chain,--chain Chain file -dict,--newSequenceDictionary Sequence .dict file for the new build -recordOriginalLocation,--recordOriginalLocation Should we record what the original location was in the INFO field?

Here is the command i have used, java -jar GenomeAnalysisTK.jar -T LiftoverVariants -chain canFam2ToCanFam3.over.chain -dict canFam3_dog_genome.dict -V canFam2_snp131.vcf -o canFam3_snp131.vcf

It showed an error message: ERROR MESSAGE: Walker requires a reference but none was provided.

The argument reference was not present in the documentation and the tool needs a reference argument. If it is required, what should be the reference file, new build or old build?

Thanks.

Created 2013-04-25 11:11:55 | Updated | Tags: gatk

Hi,

I am using unified genotyper utility of GATK. I don't exactly know how I should make the dbsnp file necessary. I thank you very much for any help.

Regards, Homa

Created 2012-12-09 02:19:39 | Updated | Tags: gatk

I have read this on GATK's documents:

Human sequence

If you are using human data, your reads must be aligned to one of the official b3x (e.g. b36, b37) or hg1x (e.g. hg18, hg19) references. The contig ordering in the reference you used must exactly match that of one of the official references canonical orderings. These are defined by historical karotyping of largest to smallest chromosomes, followed by the X, Y, and MT for the b3x references; the order is thus 1, 2, 3, ..., 10, 11, 12, ... 20, 21, 22, X, Y, MT. The hg1x references differ in that the chromosome names are prefixed with "chr" and chrM appears first instead of last. The GATK will detect misordered contigs (for example, lexicographically sorted) and throw an error.

that said the reference order must be: chr1, chr2, chr3, ... chr22, chrX, chrY, chrM. but after I download all the bundle in GATK's ftp, I check's reference, it's with a order of :

>chrM
>chr1
>chr2
>chr3
>chr4
>chr5
>chr6
>chr7
>chr8
>chr9
>chr10
>chr11
>chr12
>chr13
>chr14
>chr15
>chr16
>chr17
>chr18
>chr19
>chr20
>chr21
>chr22
>chrX
>chrY
...