Known site file format for Indel Realignment and BQSR (mouse/mm10)
Posted in Ask the GATK team | Last updated on


Comments (5)

Hello,

I was wondering about the format of the known site vcfs used by the RealignerTargetCreator and BaseRecalibrator walkers.

I'm working with mouse whole genome sequence data, so I've been using the Sanger Mouse Genome project known sites from the Keane et al. 2011 Nature paper. From the output, it seems that the RealignerTargetCreator walker is able to recognise and use the gzipped vcf fine:

INFO 15:12:09,747 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,751 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02 INFO 15:12:09,751 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 15:12:09,752 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 15:12:09,758 HelpFormatter - Program Args: -T RealignerTargetCreator -R mm10.fa -I DUK01M.sorted.dedup.bam -known /tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz -o DUK01M.indel.intervals.list INFO 15:12:09,758 HelpFormatter - Date/Time: 2014/03/25 15:12:09 INFO 15:12:09,758 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,759 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,918 ArgumentTypeDescriptor - Dynamically determined type of /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz to be VCF INFO 15:12:10,010 GenomeAnalysisEngine - Strictness is SILENT INFO 15:12:10,367 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 15:12:10,377 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 15:12:10,439 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06 INFO 15:12:10,468 RMDTrackBuilder - Attempting to blindly load /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz as a tabix indexed file INFO 15:12:11,066 IndexDictionaryUtils - Track known doesn't have a sequence dictionary built in, skipping dictionary validation INFO 15:12:11,201 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files INFO 15:12:12,333 GenomeAnalysisEngine - Done creating shard strategy INFO 15:12:12,334 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] I've checked the indel interval lists for my samples and they do all appear to contain different intervals.

However, when I use the equivalent SNP vcf in the following BQSR step, GATK errors as follows:

`##### ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 2.5-2-gf57256b):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation.
ERROR ------------------------------------------------------------------------------------------`

Which means that the SNP vcf (which has the same format as the indel vcf) is not used by BQSR.

My question is: given that the BQSR step failed, should I be worried that there are no errors from the Indel Realignment step? As the known SNP/indel vcfs are in the same format, I don't know whether I can trust the realigned .bams.

Thanks very much!


Return to top Comment on this article in the forum