I've run the IndelRealigner on my mouse WGS *bam files with known site data from the Sanger MGP, and now I'm trying to figure out how "well" it worked.
The list created by RealignerTargetCreator contains 6547185 intervals
Parsing the output realigned.bam file for reads that had an "OC" tag added (as suggested in http://www.broadinstitute.org/gatk/events/3391/GATKw1310-BP-2-Realignment.pdf) shows that 1648299 reads were actually realigned.
I used the default settings, which means that
1) -model was USE_READS - and from what I've read, this is the correct option to use, given that Smith-Waterman modelling doesn't give greatly improved results;
2) -LOD was 5.0 - but for my data, which is mouse whole-genome sequence at average 10x coverage, this may be too high and I might be losing true positives.
I've tried randomly picking out candidate intervals from the intervals and OC-tagged reads from the realigned.bam file to check, but I was wondering if there's a more empirical way of checking how good the realignment was (I realise there's "no formal measure" as per the presentation but I'm finding it hard to make a judgement call!).
My feeling from looking at the intervals or realigned reads is that the low coverage is a major issue in terms of identifying "true" indels, so preferably we'd go for specificity over sensitivity.
Thanks for any advice/suggestions in advance!
More fun with mouse known site data! I'm using the Sanger MGP v3 known indel/known SNP sites for the IndelRealigner and BQSR steps.
I'm working with whole-genome sequence; however, the known sites have been filtered for the following contigs (example from the SNP vcf):
##fileformat=VCFv4.1 ##samtoolsVersion=0.1.18-r572 ##reference=ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa ##source_20130026.2=vcf-annotate(r813) -f +/D=200/d=5/q=20/w=2/a=5 (AJ,AKR,CASTEiJ,CBAJ,DBA2J,FVBNJ,LPJ,PWKPhJ,WSBEiJ) ##source_20130026.2=vcf-annotate(r813) -f +/D=250/d=5/q=20/w=2/a=5 (129S1,BALBcJ,C3HHeJ,C57BL6NJ,NODShiLtJ,NZO,Spretus) ##source_20130305.2=vcf-annotate(r818) -f +/D=155/d=5/q=20/w=2/a=5 (129P2) ##source_20130304.2=vcf-annotate(r818) -f +/D=100/d=5/q=20/w=2/a=5 (129S5) ##contig=<ID=1,length=195471971> ##contig=<ID=10,length=130694993> ##contig=<ID=11,length=122082543> ##contig=<ID=12,length=120129022> ##contig=<ID=13,length=120421639> ##contig=<ID=14,length=124902244> ##contig=<ID=15,length=104043685> ##contig=<ID=16,length=98207768> ##contig=<ID=17,length=94987271> ##contig=<ID=18,length=90702639> ##contig=<ID=19,length=61431566> ##contig=<ID=2,length=182113224> ##contig=<ID=3,length=160039680> ##contig=<ID=4,length=156508116> ##contig=<ID=5,length=151834684> ##contig=<ID=6,length=149736546> ##contig=<ID=7,length=145441459> ##contig=<ID=8,length=129401213> ##contig=<ID=9,length=124595110> ##contig=<ID=X,length=171031299> ##FILTER=<ID=BaseQualBias,Description="Min P-value for baseQ bias (INFO/PV4) "> ##FILTER=<ID=EndDistBias,Description="Min P-value for end distance bias (INFO/PV4) [0.0001]"> ##FILTER=<ID=GapWin,Description="Window size for filtering adjacent gaps "> ##FILTER=<ID=Het,Description="Genotype call is heterozygous (low quality) "> ##FILTER=<ID=MapQualBias,Description="Min P-value for mapQ bias (INFO/PV4) "> ##FILTER=<ID=MaxDP,Description="Maximum read depth (INFO/DP or INFO/DP4) "> ##FILTER=<ID=MinAB,Description="Minimum number of alternate bases (INFO/DP4) "> ##FILTER=<ID=MinDP,Description="Minimum read depth (INFO/DP or INFO/DP4) "> ##FILTER=<ID=MinMQ,Description="Minimum RMS mapping quality for SNPs (INFO/MQ) "> ##FILTER=<ID=Qual,Description="Minimum value of the QUAL field "> ##FILTER=<ID=RefN,Description="Reference base is N "> ##FILTER=<ID=SnpGap,Description="SNP within INT bp around a gap to be filtered "> ##FILTER=<ID=StrandBias,Description="Min P-value for strand bias (INFO/PV4) [0.0001]"> ##FILTER=<ID=VDB,Description="Minimum Variant Distance Bias (INFO/VDB) ">
When I was trying to use these known sites at the VariantRecalibration step, I got a lot of walker errors saying that (I paraphrase) "it's dangerous to use this known site data on your VCF because the contigs of your references do not match".
However, if you look at the GRCm38_68.fai it DOES include the smaller scaffolds which are present in my data.
So, my question is: how should I filter my bam files for the IndelRealigner and downstream steps? I feel like the best option is to filter on the contigs present in the known site vcfs, but obviously that would throw out a proportion of my data.
Thanks very much!
I was wondering about the format of the known site vcfs used by the RealignerTargetCreator and BaseRecalibrator walkers.
I'm working with mouse whole genome sequence data, so I've been using the Sanger Mouse Genome project known sites from the Keane et al. 2011 Nature paper. From the output, it seems that the RealignerTargetCreator walker is able to recognise and use the gzipped vcf fine:
INFO 15:12:09,747 HelpFormatter - --------------------------------------------------------------------------------
INFO 15:12:09,751 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02
INFO 15:12:09,751 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 15:12:09,752 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 15:12:09,758 HelpFormatter - Program Args: -T RealignerTargetCreator -R mm10.fa -I DUK01M.sorted.dedup.bam -known /tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz -o DUK01M.indel.intervals.list
INFO 15:12:09,758 HelpFormatter - Date/Time: 2014/03/25 15:12:09
INFO 15:12:09,758 HelpFormatter - --------------------------------------------------------------------------------
INFO 15:12:09,759 HelpFormatter - --------------------------------------------------------------------------------
INFO 15:12:09,918 ArgumentTypeDescriptor - Dynamically determined type of /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz to be VCF
INFO 15:12:10,010 GenomeAnalysisEngine - Strictness is SILENT
INFO 15:12:10,367 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 15:12:10,377 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 15:12:10,439 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06
INFO 15:12:10,468 RMDTrackBuilder - Attempting to blindly load /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz as a tabix indexed file
INFO 15:12:11,066 IndexDictionaryUtils - Track known doesn't have a sequence dictionary built in, skipping dictionary validation
INFO 15:12:11,201 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files
INFO 15:12:12,333 GenomeAnalysisEngine - Done creating shard strategy
INFO 15:12:12,334 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
I've checked the indel interval lists for my samples and they do all appear to contain different intervals.
However, when I use the equivalent SNP vcf in the following BQSR step, GATK errors as follows:
`##### ERROR ------------------------------------------------------------------------------------------
Which means that the SNP vcf (which has the same format as the indel vcf) is not used by BQSR.
My question is: given that the BQSR step failed, should I be worried that there are no errors from the Indel Realignment step? As the known SNP/indel vcfs are in the same format, I don't know whether I can trust the realigned .bams.
Thanks very much!
Hello, I am just trying VariantRecalibrator on my 4 samples:
java -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R gatk.ucsc.mm10.fa -input UnifiedGenotyper.output.snps.raw.vcf -nt 14 -recalFile file_for_ApplyRecalibration.recal -tranchesFile file_for_ApplyRecalibration.tranches -resource:sanger,known=false,training=true,truth=true mgp.v2.snps.annot.reformat.vcf -resource:dbnsp,known=true,training=false,truth=false,prior=6.0 mm10_dbsnp.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -mG 4 -percentBad 0.05
which starts running then gives me this error: INFO 08:26:57,741 GATKRunReport - Uploaded run statistics report to AWS S3
java.lang.NumberFormatException: For input string: "." at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:481) at java.lang.Integer.valueOf(Integer.java:582) at org.broadinstitute.sting.utils.codecs.vcf.AbstractVCFCodec.decodeInts(AbstractVCFCodec.java:680) at org.broadinstitute.sting.utils.codecs.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:641) at org.broadinstitute.sting.utils.codecs.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:92) at org.broadinstitute.sting.utils.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:130) at org.broadinstitute.sting.utils.variantcontext.LazyGenotypesContext.getGenotypes(LazyGenotypesContext.java:120) at org.broadinstitute.sting.utils.variantcontext.GenotypesContext.iterator(GenotypesContext.java:461) at org.broadinstitute.sting.utils.variantcontext.VariantContext.getCalledChrCount(VariantContext.java:922) at org.broadinstitute.sting.utils.variantcontext.VariantContext.getCalledChrCount(VariantContext.java:908) at org.broadinstitute.sting.utils.variantcontext.VariantContext.isMonomorphicInSamples(VariantContext.java:937) at org.broadinstitute.sting.utils.variantcontext.VariantContext.isPolymorphicInSamples(VariantContext.java:948) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantDataManager.isValidVariant(VariantDataManager.java:278) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantDataManager.parseTrainingSets(VariantDataManager.java:263) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.map(VariantRecalibrator.java:259) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.map(VariantRecalibrator.java:107) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:243) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:231) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:248) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:219) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:120) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:67) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:23) at org.broadinstitute.sting.gatk.executive.ShardTraverser.call(ShardTraverser.java:73) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722)
I've used all three VCFs in other GATK tools without issues. Any help greatly appreciated!, many thanks, Lavinia.
I was wondering if anyone has used VQSR for a mouse related genome project. I am working with mm10 dbsnp and DNA-seq short insert data for multiple homozygous mouse samples. I have obtained decent results so far using the mm10 dbsnp as the training set, but was curious to see if anyone had any recommendations as to what settings to use. Any input is appreciated. I also have a lot of RNA-seq data, but that will come at a much later point in time. Thanks!
I was calling SNP from Mouse samples using GATK and was in the step of "Variant quality score recalibration". The VariantRecalibrator walker asked me to provide training sets for mouse SNPs. The only SNP data (for the USCS mm9 assembly) I can find right now is the dbSNP. So I tried the run VariantRecalibrator like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R Refseq.fa -input snps.raw.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 snp128.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -mode BOTH -recalFile output.recal -tranchesFile output.tranches -rscriptFile output.plots.R
However, the program asked for more:
I was wondering if I can change the parameters by setting both the training/truth to true:
or should I ignore the "-resource" option at the cost of being less accurate?
Any suggestions would be greatly appreciated.