The best practice guide states to call variants across all samples simultaneously. Besides the ease of working with one multi-sample VCF, what advantages are there to calling the variants at the same time? Does GATK leverage information across all samples when making calls? If so, what assumptions is the UnifiedGenotyper making about the relationship of these samples to each other, and what are the effects on the variant calls?
I have 15 affected samples. 2 are whole exome and 13 are whole genome. They have already been realigned on a single-sample level and had BQSR performed. I am contemplating running UnifiedGenotyper on all 15 samples together because we would like to compare the calls across the samples (especially in the coding regions). I am aware that there would be a large number of variant calls in the whole genome samples that would have little to no coverage in the exome samples. I haven't been able to find any posts that say you should or shouldn't run whole genome and exome samples through UnifiedGenotyper together. Are there any reasons why this should be discouraged?
Also, assuming I do perform multi-sample calling across all 15 samples, would it be ok to run that multi-sample VCF file through VQSR?
I'm working with different tools to call variants (GATK is one of them) for the same sample and I'm merging these results with
gatk -T CombineVariants -R ucsc.hg19.fasta -V:GATK GATK.vcf -V:OTHER OTHER.vcf -o combined.vcf -genotypeMergeOptions PRIORITIZE --rod_priority_list GATK,OTHER
But when I have a discrepancy between results (I mean, same chr and pos, but different call), I just get GATK's result in my result file, Is there any way to have both calls when I have discrepancies? I don't want to use -genotypeMergeOptions UNIQUIFY.
GATK.vcf: chr17 7917905 . C T 360.16 . [...] GT 0/1 chr17 7918012 . C T 896.16 . [...] GT 0/1
OTHER.vcf: chr17 7917905 . C T 360.16 . [...] GT 0/1 chr17 7918012 . C T 896.16 . [...] GT 1/1
combined.vcf: chr17 7917905 . C T 360.16 . [...];set=Intersection GT 0/1 chr17 7918012 . C T 896.16 . [...];set=Intersection GT 0/1
What I want to obtain:
combined.vcf: chr17 7917905 . C T 360.16 . [...];set=Intersection GT 0/1 chr17 7918012 . C T 896.16 . [...];set=GATK GT 0/1 chr17 7918012 . C T 896.16 . [...];set=OTHER GT 1/1
How does the haplotype caller handle multiple samples? Does it do a local denovo assembly for every sample and then compare those or does it do 1 local denovo assembly using the reads of all the samples?
Can I run the haplotype caller on a set of 9 solid (50bp fragment and 50 x 35 PE) together with 1 ilumina sample (100 x 100 PE) ?
Hi, I was wondering if you have any advice on the effects of including samples of different ancestry in the multiple sample variant calling. We run PCA after calling variants to identify any potential outliers. In case of identifying one, do we have to re-do the SNP and Indel calling step?
Any help most appreciated.
I'm working with a non-model species, concretely with citrus genus samples. What we do is basically to search target SNVs which could be responsible for the most of the phenotypic differences between citrus varieties/species.
Due to the absence of a reference assembly for every available species we have implement our genotyping pipeline by mapping all citrus species against the same reference genome (concretely, clementine genome). We are aware that this approach can produce unequal bias that are proportional to the sample-to-reference species distance, but at the end, we know that citrus genus species are relatively close, and its quite easy to find many conserved regions between them.
My question is about multi-sample calling. We are confident on performing multi-sample calling when we compare intra-species samples, but we are not so sure to follow the same methodology when we compare distant samples that don't share a considerable proportion of variants.
What do you recommend us?
We assume two alternatives
1 -Perform multisample-calling, understanding that despite of genomic heterogeneity the variants will be still detected. 2- Perform independent callings, and combine them after (by using CombineVariants tool)
Thanks in advance Jose
I have performed multisample SNP calling and also single sample SNP calling for 8 individuals. Some of the variants detected by multi-sample SNP calling were absent in variants detected by single sample snp calling.
For example, Multisample SNP calling output:
chr1 389 . T G 178.50 . AC=6;AF=0.375;AN=16;BaseQRankSum=-0.775;DP=202;Dels=0.00;FS=26.344;HaplotypeScore=0.7725;MLEAC=6;MLEAF=0.375;MQ=12.37;MQ0=75;MQRankSum=-4.220;QD=1.08;ReadPosRankSum=0.073 GT:ADP:GQ:PL 0/1:21,21:42:35:35,0,116 0/0:11,7:18:9:0,9,68 0/1:13,5:18:18:18,0,58 0/1:8,14:21:39:39,0,59 0/1:16,16:32:59:66,0,59 0/1:14,7:20:43:43,0,85 0/1:10,20:29:19:21,0,19 0/0:9,9:18:6:0,6,44
Output from Individual SNP calling but merged into a single VCF file:
chr1 389 . T G 37.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.189;DP=33;Dels=0.00;FS=3.979;HaplotypeScore=2.9800;MLEAC=1;MLEAF=0.500;MQ0=12;MQ=12.44;MQRankSum=-1.474;QD=1.14;ReadPosRankSum=-0.794;SF=5 GT:GQP:PL:AD . . . . 0/1:59:32:66,0,59:16,16 . . .
Here we can observe that the variant is present in only 1 individual from single sample SNP calling, whereas it is present in all the individuals with multi-sample SNP calling.
Could someone comment, which of these would be reliable results?
Is there a way do single sample variant calling but have the output directly in a mutli-sample vcf? We now do single sample calling and after that combine the vcf file to a multi-sample vcf file. The creates really large intermediary files (with all reference positions) and takes much longer than multi-sample calling.
We have some NGS data from a non standard experiment with non standard samples, were we are looking at small differences between the samples. We chose to do single sample variant calling because we don't want the the data from one sample to influence the calls in the other. And we are not sure what the assumptions for multi-sample calling are and if they apply for our experiment.
we are running tests trying to get UG to produce 1 vcf per sample when inputting bams from multiple subjects. our situation is complicated slightly by the fact that each sample has 3 bams. when we input all 6 bams into UG, hoping to output 2 vcfs (1 per sample) we instead get a single vcf. we found some relevant advice in this post: http://gatkforums.broadinstitute.org/discussion/2262/why-unifiedgenotyper-treat-multiple-bam-input-as-one-sample but still haven't solved the issue.
details include: 1) we are inputting 6 bams for our test, 3 per sample for 2 samples. 2) bams were generated using Bioscope from targeted capture reads sequenced on a Solid 4. 3) as recommended in the post above we checked out the @RG statements in the bam headers using Samtools -- lines for the 6 bams are as follows:
@RG ID:20130610202026358 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-10T16:20:26-0400 SM:S1
@RG ID:20130611214013844 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-11T17:40:13-0400 SM:S1
@RG ID:20130613002511879 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:147 DT:2013-06-12T20:25:11-0400 SM:S1
@RG ID:20130611021848236 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-10T22:18:48-0400 SM:S1
@RG ID:20130612014345277 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-11T21:43:45-0400 SM:S1
@RG ID:20130613085411753 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:150 DT:2013-06-13T04:54:11-0400 SM:S1
Based on the former post, I would have expected each of these bams to generate a separate vcf as it appears the ids are all different (which would not have been desirable either, as we are hoping to generate 2 vcfs in this test). Thus, it is not clear if/how we should use Picard tool AddOrReplaceReadGroups to modify the @RG headers?
Does that make sense? Any advice?
I've been going through the VQSR documentation/guide and haven't been able to pin down an answer to how it behaves on multi-sample VCF (generated by multi-sample calling with UG). Should VQSR be run on this? Or on each sample separately, given that coverage and other statistics used to determine the variant confidence score aren't the same for each sample and so can lead to conflicting determinations on different samples.
What is the best way to go about this?
I have a vcf containing multiple samples. I would like to put the bam files also as input for the Variant Annotator but how does the variant annotator know which bam is for wich column in the vcf? Does the order of the args of the bam files need to correspond to the order of the samples columns in the vcf?
I've just made a long needed update to the most recent version of GATK. I had been toying with the variant quality score recalibrator before but now that I have a great deal more exomes at my disposal I'd like to fully implement it in a meaningful way.
The phrase I'm confused about is "In our testing we've found that in order to achieve the best exome results one needs to use an exome callset with at least 30 samples." How exactly do I arrange these 30+ exomes?
Is there any difference or reason to choose one of the following two workflows over the other?
Input 30+ exomes in the "-I" argument of either the UnifiedGenotyper or HaplotypeCaller and then with my multi-sample VCF perform the variant recalibration procedure and then split the individual call sets out of the multi-sample vcf with SelectVariants?
Take 30+ individual vcf files, merge them together, and then perform variant recalibration on the merged vcf and then split the individual call sets out of the multi-sample vcf with SelectVariants?
Or some third option I'm missing
Any help is appreciated.
I want to know what's the best way to use VariantEval to get statistics for each sample in a multisample VCF file. If I call it like this:
java -jar GenomeAnalysisTK.jar \
-R ucsc.hg19.fasta \
-T VariantEval \
-o multisample.eval.gatkreport \
--eval annotated.combined.vcf.gz \
where annotated.combined.vcf.gz is a VCF file that contains ~1Mio variants for ~800 samples I get statistics for all samples combined, e.g.
#:GATKTable:CompOverlap:The overlap between eval and comp sites
CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants ...
CompOverlap dbsnp eval none all 471704 191147
CompOverlap dbsnp eval none known 280557 0 CompOverlap dbsnp eval none novel 191147 191147
But I would like to get one such entry per sample. Is there an easy way to do this?
I am trying to run GATK on a sample of 119 exomes. I followed the GATK guidelines to process the fastq files. I used the following parameters to call the UnifiedGenotyper and VQSR [for SNPs]:
-T UnifiedGenotyper --output_mode EMIT_VARIANTS_ONLY --min_base_quality_score 30 --max_alternate_alleles 5 -glm SNP
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 /media/transcription/cipn/5.pt/ref/hapmap_3.3.hg19.sites.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 /media/transcription/cipn/5.pt/ref/1000G_omni2.5.hg19.sites.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 /media/transcription/cipn/5.pt/ref/dbsnp_135.hg19.vcf.gz -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -mode SNP
I get a tranche plot, which does not look OK. The "Number of Novel Variants [1000s]" goes from -400 to 800 and the Ti/Tv ratio varies from 0.633 to 0.782 [the attach file link is not working for me and am unable to upload the plot]. Any suggestion to rectify this would be very helpful !
Hi :), I'am new to NGS and have a questions reagarding the filtering. I have 13 individuals, 3 of them with a coverage of 11x-15x, the rest with a coverage of 5x-7x. I did multi-sample SNPcalling and hard filtering, the latter as there are no known SNPs so far. Now I am not sure how to set the minimal SNP quality (QUAL). On the best practise it is suggested to juse at least 30 for individuals with a coverage of at least 30, but 4 if the coverage is below 10. So what is the best way to set the QUAL filter?
many thanks in advance
Can anyone tell me the differences between GATK's single sample and muliti sample calling methods? What extra information does the muliti sample calling methods can give? I want to know if "the more samples, the more accuracy result"?