Tagged with #multi-sample
0 documentation articles | 0 announcements | 23 forum discussions

No posts found with the requested search criteria.
No posts found with the requested search criteria.
Comments (1)


How many samples and of what file/genome size can the Unified Genotyper process simultaneously ? I will have about 250 samples, genome size 180Mb, bam sizes 3-5GB, reduced bam sizes ~1GB.

Also, how long would this take can it be multi-threaded ? Cheers,


Comments (2)

our group is calling variants using 5.5mb of targeted capture across ~2400 bams (for 830 subjects). i have successfully executed using unifiedgenotyper, but we would also like to run using haplotypecaller to compare. my basic approach has been to scatter the capture region into 25 equal sized chunks, run parallel and then CatVariants to gather afterward. this worked nicely for UG, but for HC estimated run-times for even these small intervals are >7 days and the jobs inevitably crash due to memory problems.

so, my question is how should i approach this large cohort problem in HC? i have read HC works best in cohorts <100, so should i divide up my cohort into chunks of 100 bams and combine afterward? seems this would undermine the benefits of joint calling, no? i have also read that GATK 3.0 is going to address this issue with HC -- have these improvements been made with the current version (which I am not currently using)? if so, is there a best practice or example available anywhere with cmd specs for situations similar to mine? finally, is it ok to split my capture region and run in parallel as i have done? this doesn't undermine the benefits of de novo local assembly by HC? thanks in advance for any input!

Comments (1)

Dear Team,

I encounter a situation when ;

If I run Haplotype caller on one .bam file - Haplotype caller called the mutation (sanger validated)

java GenomeAnalysisTK.jar -T HaplotypeCaller -nct 15 -minPruning 2 \ -R genome.fa \ -I sample_1.bam \ -L TP53.bed \ -U ALLOW_N_CIGAR_READS \ -o sample_1.vcf

But, If i run Haplotype caller with multiple bam as input - Haplotype caller failed to call the mutation in the sample mentioned above

java GenomeAnalysisTK.jar -T HaplotypeCaller -nct 15 -minPruning 2 \ -R genome.fa \ -I sample_1.bam -I sample_2.bam -I sample_3 .bam ..... -I sample_19.bam \ -L TP53.bed \ -U ALLOW_N_CIGAR_READS \ -o multisample.vcf

I used the -bamOutput option to view the .bam file HC would expect to see when calling the mutation. Please find below IGV screenshot

java GenomeAnalysisTK.jar -T HaplotypeCaller -nct 15 -minPruning 2 \ -R genome.fa \ -I sample_1.bam -I sample_2.bam -I sample_3 .bam ..... -I sample_19.bam \ -L TP53.bed \ -U ALLOW_N_CIGAR_READS \ --bamOutput multiple_all.bam --bamWriterType ALL_POSSIBLE_HAPLOTYPES \ -o multisample.vcf

1st row is HC run only on sample_1.bam
2nd row is HC run on multiple.bam input - using the --bamOutput option alone
3rd row is HC run on multiple.bam input - using the --bamOutput option + --bamWriterType ALL_POSSIBLE_HAPLOTYPES

The coverage on the 1st row (IGV) of sample1 is 262 (Ref(C):3, Mut(A):259)
The coverage on the 3rd row (IGV) file at the position of interest is 3627(Ref(C):3411, Mut(A):215, G:1 , DEL:13)

Could the reason why HC unable to call the mutation (when multiple.bam are used as input) is because the postion of interest is has too much noise ~3 base upstream? (shown in screen shot below)

Any help understanding why mulisample bam input would fail to call the variant would me most helpful.

Thanks in advance

Comments (14)

Hi, I have run GATK Unified genotyper with multisample and I am getting unexpected output. (Hard filters used)

Here are examples of the output format I get in the vcf for a sample: 0/1:43,4:PASS:24:24,0,4368, ./.:.:PASS or 0/0:2,0:2:DPFilter:6:0,6,47 as well as 0/0:12,0:37:0,37,830. The latter is the format I would expect. What does this mean? Also I get the filter PASS even when all but two samples have ./.:.:PASS and the other two have 1/1:0,1:1:DPFilter;GQFilter:3:38,3,0 and 1/1:0,1:1:DPFilter;GQFilter:3:38,3,0.

Also I ran HalpotypeCaller multisample and get similar output all though more of the 0/0:12,0:37:0,37,830 format....

Comments (1)

Hi there,

We are sequencing a set of regions that covers about 1.5 megabases in total. We're running into problems with VQSR -- VariantRecalibrator says there are too few variants to do recalibration. To give a sense of numbers, in one sample we have about 3000 SNVs and 600 indels.

We seem to have too few indels to do VQSR on them and have a couple of questions:

  1. Can we combine multiple samples to increase the number of variants, or does VariantRecalibrator need to work on each sample individually?

  2. If we do not use VQSR for indels, should we also avoid VQSR for the SNPs?

  3. The other question is whether joint or batch variant calling across several samples would help us in this case?

Thanks in advance!

Comments (1)

Hi, using unified genotyper with multiple samples (different lines of Medicago truncatula showing polymorphism), I would like to get a file giving me, for each sample, the alleles at each SNP position (SNP positions identified for all samples). Is it possible? I didn't find any option for this, and it would be dificult to make SNP calling for each sample and then create a new file with all of them, because SNPs positions and number are different according to the samples. Thanks in advance,


Comments (3)

Dear Support Team,

Thanks for making the GATK 2013 workshop available, those are great videos and answered most of my doubts regarding the use of your pipeline. I have started analysing my data, but I have some doubts regarding my samples. At the moment I am running a pilot analysis, and I only have 2 samples (two different strains of the same organism). I got the files already demultiplexed from my Illumina sequencer and I have mapped them separately using BWA-MEM. I was planning to analyse them separately, probably using hard filtering. Should I join the samples at some point? Or should I rather get the multiplexed file (I mean the file before demultiplexing) and then rely on PICARD tools to group the reads separately?

Thanks for your kind attention, Max

Comments (1)


I have performed multisample SNP calling and also single sample SNP calling for 8 individuals. Some of the variants detected by multi-sample SNP calling were absent in variants detected by single sample snp calling.

For example, Multisample SNP calling output:


chr1 389 . T G 178.50 . AC=6;AF=0.375;AN=16;BaseQRankSum=-0.775;DP=202;Dels=0.00;FS=26.344;HaplotypeScore=0.7725;MLEAC=6;MLEAF=0.375;MQ=12.37;MQ0=75;MQRankSum=-4.220;QD=1.08;ReadPosRankSum=0.073 GT:ADP:GQ:PL 0/1:21,21:42:35:35,0,116 0/0:11,7:18:9:0,9,68 0/1:13,5:18:18:18,0,58 0/1:8,14:21:39:39,0,59 0/1:16,16:32:59:66,0,59 0/1:14,7:20:43:43,0,85 0/1:10,20:29:19:21,0,19 0/0:9,9:18:6:0,6,44

Output from Individual SNP calling but merged into a single VCF file:


chr1 389 . T G 37.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.189;DP=33;Dels=0.00;FS=3.979;HaplotypeScore=2.9800;MLEAC=1;MLEAF=0.500;MQ0=12;MQ=12.44;MQRankSum=-1.474;QD=1.14;ReadPosRankSum=-0.794;SF=5 GT:GQP:PL:AD . . . . 0/1:59:32:66,0,59:16,16 . . .

Here we can observe that the variant is present in only 1 individual from single sample SNP calling, whereas it is present in all the individuals with multi-sample SNP calling.

Could someone comment, which of these would be reliable results?

Comments (5)


I'm working with different tools to call variants (GATK is one of them) for the same sample and I'm merging these results with

gatk -T CombineVariants -R ucsc.hg19.fasta -V:GATK GATK.vcf -V:OTHER OTHER.vcf -o combined.vcf -genotypeMergeOptions PRIORITIZE --rod_priority_list GATK,OTHER

But when I have a discrepancy between results (I mean, same chr and pos, but different call), I just get GATK's result in my result file, Is there any way to have both calls when I have discrepancies? I don't want to use -genotypeMergeOptions UNIQUIFY.


chr17   7917905 .       C       T       360.16  .       [...]   GT     0/1
chr17   7918012 .       C       T       896.16  .       [...]  GT     0/1

chr17   7917905 .       C       T       360.16  .       [...]   GT     0/1
chr17   7918012 .       C       T       896.16  .       [...]  GT     1/1

chr17   7917905 .       C       T       360.16  .       [...];set=Intersection   GT     0/1
chr17   7918012 .       C       T       896.16  .       [...];set=Intersection  GT     0/1

What I want to obtain:

chr17   7917905 .       C       T       360.16  .       [...];set=Intersection   GT     0/1
chr17   7918012 .       C       T       896.16  .       [...];set=GATK  GT     0/1
chr17   7918012 .       C       T       896.16  .       [...];set=OTHER  GT     1/1

Comments (1)


Is there a way do single sample variant calling but have the output directly in a mutli-sample vcf? We now do single sample calling and after that combine the vcf file to a multi-sample vcf file. The creates really large intermediary files (with all reference positions) and takes much longer than multi-sample calling.

We have some NGS data from a non standard experiment with non standard samples, were we are looking at small differences between the samples. We chose to do single sample variant calling because we don't want the the data from one sample to influence the calls in the other. And we are not sure what the assumptions for multi-sample calling are and if they apply for our experiment.

Comments (4)

Hi :), I'am new to NGS and have a questions reagarding the filtering. I have 13 individuals, 3 of them with a coverage of 11x-15x, the rest with a coverage of 5x-7x. I did multi-sample SNPcalling and hard filtering, the latter as there are no known SNPs so far. Now I am not sure how to set the minimal SNP quality (QUAL). On the best practise it is suggested to juse at least 30 for individuals with a coverage of at least 30, but 4 if the coverage is below 10. So what is the best way to set the QUAL filter?

many thanks in advance

Comments (5)

we are running tests trying to get UG to produce 1 vcf per sample when inputting bams from multiple subjects. our situation is complicated slightly by the fact that each sample has 3 bams. when we input all 6 bams into UG, hoping to output 2 vcfs (1 per sample) we instead get a single vcf. we found some relevant advice in this post: http://gatkforums.broadinstitute.org/discussion/2262/why-unifiedgenotyper-treat-multiple-bam-input-as-one-sample but still haven't solved the issue.

details include: 1) we are inputting 6 bams for our test, 3 per sample for 2 samples. 2) bams were generated using Bioscope from targeted capture reads sequenced on a Solid 4. 3) as recommended in the post above we checked out the @RG statements in the bam headers using Samtools -- lines for the 6 bams are as follows:

sample 1:

@RG ID:20130610202026358 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-10T16:20:26-0400 SM:S1

@RG ID:20130611214013844 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-11T17:40:13-0400 SM:S1

@RG ID:20130613002511879 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:147 DT:2013-06-12T20:25:11-0400 SM:S1

sample 2:

@RG ID:20130611021848236 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-10T22:18:48-0400 SM:S1

@RG ID:20130612014345277 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-11T21:43:45-0400 SM:S1

@RG ID:20130613085411753 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:150 DT:2013-06-13T04:54:11-0400 SM:S1

Based on the former post, I would have expected each of these bams to generate a separate vcf as it appears the ids are all different (which would not have been desirable either, as we are hoping to generate 2 vcfs in this test). Thus, it is not clear if/how we should use Picard tool AddOrReplaceReadGroups to modify the @RG headers?

Does that make sense? Any advice?

Comments (3)

How does the haplotype caller handle multiple samples? Does it do a local denovo assembly for every sample and then compare those or does it do 1 local denovo assembly using the reads of all the samples?

Can I run the haplotype caller on a set of 9 solid (50bp fragment and 50 x 35 PE) together with 1 ilumina sample (100 x 100 PE) ?

Comments (2)


I have a vcf containing multiple samples. I would like to put the bam files also as input for the Variant Annotator but how does the variant annotator know which bam is for wich column in the vcf? Does the order of the args of the bam files need to correspond to the order of the samples columns in the vcf?



Comments (4)


I'm working with a non-model species, concretely with citrus genus samples. What we do is basically to search target SNVs which could be responsible for the most of the phenotypic differences between citrus varieties/species.

Due to the absence of a reference assembly for every available species we have implement our genotyping pipeline by mapping all citrus species against the same reference genome (concretely, clementine genome). We are aware that this approach can produce unequal bias that are proportional to the sample-to-reference species distance, but at the end, we know that citrus genus species are relatively close, and its quite easy to find many conserved regions between them.

My question is about multi-sample calling. We are confident on performing multi-sample calling when we compare intra-species samples, but we are not so sure to follow the same methodology when we compare distant samples that don't share a considerable proportion of variants.

What do you recommend us?

We assume two alternatives

1 -Perform multisample-calling, understanding that despite of genomic heterogeneity the variants will be still detected. 2- Perform independent callings, and combine them after (by using CombineVariants tool)

Thanks in advance Jose

Comments (6)


I've been going through the VQSR documentation/guide and haven't been able to pin down an answer to how it behaves on multi-sample VCF (generated by multi-sample calling with UG). Should VQSR be run on this? Or on each sample separately, given that coverage and other statistics used to determine the variant confidence score aren't the same for each sample and so can lead to conflicting determinations on different samples.

What is the best way to go about this?

Many thanks.

Comments (1)

The best practice guide states to call variants across all samples simultaneously. Besides the ease of working with one multi-sample VCF, what advantages are there to calling the variants at the same time? Does GATK leverage information across all samples when making calls? If so, what assumptions is the UnifiedGenotyper making about the relationship of these samples to each other, and what are the effects on the variant calls?

thanks, Justin

Comments (4)


I want to know what's the best way to use VariantEval to get statistics for each sample in a multisample VCF file. If I call it like this: java -jar GenomeAnalysisTK.jar \ -R ucsc.hg19.fasta \ -T VariantEval \ -o multisample.eval.gatkreport \ --eval annotated.combined.vcf.gz \ --dbsnp dbsnp_137.hg19.vcf where annotated.combined.vcf.gz is a VCF file that contains ~1Mio variants for ~800 samples I get statistics for all samples combined, e.g.

#:GATKReport.v1.1:8 #:GATKTable:11:3:%s:%s:%s:%s:%s:%d:%d:%d:%.2f:%d:%.2f:; #:GATKTable:CompOverlap:The overlap between eval and comp sites CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants ... CompOverlap dbsnp eval none all 471704 191147
CompOverlap dbsnp eval none known 280557 0 CompOverlap dbsnp eval none novel 191147 191147

But I would like to get one such entry per sample. Is there an easy way to do this?

Thanks, Thomas

Comments (1)

Hi, I was wondering if you have any advice on the effects of including samples of different ancestry in the multiple sample variant calling. We run PCA after calling variants to identify any potential outliers. In case of identifying one, do we have to re-do the SNP and Indel calling step?

Any help most appreciated.


Comments (1)


I've just made a long needed update to the most recent version of GATK. I had been toying with the variant quality score recalibrator before but now that I have a great deal more exomes at my disposal I'd like to fully implement it in a meaningful way.

The phrase I'm confused about is "In our testing we've found that in order to achieve the best exome results one needs to use an exome callset with at least 30 samples." How exactly do I arrange these 30+ exomes?

Is there any difference or reason to choose one of the following two workflows over the other?

  1. Input 30+ exomes in the "-I" argument of either the UnifiedGenotyper or HaplotypeCaller and then with my multi-sample VCF perform the variant recalibration procedure and then split the individual call sets out of the multi-sample vcf with SelectVariants?

  2. Take 30+ individual vcf files, merge them together, and then perform variant recalibration on the merged vcf and then split the individual call sets out of the multi-sample vcf with SelectVariants?

  3. Or some third option I'm missing

Any help is appreciated.


Comments (3)


I have 15 affected samples. 2 are whole exome and 13 are whole genome. They have already been realigned on a single-sample level and had BQSR performed. I am contemplating running UnifiedGenotyper on all 15 samples together because we would like to compare the calls across the samples (especially in the coding regions). I am aware that there would be a large number of variant calls in the whole genome samples that would have little to no coverage in the exome samples. I haven't been able to find any posts that say you should or shouldn't run whole genome and exome samples through UnifiedGenotyper together. Are there any reasons why this should be discouraged?

Also, assuming I do perform multi-sample calling across all 15 samples, would it be ok to run that multi-sample VCF file through VQSR?

Thanks! Jared

Comments (3)


I am trying to run GATK on a sample of 119 exomes. I followed the GATK guidelines to process the fastq files. I used the following parameters to call the UnifiedGenotyper and VQSR [for SNPs]:


-T UnifiedGenotyper 
--output_mode EMIT_VARIANTS_ONLY 
--min_base_quality_score 30 
--max_alternate_alleles 5 
-glm SNP 


-resource:hapmap,known=false,training=true,truth=true,prior=15.0 /media/transcription/cipn/5.pt/ref/hapmap_3.3.hg19.sites.vcf 
-resource:omni,known=false,training=true,truth=false,prior=12.0 /media/transcription/cipn/5.pt/ref/1000G_omni2.5.hg19.sites.vcf 
-resource:dbsnp,known=true,training=false,truth=false,prior=6.0 /media/transcription/cipn/5.pt/ref/dbsnp_135.hg19.vcf.gz 
-an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff 
-mode SNP 

I get a tranche plot, which does not look OK. The "Number of Novel Variants [1000s]" goes from -400 to 800 and the Ti/Tv ratio varies from 0.633 to 0.782 [the attach file link is not working for me and am unable to upload the plot]. Any suggestion to rectify this would be very helpful !

cheers, Rahul

Comments (1)

Can anyone tell me the differences between GATK's single sample and muliti sample calling methods? What extra information does the muliti sample calling methods can give? I want to know if "the more samples, the more accuracy result"?