We recommend performing variant discovery in a way that enables joint analysis of multiple samples, as laid out in our Best Practices workflow. That workflow includes a joint analysis step that empowers variant discovery by providing the ability to leverage population-wide information from a cohort of multiple sample, allowing us to detect variants with great sensitivity and genotype samples as accurately as possible. Our workflow recommendations provide a way to do this in a way that is scalable and allows incremental processing of the sequencing data.
The key point is that you don’t actually have to call variants on all your samples together to perform a joint analysis. We have developed a workflow that allows us to decouple the initial identification of potential variant sites (ie variant calling) from the genotyping step, which is the only part that really needs to be done jointly. Since GATK 3.0, you can use the HaplotypeCaller to call variants individually per-sample in
-ERC GVCF mode, followed by a joint genotyping step on all samples in the cohort, as described in this method article. This achieves what we call incremental joint discovery, providing you with all the benefits of classic joint calling (as described below) without the drawbacks.
Why "almost always"? Because some people have reported missing a small fraction of singletons (variants that are unique to individual samples) when using the new method. For most studies, this is an acceptable tradeoff (which is reduced by the availability of high quality sequencing data), but if you are very specifically looking for singletons, you may need to do some careful evaluation before committing to this method.
Until recently, three strategies were available for variant discovery in multiple samples:
- single sample calling: sample BAMs are analyzed individually, and individual call sets are combined in a downstream processing step;
- batch calling: sample BAMs are analyzed in separate batches, and batch call sets are merged in a downstream processing step;
- joint calling: variants are called simultaneously across all sample BAMs, generating a single call set for the entire cohort.
The best of these, from the point of view of variant discovery, was joint calling, because it provided the following benefits:
Batch-calling does not output a genotype call at sites where no member in the batch has evidence for a variant; it is thus impossible to distinguish such sites from locations missing data. In contrast, joint calling emits genotype calls at every site where any individual in the call set has evidence for variation.
By sharing information across all samples, joint calling makes it possible to “rescue” genotype calls at sites where a carrier has low coverage but other samples within the call set have a confident variant at that location. However this does not apply to singletons, which are unique to a single sample. To minimize the chance of missing singletons, we increase the cohort size -- so that singletons themselves have less chance of happening in the first place.
The current approaches to variant filtering (such as VQSR) use statistical models that work better with large amounts of data. Of the three calling strategies above, only joint calling provides enough data for accurate error modeling and ensures that filtering is applied uniformly across all samples.
Figure 1: Power of joint calling in finding mutations at low coverage sites. The variant allele is present in only two of the N samples, in both cases with such low coverage that the variant is not callable when processed separately. Joint calling allows evidence to be accumulated over all samples and renders the variant callable. (right) Importance of joint calling to square off the genotype matrix, using an example of two disease-relevant variants. Neither sample will have records in a variants-only output file, for different reasons: the first sample is homozygous reference while the second sample has no data. However, merging the results from single sample calling will incorrectly treat both of these samples identically as being non-informative.
There are two major problems with the joint calling strategy.
- Scaling & infrastructure
Joint calling scales very badly -- the calculations involved in variant calling (especially by methods like the HaplotypeCaller’s) become exponentially more computationally costly as you add samples to the cohort. If you don't have a lot of compute available, you run into limitations pretty quickly. Even here at Broad where we have fairly ridiculous amounts of compute available, we can't brute-force our way through the numbers for the larger cohort sizes that we're called on to handle.
- The N+1 problem
When you’re getting a large-ish number of samples sequenced (especially clinical samples), you typically get them in small batches over an extended period of time, and you analyze each batch as it comes in (whether it’s because the analysis is time-sensitive or your PI is breathing down your back). But that’s not joint calling, that’s batch calling, and it doesn’t give you the same significant gains that joint calling can give you. Unfortunately the joint calling approach doesn’t allow for incremental analysis -- every time you get even one new sample sequence, you have to re-call all samples from scratch.
Hi, I'm using GATK ver 3.4 for SNP calling and I have some question about it. My data set has 500 samples, and I used genome data as reference for bowtie/GATK
1) I called SNP by sample (gvcf) with haplotype and then combined gvcf, however, the combination takes a long time, the GATK wants to recreate gvcf.idx files (4 of my gatk mission stuck at this step), one gatk combination finished after about 20 days calculation. I also try to use '-nct' to improve this, but it still stuck at preparing idx files.
2) For that finished gatk combination data set, I also used Unifiedgenotype with Gr.sorted.bam as input to call SNPs. The result is output with Gr.sorted.bam has 5 times more SNPs number than gvcf combination, and most missing SNPs could be found in individual gvcf files but missing in final result.
Could you help me with these? Thank you!
Dear Professor, I met a question, when I use VariantsToTable to generate a new table . My vcf format as the below:
I wanna output the specified Sample information column, but I try -F parameters , it's not available!
eg . -F Sample_B
I think maybe this tool can not provide this function ??
I would like to force call a list of variants across my cohort using HaplotypeCaller to get more accurate QC metrics for each variant. I am using the following command:
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -et NO_ET -K my.key -I my.cohort.list --alleles my.vcf -L my.vcf -out_mode EMIT_ALL_SITES -gt_mode GENOTYPE_GIVEN_ALLELES -stand_call_conf 30.0 -stand_emit_conf 0.0 -dt NONE -o final_my.vcf
Here is a link to the input VCF file: VCF File
Unfortunately, I keep running into the following error (I've tried GATK ver3.3 and ver3.5):
INFO 18:49:21,288 ProgressMeter - chr1:11177077 21138.0 49.5 m 39.0 h 69.4% 71.3 m 21.8 m
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.IndexOutOfBoundsException: Index: 3, Size: 3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3.0-mssm-0-gaa95802):
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR MESSAGE: Index: 3, Size: 3
##### ERROR ------------------------------------------------------------------------------------------
Would appreciate your help in solving this issue.
I am trying to use GATK HP ( v3.5 ) to call SNPs in amplicon seq data of a small genome of around 600bp. As shown in the attachment, the variants are not called between location 50 and 60, despite high coverage across many samples. ( There are total 96 samples ).
The base qualities are above 30 and mapping quality is also 60 ( bwa mem ). I also did not remove duplicates ( not marked as well ) as its amplicon seq data.
The command I used was ( multisample SNP calling ) :
java -Xmx50g -jar GenomeAnalysisTK.jar -nct 2 -R
Its the same case even with base quality 20.
Thanks in advance.
Hi, I have generated 9 .g.vcf files from my .bam files, and want to do joint genotype calling with all these gvcf files using GenotypeGVCFs tool. I have the following command:
java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R ref.fasta -V:VCF gvcflist.list -o output.vcf
In my output, I expected to find 9 sample columns for the genotype (ie 0/1:195,32:250:99:534,0,6705), however I can only find one sample column in the vcf file, and it seems that all 9 samples were pooled together. May I know if I have done something wrong? I would like to know individual genotype for each sample.
I have tried to use:
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I sam1.bam -I sam2.bam -I sam3.bam ....
and also only obtain one column instead of multiple-sample column for the genotype.
Hi everybody! I've recently started working with GATK and after reading documents, tutorial and forum discussions, I set a pipeline for my experiment. I'm dealing with multi-sample RNAseq for which GATK tools are less improved and verified than DNAseq, thus I 'd like to have your suggestions. Briefly, this is the experiment: 2 phenotypes of Sparus aurata, 8 libary per phenotype, each library consists on a pool (not barcoded) of 3 animals. Thus I have a total of 16 samples. My goal is to find the total number of variant sites and compare the allele frequencies between the two phenotypes. I lack genome and SNPs database.
Step by step: 1) I used STAR (not 2-pass) in order to map reads against my de novo assembly. STAR --runThreadN 16 --genomeDir ./GenomeIndex --readFilesIn XXX.fastq --alignIntronMax 19 --outSAMtype BAM SortedByCoordinate --outSAMmapqUnique 60 --outFilterMultimapNmax 5 --outFilterMismatchNmax 4 2) I used the picard-tools to Mark duplicates 3) I used the picard-tools to AddOrReplaceReadGroups 4) I used the picard-tools to BuildBamIndex 5) I called the haplotype for the 16 samples with the following command: GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I sample.bam -dontUseSoftClippedBases -ploidy 6 -ERC GVCF -o output.g.vcf 6) I used the GenotypeGVCFs to merge the samples from the same population in an unique vcf file as follow GenomeAnalysisTK.jar -T GenotypeGVCFs -R reference.fasta -stand_call_conf 20 -stand_emit_conf 20 -ploidy 6 --variant sample1.g.vcf --variant sample2.g.vcf --variant sample3.g.vcf (8 samples) -o output_HC.vcf
Finally I'm going to filter the results with the Variant Filtration: GenomeAnalysisTK.jar -T VariantFiltration -R reference.fasta -V output.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o utputFiltered.vcf
What do you think?
Now I'd like to compare the two populations, but how??? manually in excel files? vcf tools seem not to handle ploidy higher than 2. Does anyone deal with these issues and can kindly give some tips?
I've got an issue using the UnifiedGenotyper, it does not output all possible variants sharing the same start coordinate.
My command line: GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ReferenceGenome.fasta -I list_of_Samples.list -ploidy 1 -glm BOTH --heterozygosity 0.001 --indel_heterozygosity 0.005 -gt_mode DISCOVERY -dcov 1000 -o output_raw_variants.vcf
Using "Batch 1" , about 50 Plasmodium falciparum samples, I identified the following insertion in Sample1, which I know is real :
chrom14 2261798 . T TTA 1576.91 1:2,42:59:99:1:1.00:1609,0
However when I run the same command line with Batch1 + Batch2, total of 100 samples, I get the following result for Sample1:
chrom14 2261798 . T TTATATA 23812.75 0:39,5:59:99:0:0.00:0,775
Some samples from Batch2 have a longer insertion starting at the same coordinate, and now the original insertion in Sample1 does not appear in the VCF anymore... Why UnifiedGenotyper did not output multiple lines in this example? (for some other positions it did output multiple lines sharing the same coordinate) I did not change the parameter --max_alternate_alleles 6.
Thanks a lot in advance for your help,
We perform a GenotypeGVCFs+VQSR analysis following best practices.
We saw the option -stand_call_conf, but this option concern the variant. We would like to know how a genotype is set to missing on a called variant ?
We suppose it involes DP and PL, but could you give details ?
Will VariantsToTable work for multi-sample VCF files? I've tried the following command, but it only outputs the column headers:
java -jar ~/tools/GenomeAnalysisTK.jar -R chr1.fa -T VariantsToTable -V cg_100.vcf.gz -F CHROM -F POS -F ID -F REF -F ALT -F F-101-009 -o test.table
Hi, I have with me 1000 samples that were pooled together in 5 different libraries and sequenced for genomic regions. So first, for each library, I independantly performed alignment, marked duplicates,realigned the bam files and also did recalibration and separated the bam file for each individual. Now I have 1000 individual bam files and I wanted to check the target coverage and perform variant calling. I wanted to know if it makes sense to merge the bam files according to cases:controls and perform the analysis with 2 bam files or is it better to carry on with single individual bam files ?
I used the variant annotator on a multi sample vcf and I am happy to report that it worked great (especially being tree reducible). I had a question that I could not find in the documentation. For each variant call it the annotator would either declare a sample to be a "hiConfDeNovo" or a "loConfDeNovo" or there would be no assessment whatsover (indicating I suspect no De Novos for that variant call).
My question: What defines a "hiConfDeNovo" and what defines a "loConfDeNovo" as it is used by the annotator. If there was some documentation I had overlooked please let me know, otherwise any insight is much appreciated.
Thank you as always
we are working on 454 RNaseq data......we have RNAseq data of four different tissues from six different individuals.... we want to call variants from all these data in a one job....but as per your recommendation we have to run each tissue data from each individual separately...so my query is can we join them as per the DNAseq guidelines or we have to find out some another way to do this kind of analysis.... Please help us if you have any suggestions to do Multisample Variant calling from RNAseq data.....
I am working with a set of samples at the moment which have the following problems:
1) The organism is aneuploid 2) Ploidy for individual chromosomes varies between samples (we have a script that estimates this from the coverage so we can use this information to improve variant calling) 3) Coverage is thin (sequencing was done on a student's budget)
I am using the UnifiedGenotyper for variant discovery as it can account for aneuploidy.
I initially tried calling variants for each sample, grouping chromosomes by ploidy (i.e, for sample 1 I call all the diploid chromosomes, then all the triploid chromosomes etc). I also tried doing multi-sample variant calling across all the samples, setting the ploidy to 2 (The modal chromosome number is 2). Comparison of these two analyses shows that each is missing good SNPs that are called by the other. Multi-sample analyses is missing or mis-calling SNPs on aneuploid chromosomes, and individual analysis is missing SNPs in individual samples due to thin coverage (or more accurately, they are being called, but failing QD or quality filters due to thin coverage - these SNPs pass these filters in the multi-sample analysis).
I am thinking of trying to combine these approaches. The idea is to analyse each chromosome separately. For each chromosome, I would do a multi-sample call on all the samples which are diploid and a multi-sample call on all the samples which are triploid etc etc. I am hoping that this will give me the best of the two approaches above.
So my questions are:
1) Does this strategy makes sense? Is there any reason not to do it this way?
2) How should I proceed downstream? I know I will have to hard-filter regardless. Can I merge all my vcf files before filtering and annotation, or is there a reason to keep them separate?
Any input from the GATK team or the wider community is appreciated!
I'm running the HaplotypeCaller on a series of samples using a while loop in a bash script and for some samples the HaplotypeCaller is stopping part way through the file. My command was:
java -Xmx18g -jar $Gpath/GenomeAnalysisTK.jar \ -nct 8 \ -l INFO \ -R $ref \ -log $log/$plate.$prefix.HaplotypeCaller.log \ -T HaplotypeCaller \ -I $bam/$prefix.realign.bam \ --emitRefConfidence GVCF \ -variant_index_type LINEAR \ -variant_index_parameter 128000 \ -o $gvcf/$prefix.GATK.gvcf.vcf
Most of the samples completed and the output looks good, but for some I only have a truncated gvcf file with no index. When I look at the log it looks like this:
INFO 17:25:15,289 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:25:15,291 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:21 INFO 17:25:15,291 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:25:15,291 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:25:15,294 HelpFormatter - Program Args: -nct 8 -l INFO -R /home/owens/ref/Gasterosteus_aculeatus.BROADS1.73.dna.toplevel.fa -log /home/owens/SB/C31KCACXX05.log/C31KCACXX05.sb1Pax102L-S2013.Hap INFO 17:25:15,296 HelpFormatter - Executing as owens@GObox on Linux 3.2.0-63-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_17-b02. INFO 17:25:15,296 HelpFormatter - Date/Time: 2014/06/10 17:25:15 INFO 17:25:15,296 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:25:15,296 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:25:15,722 GenomeAnalysisEngine - Strictness is SILENT INFO 17:25:15,892 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 17:25:15,898 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 17:25:15,942 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04 INFO 17:25:15,948 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 17:25:15,993 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 8 CPU thread(s) for each of 1 data thread(s), of 12 processors available on this machine INFO 17:25:16,097 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 17:25:16,114 GenomeAnalysisEngine - Done preparing for traversal INFO 17:25:16,114 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:25:16,114 ProgressMeter - Location processed.active regions runtime per.1M.active regions completed total.runtime remaining INFO 17:25:16,114 HaplotypeCaller - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output INFO 17:25:16,116 HaplotypeCaller - All sites annotated with PLs force to true for reference-model confidence output INFO 17:25:16,278 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units INFO 17:25:46,116 ProgressMeter - scaffold_1722:1180 1.49e+05 30.0 s 3.3 m 0.0% 25.6 h 25.6 h INFO 17:26:46,117 ProgressMeter - scaffold_279:39930 1.37e+07 90.0 s 6.0 s 3.0% 50.5 m 49.0 m INFO 17:27:16,118 ProgressMeter - scaffold_139:222911 2.89e+07 120.0 s 4.0 s 6.3% 31.7 m 29.7 m INFO 17:27:46,119 ProgressMeter - scaffold_94:517387 3.89e+07 2.5 m 3.0 s 8.5% 29.2 m 26.7 m INFO 17:28:16,121 ProgressMeter - scaffold_80:591236 4.06e+07 3.0 m 4.0 s 8.9% 33.6 m 30.6 m INFO 17:28:46,123 ProgressMeter - groupXXI:447665 6.07e+07 3.5 m 3.0 s 13.3% 26.4 m 22.9 m INFO 17:29:16,395 ProgressMeter - groupV:8824013 7.25e+07 4.0 m 3.0 s 17.6% 22.7 m 18.7 m INFO 17:29:46,396 ProgressMeter - groupXIV:11551262 9.93e+07 4.5 m 2.0 s 24.0% 18.7 m 14.2 m WARN 17:29:52,732 ExactAFCalc - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at groupX:1516679 has 8 alternate alleles so only the top alleles INFO 17:30:19,324 ProgressMeter - groupX:14278234 1.15e+08 5.1 m 2.0 s 27.9% 18.1 m 13.0 m INFO 17:30:49,414 ProgressMeter - groupXVIII:5967453 1.46e+08 5.6 m 2.0 s 33.0% 16.8 m 11.3 m INFO 17:31:19,821 ProgressMeter - groupXI:15030145 1.63e+08 6.1 m 2.0 s 38.5% 15.7 m 9.7 m INFO 17:31:50,192 ProgressMeter - groupVI:5779653 1.96e+08 6.6 m 2.0 s 43.8% 15.0 m 8.4 m INFO 17:32:20,334 ProgressMeter - groupXVI:18115788 2.13e+08 7.1 m 1.0 s 50.1% 14.1 m 7.0 m INFO 17:32:50,335 ProgressMeter - groupVIII:4300439 2.50e+08 7.6 m 1.0 s 55.1% 13.7 m 6.2 m INFO 17:33:30,336 ProgressMeter - groupXIII:2378126 2.89e+08 8.2 m 1.0 s 63.1% 13.0 m 4.8 m INFO 17:34:02,099 GATKRunReport - Uploaded run statistics report to AWS S3
It seems like it got half way through and stopped. I think it's a memory issue because when I increased the available ram to java, the problem happens less, although I can't figure out why some samples work and others don't (there isn't anything else running on the machine using ram and the biggest bam files aren't failing). It's also strange to me that there doesn't seem to be an error message. Any insight into why this is happening and how to avoid it would be appreciated.
I'm doing a variant analysis of genomic DNA from 2 related samples. I followed the up-to-date Best practices using HaplotypeCaller in GVCF mode for both samples followed by GenotypeGVCF to compute a common vcf of variant loci. I'm looking at variants that would be sample2-specific (present in sample2 but not in sample1)
Here is a line of this file:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 chrIII 91124 . A AATAAGAGGAATTAGGCT 1132.42 . AC=2;AF=0.500;AN=4;DP=47;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=58.85;MQ0=0;QD=7.99 GT:AD:DP:GQ:PL 1/1:0,25:25:55:1167,55,0 0/0:.:22:33:0,33,495
In the Genotype Field, sample2.AD is a . (dot) meaning that no reads passed the Quality filters. However, sample2.DP=22 meaning that 22 reads covered this position. This line suggest that this variation is specific to sample1 (genotype HomVar 1/1) and is not present in sample2 (HomRef 0/0). But given the biological relationship between sample1 and 2 (the way they were generated), I doubt that this variation is true: it is very likely to be present in sample2 as well. It's a false
I have 416 loci like this. For the vast majority of them, sample1 and 2 likely share the same variation. But since it is not impossible that a very few of them are really sample1=HomVar and sample2=HomRef, could you suggest me a way to detect those guys? What about comparing sample1.PL(1/1) and sample2.PL(0/0) ? For example could you suggest a rule of thumb to determine their ratio ?
How many samples and of what file/genome size can the Unified Genotyper process simultaneously ? I will have about 250 samples, genome size 180Mb, bam sizes 3-5GB, reduced bam sizes ~1GB.
Also, how long would this take can it be multi-threaded ? Cheers,
our group is calling variants using 5.5mb of targeted capture across ~2400 bams (for 830 subjects). i have successfully executed using unifiedgenotyper, but we would also like to run using haplotypecaller to compare. my basic approach has been to scatter the capture region into 25 equal sized chunks, run parallel and then CatVariants to gather afterward. this worked nicely for UG, but for HC estimated run-times for even these small intervals are >7 days and the jobs inevitably crash due to memory problems.
so, my question is how should i approach this large cohort problem in HC? i have read HC works best in cohorts <100, so should i divide up my cohort into chunks of 100 bams and combine afterward? seems this would undermine the benefits of joint calling, no? i have also read that GATK 3.0 is going to address this issue with HC -- have these improvements been made with the current version (which I am not currently using)? if so, is there a best practice or example available anywhere with cmd specs for situations similar to mine? finally, is it ok to split my capture region and run in parallel as i have done? this doesn't undermine the benefits of de novo local assembly by HC? thanks in advance for any input!
I encounter a situation when ;
If I run Haplotype caller on one .bam file - Haplotype caller called the mutation (sanger validated)
java GenomeAnalysisTK.jar -T HaplotypeCaller -nct 15 -minPruning 2 \ -R genome.fa \ -I sample_1.bam \ -L TP53.bed \ -U ALLOW_N_CIGAR_READS \ -o sample_1.vcf
But, If i run Haplotype caller with multiple bam as input - Haplotype caller failed to call the mutation in the sample mentioned above
java GenomeAnalysisTK.jar -T HaplotypeCaller -nct 15 -minPruning 2 \ -R genome.fa \ -I sample_1.bam -I sample_2.bam -I sample_3 .bam ..... -I sample_19.bam \ -L TP53.bed \ -U ALLOW_N_CIGAR_READS \ -o multisample.vcf
I used the -bamOutput option to view the .bam file HC would expect to see when calling the mutation. Please find below IGV screenshot
java GenomeAnalysisTK.jar -T HaplotypeCaller -nct 15 -minPruning 2 \ -R genome.fa \ -I sample_1.bam -I sample_2.bam -I sample_3 .bam ..... -I sample_19.bam \ -L TP53.bed \ -U ALLOW_N_CIGAR_READS \ --bamOutput multiple_all.bam --bamWriterType ALL_POSSIBLE_HAPLOTYPES \ -o multisample.vcf
1st row is HC run only on sample_1.bam
2nd row is HC run on multiple.bam input - using the --bamOutput option alone
3rd row is HC run on multiple.bam input - using the --bamOutput option + --bamWriterType ALL_POSSIBLE_HAPLOTYPES
The coverage on the 1st row (IGV) of sample1 is 262 (Ref(C):3, Mut(A):259)
The coverage on the 3rd row (IGV) file at the position of interest is 3627(Ref(C):3411, Mut(A):215, G:1 , DEL:13)
Could the reason why HC unable to call the mutation (when multiple.bam are used as input) is because the postion of interest is has too much noise ~3 base upstream? (shown in screen shot below)
Any help understanding why mulisample bam input would fail to call the variant would me most helpful.
Thanks in advance
Hi, I have run GATK Unified genotyper with multisample and I am getting unexpected output. (Hard filters used)
Here are examples of the output format I get in the vcf for a sample: 0/1:43,4:PASS:24:24,0,4368, ./.:.:PASS or 0/0:2,0:2:DPFilter:6:0,6,47 as well as 0/0:12,0:37:0,37,830. The latter is the format I would expect. What does this mean? Also I get the filter PASS even when all but two samples have ./.:.:PASS and the other two have 1/1:0,1:1:DPFilter;GQFilter:3:38,3,0 and 1/1:0,1:1:DPFilter;GQFilter:3:38,3,0.
Also I ran HalpotypeCaller multisample and get similar output all though more of the 0/0:12,0:37:0,37,830 format....
We are sequencing a set of regions that covers about 1.5 megabases in total. We're running into problems with VQSR -- VariantRecalibrator says there are too few variants to do recalibration. To give a sense of numbers, in one sample we have about 3000 SNVs and 600 indels.
We seem to have too few indels to do VQSR on them and have a couple of questions:
Can we combine multiple samples to increase the number of variants, or does VariantRecalibrator need to work on each sample individually?
If we do not use VQSR for indels, should we also avoid VQSR for the SNPs?
Thanks in advance!
Hi, using unified genotyper with multiple samples (different lines of Medicago truncatula showing polymorphism), I would like to get a file giving me, for each sample, the alleles at each SNP position (SNP positions identified for all samples). Is it possible? I didn't find any option for this, and it would be dificult to make SNP calling for each sample and then create a new file with all of them, because SNPs positions and number are different according to the samples. Thanks in advance,
Dear Support Team,
Thanks for making the GATK 2013 workshop available, those are great videos and answered most of my doubts regarding the use of your pipeline. I have started analysing my data, but I have some doubts regarding my samples. At the moment I am running a pilot analysis, and I only have 2 samples (two different strains of the same organism). I got the files already demultiplexed from my Illumina sequencer and I have mapped them separately using BWA-MEM. I was planning to analyse them separately, probably using hard filtering. Should I join the samples at some point? Or should I rather get the multiplexed file (I mean the file before demultiplexing) and then rely on PICARD tools to group the reads separately?
Thanks for your kind attention, Max
I have performed multisample SNP calling and also single sample SNP calling for 8 individuals. Some of the variants detected by multi-sample SNP calling were absent in variants detected by single sample snp calling.
For example, Multisample SNP calling output:
chr1 389 . T G 178.50 . AC=6;AF=0.375;AN=16;BaseQRankSum=-0.775;DP=202;Dels=0.00;FS=26.344;HaplotypeScore=0.7725;MLEAC=6;MLEAF=0.375;MQ=12.37;MQ0=75;MQRankSum=-4.220;QD=1.08;ReadPosRankSum=0.073 GT:ADP:GQ:PL 0/1:21,21:42:35:35,0,116 0/0:11,7:18:9:0,9,68 0/1:13,5:18:18:18,0,58 0/1:8,14:21:39:39,0,59 0/1:16,16:32:59:66,0,59 0/1:14,7:20:43:43,0,85 0/1:10,20:29:19:21,0,19 0/0:9,9:18:6:0,6,44
Output from Individual SNP calling but merged into a single VCF file:
chr1 389 . T G 37.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.189;DP=33;Dels=0.00;FS=3.979;HaplotypeScore=2.9800;MLEAC=1;MLEAF=0.500;MQ0=12;MQ=12.44;MQRankSum=-1.474;QD=1.14;ReadPosRankSum=-0.794;SF=5 GT:GQP:PL:AD . . . . 0/1:59:32:66,0,59:16,16 . . .
Here we can observe that the variant is present in only 1 individual from single sample SNP calling, whereas it is present in all the individuals with multi-sample SNP calling.
Could someone comment, which of these would be reliable results?
I'm working with different tools to call variants (GATK is one of them) for the same sample and I'm merging these results with
gatk -T CombineVariants -R ucsc.hg19.fasta -V:GATK GATK.vcf -V:OTHER OTHER.vcf -o combined.vcf -genotypeMergeOptions PRIORITIZE --rod_priority_list GATK,OTHER
But when I have a discrepancy between results (I mean, same chr and pos, but different call), I just get GATK's result in my result file, Is there any way to have both calls when I have discrepancies? I don't want to use -genotypeMergeOptions UNIQUIFY.
GATK.vcf: chr17 7917905 . C T 360.16 . [...] GT 0/1 chr17 7918012 . C T 896.16 . [...] GT 0/1
OTHER.vcf: chr17 7917905 . C T 360.16 . [...] GT 0/1 chr17 7918012 . C T 896.16 . [...] GT 1/1
combined.vcf: chr17 7917905 . C T 360.16 . [...];set=Intersection GT 0/1 chr17 7918012 . C T 896.16 . [...];set=Intersection GT 0/1
What I want to obtain:
combined.vcf: chr17 7917905 . C T 360.16 . [...];set=Intersection GT 0/1 chr17 7918012 . C T 896.16 . [...];set=GATK GT 0/1 chr17 7918012 . C T 896.16 . [...];set=OTHER GT 1/1
Is there a way do single sample variant calling but have the output directly in a mutli-sample vcf? We now do single sample calling and after that combine the vcf file to a multi-sample vcf file. The creates really large intermediary files (with all reference positions) and takes much longer than multi-sample calling.
We have some NGS data from a non standard experiment with non standard samples, were we are looking at small differences between the samples. We chose to do single sample variant calling because we don't want the the data from one sample to influence the calls in the other. And we are not sure what the assumptions for multi-sample calling are and if they apply for our experiment.
Hi :), I'am new to NGS and have a questions reagarding the filtering. I have 13 individuals, 3 of them with a coverage of 11x-15x, the rest with a coverage of 5x-7x. I did multi-sample SNPcalling and hard filtering, the latter as there are no known SNPs so far. Now I am not sure how to set the minimal SNP quality (QUAL). On the best practise it is suggested to juse at least 30 for individuals with a coverage of at least 30, but 4 if the coverage is below 10. So what is the best way to set the QUAL filter?
many thanks in advance
we are running tests trying to get UG to produce 1 vcf per sample when inputting bams from multiple subjects. our situation is complicated slightly by the fact that each sample has 3 bams. when we input all 6 bams into UG, hoping to output 2 vcfs (1 per sample) we instead get a single vcf. we found some relevant advice in this post: http://gatkforums.broadinstitute.org/discussion/2262/why-unifiedgenotyper-treat-multiple-bam-input-as-one-sample but still haven't solved the issue.
details include: 1) we are inputting 6 bams for our test, 3 per sample for 2 samples. 2) bams were generated using Bioscope from targeted capture reads sequenced on a Solid 4. 3) as recommended in the post above we checked out the @RG statements in the bam headers using Samtools -- lines for the 6 bams are as follows:
@RG ID:20130610202026358 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-10T16:20:26-0400 SM:S1
@RG ID:20130611214013844 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:148 DT:2013-06-11T17:40:13-0400 SM:S1
@RG ID:20130613002511879 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:147 DT:2013-06-12T20:25:11-0400 SM:S1
@RG ID:20130611021848236 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-10T22:18:48-0400 SM:S1
@RG ID:20130612014345277 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:151 DT:2013-06-11T21:43:45-0400 SM:S1
@RG ID:20130613085411753 PL:SOLiD PU:bioscope-pairing LB:75x35RR PI:150 DT:2013-06-13T04:54:11-0400 SM:S1
Based on the former post, I would have expected each of these bams to generate a separate vcf as it appears the ids are all different (which would not have been desirable either, as we are hoping to generate 2 vcfs in this test). Thus, it is not clear if/how we should use Picard tool AddOrReplaceReadGroups to modify the @RG headers?
Does that make sense? Any advice?
How does the haplotype caller handle multiple samples? Does it do a local denovo assembly for every sample and then compare those or does it do 1 local denovo assembly using the reads of all the samples?
Can I run the haplotype caller on a set of 9 solid (50bp fragment and 50 x 35 PE) together with 1 ilumina sample (100 x 100 PE) ?
I have a vcf containing multiple samples. I would like to put the bam files also as input for the Variant Annotator but how does the variant annotator know which bam is for wich column in the vcf? Does the order of the args of the bam files need to correspond to the order of the samples columns in the vcf?
I'm working with a non-model species, concretely with citrus genus samples. What we do is basically to search target SNVs which could be responsible for the most of the phenotypic differences between citrus varieties/species.
Due to the absence of a reference assembly for every available species we have implement our genotyping pipeline by mapping all citrus species against the same reference genome (concretely, clementine genome). We are aware that this approach can produce unequal bias that are proportional to the sample-to-reference species distance, but at the end, we know that citrus genus species are relatively close, and its quite easy to find many conserved regions between them.
My question is about multi-sample calling. We are confident on performing multi-sample calling when we compare intra-species samples, but we are not so sure to follow the same methodology when we compare distant samples that don't share a considerable proportion of variants.
What do you recommend us?
We assume two alternatives
1 -Perform multisample-calling, understanding that despite of genomic heterogeneity the variants will be still detected. 2- Perform independent callings, and combine them after (by using CombineVariants tool)
Thanks in advance Jose
I've been going through the VQSR documentation/guide and haven't been able to pin down an answer to how it behaves on multi-sample VCF (generated by multi-sample calling with UG). Should VQSR be run on this? Or on each sample separately, given that coverage and other statistics used to determine the variant confidence score aren't the same for each sample and so can lead to conflicting determinations on different samples.
What is the best way to go about this?
The best practice guide states to call variants across all samples simultaneously. Besides the ease of working with one multi-sample VCF, what advantages are there to calling the variants at the same time? Does GATK leverage information across all samples when making calls? If so, what assumptions is the UnifiedGenotyper making about the relationship of these samples to each other, and what are the effects on the variant calls?
I want to know what's the best way to use VariantEval to get statistics for each sample in a multisample VCF file. If I call it like this:
java -jar GenomeAnalysisTK.jar \
-R ucsc.hg19.fasta \
-T VariantEval \
-o multisample.eval.gatkreport \
--eval annotated.combined.vcf.gz \
where annotated.combined.vcf.gz is a VCF file that contains ~1Mio variants for ~800 samples I get statistics for all samples combined, e.g.
#:GATKTable:CompOverlap:The overlap between eval and comp sites
CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants ...
CompOverlap dbsnp eval none all 471704 191147
CompOverlap dbsnp eval none known 280557 0 CompOverlap dbsnp eval none novel 191147 191147
But I would like to get one such entry per sample. Is there an easy way to do this?
Hi, I was wondering if you have any advice on the effects of including samples of different ancestry in the multiple sample variant calling. We run PCA after calling variants to identify any potential outliers. In case of identifying one, do we have to re-do the SNP and Indel calling step?
Any help most appreciated.
I've just made a long needed update to the most recent version of GATK. I had been toying with the variant quality score recalibrator before but now that I have a great deal more exomes at my disposal I'd like to fully implement it in a meaningful way.
The phrase I'm confused about is "In our testing we've found that in order to achieve the best exome results one needs to use an exome callset with at least 30 samples." How exactly do I arrange these 30+ exomes?
Is there any difference or reason to choose one of the following two workflows over the other?
Input 30+ exomes in the "-I" argument of either the UnifiedGenotyper or HaplotypeCaller and then with my multi-sample VCF perform the variant recalibration procedure and then split the individual call sets out of the multi-sample vcf with SelectVariants?
Take 30+ individual vcf files, merge them together, and then perform variant recalibration on the merged vcf and then split the individual call sets out of the multi-sample vcf with SelectVariants?
Any help is appreciated.
I have 15 affected samples. 2 are whole exome and 13 are whole genome. They have already been realigned on a single-sample level and had BQSR performed. I am contemplating running UnifiedGenotyper on all 15 samples together because we would like to compare the calls across the samples (especially in the coding regions). I am aware that there would be a large number of variant calls in the whole genome samples that would have little to no coverage in the exome samples. I haven't been able to find any posts that say you should or shouldn't run whole genome and exome samples through UnifiedGenotyper together. Are there any reasons why this should be discouraged?
Also, assuming I do perform multi-sample calling across all 15 samples, would it be ok to run that multi-sample VCF file through VQSR?
I am trying to run GATK on a sample of 119 exomes. I followed the GATK guidelines to process the fastq files. I used the following parameters to call the UnifiedGenotyper and VQSR [for SNPs]:
-T UnifiedGenotyper --output_mode EMIT_VARIANTS_ONLY --min_base_quality_score 30 --max_alternate_alleles 5 -glm SNP
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 /media/transcription/cipn/5.pt/ref/hapmap_3.3.hg19.sites.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 /media/transcription/cipn/5.pt/ref/1000G_omni2.5.hg19.sites.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 /media/transcription/cipn/5.pt/ref/dbsnp_135.hg19.vcf.gz -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -mode SNP
I get a tranche plot, which does not look OK. The "Number of Novel Variants [1000s]" goes from -400 to 800 and the Ti/Tv ratio varies from 0.633 to 0.782 [the attach file link is not working for me and am unable to upload the plot]. Any suggestion to rectify this would be very helpful !
Can anyone tell me the differences between GATK's single sample and muliti sample calling methods? What extra information does the muliti sample calling methods can give? I want to know if "the more samples, the more accuracy result"?