Create a master set of sites from your N batch VCFs that you want to genotype in all samples. At this stage you need to determine how you want to resolve disagreements among the VCFs. This is your master sites VCF.
Take the master sites VCF and genotype each sample BAM file at these sites
(Optionally) Merge the single sample VCFs into a master VCF file
The first step of batch merging is to create a master set of sites that you want to genotype in all samples. To make this problem concrete, suppose I have two VCF files:
##fileformat=VCFv4.0 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 20 9999996 . A ATC . PASS . GT:GQ 0/1:30 20 10000000 . T G . PASS . GT:GQ 0/1:30 20 10000117 . C T . FAIL . GT:GQ 0/1:30 20 10000211 . C T . PASS . GT:GQ 0/1:30 20 10001436 . A AGG . PASS . GT:GQ 1/1:30
##fileformat=VCFv4.0 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 20 9999996 . A ATC . PASS . GT:GQ 0/1:30 20 10000117 . C T . FAIL . GT:GQ 0/1:30 20 10000211 . C T . FAIL . GT:GQ 0/1:30 20 10000598 . T A . PASS . GT:GQ 1/1:30 20 10001436 . A AGGCT . PASS . GT:GQ 1/1:30
In order to merge these batches, I need to make a variety of bookkeeping and filtering decisions, as outlined in the merged VCF below:
20 9999996 . A ATC . PASS . GT:GQ 0/1:30 [pass in both] 20 10000000 . T G . PASS . GT:GQ 0/1:30 [only in batch 1] 20 10000117 . C T . FAIL . GT:GQ 0/1:30 [fail in both] 20 10000211 . C T . FAIL . GT:GQ 0/1:30 [pass in 1, fail in 2, choice in unclear] 20 10000598 . T A . PASS . GT:GQ 1/1:30 [only in batch 2] 20 10001436 . A AGGCT . PASS . GT:GQ 1/1:30 [A/AGG in batch 1, A/AGGCT in batch 2, including this site may be problematic]
These issues fall into the following categories:
There are two difficult situations that must be addressed by the needs of the project merging batches:
Unfortunately, we cannot determine which is actually the correct choice, especially given the goals of the project. We leave it up the project bioinformatician to handle these cases when creating the master VCF. We are hopeful that at some point in the future we'll have a consensus approach to handle such merging, but until then this will be a manual process.
The GATK tool CombineVariants can be used to merge multiple VCF files, and parameter choices will allow you to handle some of the above issues. With tools like SelectVariants one can slice-and-dice the merged VCFs to handle these complexities as appropriate for your project's needs. For example, the above master merge can be produced with the following CombineVariants:
java -jar dist/GenomeAnalysisTK.jar \ -T CombineVariants \ -R human_g1k_v37.fasta \ -V:one,VCF combine.1.vcf -V:two,VCF combine.2.vcf \ --sites_only \ -minimalVCF \ -o master.vcf
producing the following VCF:
##fileformat=VCFv4.0 #CHROM POS ID REF ALT QUAL FILTER INFO 20 9999996 . A ACT . PASS set=Intersection 20 10000000 . T G . PASS set=one 20 10000117 . C T . FAIL set=FilteredInAll 20 10000211 . C T . PASS set=filterIntwo-one 20 10000598 . T A . PASS set=two 20 10001436 . A AGG,AGGCT . PASS set=Intersection
Having created the master set of sites to genotype, along with their alleles, as in the previous section, you now use the UnifiedGenotyper to genotype each sample independently at the master set of sites. This GENOTYPE_GIVEN_ALLELES mode of the UnifiedGenotyper will jump into the sample BAM file, and calculate the genotype and genotype likelihoods of the sample at the site for each of the genotypes available for the REF and ALT alleles. For example, for site 10000211, the UnifiedGenotyper would evaluate the likelihoods of the CC, CT, and TT genotypes for the sample at this site, choose the most likely configuration, and generate a VCF record containing the genotype call and the likelihoods for the three genotype configurations.
As a concrete example command line, you can genotype the master.vcf file using in the bundle sample NA12878 with the following command:
java -Xmx2g -jar dist/GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R bundle/b37/human_g1k_v37.fasta \ -I bundle/b37/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam \ -alleles master.vcf \ -L master.vcf \ -gt_mode GENOTYPE_GIVEN_ALLELES \ -out_mode EMIT_ALL_SITES \ -stand_call_conf 0.0 \ -glm BOTH \ -G none \
-L master.vcf argument tells the UG to only genotype the sites in the master file. If you don't specify this, the UG will genotype the master sites in GGA mode, but it will also genotype all other sites in the genome in regular mode.
The last item,-G ` prevents the UG from computing annotations you don't need. This command produces something like the following output:
##fileformat=VCFv4.0 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 20 9999996 . A ACT 4576.19 . . GT:DP:GQ:PL 1/1:76:99:4576,229,0 20 10000000 . T G 0 . . GT:DP:GQ:PL 0/0:79:99:0,238,3093 20 10000211 . C T 857.79 . . GT:AD:DP:GQ:PL 0/1:28,27:55:99:888,0,870 20 10000598 . T A 1800.57 . . GT:AD:DP:GQ:PL 1/1:0,48:48:99:1834,144,0 20 10001436 . A AGG,AGGCT 1921.12 . . GT:DP:GQ:PL 0/2:49:84.06:1960,2065,0,2695,222,84
Several things should be noted here:
This genotyping command can be performed independently per sample, and so can be parallelized easily on a farm with one job per sample, as in the following:
foreach sample in samples: run UnifiedGenotyper command above with -I $sample.bam -o $sample.vcf end
You can use a similar command for CombineVariants above to merge back together all of your single sample genotyping runs. Suppose all of my UnifiedGenotyper jobs have completed, and I have VCF files named sample1.vcf, sample2.vcf, to sampleN.vcf. The single command:
java -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R human_g1k_v37.fasta -V:sample1 sample1.vcf -V:sample2 sample2.vcf [repeat until] -V:sampleN sampleN.vcf -o combined.vcf
Hi, I am working with RNA-Seq data from 6 different samples. Part of my research is to identify novel polymorphisms. I have generated a filtered vcf file for each sample. I would like to now combine these into a single vcf.
I am concerned about sites that were either not covered by the RNA-Seq analysis or were no different from the reference allele in some individuals but not others. These sites will be ‘missed’ when haplotypeCaller analyzes each sample individually and will not be represented in the downstream vcf files.
When the files are combined, what happens to these ‘missed’ sites? Are they automatically excluded? Are they treated as missing data? Is the absent data filled in from the reference genome?
Alternatively, can BaseRecallibrator and/or HaplotypeCaller simultaneously analyze multiple bam files?
Is it common practice to combine bam files for discovering sequence variants?
Hi, I have a a really deep (150x coverage) data for which I need to perform variant detection. Which of the two options is more effective to speed up the variant detection: 1. I run the whole data in one go and use -nt and -nct options wherever possible. 2. Or, I split up the genome bam files into 3 or 4 sets of chromosomes and then run them in parallel (with lower number of -nt and -nct).
If I go with option 2, can I merge the vcf files from all parallel runs (from different chromosomes) right after running HaplotypeCaller? Is that what is recommended to make sure that I dont have too small of a variant set necessary for recalibration (which is the issue I am facing right now)?
I have in a database 11 vcf and bam files for individuals we've sequenced. I have been trying to merge the 11 individual vcf files into one combined vcf file using CombineVariants in GATK. While it does combine the vcf files, it does something odd that I'm sure has been solved by other users and I am looking for input on.
A singleton SNP in individual 1 will be given "./." in all other 10 individuals instead of "0/0". Is there a way to fix this--the genotypes are not missing, they are reference.That said, some of them will be missing and are rightly called "./.", but I don't know how to incorporate this information into a merged VCF file.
Your help is most appreciated and apologies if this has been asked before--I couldn't find this exact topic.
I would appreciate your thoughts on the following pipeline:
I'm currently working on a number of WGS of non-human vertebrates. My approach for calling variants is to maximize the sensitivity of the calls by using two callers (GATK's UnifiedGenotyper + samtools' mpileup) per chromosome regardless of / ingnoring all filters. Next, I would like to merge (not intersect) the two vcf files (GATK+samtools) per each chromosome, then merge (not intersect) all the vcf files pertaining to all chromosomes in order to retrieve a final vcf dataset per individual:
For merging the GATK and samtools:
$ java -Xmx10g -jar GenomeAnalysisTK.jar -T CombineVariants -R ref.fasta --variant:GATK chr#.GATK.vcf --variant:samtools chr#.samtools.vcf -o chr#.GATK_samtools.union.vcf -genotypeMergeOptions PRIORITIZE -priority GATK,samtools --filteredrecordsmergetype KEEP_UNCONDITIONAL
For merging all chromosomes per individual:
$ java -Xmx10g -jar GenomeAnalysisTK.jar -T CombineVariants -R ref.fasta --variant:chr1 chr1.GATK_samtools.union.vcf --variant:chr2 chr2.GATK_samtools.union.vcf --variant:chr3 chr3.GATK_samtools.union.vcf -o Individual1.union.vcf -genotypeMergeOptions PRIORITIZE -priority chr1,chr2,chr3 --filteredrecordsmergetype KEEP_UNCONDITIONAL
Finally I would like to intersect between two individuals and keep only the variants that are common to both individuals:
Uniting / merging two individuals:
$ java -Xmx10g -jar GenomeAnalysisTK.jar -T CombineVariants -R ref.fasta --variant:individual1 Individual1.union.vcf --variant:Individual2 Individual2.union.vcf -o Individual1_2.union.vcf -genotypeMergeOptions PRIORITIZE -priority Indiviual1,Individual2 --filteredrecordsmergetype KEEP_UNCONDITIONAL
Intersecting the two indiviuals in order to keep only common variants:
$ java -Xmx10g -jar GenomeAnalysisTK.jar -T SelectVariants -R ref.fasta --variant Individual1_2.union.vcf -select 'set == "Intersection";' -o Intersected.vcf
Am I doing this right? I'm afraid I may be losing variants or something else along this pipeline. Remember that I want to keep only the common variants while ignoring the filters in order to increase sensitivity as much as possible.
Hi to all
I have just started using GATK and I have few question about some tools and about the general workflow.
I have 3 exome-seq data from a trio and I have to detect rare or private variants that segregate with the disease.
From the 3 aligned bam file I procedeed with the GATK pipeline (ADDgroupInfo, MarkDup, Realign, BQSR, Unified Genotyper and variant filtration) and I generated 3 VCF file.
As now I have to use the PhaseByTrasmission tool, should I merge the 3 VCF file?
Or it was better to merge the BAM file after adding the group info and proceed with the other analysis?
And should I create my .ped file,(I visited http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped, but I couln't understand how ped file is generated) based on the read group that I have assigned?
Hi. I want to merge two VCF files. Initially I was selected only indels(by select variant option). Now I want to merge these two VCF file which contains only INDELS. But When I run the command, I am getting the same error:
ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.NumberFormatException: For input string: "."
I run this command:
java -jar -Xmx2g GenomeAnalysisTK.jar -R hg19_5.fasta -T CombineVariants -V indelsample1.vcf -V indelsample3.vcf -o indels1s3.vcf -genotypeMergeOptions UNIQUIFY
Could you please tell me what is the reason behind this? and how to merge two VCF file having INDELS?
Thanks in advance.
Dear All, I am very new to the analysis of NGS data.
I would like to merge the information of sample 1029 from HGDP (http://cdna.eva.mpg.de/denisova/VCF/human/HGDP01029.hg19_1000g.12.mod.vcf.gz) to SAN sample in Schuster et al 2010 ftp://ftp.bx.psu.edu/data/bushman/hg18/bam/KB1illumChr12.bam)
If I well understood, I should call the variants from the bam file and then merge with the vcf. Is it correct? Could you gently suggest me the best way to do it in your opinion? When should i convert my files to the same reference sequence?
In addition I am looking at http://gatkforums.broadinstitute.org/discussion/1186/best-practice-variant-detection-with-the-gatk-v4-for-release-2-0, and I am trying to do Variant Detection on the example file NA12878. I have some doubt, Where I can find MarkDuplicates tool? Should I invoke it just with -T argument? Or Do I need to install it?
I am really sorry, I am trying to understand GATK, but it is not rally intuitive, so of you have any tips or recommendation please let me know it.
Dear team, I am new to GATK and I am having a hard time simply trying to merge vcf files. I have tried to solve the problem by referring to the guide and to previous posts, but nothing woked. Actually I found several discussions about the very same error message I receive, but it seems that no clear answere was provided. Here is this message:
I have tried three different MS Dos commands to do the task (see belbow), but the message didn't change:
1. java -jar GenomeAnalysisTK.jar -T CombineVariants -R E:\RessourcesGATK\ucsc.hg19.fasta -V:sample1 E:\TestGATK\sample1.vcf -V:sample2 E:\TestGATK\sample2.vcf -o combined.vcf 2. java -jar GenomeAnalysisTK.jar -R E:\RessourcesGATK\ucsc.hg19.fasta -T CombineVariants --variant E:\TestGATK\sample1.vcf --variant E:\TestGATK\sample2.vcf -o output.vcf -genotypeMergeOptions UNIQUIFY 3.java -jar GenomeAnalysisTK.jar -R E:\RessourcesGATK\ucsc.hg19.fasta -T CombineVariants --variant E:\TestGATK\sample1.vcf --variant E:\TestGATK\sample2.vcf -o output.vcf -genotypeMergeOptions PRIORITIZE -priority foo,bar
I have also tried to use the reference human_g1k_v37.fasta as a resource, but it was the same. I have suppressed the # before CHROM in the header line, tested vcf generated by Samtools or by GATK, but it did not change anything. Is this a problem of environment? I haven't read anything mentioning that GATK could not work with MS Dos.
Thank you very much for your help. S.