It is fairly common to have multiple read groups for a sample, either from sequencing multiple libraries or from spreading a library across multiple lanes. It seems this causes a lot of confusion, and people often tell us they're not sure how to organize the data for the pre-processing steps or how to feed the data into HaplotypeCaller.
Well, there are several options for organizing the processing. We have a fairly detailed FAQ article that describes our preferred workflow for pre-processing data from multiplexed sequencing and multi-library designs. But in this article we describe at a simpler level what are the main two options depending on how you want to provide the analysis ready BAM files to the variant caller.
The simplest thing to do is to input all the bam files that belong to that sample, either at the MarkDuplicates step, the Indel Realignment step or at the BQSR step. The choice depends mostly on how deep the coverage is. High depth means a lot of data to process at the same time, which slows down Indel Realignment. This is because Indel Realignment ignores all read group information and simply processes all reads together. BQSR doesn't suffer from that problem because it processes read groups separately. In either case, when you input all samples together, the bam that gets written out with the processed data will include all the libraries / read groups in one handy per-sample file.
Note: We do not require the PU field in the RG, however, BQSR will consider the PU field over all other fields.
Another option is to keep all the bam files separate until variant calling, and then input them to Haplotype Caller together. You can do this by simply running Indel Realignment and BQSR on each of the bams separately. You can then input all of the bams into HaplotypeCaller at once. This works even if you want to run HaplotypeCaller in GVCF mode, which can only be done on a single sample at a time. As long as the SM tags are identical, HaplotypeCaller will recognize that it's a single-sample run. This is because the GATK engine will merge the data before presenting it to the HaplotypeCaller tool, so HaplotypeCaller does not know nor care whether the data came from many files or one file.
Note: If you input many bam files into Indel Realigner, the default output is one bam file. However, you can output one bam file for each input bam file by using
Hi, I'm trying to merge vcf files produced by GATK (HaplotypeCaller,Version=3.4-46-gbc02625) using either bcftools or vcftools and I h=get this error: Incorrect number of PL fields (6) at 1:934345, cannot merge. there are actually more values to PL than 3. any fix around that?
Hello, and thanks for making all the GATK tools! I have recently started to try my hand at variant calling of my RNA-seq data, following the GATK Best Practices more or less verbatim, only excluding indel alignment (because I am only interested in SNPs at this point) and the BQSR (partly because I have very high quality data, but mostly because I couldn't get it to work in the workflow).
I have three replicates for each of my samples, and my question is where, if at all, I should merge the data from them. I am not sure if I can (or even should!) merge the FASTQ files before the alignment step, or merge the aligned BAM files, or something else. I read that for aligners such as BWA the options are (more or less) equivalent, but seeing as the RNA-seq Best Practice workflow using STAR... What would be the "correct" way to do it, if at all? How would merging (at some level) affect the speed of the workflow, and can I optimise that somehow?
If it's a bad idea to do merging, how would I determine the "true" variant from my three resulting VCF-files at the end, for cases where they differ?
Hi, I have multiple fastq files for 1 individual (you can think of those files as replicates). I was wondering at which point in the GATK variant calling pipeline should I combine them to get the best results. I am looking for 1 set of variants present in that individual (1 vcf is desired). Should I combine them in the HC step?
Hi, I am working with RNA-Seq data from 6 different samples. Part of my research is to identify novel polymorphisms. I have generated a filtered vcf file for each sample. I would like to now combine these into a single vcf.
I am concerned about sites that were either not covered by the RNA-Seq analysis or were no different from the reference allele in some individuals but not others. These sites will be ‘missed’ when haplotypeCaller analyzes each sample individually and will not be represented in the downstream vcf files.
When the files are combined, what happens to these ‘missed’ sites? Are they automatically excluded? Are they treated as missing data? Is the absent data filled in from the reference genome?
Alternatively, can BaseRecallibrator and/or HaplotypeCaller simultaneously analyze multiple bam files?
Is it common practice to combine bam files for discovering sequence variants?
Hi, I have a a really deep (150x coverage) data for which I need to perform variant detection. Which of the two options is more effective to speed up the variant detection:
If I go with option 2, can I merge the vcf files from all parallel runs (from different chromosomes) right after running HaplotypeCaller? Is that what is recommended to make sure that I dont have too small of a variant set necessary for recalibration (which is the issue I am facing right now)?
I have in a database 11 vcf and bam files for individuals we've sequenced. I have been trying to merge the 11 individual vcf files into one combined vcf file using CombineVariants in GATK. While it does combine the vcf files, it does something odd that I'm sure has been solved by other users and I am looking for input on.
A singleton SNP in individual 1 will be given "./." in all other 10 individuals instead of "0/0". Is there a way to fix this--the genotypes are not missing, they are reference.That said, some of them will be missing and are rightly called "./.", but I don't know how to incorporate this information into a merged VCF file.
Your help is most appreciated and apologies if this has been asked before--I couldn't find this exact topic.
I would appreciate your thoughts on the following pipeline:
I'm currently working on a number of WGS of non-human vertebrates. My approach for calling variants is to maximize the sensitivity of the calls by using two callers (GATK's UnifiedGenotyper + samtools' mpileup) per chromosome regardless of / ingnoring all filters. Next, I would like to merge (not intersect) the two vcf files (GATK+samtools) per each chromosome, then merge (not intersect) all the vcf files pertaining to all chromosomes in order to retrieve a final vcf dataset per individual:
For merging the GATK and samtools:
$ java -Xmx10g -jar GenomeAnalysisTK.jar -T CombineVariants -R ref.fasta --variant:GATK chr#.GATK.vcf --variant:samtools chr#.samtools.vcf -o chr#.GATK_samtools.union.vcf -genotypeMergeOptions PRIORITIZE -priority GATK,samtools --filteredrecordsmergetype KEEP_UNCONDITIONAL
For merging all chromosomes per individual:
$ java -Xmx10g -jar GenomeAnalysisTK.jar -T CombineVariants -R ref.fasta --variant:chr1 chr1.GATK_samtools.union.vcf --variant:chr2 chr2.GATK_samtools.union.vcf --variant:chr3 chr3.GATK_samtools.union.vcf -o Individual1.union.vcf -genotypeMergeOptions PRIORITIZE -priority chr1,chr2,chr3 --filteredrecordsmergetype KEEP_UNCONDITIONAL
Finally I would like to intersect between two individuals and keep only the variants that are common to both individuals:
Uniting / merging two individuals:
$ java -Xmx10g -jar GenomeAnalysisTK.jar -T CombineVariants -R ref.fasta --variant:individual1 Individual1.union.vcf --variant:Individual2 Individual2.union.vcf -o Individual1_2.union.vcf -genotypeMergeOptions PRIORITIZE -priority Indiviual1,Individual2 --filteredrecordsmergetype KEEP_UNCONDITIONAL
Intersecting the two indiviuals in order to keep only common variants:
$ java -Xmx10g -jar GenomeAnalysisTK.jar -T SelectVariants -R ref.fasta --variant Individual1_2.union.vcf -select 'set == "Intersection";' -o Intersected.vcf
Am I doing this right? I'm afraid I may be losing variants or something else along this pipeline. Remember that I want to keep only the common variants while ignoring the filters in order to increase sensitivity as much as possible.
Hi to all
I have just started using GATK and I have few question about some tools and about the general workflow.
I have 3 exome-seq data from a trio and I have to detect rare or private variants that segregate with the disease.
From the 3 aligned bam file I procedeed with the GATK pipeline (ADDgroupInfo, MarkDup, Realign, BQSR, Unified Genotyper and variant filtration) and I generated 3 VCF file.
As now I have to use the PhaseByTrasmission tool, should I merge the 3 VCF file?
Or it was better to merge the BAM file after adding the group info and proceed with the other analysis?
And should I create my .ped file,(I visited http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped, but I couln't understand how ped file is generated) based on the read group that I have assigned?
Hi. I want to merge two VCF files. Initially I was selected only indels(by select variant option). Now I want to merge these two VCF file which contains only INDELS. But When I run the command, I am getting the same error:
ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.NumberFormatException: For input string: "."
I run this command:
java -jar -Xmx2g GenomeAnalysisTK.jar -R hg19_5.fasta -T CombineVariants -V indelsample1.vcf -V indelsample3.vcf -o indels1s3.vcf -genotypeMergeOptions UNIQUIFY
Could you please tell me what is the reason behind this? and how to merge two VCF file having INDELS?
Thanks in advance.
Dear All, I am very new to the analysis of NGS data.
I would like to merge the information of sample 1029 from HGDP (http://cdna.eva.mpg.de/denisova/VCF/human/HGDP01029.hg19_1000g.12.mod.vcf.gz) to SAN sample in Schuster et al 2010 ftp://ftp.bx.psu.edu/data/bushman/hg18/bam/KB1illumChr12.bam)
If I well understood, I should call the variants from the bam file and then merge with the vcf. Is it correct? Could you gently suggest me the best way to do it in your opinion? When should i convert my files to the same reference sequence?
In addition I am looking at http://gatkforums.broadinstitute.org/discussion/1186/best-practice-variant-detection-with-the-gatk-v4-for-release-2-0, and I am trying to do Variant Detection on the example file NA12878. I have some doubt, Where I can find MarkDuplicates tool? Should I invoke it just with -T argument? Or Do I need to install it?
I am really sorry, I am trying to understand GATK, but it is not rally intuitive, so of you have any tips or recommendation please let me know it.
Dear team, I am new to GATK and I am having a hard time simply trying to merge vcf files. I have tried to solve the problem by referring to the guide and to previous posts, but nothing woked. Actually I found several discussions about the very same error message I receive, but it seems that no clear answere was provided. Here is this message:
I have tried three different MS Dos commands to do the task (see belbow), but the message didn't change:
1. java -jar GenomeAnalysisTK.jar -T CombineVariants -R E:\RessourcesGATK\ucsc.hg19.fasta -V:sample1 E:\TestGATK\sample1.vcf -V:sample2 E:\TestGATK\sample2.vcf -o combined.vcf 2. java -jar GenomeAnalysisTK.jar -R E:\RessourcesGATK\ucsc.hg19.fasta -T CombineVariants --variant E:\TestGATK\sample1.vcf --variant E:\TestGATK\sample2.vcf -o output.vcf -genotypeMergeOptions UNIQUIFY 3.java -jar GenomeAnalysisTK.jar -R E:\RessourcesGATK\ucsc.hg19.fasta -T CombineVariants --variant E:\TestGATK\sample1.vcf --variant E:\TestGATK\sample2.vcf -o output.vcf -genotypeMergeOptions PRIORITIZE -priority foo,bar
I have also tried to use the reference human_g1k_v37.fasta as a resource, but it was the same. I have suppressed the # before CHROM in the header line, tested vcf generated by Samtools or by GATK, but it did not change anything. Is this a problem of environment? I haven't read anything mentioning that GATK could not work with MS Dos.
Thank you very much for your help. S.