Tagged with #merge
1 documentation article | 0 announcements | 11 forum discussions

Created 2015-08-26 20:08:46 | Updated 2016-04-30 02:21:28 | Tags: bqsr merge

Comments (1)

It is fairly common to have multiple read groups for a sample, either from sequencing multiple libraries or from spreading a library across multiple lanes. It seems this causes a lot of confusion, and people often tell us they're not sure how to organize the data for the pre-processing steps or how to feed the data into HaplotypeCaller.

Well, there are several options for organizing the processing. We have a fairly detailed FAQ article that describes our preferred workflow for pre-processing data from multiplexed sequencing and multi-library designs. But in this article we describe at a simpler level what are the main two options depending on how you want to provide the analysis ready BAM files to the variant caller.

To produce a combined per-sample bam file to feed to HaplotypeCaller (most common)

The simplest thing to do is to input all the bam files that belong to that sample, either at the MarkDuplicates step, the Indel Realignment step or at the BQSR step. The choice depends mostly on how deep the coverage is. High depth means a lot of data to process at the same time, which slows down Indel Realignment. This is because Indel Realignment ignores all read group information and simply processes all reads together. BQSR doesn't suffer from that problem because it processes read groups separately. In either case, when you input all samples together, the bam that gets written out with the processed data will include all the libraries / read groups in one handy per-sample file.

Note: We do not require the PU field in the RG, however, BQSR will consider the PU field over all other fields.

To produce a separate bam file for each read group (less common)

Another option is to keep all the bam files separate until variant calling, and then input them to Haplotype Caller together. You can do this by simply running Indel Realignment and BQSR on each of the bams separately. You can then input all of the bams into HaplotypeCaller at once. This works even if you want to run HaplotypeCaller in GVCF mode, which can only be done on a single sample at a time. As long as the SM tags are identical, HaplotypeCaller will recognize that it's a single-sample run. This is because the GATK engine will merge the data before presenting it to the HaplotypeCaller tool, so HaplotypeCaller does not know nor care whether the data came from many files or one file.

Note: If you input many bam files into Indel Realigner, the default output is one bam file. However, you can output one bam file for each input bam file by using -nWayOut.

No articles to display.

Created 2016-03-09 13:37:26 | Updated | Tags: merge pl

Comments (1)

Hi, I'm trying to merge vcf files produced by GATK (HaplotypeCaller,Version=3.4-46-gbc02625) using either bcftools or vcftools and I h=get this error: Incorrect number of PL fields (6) at 1:934345, cannot merge. there are actually more values to PL than 3. any fix around that?


Created 2016-02-05 09:25:43 | Updated | Tags: haplotypecaller best-practices merge rna-seq

Comments (11)

Hello, and thanks for making all the GATK tools! I have recently started to try my hand at variant calling of my RNA-seq data, following the GATK Best Practices more or less verbatim, only excluding indel alignment (because I am only interested in SNPs at this point) and the BQSR (partly because I have very high quality data, but mostly because I couldn't get it to work in the workflow).

I have three replicates for each of my samples, and my question is where, if at all, I should merge the data from them. I am not sure if I can (or even should!) merge the FASTQ files before the alignment step, or merge the aligned BAM files, or something else. I read that for aligners such as BWA the options are (more or less) equivalent, but seeing as the RNA-seq Best Practice workflow using STAR... What would be the "correct" way to do it, if at all? How would merging (at some level) affect the speed of the workflow, and can I optimise that somehow?

If it's a bad idea to do merging, how would I determine the "true" variant from my three resulting VCF-files at the end, for cases where they differ?

Created 2016-01-25 20:27:33 | Updated | Tags: merge multiple-inputs

Comments (1)

Hi, I have multiple fastq files for 1 individual (you can think of those files as replicates). I was wondering at which point in the GATK variant calling pipeline should I combine them to get the best results. I am looking for 1 set of variants present in that individual (1 vcf is desired). Should I combine them in the HC step?

Thanks! nb

Created 2015-01-21 19:53:26 | Updated | Tags: baserecalibrator haplotypecaller vcf bam merge rnaseq

Comments (3)

Hi, I am working with RNA-Seq data from 6 different samples. Part of my research is to identify novel polymorphisms. I have generated a filtered vcf file for each sample. I would like to now combine these into a single vcf.

I am concerned about sites that were either not covered by the RNA-Seq analysis or were no different from the reference allele in some individuals but not others. These sites will be ‘missed’ when haplotypeCaller analyzes each sample individually and will not be represented in the downstream vcf files.

When the files are combined, what happens to these ‘missed’ sites? Are they automatically excluded? Are they treated as missing data? Is the absent data filled in from the reference genome?

Alternatively, can BaseRecallibrator and/or HaplotypeCaller simultaneously analyze multiple bam files?

Is it common practice to combine bam files for discovering sequence variants?

Created 2014-10-09 23:20:12 | Updated | Tags: parallelism merge

Comments (3)

Hi, I have a a really deep (150x coverage) data for which I need to perform variant detection. Which of the two options is more effective to speed up the variant detection:

  1. I run the whole data in one go and use -nt and -nct options wherever possible.
  2. Or, I split up the genome bam files into 3 or 4 sets of chromosomes and then run them in parallel (with lower number of -nt and -nct).

If I go with option 2, can I merge the vcf files from all parallel runs (from different chromosomes) right after running HaplotypeCaller? Is that what is recommended to make sure that I dont have too small of a variant set necessary for recalibration (which is the issue I am facing right now)?


Created 2014-01-23 16:32:26 | Updated | Tags: merge gatk

Comments (2)

I have in a database 11 vcf and bam files for individuals we've sequenced. I have been trying to merge the 11 individual vcf files into one combined vcf file using CombineVariants in GATK. While it does combine the vcf files, it does something odd that I'm sure has been solved by other users and I am looking for input on.

A singleton SNP in individual 1 will be given "./." in all other 10 individuals instead of "0/0". Is there a way to fix this--the genotypes are not missing, they are reference.That said, some of them will be missing and are rightly called "./.", but I don't know how to incorporate this information into a merged VCF file.

Your help is most appreciated and apologies if this has been asked before--I couldn't find this exact topic.

Created 2013-03-04 12:53:20 | Updated 2013-03-04 13:58:54 | Tags: merge callset

Comments (4)

Hi all,

I would appreciate your thoughts on the following pipeline:
I'm currently working on a number of WGS of non-human vertebrates. My approach for calling variants is to maximize the sensitivity of the calls by using two callers (GATK's UnifiedGenotyper + samtools' mpileup) per chromosome regardless of / ingnoring all filters. Next, I would like to merge (not intersect) the two vcf files (GATK+samtools) per each chromosome, then merge (not intersect) all the vcf files pertaining to all chromosomes in order to retrieve a final vcf dataset per individual:

For merging the GATK and samtools:

$ java -Xmx10g -jar GenomeAnalysisTK.jar -T CombineVariants -R ref.fasta 
--variant:GATK chr#.GATK.vcf --variant:samtools chr#.samtools.vcf 
-o chr#.GATK_samtools.union.vcf 
-genotypeMergeOptions PRIORITIZE -priority GATK,samtools --filteredrecordsmergetype KEEP_UNCONDITIONAL

For merging all chromosomes per individual:

$ java -Xmx10g -jar GenomeAnalysisTK.jar -T CombineVariants -R ref.fasta 
--variant:chr1 chr1.GATK_samtools.union.vcf --variant:chr2 chr2.GATK_samtools.union.vcf --variant:chr3 chr3.GATK_samtools.union.vcf 
-o Individual1.union.vcf 
-genotypeMergeOptions PRIORITIZE -priority chr1,chr2,chr3 --filteredrecordsmergetype KEEP_UNCONDITIONAL

Finally I would like to intersect between two individuals and keep only the variants that are common to both individuals:

Uniting / merging two individuals:

$ java -Xmx10g -jar GenomeAnalysisTK.jar -T CombineVariants -R ref.fasta 
--variant:individual1 Individual1.union.vcf --variant:Individual2 Individual2.union.vcf -o Individual1_2.union.vcf 
-genotypeMergeOptions PRIORITIZE -priority Indiviual1,Individual2 --filteredrecordsmergetype KEEP_UNCONDITIONAL

Intersecting the two indiviuals in order to keep only common variants:

$  java -Xmx10g -jar GenomeAnalysisTK.jar -T SelectVariants -R ref.fasta 
--variant Individual1_2.union.vcf -select 'set == "Intersection";' 
-o Intersected.vcf

Am I doing this right? I'm afraid I may be losing variants or something else along this pipeline. Remember that I want to keep only the common variants while ignoring the filters in order to increase sensitivity as much as possible.



Created 2013-02-26 17:31:08 | Updated | Tags: pedigree vcf pipeline merge

Comments (3)

Hi to all

I have just started using GATK and I have few question about some tools and about the general workflow.

I have 3 exome-seq data from a trio and I have to detect rare or private variants that segregate with the disease.

From the 3 aligned bam file I procedeed with the GATK pipeline (ADDgroupInfo, MarkDup, Realign, BQSR, Unified Genotyper and variant filtration) and I generated 3 VCF file.

As now I have to use the PhaseByTrasmission tool, should I merge the 3 VCF file?

Or it was better to merge the BAM file after adding the group info and proceed with the other analysis?

And should I create my .ped file,(I visited http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped, but I couln't understand how ped file is generated) based on the read group that I have assigned?


Created 2013-01-14 11:07:56 | Updated 2013-01-14 14:20:02 | Tags: combinevariants merge indels

Comments (4)

Hi. I want to merge two VCF files. Initially I was selected only indels(by select variant option). Now I want to merge these two VCF file which contains only INDELS. But When I run the command, I am getting the same error:

ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
java.lang.NumberFormatException: For input string: "."

I run this command:

java -jar -Xmx2g GenomeAnalysisTK.jar -R hg19_5.fasta -T CombineVariants -V indelsample1.vcf -V indelsample3.vcf -o indels1s3.vcf -genotypeMergeOptions UNIQUIFY

Could you please tell me what is the reason behind this? and how to merge two VCF file having INDELS?

Thanks in advance.

Created 2012-11-16 18:29:54 | Updated 2013-01-07 20:01:32 | Tags: vcf bam merge

Comments (2)

Dear All, I am very new to the analysis of NGS data.

I would like to merge the information of sample 1029 from HGDP (http://cdna.eva.mpg.de/denisova/VCF/human/HGDP01029.hg19_1000g.12.mod.vcf.gz) to SAN sample in Schuster et al 2010 ftp://ftp.bx.psu.edu/data/bushman/hg18/bam/KB1illumChr12.bam)

If I well understood, I should call the variants from the bam file and then merge with the vcf. Is it correct? Could you gently suggest me the best way to do it in your opinion? When should i convert my files to the same reference sequence?

In addition I am looking at http://gatkforums.broadinstitute.org/discussion/1186/best-practice-variant-detection-with-the-gatk-v4-for-release-2-0, and I am trying to do Variant Detection on the example file NA12878. I have some doubt, Where I can find MarkDuplicates tool? Should I invoke it just with -T argument? Or Do I need to install it?

I am really sorry, I am trying to understand GATK, but it is not rally intuitive, so of you have any tips or recommendation please let me know it.

Created 2012-10-12 09:44:57 | Updated 2012-10-18 00:59:49 | Tags: combinevariants vcf merge

Comments (4)

Dear team, I am new to GATK and I am having a hard time simply trying to merge vcf files. I have tried to solve the problem by referring to the guide and to previous posts, but nothing woked. Actually I found several discussions about the very same error message I receive, but it seems that no clear answere was provided. Here is this message:

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.1-12-ga99c19d):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR MESSAGE: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
ERROR ------------------------------------------------------------------------------------------

I have tried three different MS Dos commands to do the task (see belbow), but the message didn't change:

1. java -jar GenomeAnalysisTK.jar -T CombineVariants -R E:\RessourcesGATK\ucsc.hg19.fasta -V:sample1 E:\TestGATK\sample1.vcf -V:sample2 E:\TestGATK\sample2.vcf -o combined.vcf

2. java -jar GenomeAnalysisTK.jar -R E:\RessourcesGATK\ucsc.hg19.fasta -T CombineVariants  --variant E:\TestGATK\sample1.vcf  --variant E:\TestGATK\sample2.vcf  -o output.vcf  -genotypeMergeOptions UNIQUIFY

3.java -jar GenomeAnalysisTK.jar -R E:\RessourcesGATK\ucsc.hg19.fasta  -T CombineVariants  --variant E:\TestGATK\sample1.vcf  --variant E:\TestGATK\sample2.vcf  -o output.vcf  -genotypeMergeOptions PRIORITIZE  -priority foo,bar

I have also tried to use the reference human_g1k_v37.fasta as a resource, but it was the same. I have suppressed the # before CHROM in the header line, tested vcf generated by Samtools or by GATK, but it did not change anything. Is this a problem of environment? I haven't read anything mentioning that GATK could not work with MS Dos.

Thank you very much for your help. S.