We have not yet validated the joint genotyping methods (HaplotypeCaller in
-ERC GVCF mode per-sample then GenotypeGVCFs per-cohort) on RNAseq data. Our standard recommendation is to process RNAseq samples individually as laid out in the RNAseq-specific documentation.
However, we know that a lot of people have been trying out the joint genotyping workflow on RNAseq data, and there do not seem to be any major technical problems. You are welcome to try it on your own data, with the caveat that we cannot guarantee correctness of results, and may not be able to help you if something goes wrong. Please be sure to examine your results carefully and critically.
If you do pursue this, you will need to pre-process your samples according to our RNA-specific documentation, then switch to the GVCF workflow at the HaplotypeCaller stage. For filtering, it will be up to you to determine whether the hard filtering or VQSR filtering method produce best results. We have not tested any of this so we cannot provide a recommendation. Be prepared to do a lot of analysis to validate the quality of your results.
We recommend performing variant discovery in a way that enables joint analysis of multiple samples, as laid out in our Best Practices workflow. That workflow includes a joint analysis step that empowers variant discovery by providing the ability to leverage population-wide information from a cohort of multiple sample, allowing us to detect variants with great sensitivity and genotype samples as accurately as possible. Our workflow recommendations provide a way to do this in a way that is scalable and allows incremental processing of the sequencing data.
The key point is that you don’t actually have to call variants on all your samples together to perform a joint analysis. We have developed a workflow that allows us to decouple the initial identification of potential variant sites (ie variant calling) from the genotyping step, which is the only part that really needs to be done jointly. Since GATK 3.0, you can use the HaplotypeCaller to call variants individually per-sample in
-ERC GVCF mode, followed by a joint genotyping step on all samples in the cohort, as described in this method article. This achieves what we call incremental joint discovery, providing you with all the benefits of classic joint calling (as described below) without the drawbacks.
Why "almost always"? Because some people have reported missing a small fraction of singletons (variants that are unique to individual samples) when using the new method. For most studies, this is an acceptable tradeoff (which is reduced by the availability of high quality sequencing data), but if you are very specifically looking for singletons, you may need to do some careful evaluation before committing to this method.
Until recently, three strategies were available for variant discovery in multiple samples:
- single sample calling: sample BAMs are analyzed individually, and individual call sets are combined in a downstream processing step;
- batch calling: sample BAMs are analyzed in separate batches, and batch call sets are merged in a downstream processing step;
- joint calling: variants are called simultaneously across all sample BAMs, generating a single call set for the entire cohort.
The best of these, from the point of view of variant discovery, was joint calling, because it provided the following benefits:
Batch-calling does not output a genotype call at sites where no member in the batch has evidence for a variant; it is thus impossible to distinguish such sites from locations missing data. In contrast, joint calling emits genotype calls at every site where any individual in the call set has evidence for variation.
By sharing information across all samples, joint calling makes it possible to “rescue” genotype calls at sites where a carrier has low coverage but other samples within the call set have a confident variant at that location. However this does not apply to singletons, which are unique to a single sample. To minimize the chance of missing singletons, we increase the cohort size -- so that singletons themselves have less chance of happening in the first place.
The current approaches to variant filtering (such as VQSR) use statistical models that work better with large amounts of data. Of the three calling strategies above, only joint calling provides enough data for accurate error modeling and ensures that filtering is applied uniformly across all samples.
Figure 1: Power of joint calling in finding mutations at low coverage sites. The variant allele is present in only two of the N samples, in both cases with such low coverage that the variant is not callable when processed separately. Joint calling allows evidence to be accumulated over all samples and renders the variant callable. (right) Importance of joint calling to square off the genotype matrix, using an example of two disease-relevant variants. Neither sample will have records in a variants-only output file, for different reasons: the first sample is homozygous reference while the second sample has no data. However, merging the results from single sample calling will incorrectly treat both of these samples identically as being non-informative.
There are two major problems with the joint calling strategy.
- Scaling & infrastructure
Joint calling scales very badly -- the calculations involved in variant calling (especially by methods like the HaplotypeCaller’s) become exponentially more computationally costly as you add samples to the cohort. If you don't have a lot of compute available, you run into limitations pretty quickly. Even here at Broad where we have fairly ridiculous amounts of compute available, we can't brute-force our way through the numbers for the larger cohort sizes that we're called on to handle.
- The N+1 problem
When you’re getting a large-ish number of samples sequenced (especially clinical samples), you typically get them in small batches over an extended period of time, and you analyze each batch as it comes in (whether it’s because the analysis is time-sensitive or your PI is breathing down your back). But that’s not joint calling, that’s batch calling, and it doesn’t give you the same significant gains that joint calling can give you. Unfortunately the joint calling approach doesn’t allow for incremental analysis -- every time you get even one new sample sequence, you have to re-call all samples from scratch.
This document describes the new approach to joint variant discovery that is available in GATK versions 3.0 and above. For a more detailed discussion of why it's better to perform joint discovery, see this FAQ article. For more details on how this fits into the overall reads-to-variants analysis workflow, see the Best Practices workflows documentation.
This is the workflow recommended in our Best Practices for performing variant discovery analysis on cohorts of samples.
In a nutshell, we now call variants individually on each sample using the HaplotypeCaller in
-ERC GVCF mode, leveraging the previously introduced reference model to produce a comprehensive record of genotype likelihoods and annotations for each site in the genome (or exome), in the form of a gVCF file (genomic VCF).
In a second step, we then perform a joint genotyping analysis of the gVCFs produced for all samples in a cohort. This allows us to achieve the same results as joint calling in terms of accurate genotyping results, without the computational nightmare of exponential runtimes, and with the added flexibility of being able to re-run the population-level genotyping analysis at any time as the available cohort grows.
This is meant to replace the joint discovery workflow that we previously recommended, which involved calling variants jointly on multiple samples, with a much smarter approach that reduces computational burden and solves the "N+1 problem".
This is a quick overview of how to apply the workflow in practice. For more details, see the Best Practices workflows documentation.
Run the HaplotypeCaller on each sample's BAM file(s) (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs, with the option
-emitRefConfidence GVCF, and using the
.g.vcf extension for the output file.
Note that versions older than 3.4 require passing the options
--variant_index_type LINEAR --variant_index_parameter 128000 to set the correct index strategy for the output gVCF.
If you have more than a few hundred samples, run CombineGVCFs on batches of ~200 gVCFs to hierarchically merge them into a single gVCF. This will make the next step more tractable.
Take the outputs from step 2 (or step 1 if dealing with fewer samples) and run GenotypeGVCFs on all of them together to create the raw SNP and indel VCFs that are usually emitted by the callers.
Finally, resume the classic GATK Best Practices workflow by running VQSR on these "regular" VCFs according to our usual recommendations.
That's it! Fairly simple in practice, but we predict this is going to have a huge impact in how people perform variant discovery in large cohorts. We certainly hope it helps people deal with the challenges posed by ever-growing datasets.
As always, we look forward to comments and observations from the research community!
Okay, we realize the name's a bit of a mouthful, and we're willing to tweak it if anyone has any good ideas. But never mind that. It's difficult to overstate the importance of this new approach to joint variant discovery (but I'll give it a shot) so we're really stoked to finally be able to share the details of how it's is going to work in practice.
You're probably going to be surprised at how simple it is in practice (not that it was particularly easy to actually build, mind you). The gory details are in the new document here, but here's an overview of how it looks within the Best Practices workflow you all know and (hopefully) love:
The first surprise is that instead of calling variants on multiple samples, you now just run HaplotypeCaller on each sample individually. "Oh no," I hear you cry, "but the results were so much better when I called multiple samples together!". Well yeah, but it took forever. Bear with me for a minute.
The key here is that you run HaplotypeCaller in gVCF mode. This outputs a so-called genomic VCF, which contains a record of the genotype likelihoods and annotations for every single site in the genome (or exome), whether or not there is evidence of variation. This essentially boils down all the useful information that can be gleaned from the BAM files, and makes it unnecessary to refer back to the BAM in later steps.
So you repeat that for all your samples (which goes reasonably fast since per-sample calling is pretty tractable nowadays). Optionally, you can add in a step to combine gVCF files if you're working on a really large cohort. Then in the next step, you just run a separate genotyping tool on all the gVCFs (or combined gVCFs) together, which gives you the same output (raw SNPs and indel calls) that you would have got from one-step multisample calling.
See, that's the beauty of the new workflow. A lot less work (for the computer) for equivalent results. And the ability to process samples incrementally and perform joint discovery on cohort sizes that would have previously got you hauled off to the funny farm.
Let us know what you think!
Yep, you read that right, the next release of GATK is going to be the big Three-Oh!
You may have noticed that the 2.8 release was really slim. We explained in the release notes, perhaps a tad defensively, that it was because we’d been working on some ambitious new features that just weren’t ready for prime time. And that was true. Now we’ve got a couple of shiny new toys to show for it that we think you’re really going to like.
But GATK 3.0 is not really about the new features (otherwise we’d just call it 2.9). It’s about a shift in the way we approach the problems that we want to solve -- and to some extent, a shift in the scope of problems we choose to tackle.
We’ll explain what this entails in much more detail in a series of blog posts over the next few days, but let me reassure you right now on one very important point: there is nothing in the upcoming release that will disrupt your existing workflows. What it will do is offer you new paths for discovery that we believe will empower research on a scale that has previously not been possible.
And lest you think this is all just vaporware, here’s a sample of what we have in hand right now: variant calling on RNA-Seq, and a multisample variant discovery workflow liberated from the shackles of time and scaling issues.
Stay tuned for details!
In case we do not have a database of known sites for a non model organism, what is the best strategy to do the base recalibration and variant calling with joint genotyping when we have several samples ?
According to the GATK best practices, when we do not have a data base of known sites, it is recommended to run HaplotypeCaller and hard filter the SNPs then use the resultings SNPs for the base recalibration until convergence. As I have many samples (31), I have to do this on all the samples. However, as, after the base recalibration, I want to do the final snp calling with joint genotyping, I think this approach is going to be really time consuming...
So, I'm wondering whether I can execute an initial snp calling (with HaplotypeCaller) on each sample, then joint genotyping them to obtain a global snp data that I will manually filter. This joint and filtered snp database could then be used for the base recalibration and a second joint genotyping. I think that it will be faster this way. However, I'm not sure whether there are caveats in this approach and I would like to have your advices.
Thank you very much in advance.
I used HC on many individuals and have two groups. Lets call them A and B. A has 4 sets of CombinedGVCFs (~400 individuals). B has 1 set of CombinedGVCFs (~100 individuals).
I have JointGT of all A and of all B. A gives 507248 variants B gives 1030483 variants
Combining these with CombineVariants gives 1207973 variants
When doing JointGT on A and B together, I get 1196703 variants (11270 less than combined)
Looking at the disconcordance give these numbers: Present in Combined, but not in Joint: 7335 variants with QualityScore Mean 71, Median 34
Present in Joint, but not in Combined: 978 variants with QualityScore Mean 641, Median 38
I am wondering: Why does that happen? I was assuming, that when using CombineVariants, I don't loose information. And when JointGenotyping, I should actually get equal or more variants than when calling every individual on its own (based on a comment of @Geraldine_VdAuwera in this forum), because the other individuals context helps to call 'complicated to call' variants. Are these assumptions correct? And if yes, why do I get these results?
If someone checked the numbers: The concordance variants don't add up to the difference of the sets: 7335 + 978 = 11270 - 2957 What are these 2957 variants??? The disconcordances from both sides (switching -V and -disc) should add up to the difference of two sets, right? And the two concordances (from both sides, switching -V and -conc) should be the same. A ∩ B + A \ B + B \ A = A ∪ B A \ B + B \ A = A ∪ B - A ∩ B So.. ???
Apart from that: based on the Quality scores and the circumstances: Could one consider the disconcordance variants as very likely to be FP?
Thanks for clearing things up! Alexander