I have exome data run on two lanes per library is it better to combine the lanes into one or to run each lane independently through GATK? What are the pros and cons? Many thanks
I have seen the definition of strand bias on this site (below) but I need a little clarification. Does the FS filter (a) highlight instances where reads are only present on a single strand and contain a variant (as may occur toward the end of exome capture regions) or does it (b) specifically look for instances where there are reads on both strands but the variant allele is disproportionately represented on one strand (as might be indicative of a false positive), or does it (c) do both?
I had thought it did (b) but have encountered some disagreement.
** How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls.
I have exome data for a few sets of parent-offspring trios, in which offspring have phenotypically related but probably genetically different diseases. Their parents are unaffected so I am particularly interested in identifying de novo mutations. My plan was to analyse each individual separately up to the variant calling phase and then to input three (mother, father, child) analysis-ready BAMs into the UnifiedGenotyper along with a ped file. My questions are:
1) Can you tell me whether the UnifiedGenotyper uses pedigree information in the ped file to call genotypes more accurately? In other words, is this better than calling variants jointly without supplying the ped file? If not, does GATK recommend any external tools for doing this step?
2) It is better to call variants jointly using all the trios (even though they are not related and probably don't share the same disease-causing mutations)?
I am currently working on a Exome sequencing projekt with older single-end SOLiD exomes and new paired-end exomes. In a first attempt (GATK 1.7 and best practices v3 back then) i tried calling and recalibrating all exomes together (at that time 120) without selecting for paired/single-end. As I already had validatet many variants I could check the quality of the calls and got very bad results, especially for InDels (previously called, true positive variants missing). My idee is that the UnifiedGenotyper has Problems mixing paired-end with single-end exomes.
Is there any official recommendation for this problem? My solution right now is to group the exomes in batches (40-50 Exomes) which ran on the same technology.
Also a second Problem/Question: For some individuals exomes where sequenced twice, and for some of these the first run was single-end and the second one was paired. The best practices mentions one should se all available reads for a individual for calling. Do you have any experience on how to handle these cases?
Any help is greatly appreciated!
I'm curious about the experience of the community at large with VQSR, and specifically with which sets of annotations people have found to work well. The GATK team's recommendations are valuable, but my impression is that they have fairly homogenous data types - I'd like to know if anyone has found it useful to deviate from their recommendations.
For instance, I no longer include InbreedingCoefficient with my exome runs. This was spurred by a case where previously validated variants were getting discarded by VQSR. It turned out that these particular variants were homozygous alternate in the diseased samples and homozygous reference in the controls, yielding an InbreedingCoefficient very close to 1. We decided that the all-homozygous case was far more likely to be genuinely interesting than a sequencing/variant calling artifact, so we removed the annotation from VQSR. In order to catch the all-heterozygous case (which is more likely to be an error), we add a VariantFiltration pass for 'InbreedingCoefficient < -0.8' following ApplyRecalibration.
In my case, I think InbreedingCoefficient isn't as useful because my UG/VQSR cohorts tend to be smaller and less diverse than what the GATK team typically runs (and to be honest, I'm still not sure we're doing the best thing). Has anyone else found it useful to modify these annotations? It would be helpful if we could build a more complete picture of these metrics in a diverse set of experiments.
I found the materials of the BroadE Workshop very helpful, especially the slide on analyzing variant calls using VariantEval, because there is not much documentation for it on GATK site. As an example 62 whole genome sequencing samples from north Europe were evaluated together with 1000G FIN samples, and also the polymorphic and monomorphic sites on the 1000G genotype chip were used as comparator. I would like very much to do the same for our whole exome data, the question is: is there good quality whole exome data that I can use to evaluate our own exome data?
I have checked the NHLBI ESP project Exome Variant Server site, the vcf files can be downloaded doesn't have the genotype data.
Thanks in advance!
Hi all, I have a 4 exome experiment (high coverage). According to best practices I can't apply VQSR, so I must filter on SNP properties. One thing is not clear: should all the option apply? In other words should I select using boolean "and" (i.e. QD < 2.0 & MQ < 40.0 & FS > 60.0 & HaplotypeScore > 13.0 & MQRankSum < -12.5 & ReadPosRankSum < -8.0 for SNP) or "or" (i.e. QD < 2.0 | MQ < 40.0 | FS > 60.0 | HaplotypeScore > 13.0 | MQRankSum < -12.5 | ReadPosRankSum < -8.0 for SNP)?