I am trying to run UG across just over 2,200 individuals (exome sequencing). I have successfully done this on our computing cluster with just over 1,000 samples without issues (apart from having to get the limit on no. of open files (ulimit) increased).
I got another increase in ulimit to allow me to run UG on the larger set. However, our IO is being pushed over the edge with the 2,200 input samples. I have two questions:
Would appreciate any advice you would have on getting this to run on this size of data. Thanks!
We are running GATK HaplotypeCaller on ~50 whole exome samples. We are interested in rare variants - so we ran GATK in single sample mode instead of multi sample as you recommend, however we would like to take advantage of VQSR. What would you recommend? Can we run VQSR on the output from GATK single sample?
Additionally, we are likely to run extra batches of new exome samples. Should we wait until we have them all before running them through the GATK pipeline?
Many thanks in advance.
Sorry if this is the wrong forum for this question - I just thought someone might have an idea/opinion...
Should I trim/filter exome sequencing reads prior to mapping with BWA and variant calling using GATK? I am currently filtering out reads in which <80% of bases have quality>=Q30 but I lose >20% of my reads this way. Does GATK take quality into account therefore rendering pre-mapping filtering unecessary?
Thanks in advance, Kath
Hi I am trying to run the DepthofCoverage walker from GATK for the first time and I am running into stumbling blocks. I am interested in getting coverage info for genes and I downloaded the Refseq genes list and am filtering and sorting according to specifications. I am using the SortByRef.pl script provided in the website. However, I am getting an error saying:
Can not open temporary file 1045: Too many open files at SortByRef.pl line 68, <$INPUT> line 1021.
I also checked the soft/hard limits for filesystem and it shows that my current system's limit is the following:
ulimit -u :386246 ulimit -S -u: 386246
I am stuck at this point and don't know how to proceed.
Any help is much appreciated.
Hi, I'm working with trios and small-pedigrees (up to six individuals). The VQSR section of the 'best practice' document states that 'in order to achieve the best exome results one needs an exome callset with at least 30 samples', and suggests to add additional samples such as 1000 genomes BAMs.
I' a little confused about two aspects:
1) the addition of 1000G BAMs being suggested in the VQSR section. If we need the 1000G call sets, we'd have to run these through the HaplotypeCaller or UnifiedGenotyper stages? Please forgive the question - I'm not trying to find fault in your perfect document, but please confirm as it would dramatically increase compute time (though only once), and overlaps with my next point of confusion:
2) I can understand how increasing the number of individuals from a consistent cohort, or maybe even from very similar experimental platforms, improves the outcome of the VQSR stage. However, the workshop video comments that the variant call properties are highly dependent on individual experiments (design, coverage, technical, etc). So I can't understand how the overall result is improved when I add variant calls from 30 1000G exomes (with their own typical variant quality distributions) to my trio's sample variant calls (also with their own, but very different to the 1000G's, quality distribution).
Hopefully I'm missing an important point somewhere? Many thanks in advance, K
I have exome data for a few sets of parent-offspring trios, in which offspring have phenotypically related but probably genetically different diseases. Their parents are unaffected so I am particularly interested in identifying de novo mutations. My plan was to analyse each individual separately up to the variant calling phase and then to input three (mother, father, child) analysis-ready BAMs into the UnifiedGenotyper along with a ped file. My questions are:
1) Can you tell me whether the UnifiedGenotyper uses pedigree information in the ped file to call genotypes more accurately? In other words, is this better than calling variants jointly without supplying the ped file? If not, does GATK recommend any external tools for doing this step?
2) It is better to call variants jointly using all the trios (even though they are not related and probably don't share the same disease-causing mutations)?
I found the materials of the BroadE Workshop very helpful, especially the slide on analyzing variant calls using VariantEval, because there is not much documentation for it on GATK site. As an example 62 whole genome sequencing samples from north Europe were evaluated together with 1000G FIN samples, and also the polymorphic and monomorphic sites on the 1000G genotype chip were used as comparator. I would like very much to do the same for our whole exome data, the question is: is there good quality whole exome data that I can use to evaluate our own exome data?
I have checked the NHLBI ESP project Exome Variant Server site, the vcf files can be downloaded doesn't have the genotype data.
Thanks in advance!
I have seen the definition of strand bias on this site (below) but I need a little clarification. Does the FS filter (a) highlight instances where reads are only present on a single strand and contain a variant (as may occur toward the end of exome capture regions) or does it (b) specifically look for instances where there are reads on both strands but the variant allele is disproportionately represented on one strand (as might be indicative of a false positive), or does it (c) do both?
I had thought it did (b) but have encountered some disagreement.
** How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls.
Hi all, I have a 4 exome experiment (high coverage). According to best practices I can't apply VQSR, so I must filter on SNP properties. One thing is not clear: should all the option apply? In other words should I select using boolean "and" (i.e. QD < 2.0 & MQ < 40.0 & FS > 60.0 & HaplotypeScore > 13.0 & MQRankSum < -12.5 & ReadPosRankSum < -8.0 for SNP) or "or" (i.e. QD < 2.0 | MQ < 40.0 | FS > 60.0 | HaplotypeScore > 13.0 | MQRankSum < -12.5 | ReadPosRankSum < -8.0 for SNP)?
I am currently working on a Exome sequencing projekt with older single-end SOLiD exomes and new paired-end exomes. In a first attempt (GATK 1.7 and best practices v3 back then) i tried calling and recalibrating all exomes together (at that time 120) without selecting for paired/single-end. As I already had validatet many variants I could check the quality of the calls and got very bad results, especially for InDels (previously called, true positive variants missing). My idee is that the UnifiedGenotyper has Problems mixing paired-end with single-end exomes.
Is there any official recommendation for this problem? My solution right now is to group the exomes in batches (40-50 Exomes) which ran on the same technology.
Also a second Problem/Question: For some individuals exomes where sequenced twice, and for some of these the first run was single-end and the second one was paired. The best practices mentions one should se all available reads for a individual for calling. Do you have any experience on how to handle these cases?
Any help is greatly appreciated!
I'm curious about the experience of the community at large with VQSR, and specifically with which sets of annotations people have found to work well. The GATK team's recommendations are valuable, but my impression is that they have fairly homogenous data types - I'd like to know if anyone has found it useful to deviate from their recommendations.
For instance, I no longer include InbreedingCoefficient with my exome runs. This was spurred by a case where previously validated variants were getting discarded by VQSR. It turned out that these particular variants were homozygous alternate in the diseased samples and homozygous reference in the controls, yielding an InbreedingCoefficient very close to 1. We decided that the all-homozygous case was far more likely to be genuinely interesting than a sequencing/variant calling artifact, so we removed the annotation from VQSR. In order to catch the all-heterozygous case (which is more likely to be an error), we add a VariantFiltration pass for 'InbreedingCoefficient < -0.8' following ApplyRecalibration.
In my case, I think InbreedingCoefficient isn't as useful because my UG/VQSR cohorts tend to be smaller and less diverse than what the GATK team typically runs (and to be honest, I'm still not sure we're doing the best thing). Has anyone else found it useful to modify these annotations? It would be helpful if we could build a more complete picture of these metrics in a diverse set of experiments.