We're trying to put together some recommendations for folks who want to use GATK tools on non-human genomes. But we really don't have much experience with non-human genomes, so we're hoping that those of you in the GATK community who do will chime in and help your fellow scientists find the answers for a few common problems.
The most common problem seems to be finding sets of known sites for organisms like Drosophila, dogs, and various plants. If you know of such resources, please share your knowledge by commenting in this thread. You could earn upvotes and warm fuzzy feelings!
We've realized that having separate Team / Community Q&A sections is confusing and unnecessary, so we are going to merge the two. From now on there will only be one Q&A section (Ask the Team). We invite you to ask any questions related to the GATK and its application in that category. We welcome answers and comments from everyone in the community in these discussions; if you know something, say something! And of course you are also always welcome to comment on tool and method documentation articles.
Hello GATK users,
As you know, we have been trying for the past few months to beef up support and improve documentation. This is a long game and although we'd like to think we've made some progress, there still remains a lot to be done.
One thing we're doing right now is developing teaching materials for the upcoming workshop. Even though that will only serve a portion of our user community directly, the materials will be a useful addition to the online documentation. We've also got some new documents in preparation that should help with parallelization options, as well as some frequently asked questions on a range of subjects, illustrations to clarify workflows, and a lot of updates and link fixes for the older doc articles.
All of this takes time to develop, and unfortunately, responding to questions on the forum takes time away from that.
So we're appealing to everyone of you to help us by helping others. There's a whole subset of "beginner" questions that any intermediate user could easily answer. It would be a big help if you folks could jump in and answer those questions for us whenever you can. Some of you are already doing it, and we're really thankful, because it takes some of our support burden away and frees up time for us to work on the materials that can benefit many people at once. So we'd love to see more people help in this way, and to be able to move more questions to the "Ask the Community" section (which currently houses mostly the "weird datasets" questions that we honestly don't know how to answer).
We'd be happy to consider an incentive system, by the way. We can't really offer money or anything like that, but if there's any intangible form of reward that would motivate you (leaderboards? gold stars? "expert support" coupons? big smiles and karma points?), let us know!
To anyone who's worried about the quality of community-sourced answers: we're always going to monitor every discussion, so rest assured that there won't be any decrease on quality. Who knows, there may even be an increase! ;-)
Hello, We are using some custom made and predefined Haloplex kits. I was wandering how the best practice for variant detection should be "adapted". One of the biggest challenges we are facing is that we can not do a VQSR due to the low number of variants detected. So we have to use an hard filtering step, but here again the nature of the reads, all produced by enzyme restrictions, make some filters inappropriate like the ReadPosRankSumTest as the reads are not randomly produced. I was wondering if the community has any experience with this kind of data and how the hard filtering should be made? Thanks for your help. yvan
I had a question that, while it might be more appropriate for a BWA or seqanswers audience, I noticed something in the GATK's "Data Processing Pipeline" under "Methods and Workflows" that made me wonder. The pipeline is described here and there's a nice flowchart as well: http://www.broadinstitute.org/gatk/guide/topic?name=methods-and-workflows#41
The process describes a BAM of reads that are either not aligned or aligned by some process you don't want to use, so the first step seems to be Picard's RevertSam and then a realignment with BWA. I'm wondering why the process described by this GATK document splits it into per-lane BAM files. There doesn't seem to be any process done at the per-lane level other than BWA alignment. I have two guesses.. the first was to allow more parallelization at that step.
But my second guess is that perhaps BWA doesn't play nice with read groups when reading reads from BAM input files. If that is true, that would explain why I'm having trouble with BAM (a single sample, multiple lanes, merged into one file) -> BWA -> realigned-BAM -> GATK (either UnifiedGenotyper or RealignerTargetCreator, etc)--somewhere along the way, read groups are getting lost. So my guess is that the above-described pipeline splits it per-lane so it can manually respecify read groups all over again to BWA?
Is that other people's experience as well?
Also, the Methods and Workflows page describes a Queue script, but there's no link or anything to the actual Queue script. Anyone know where to find it?
I'd like to know if someone has tested the concordance from output of PhaseByTransmission with SNP array data.
I have calculated the genotype concordance for the most likely GT combination from the VCF obtained from unified genotyper for a family trio based on the GL values against SNP array data and then did the same for the genotypes obtained after using PhaseByTransmission and I'm seeing a drop in concordance.
Is this to be expected?
I have recently had a number of questions about analysing data where small genomic regions are PCR amplified and sequenced. Does anyone have any workflow or suggestions for parameters to help deal with this data ?
I just wrote a walker to look for particular types of low frequency mutations, and I wanted to verify that the methods were working. I was hoping to simulate some illumina data with the variants and then run the methods against this data.
However, I don't know what a realistic error model is for common Illumina data and so am not sure how realistic my simulations are (Proportion of gaps, A->C versus A->G, etc.). Does the GATK include a read simulator? I saw one walker in the documentation but it seemed to rely on inputting settings that I didn't know about it and looked a bit out of date.
Any help appreciated.
Dear GATK team and community members,
I used ProduceBeagleInput to create a genotype likelihoods file, and ran beagle.jar according to the example in http://gatkforums.broadinstitute.org/discussion/43/interface-with-beagle-software. Beagle gave a warning that it is better to use a reference panel for imputing genotypes and phasing. So I downloaded the recommended reference panel (http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes.phase1_release_v3/), but Beagle requires that the alleles be in the same order on both reference and sample files. The tool to do this is check_strands.py (http://faculty.washington.edu/sguy/beagle/strand_switching/README), but it requires both sample and reference files be in .bgl format. This is a little disappointing since not being able to use the reference panel means Beagle's calculations won't be as accurate, although I'm not sure by how much.
I understand that this might be out of the scope of responsibility for the GATK team, but I will greatly appreciate if someone can provide suggestions to allow GATK's input to Beagle be phased using a reference panel. Or hopefully, the GATK team will write a tool to produce .bgl files?
I'm curious about the experience of the community at large with VQSR, and specifically with which sets of annotations people have found to work well. The GATK team's recommendations are valuable, but my impression is that they have fairly homogenous data types - I'd like to know if anyone has found it useful to deviate from their recommendations.
For instance, I no longer include InbreedingCoefficient with my exome runs. This was spurred by a case where previously validated variants were getting discarded by VQSR. It turned out that these particular variants were homozygous alternate in the diseased samples and homozygous reference in the controls, yielding an InbreedingCoefficient very close to 1. We decided that the all-homozygous case was far more likely to be genuinely interesting than a sequencing/variant calling artifact, so we removed the annotation from VQSR. In order to catch the all-heterozygous case (which is more likely to be an error), we add a VariantFiltration pass for 'InbreedingCoefficient < -0.8' following ApplyRecalibration.
In my case, I think InbreedingCoefficient isn't as useful because my UG/VQSR cohorts tend to be smaller and less diverse than what the GATK team typically runs (and to be honest, I'm still not sure we're doing the best thing). Has anyone else found it useful to modify these annotations? It would be helpful if we could build a more complete picture of these metrics in a diverse set of experiments.
I found the materials of the BroadE Workshop very helpful, especially the slide on analyzing variant calls using VariantEval, because there is not much documentation for it on GATK site. As an example 62 whole genome sequencing samples from north Europe were evaluated together with 1000G FIN samples, and also the polymorphic and monomorphic sites on the 1000G genotype chip were used as comparator. I would like very much to do the same for our whole exome data, the question is: is there good quality whole exome data that I can use to evaluate our own exome data?
I have checked the NHLBI ESP project Exome Variant Server site, the vcf files can be downloaded doesn't have the genotype data.
Thanks in advance!
Is there any rule of thumb for allocating memory through "bsub" for running DataProcessingPipeline per bam file or per number of reads ?