# Tagged with #best-practices 9 documentation articles | 12 announcements | 27 forum discussions

Ami Levy-Moonshine presented this condensed 90-minute workshop given at ASHG 2014 in San Diego, CA on October 21.

This workshop covered all the core steps involved in calling variants with the GATK, using the “Best Practices” developed by the GATK team. The presentation materials outline why each step is essential to the calling process and what are the key operations performed on the data at each step. This includes specific information about variant calling in RNAseq data and efficient analysis of cohorts.

His slide deck is available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.

GATK was also featured in another mini-workshop at ASHG which covered the iSeqTools network, focused on cloud-based analysis. The presentation slides will be posted to the iSeqTools website in the near future.

Best Practices for variant discovery in DNA:

Best Practices for variant discovery in RNAseq:

Excerpt from Ami's ASHG poster:

Eric Banks, Sheila Chandran and Geraldine Van der Auwera presented this workshop in Philadelphia, PA, upon invitation from the School of Medicine at UPenn.

This workshop covered all the core steps involved in calling variants with the GATK, using the “Best Practices” developed by the GATK team. The presentation materials describe why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset. This includes specific information about variant calling in RNAseq data and efficient analysis of cohorts.

The material was presented over two days, organized in the following modules:

• Data pre-processing: From FASTQ to analysis-ready BAM
• Variant Discovery: From BAM to analysis-ready VCF

This was complemented by a set of hands-on exercises aiming to teach basic GATK usage to new users.

The workshop materials are available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.

This is a placeholder for a document in preparation.

In the meantime, for a quick introduction to the GATK and the purpose of the Best Practices, please see the intro talks from our latest workshop in the Presentations section.

Important notes on context and caveats

These recommendations have been developed by the GATK development team over years of analysis work on many of the Broad Institute's sequencing projects. As a general rule, the command-line arguments and parameters given in the documentation examples are meant to be broadly applicable.

However, our testing focuses largely on data from human whole-genome or whole-exome samples sequenced with Illumina technology, so if you are working with different types of data or experimental designs, you may need to adapt certain branches of the workflow, as well as certain parameter selections and values. Unfortunately we are not able to provide official recommendations on how to deal with very different experimental designs or divergent datatypes (such as Ion Torrent). In addition, the illustrations and tutorials provided in these pages tend to assume a simple experimental design where each sample is used to produce one DNA library that is sequenced separately on one lane of the machine. See the FAQ articles for help dealing with other experimental designs. Finally, please be aware that several key steps in the Best Practices workflow make use of existing resources such as known variants, which are readily available for humans (we provide several useful resource datasets for download from our FTP server). If no such resources are available for your organism, you may need to bootstrap your own or use alternative methods. We have documented useful methods to do this wherever possible, but be aware than some issues are currently still without a good solution.

Important note on GATK versions

The Best Practices have been updated for GATK version 3. If you are running an older version, you should seriously consider upgrading. For more details about what has changed in each version, please see the Version History section. If you cannot upgrade your version of GATK for any reason, please look up the corresponding version of the GuideBook PDF (also in the Version History section) to ensure that you are using the appropriate recommendations for your version.

This document describes the new approach to joint variant discovery that is available in GATK versions 3.0 and above.

### Overview and motivations

This is meant to replace the joint discovery workflow that we previously recommended, which involved calling variants jointly on multiple samples, with a much smarter approach that reduces computational burden and solves the "N+1 problem".

This is the workflow recommended in our Best Practices for performing variant discovery analysis on cohorts of samples.

In a nutshell, we now call variants individually on each sample using the HaplotypeCaller in -ERC GVCF mode, leveraging the previously introduced reference model to produce a comprehensive record of genotype likelihoods and annotations for each site in the genome (or exome), in the form of a gVCF file (genomic VCF).

In a second step, we then perform a joint genotyping analysis of the gVCFs produced for all samples in a cohort.

This allows us to achieve the same results as joint calling in terms of accurate genotyping results, without the computational nightmare of exponential runtimes, and with the added flexibility of being able to re-run the population-level genotyping analysis at any time as the available cohort grows.

### Workflow details

#### 1. Variant calling

Run the HaplotypeCaller on each sample's BAM file(s) (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs, with the following options:

--emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

#### 2. Optional data aggregation step

If you have more than a few hundred samples, run CombineGVCFs on batches of ~200 gVCFs to hierarchically merge them into a single gVCF. This will make the next step more tractable.

#### 3. Joint genotyping

Take the outputs from step 2 (or step 1 if dealing with fewer samples) and run GenotypeGVCFs on all of them together to create the raw SNP and indel VCFs that are usually emitted by the callers.

#### 4. Variant recalibration

Finally, resume the classic GATK Best Practices workflow by running VQSR on these "regular" VCFs according to our usual recommendations.

That's it! Fairly simple in practice, but we predict this is going to have a huge impact in how people perform variant discovery in large cohorts. We certainly hope it helps people deal with the challenges posed by ever-growing datasets.

As always, we look forward to comments and observations from the research community!

### Overview

This document describes the details of the GATK Best Practices workflow for SNP and indel calling on RNAseq data.

Please note that any command lines are only given as example of how the tools can be run. You should always make sure you understand what is being done at each step and whether the values are appropriate for your data. To that effect, you can find more guidance here.

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller. Here is a detailed overview:

### Caveats

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have been working with RNAseq for a somewhat shorter time, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

We know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data.

### The workflow

#### 1. Mapping to the reference

The first major difference relative to the DNAseq Best Practices is the mapping step. For DNA-seq, we recommend BWA. For RNA-seq, we evaluated all the major software packages that are specialized in RNAseq alignment, and we found that we were able to achieve the highest sensitivity to both SNPs and, importantly, indels, using STAR aligner. Specifically, we use the STAR 2-pass method which was described in a recent publication (see page 43 of the Supplemental text of the Pär G Engström et al. paper referenced below for full protocol details -- we used the suggested protocol with the default parameters). In brief, in the STAR 2-pass approach, splice junctions detected in a first alignment run are used to guide the final alignment.

Here is a walkthrough of the STAR 2-pass alignment steps:

1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:

genomeDir=/path/to/hg19
mkdir $genomeDir STAR --runMode genomeGenerate --genomeDir$genomeDir --genomeFastaFiles hg19.fa\  --runThreadN <n>


2) Alignment jobs were executed as follows:

runDir=/path/to/1pass
mkdir $runDir cd$runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>  3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass: genomeDir=/path/to/hg19_2pass mkdir$genomeDir
STAR --runMode genomeGenerate --genomeDir genomeDir --genomeFastaFiles hg19.fa \ --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>  4) The resulting index is then used to produce the final alignments as follows: runDir=/path/to/2pass mkdirrunDir
cd $runDir STAR --genomeDir$genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>


The above step produces a SAM file, which we then put through the usual Picard processing steps: adding read group information, sorting, marking duplicates and indexing.

java -jar AddOrReplaceReadGroups I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample

java -jar MarkDuplicates I=rg_added_sorted.bam O=dedupped.bam  CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics


#### 3. Split'N'Trim and reassign mapping qualities

Next, we use a new GATK tool called SplitNCigarReads developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.

In the future we plan to integrate this into the GATK engine so that it will be done automatically where appropriate, but for now it needs to be run as a separate step.

At this step we also add one important tweak: we need to reassign mapping qualities, because STAR assigns good alignments a MAPQ of 255 (which technically means “unknown” and is therefore meaningless to GATK). So we use the GATK’s ReassignOneMappingQuality read filter to reassign all good alignments to the default value of 60. This is not ideal, and we hope that in the future RNAseq mappers will emit meaningful quality scores, but in the meantime this is the best we can do. In practice we do this by adding the ReassignOneMappingQuality read filter to the splitter command.

Please note that we recently (6/11/14) edited this to fix a documentation error regarding the filter to use. See this announcement for details.

Finally, be sure to specify that reads with N cigars should be allowed. This is currently still classified as an "unsafe" option, but this classification will change to reflect the fact that this is now a supported option for RNAseq processing.

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS


#### 4. Indel Realignment (optional)

After the splitting step, we resume our regularly scheduled programming... to some extent. We have found that performing realignment around indels can help rescue a few indels that would otherwise be missed, but to be honest the effect is marginal. So while it can’t hurt to do it, we only recommend performing the realignment step if you have compute and time to spare (or if it’s important not to miss any potential indels).

#### 5. Base Recalibration

We do recommend running base recalibration (BQSR). Even though the effect is also marginal when applied to good quality data, it can absolutely save your butt in cases where the qualities have systematic error modes.

Both steps 4 and 5 are run as described for DNAseq (with the same known sites resource files), without any special arguments. Finally, please note that you should NOT run ReduceReads on your RNAseq data. The ReduceReads tool will no longer be available in GATK 3.0.

#### 6. Variant calling

Finally, we have arrived at the variant calling step! Here, we recommend using HaplotypeCaller because it is performing much better in our hands than UnifiedGenotyper (our tests show that UG was able to call less than 50% of the true positive indels that HC calls). We have added some functionality to the variant calling code which will intelligently take into account the information about intron-exon split regions that is embedded in the BAM file by SplitNCigarReads. In brief, the new code will perform “dangling head merging” operations and avoid using soft-clipped bases (this is a temporary solution) as necessary to minimize false positive and false negative calls. To invoke this new functionality, just add -dontUseSoftClippedBases to your regular HC command line. Note that the -recoverDanglingHeads argument which was previously required is no longer necessary as that behavior is now enabled by default in HaplotypeCaller. Also, we found that we get better results if we lower the minimum phred-scaled confidence threshold for calling variants on RNAseq data, so we use a default of 20 (instead of 30 in DNA-seq data).

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I input.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o output.vcf


#### 7. Variant filtering

To filter the resulting callset, you will need to apply hard filters, as we do not yet have the RNAseq training/truth resources that would be needed to run variant recalibration (VQSR).

We recommend that you filter clusters of at least 3 SNPs that are within a window of 35 bases between them by adding -window 35 -cluster 3 to your command. This filter recommendation is specific for RNA-seq data.

As in DNA-seq, we recommend filtering based on Fisher Strand values (FS > 30.0) and Qual By Depth values (QD < 2.0).

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf


Please note that we selected these hard filtering values in attempting to optimize both high sensitivity and specificity together. By applying the hard filters, some real sites will get filtered. This is a tradeoff that each analyst should consider based on his/her own project. If you care more about sensitivity and are willing to tolerate more false positives calls, you can choose not to filter at all (or to use less restrictive thresholds).

An example of filtered (SNPs cluster filter) and unfiltered false variant calls:

An example of true variants that were filtered (false negatives). As explained in text, there is a tradeoff that comes with applying filters:

### Known issues

There are a few known issues; one is that the allelic ratio is problematic. In many heterozygous sites, even if we can see in the RNAseq data both alleles that are present in the DNA, the ratio between the number of reads with the different alleles is far from 0.5, and thus the HaplotypeCaller (or any caller that expects a diploid genome) will miss that call. A DNA-aware mode of the caller might be able to fix such cases (which may be candidates also for downstream analysis of allele specific expression).

Although our new tool (splitNCigarReads) cleans many false positive calls that are caused by splicing inaccuracies by the aligners, we still call some false variants for that same reason, as can be seen in the example below. Some of those errors might be fixed in future versions of the pipeline with more sophisticated filters, with another realignment step in those regions, or by making the caller aware of splice positions.

As stated previously, we will continue to improve the tools and process over time. We have plans to improve the splitting/clipping functionalities, improve true positive and minimize false positive rates, as well as developing statistical filtering (i.e. variant recalibration) recommendations.

We also plan to add functionality to process DNAseq and RNAseq data from the same samples simultaneously, in order to facilitate analyses of post-transcriptional processes. Future extensions to the HaplotypeCaller will provide this functionality, which will require both DNAseq and RNAseq in order to produce the best results. Finally, we are also looking at solutions for measuring differential expression of alleles.

[1] Pär G Engström et al. “Systematic evaluation of spliced alignment programs for RNA-seq data”. Nature Methods, 2013

The full GATK team presented this workshop at the Broad Institute with support form the BroadE education program.

This workshop included two modules:

• Best Practices for Variant Calling with the GATK

The core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. View the workshop materials to learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

• Building Analysis Pipelines with Queue

An introduction to the Queue pipelining system. View the workshop materials to learn about how to use Queue to create analysis pipelines, scatter-gather them and run them locally or in parallel on a computing farm.

The full GATK team presented this workshop at the Broad Institute with support form the BroadE education program.

This workshop covered the core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. View the workshop materials to learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.

Let's say we have this example data:

• sample1_lane1.fq
• sample1_lane2.fq
• sample2_lane1.fq
• sample2_lane2.fq

#### 1. Run all core steps per-lane once

At the basic level, all pre-processing steps are meant to be performed per-lane. Assuming that you received one FASTQ file per lane of sequence data, just run each file through each pre-processing step individually: map & dedup -> realign -> recal.

The example data becomes:

• sample1_lane1.dedup.realn.recal.bam
• sample1_lane2.dedup.realn.recal.bam
• sample2_lane1.dedup.realn.recal.bam
• sample2_lane2.dedup.realn.recal.bam

#### 2. Merge lanes per sample

Once you have pre-processed each lane individually, you merge lanes belonging to the same sample into a single BAM file.

The example data becomes:

• sample1.merged.bam
• sample2.merged.bam

#### 3. Per-sample refinement

You can increase the quality of your results by performing an extra round of dedupping and realignment, this time at the sample level. It is not absolutely required and will increase your computational costs, so it's up to you to decide whether you want to do it on your data, but that's how we do it internally at Broad.

The example data becomes:

• sample1.merged.dedup.realn.bam
• sample2.merged.dedup.realn.bam

This gets you two big wins:

• Dedupping per-sample eliminates PCR duplicates across all lanes in addition to optical duplicates (which are by definition only per-lane)
• Realigning per-sample means that you will have consistent alignments across all lanes within a sample.

People often ask also if it's worth the trouble to try realigning across all samples in a cohort. The answer is almost always no, unless you have very shallow coverage. The problem is that while it would be lovely to ensure consistent alignments around indels across all samples, the computational cost gets too ridiculous too fast. That being said, for contrastive calling projects -- such as cancer tumor/normals -- we do recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.

Finally, why not do base recalibration across lanes or across samples? Well, by definition there is no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine. That said, don't worry if you find yourself needing to recalibrate a BAM file with the lanes already merged -- the GATK's BaseRecalibrator is read group-aware, which means that it will identify separate lanes as such even if they are in the same BAM file, and it will always process them separately.

The full GATK team presented this workshop at the Broad Institute with support form the BroadE education program.

This workshop covered the core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. View the workshop materials to learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

Registration is now LIVE for our upcoming BroadE Workshop: Best Practices for Variant Calling with the GATK.

WHEN: Thursday, March 19 & Friday, March 20, 2015

10:00 AM - 5:00 PM (Lecture, March 19)
2:00PM - 5:00 PM (Optional Tutorial, March 20)

Auditorium (lecture)/Yellowstone (Tutorial)
415 Main Street
Cambridge, Massachusetts 02142

Registration Schedule:

Registration closes February 27 at 5:00 PM.
Notification of acceptance or wait list status sent by March 4.

REGISTER HERE

This workshop will focus on the core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. The GATK development team and invited guests will give talks explaining the rationale, theory and real-life applications of the Best Practices. You will learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

An optional hands-on session will be available to select participants. In this session, the GATK team will help beginners work through interactive exercises and tutorials to learn how to use GATK and apply the Best Practices to real data.

Workshop attendees will gain broad insight into the rationale of the GATK Best Practices for variant discovery, as well as a solid understanding of how individual GATK tools work and how to apply them in practice. Novices to the GATK will come out of the workshop knowing enough to identify which questions they can use address using GATK tools, how to get started on designing their experiment and analytical workflow, and how to run the tools on their own computer. Existing GATK users will come out with a deeper understanding of how the GATK works under the hood and how to improve their results further, especially as regards the latest innovations.

The Best Practices recommendations for Variant Quality Score Recalibration have been slightly updated to use the new(ish) StrandOddsRatio (SOR) annotation, which complements FisherStrand (FS) as indicator of strand bias (only available in GATK version 3.3-0 and above).

While we were at it we also reconciled some inconsistencies between the tutorial and the FAQ document. As a reminder, if you ever find differences between parameters given in the VQSR docs, let us know, but FYI that the FAQ is the ultimate source of truth=true. Note also that the command line example given in VariantRecalibrator tool doc tends to be out of date because it can only be updated with the next release (due to a limitation of the tool doc generation system) and, well, we often forget to do it in time -- so it should never be used as a reference for Best Practice parameter values, as indicated in the caveat right underneath it which no one ever reads.

Speaking of caveats, there's no such thing as too much repetition of the fact that whole genomes and exomes have subtle differences that require some tweaks to your command lines. In the case of VQSR, that means dropping Coverage (DP) from your VQSR command lines if you're working with exomes.

Finally, keep in mind that the values we recommend for tranches are really just examples; if there's one setting you should freely experiment with, that's the one. You can specify as many tranche cuts as you want to get really fine resolution.

Here is the official announcement for the upcoming workshop in Philadelphia. Registration is not necessary for the lecture sessions, but it is required for the hands-on sessions (see link further below).

We look forward to seeing you there!

The Center for Genetics and Complex Traits (CGACT) and the Institute for Biomedical Informatics (IBI) of the University of Pennsylvania Perelman School of Medicine announce a Workshop on Variants Discovery in Next Generation Sequence Data on September 18 and 19, 2014.

This workshop will focus on the core steps involved in calling variants with the Broad Institute¹s Genome Analysis Toolkit (GATK), using the "Best Practices" developed by the GATK team, and will be presented by Dr. Geraldine Van der Auwera of the Broad Institute and other instructors from the GATK team. Participants will learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of their dataset.

The workshop will take place over two consecutive days (September 18 and 19, 2014). In the morning lecture sessions, attendees will learn the rationale, theory, and real-life applications of GATK Best Practices for variant discovery in high-throughput DNA sequencing data, including recommendations for additional experimental designs and datatypes such as RNAseq. In the afternoon hands-on sessions, attendees will learn to interact with the GATK tools and apply them effectively through interactive exercises and tutorials.

The morning lecture sessions will take place on Thursday, September 18, from 9:00 am to 12:30 pm, and Friday, September 19, from 9:00 am to 11:30 am, in the Dunlop Auditorium of Stemmler Hall, University of Pennsylvania, 3450 Hamilton Walk, Philadelphia, PA 19104. Both morning sessions are open to all participants and registration is not required.

The afternoon hands-on sessions will take place on Thursday, September 18, from 2:00 pm to 5:30 pm, and Friday, September 19, from 1:00 pm to 4:30 pm. The September 18 hands-on session is aimed mainly at beginners (though familiarity with the command line environment is expected). The September 19 hands-on session is aimed at more advanced users who are already familiar with the basic GATK functions. Attendance to the hands-on sessions is limited to 20 participants each day, and precedence will be given to members of the University of Pennsylvania or its affiliated hospitals and research institutes (HUP, CHOP, Monell, Wistar, etc.).

Registration for the hands-on sessions is mandatory and open through Friday, August 29th at http://ibi.upenn.edu/?p=996.

We discovered today that we made an error in the documentation article that describes the RNAseq Best Practices workflow. The error is not critical but is likely to cause an increased rate of False Positive calls in your dataset.

The error was made in the description of the "Split & Trim" pre-processing step. We originally wrote that you need to reassign mapping qualities to 60 using the ReassignMappingQuality read filter. However, this causes all MAPQs in the file to be reassigned to 60, whereas what you want to do is reassign MAPQs only for good alignments which STAR identifies with MAPQ 255. This is done with a different read filter, called ReassignOneMappingQuality. The correct command is therefore:

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS


In our hands we see a bump in the rate of FP calls from 4% to 8% when the wrong filter is used. We don't see any significant amount of false negatives (lost true positives) with the bad command, although we do see a few more true positives show up in the results of the bad command. So basically the effect is to excessively increase sensitivity, at the expense of specificity, because poorly mapped reads are taken into account with a "good" mapping quality, where they would normally be discarded.

This effect will be stronger in datasets with lower overall quality, so your results may vary. Let us know if you observe any really dramatic effects, but we don't expect that to happen.

To be clear, we do recommend re-processing your data if you can, but if that is not an option, keep in mind how this affects the rate of false positive discovery in your data.

We apologize for this error (which has now been corrected in the documentation) and for the inconvenience it may cause you.

Calling all Belgians! (and immediate neighbors)

In case you didn't hear of this through your local institutions, I'm excited to announce that we are doing a GATK workshop in Belgium in two weeks (June 24-26 to be precise). The workshop, which is open and free to the scientific community, will be held at the Royal Institute of Natural Sciences in Brussels.

This is SUPER EXCITING to me because as a small child I spent many hours drooling in front of the Institute Museum's stunningly beautiful Iguanodons, likely leaving grubby handprints all over the glass cases, to the shame and annoyance of my parents. I also happen to have attended the Lycee Emile Jacqmain which is located in the same park, right next to the Museum (also within a stone's throw of the more recently added European Parliament) so for me this is a real trip into the past. Complete with dinosaurs!

That said, I expect you may find this workshop exciting for very different reasons, such as learning how the GATK can empower your research and hearing about the latest cutting-edge developments that you can expect for version 3.2.

See this website or the attached flyer for practical details (but note that the exact daily program may be slightly different than announced due to the latest changes in GATK) and be sure to register (it's required for admission!) by emailing cvangestel at naturalsciences.be with your name and affiliation.

Please note that the hands-on sessions (to be held on the third day) are already filled to capacity. The tutorial materials will be available on our website in the days following the workshop.

We’re excited to introduce our Best Practices recommendations for calling variants on RNAseq data. These recommendations are based on our classic DNA-focused Best Practices, with some key differences in the early data processing steps, as well as in the calling step.

### Best Practices workflow for RNAseq

This workflow is intended to be run per-sample; joint calling on RNAseq is not supported yet, though that is on our roadmap.

Please see the new document here for full details about how to run this workflow in practice.

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller.

Now, before you try to run this on your data, there are a few important caveats that you need to keep in mind.

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

For one thing, these recommendations are based on high quality RNA-seq data (30 million 75bp paired-end reads produced on Illumina HiSeq). Other types of data might need slightly different processing. In addition, we have currently worked only on data from one tissue from one individual. Once we’ve had the opportunity to get more experience with different types (and larger amounts) of data, we will update these recommendations to be more comprehensive.

Finally, we know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data. We look forward to hearing your thoughts and observations!

The presentation videos for:

• Day 1 (Best Practices for Variant Calling with GATK)
• Day 2 (Building Analysis Pipelines with Queue)

Slides for :

• Day 1 (Best Practices for Variant Calling with GATK)
• Day 2 (Building Analysis Pipelines with Queue)

are available at this link in our DropBox slide archive :

https://www.dropbox.com/sh/ryz7bx4aeqc7mkl/44Q_2jvIyH

Register now for a spot at the upcoming GATK workshop, which will be held in Cambridge, MA on October 21-22.

This workshop will cover the following topics:

• GATK Best Practices for Variant Detection
• Building Analysis Pipelines with Queue

The workshop is scheduled right before ASHG Boston, so if you're going to be in town for the conference, make sure you come a couple of days early and attend the GATK workshop!

We're very pleased to announce that we have finally finished our big rewrite of the Best Practices documentation. We hope that the new format, which you can find here, will prove more user-friendly, searchable and overall more helpful than the previous version.

We have a few more improvements in mind (e.g. a clickable image map of the workflow) and there may be a few bugs here and there to iron out. So please feel free to comment on this announcement and give us feedback, whether flattering or critical, so we can improve it to help you as much as possible.

The slides from the July 2013 Best Practices workshop are available here:

https://www.dropbox.com/sh/jhlk451jntywfdy/vJqbKTZZd_

The videos will be put online once they are processed.

These are videos of the presentations given at the GATK workshop on 4-5 Dec 2012 on "Best Practices for variant calling with the GATK".

I am currently processing ~100 exomes and following the Best Practice recommendations for Pre-processing and Variant Discovery. However, there are a couple of gaps in the documentation, as far as I can tell, regarding exactly how to proceed with VQSR with exome data. I would be grateful for some feedback, particularly regarding VQSR. The issues are similar to those discussed on this thread: http://gatkforums.broadinstitute.org/discussion/4798/vqsr-using-capture-and-padding but my questions aren't fully-addressed there (or elsewhere on the Forum as far as I can see).

Prior Steps: 1) All samples processed with same protocol (~60Mb capture kit) - coverage ~50X-100X 2) Alignment with BWA-MEM (to whole genome) 3) Remove duplicates, indel-realignment, bqsr 4) HC to produce gVCFs (-ERC) 5) Genotype gVCFs

This week I have been investigating VQSR, which has generated some questions.

Q1) Which regions should I use from my data for building the VQSR model?

Here I have tried 3 different input datasets:

a) All my variant positions (11Million positions) b) Variant positions that are in the capture kit (~326k positions) - i.e. used bedtools intersect to only extract variants from (1) c) Variant positions that are in the capture kit with padding of 100nt either side (~568k positions) - as above but bed has +/-100 on regions + uniq to remove duplicate variants that are now in more than one bed region

For each of the above, I have produced "sensitive" and "specific" datasets: "Specific" --ts_filter_level 90.0 \ for both SNPs and INDELs "Sensitive" --ts_filter_level 99.5 \ for SNPs, and --ts_filter_level 99.0 \ for INDELs (as suggested in the definitive FAQ https://www.broadinstitute.org/gatk/guide/article?id=1259 )

I also wanted to see what effect, if any, the "-tranche" argument has - i.e. does it just allow for ease of filtering, or does it affect the mother generated, since it was not clear to me. I applied either 5 tranches or 6:

5-tranche: -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 90.0 \ for both SNPs and INDELs 6-tranche: -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 95.0 -tranche 90.0 \ for both SNPs and INDELs

To compare the results I then used bed intersect to get back to the variants that are within the capture kit (~326k, as before). The output is shown in the spreadsheet image below.

http://i58.tinypic.com/25gc4k7.png

What the table appears to show me, is that at the "sensitive" settings (orange background), the results are largely the same - the difference between "PASS" in the set at the bottom where all variants were being used, and the others is mostly accounted for by variants being pushed into the 99.9-100 tranche.

However, when trying to be specific (blue background), the difference between using all variants, or just the capture region/capture+100 is marked. Also surprising (at least for me) is the huge difference in "PASS" in cells E15 and E16, where the only difference was the number of tranches given to the model (note that there is very little difference in the analogous cells in Rows 5/6 andRows 10/11.

Q2) Can somebody explain why there is such a difference in "PASS" rows between All-SPEC and the Capture(s)-Spec Q3) Can somebody explain why 6 tranches resulted in ~23k more PASSes than 5 tranches for the All-SPEC Q4) What does "PASS" mean in this context - a score =100? Is it an observation of a variant position in my data that has been observed in the "truth" set? It isn't actually described in the header of the VCF, though presumably the following corresponds: FILTER=<ID=VQSRTrancheSNP99.90to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -38682.8777"> Q5) Similarly, why do no variants fall below my lower tranche threshold of 90? Is it because they are all reliable at least to this level?

Q6) Am I just really confused? :-(

Hi,

I have discovered some unusual calls in my VCF file after using HaplotypeCaller. I am using version 3.3 of GATK. I applied VQSR as well as the genotype refinement workflow (CalculateGenotypePosteriors, etc.) to refine the calls, but the unusual calls did not get removed. I also calculated the number of Mendelian error just in the biallelic SNPs in my final VCF file (using PLINK) and found unusually high percentage for each of the 3 families I am studying: 0.153%, 0.167%, and 0.25%. The percentage of triallelic SNPs is also very high: 0.111%. Why are the error rates so high?

I used the following commands to call the variants and generate the initial VCF file:

# CatVariants (generate 1 vcf file with all inds and all chrs)

java -Xmx1g -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants -R hs37d5.fa -V final.new.list -out final_allHutt.vcf -assumeSorted

# VQSR

java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input final_allHutt.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches

# Used excludeFiltered here

java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input final_allHutt.vcf -mode SNP --ts_filter_level 99.9 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches -o recalibrated_snps_raw_indels_allHutt_filteredout.vcf

java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches

# Used excludeFiltered here

java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -mode INDEL --ts_filter_level 99.0 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches -o recalibrated_variants_allHutt_filteredout.vcf

# Genotype Refinement Workflow

java -Xmx3g -jar GenomeAnalysisTK.jar -T CalculateGenotypePosteriors -R hs37d5.fa --supporting ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf -ped Hutt.ped -V recalibrated_variants_allHutt_filteredout.vcf -o recalibrated_variants_allHutt.postCGP.f.vcf

java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantFiltration -R hs37d5.fa -V recalibrated_variants_allHutt.postCGP.f.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibrated_variants_allHutt.postCGP.Gfiltered.f.vcf

Again, the first genotype in this example (indel) passed VariantFiltration even though its coverage was zero (2/2:0,0,0:0:PASS)

The entire example is below:

1 20199272 . T TCTTC,C 3520.08 PASS AC=8,22;AF=0.160,0.440;AN=50;BaseQRankSum=-1.356e+00;ClippingRankSum=-1.267e+00;DP=487;FS=4.843;GQ_MEAN=27.84;GQ_STDDEV=40.31;InbreedingCoeff=0.1002;MLEAC=1,12;MLEAF=0.020,0.240;MQ=51.74;MQ0=0;MQRankSum=0.421;NCC=2;PG=0,0,0,0,0,0;QD=32.53;ReadPosRankSum=1.27;SOR=0.699;VQSLOD=0.687;culprit=FS GT:AD:DP:FT:GQ:PGT:PID:PL:PP 2/2:0,0,0:0:PASS:22:.:.:410,207,355,32,22,0:410,207,355,32,22,0 2/2:0,0,1:1:lowGQ:5:.:.:240,51,36,18,5,0:240,51,36,18,5,0 0/2:4,0,4:8:PASS:90:.:.:140,153,256,0,103,90:140,153,256,0,103,90 0/0:22,0,0:22:lowGQ:0:.:.:0,0,390,0,390,390:0,0,390,0,390,390 0/0:2,0,0:2:lowGQ:3:.:.:0,3,45,3,45,45:0,3,45,3,45,45 2/2:0,0,3:3:lowGQ:11:.:.:287,135,124,21,11,0:287,135,124,21,11,0 ./.:7,0,0:7:PASS 2/2:0,0,3:4:lowGQ:11:.:.:282,126,115,22,11,0:282,126,115,22,11,0 0/2:10,0,0:10:lowGQ:5:.:.:27,5,494,0,411,405:27,5,494,0,411,405 0/2:7,0,0:7:lowGQ:13:.:.:13,15,502,0,303,288:13,15,502,0,303,288 0/1:8,6,0:14:PASS:99:.:.:194,0,255,218,273,491:194,0,255,218,273,491 0/0:18,0,0:18:PASS:52:.:.:0,52,755,52,755,755:0,52,755,52,755,755 2/2:0,0,0:0:PASS:23:.:.:305,168,416,23,30,0:305,168,416,23,30,0 0/2:0,0,4:4:lowGQ:14:.:.:40,14,634,0,185,699:40,14,634,0,185,699 0/0:19,0,0:19:PASS:58:.:.:0,58,824,58,824,824:0,58,824,58,824,824 0/0:1,0,0:1:lowGQ:6:0|1:20199257_CT_C:0,6,91,6,91,91:0,6,91,6,91,91 1/1:0,0,0:0:lowGQ:2:.:.:177,11,0,12,2,44:177,11,0,12,2,44 0/1:0,0,3:3:PASS:34:.:.:94,0,388,34,38,304:94,0,388,34,38,304 0/2:15,0,2:17:lowGQ:18:0|1:20199249_CT_C:18,64,695,0,632,624:18,64,695,0,632,624 1/1:0,0,0:0:lowGQ:8:.:.:133,8,0,101,17,265:133,8,0,101,17,265 0/2:3,0,0:3:PASS:25:.:.:129,25,484,0,121,94:129,25,484,0,121,94 0/2:2,0,0:2:PASS:38:.:.:185,38,644,0,88,42:185,38,644,0,88,42 0/2:2,0,0:2:lowGQ:14:.:.:256,14,293,0,57,41:256,14,293,0,57,41./.:11,0,0:11:PASS 1/2:0,0,1:1:lowGQ:14:.:.:115,24,14,36,0,359:115,24,14,36,0,359 1/2:0,0,1:1:PASS:28:.:.:188,39,28,35,0,206:188,39,28,35,0,206 2/2:0,0,3:3:lowGQ:8:1|1:20199249_CT_C:88,88,89,8,9,0:88,88,89,8,9,0

Why are some genotypes being passed when there is no support for their genotype? Why are the Mendelian error rates so high?

Thank you very much in advance, Alva

Hi all - I'm stumped and need your help. I'm following the GATK best practices for calling variants with HaplotypeCaller in GVCF mode. One of my samples is NA12878, among 119 others samples in my cohort. For some reason GATK is missing a bunch of variants in this sample that I can clearly see in IGV but are not listed in the VCF. I discovered that the variant is being filtered out..reason being VQSRTranchesSNP99.00to99.90. The genotype is homozygous variant, DP is 243, Qual is 524742.54 and its known in dbSNP. I suspect this is happening to other variants.

How do I adjust VQSR or how tranches are used and variants get placed in? I supposed I need to fine tune my parameters...but I would think something as obvious as this variant would pass Filtering.

Hi,

I am combining gcvf files into single gvcf files by chromosome, using CombineGVCFs, in order to run GenotypeGVCFs. When I checked the first gvcf file generated by CombineGVCFs, I noticed that at each position, all the alleles were missing.

For example, at position 16050036, this is what comes up in the final gvcf file: 22 16050036 . A C,<NON_REF> . . BaseQRankSum=-7.360e-01;ClippingRankSum=-7.360e-01;DP=4;MQ=27.00;MQ0=0;MQRankSum=-7.360e-01;ReadPosRankSum=0.736 GT:AD:DP:MIN_DP:PL:SB ./.:1,2,0:3:.:55,0,23,58,29,86:1,0,2,0 ./.:.:1:1:0,0,0,0,0,0 ./.:.:0:0:0,0,0,0,0,0

But if we just take one of the precursor gvcf files (one individual), we clearly see the genotype at that site: 22 16050036 . A C,<NON_REF> 26.80 . BaseQRankSum=-0.736;ClippingRankSum=-0.736;DP=3;MLEAC=1,0;MLEAF=0.500,0.00;MQ=27.00;MQ0=0;MQRankSum=-0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB 0/1:1,2,0:3:23:55,0,23,58,29,86:1,0,2,0

The command I'm using to generate these files is:

java -Xmx1g -jar GenomeAnalysisTK.jar -T CombineGVCFs -R hs37d5.fa -V vcfs.chr${numchr}.new.list -o mergeGvcf_${numchr}.vcf -L \${numchr} where numchr is a variable previously defined (indicating the chromosome number).

It seems that all the information is being taken into account except the actual genotypes. How do I solve this problem?

Thanks, Alva

Excuse me:

After calling the genotype using Haplotyper Caller in gatk, i manually check the reads covering the variant sites. But i found one exception: The genotype of one sample was 0/0，wild type,but the reads for this site were .,,,..,,c.,c.c,.cccc. I think almost all the reads should be . or ,. Is this case normal?

Hi,

I noticed that the pipeline for BQSR is slightly different between the "Workshop walkthrough (Brussels 2014)" document and the "(howto) Recalibrate base quality scores = run BQSR" document. I used the former document as my guide since it is more recent, but I am curious as to why these two are different. The first step is the same for the 2 docs, but then for the second step, the output is a recal.bam file in the brussels document instead of a post_recal.table file in the (howto) document. Then, I am confused about the (howto) document because it seems that the 4th step is exactly the same as the second step, except we generate a recal.bam file. Is the brussels document presenting steps that are more efficient? (since there is only 1 use of -BQSR to generate the recalibrated bam file we need, as opposed to 2 uses of the option)

Thanks, Alva

Hi,

I used HaplotypeCaller in GVCF mode to generate a single sample GVCF, but when I checked my vcf file I see that the reference allele is not showing up:

22  1   .   N   <NON_REF>   .   .   END=16050022    GT:DP:GQ:MIN_DP:PL  0/0:0:0:0:0,0,0
22 16050023    .   C   <NON_REF>   .   .   END=16050023    GT:DP:GQ:MIN_DP:PL  0/0:1:3:1:0,3,37
22  16050024    .   A   <NON_REF>   .   .   END=16050026    GT:DP:GQ:MIN_DP:PL  0/0:2:6:2:0,6,73
22  16050027    .   A   <NON_REF>   .   .   END=16050035    GT:DP:GQ:MIN_DP:PL  0/0:3:9:3:0,9,110
22  16050037    .   G   <NON_REF>   .   .   END=16050037    GT:DP:GQ:MIN_DP:PL  0/0:3:9:3:0,9,109
22  16050038    .   A   <NON_REF>   .   .   END=16050039    GT:DP:GQ:MIN_DP:PL  0/0:4:12:4:0,12,153


I am not sure where to start troubleshooting for this, since all the steps prior to using HaplotypeCaller did not generate any obvious errors.

The basic command that I used was: java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hs37d5.fa -I recal_1.bam -o raw_1.vcf -L 22 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

Have you encountered this problem before? Where should I start troubleshooting?

Thanks very much in advance, Alva

Chaps,

I am a newbie to GATK. I have started the best practices pipline but stuck in bwa. From where can I get the reference.fa file? The GATK bundle has .fasta .bam .fai but no .fa. Anyone please!

Where can I get the fasta and ref files for experimenting on GATK?

Hi, I am following the Best Practices for DNAseq analysis and have 2 quick questions: 1. I wanted to confirm if the VCF files produced after the VariantRecalibrator and ApplyRecalibration steps for SNP and for Indels are completely independent of each other. In other words, the VCF file produced after these two steps for SNPs (mode SNP) is just for SNPs and for Indels (mode INDEL) is just for Indels. 2. What is the source of the ALT alleles in the VCF file - is it the various annotation files that were used during the analysis? Thanks, - Pankaj

Hi,

Thank you for providing guidelines on RNA-Seq variant discovery. For our data, we are currently playing with multiple mapping methods and have noticed that 2-step alignments work "better" than 1-step alignments. By 2-step alignments, I mean using STAR as step1 and then take the unmapped from this and use another aligner (like Bowtie2) for alignment. If I use such a methodology, will there be an issue in variant calling when during splitting cigar strings I ask it convert the 255 MAPQ to another value (like 60 in the best practices example), since bowtie2 gives different MAPQ scores. Sorry if this seems like a stupid question, but I am just a little curious how such a thing might affect the variant calls. Any insights/comment on this will be greatly appreciated.

Thanks!

Hi,

I am wondering how HaplotypeCaller in GATK 3.1 handles multimappers? I thought I read that they are passed over for variant calling but stay in the realigned, recalibrated BAM for 'completions sake', like marking duplicates not removing them but cannot find supporting info on the website or from farther afield,

Presumably it is the NotPrimaryAlignmentFilter but there is no info on that posted yet. I know I can output a BAM from HC with haplotype info in there but can I just get reads used in variant calls? Or should I trim the BAM myself to retain reads I want used? I do this for mark duplicates (removed) but for multimappers I would like to know how you define so I can do the same. The reason is for coverage estimates, using bamtobed or such means I take all realigned, recalibrated which is many more lines including multimappers which skews my results.

Thanks,

Bruce.

Hi,

I was wondering if there is a nice way to apply multiple processing steps to each variant (or a group of variants) as they are read so that the variant file is not read again and again. My understanding is that even if I use Queue, each script would read the vcf again. Is that correct?

Hello there! Thanks as always for the lovely tools, I continue to live in them.

• Been wondering how best to interpret my VQSLOD plots/tranches and subsequent VQSLOD scores. Attached are those plots, and a histogram of my VQSLOD scores as they are found across my replicate samples.

Methods Thus Far

We have HiSeq reads of "mutant" and wt fish, three replicates of each. The sequences were captured by size selected digest, so some have amazing coverage but not all. The mutant fish should contain de novo variants of an almost cancer-like variety (TiTv independent).

As per my interpretation of the best practices, I did an initial calling of the variants (HaplotypeCaller) and filtered them very heavily, keeping only those that could be replicated across all samples. Then I reprocessed and called variants again with that first set as a truth set. I also used the zebrafish dbSNP as "known", though I lowered the Bayesian priors of each from the suggested human ones. The rest of my pipeline follows the best practices fairly closely, GATK version was 2.7-2, and my mapping was with BWA MEM.

My semi-educated guess..

The spike in VQSLOD I see for samples found across all six replicates are simply the rediscovery of those in my truth set, and those with amazing coverage, which is probably fine/good. The part that worries me are the plots and tranches. The plots don't ever really show a section where the "known" set clusters with one set of obviously good variants but not with another. Is that OK or does that and my inflated VQSLOD values ring of poor practice?

We have 50 human whole genome data (12x) from the same population and we have already used GATK to call variants and followed the best practice for population level calling. We have then sequenced 20 more individual (12x) from the same population. For the population level variant calling do we have to put all the BAM files (50+20) together and then call the variants(in UnifiedGenotyper) ? Or we can somehow use the VCF file which was created from the previous 50 individual and use them as training set and call the variant for the new 20 individuals?

I am not sure where I should ask this question, but the GATK forum seemed the most appropriate place. I am currently viewing the webinars from the BroadE Workshop 2013 July 9-10 and the "Call Variants" streaming doesn't seem to work:

Would you have a work around to view this file? or is the file inaccessible for some reason?

Hi, I am estimating SNP number for a genome of a speceis.

1. After mapping the illumina reads to a reference seqence (of a related species), I used picard to remove duplicates and used samtools (mpileup) to convert the bam file to a fastq file.
2. I also use GATK to do local reglimant and recalication from the mappnig bam file. Then I used samtools (mpileup) to convert the bam files output by GATK to a fastq file.

I found that the number of SNP in the fastq going through GATK is 10 times more than the first fastq. Interestingly, if I use picard to do duplicats-removomg again to the GATK bam and used samtools to convert the bam to fastq file. The SNP jumps back to 10 time fewer.

What can be the reason that the SNP number can be 10 time different between the two methods? Actually, I expect the GATK output file has fewer SNP given the effect of recaliraiton or relingement. But the result is opposite.

Chih-Ming

Hi:

for the best practice V4, for the step: 2. Raw reads to analysis-ready reads at phase I, I only have lane-level sample data (each sample in one lane, no sample in multi-lanes), so I only did the Fast: lane-level realignment (at known sites only) and lane-level recalibration, which is

for each lane.bam
dedup.bam <- MarkDuplicate(lane.bam)
realigned.bam <- realign(dedup.bam) [at only known sites, if possible, otherwise skip]
recal.bam <- recal(realigned.bam)


Since I do not have multi-lane samples, I do not have to run Fast + per-sample processing, Better: per-sample realignment with known indels and recalibration, or Best: multi-sample realignment with known sites and recalibration, right? It seems that all of these schemes, Better, Best etc, are only for situation when some samples run at multiple lanes, is my understanding correct?

at the paragraph for Fast + per-sample processing, it mentioned: cross-lane realignment, not sure what it means exactly. if every sample of mine are in a single lane, it does not need, right?

Thanks

Mike

Hi all, I have a 4 exome experiment (high coverage). According to best practices I can't apply VQSR, so I must filter on SNP properties. One thing is not clear: should all the option apply? In other words should I select using boolean "and" (i.e. QD  < 2.0 & MQ < 40.0 & FS > 60.0 & HaplotypeScore > 13.0 & MQRankSum < -12.5 & ReadPosRankSum < -8.0 for SNP) or "or" (i.e. QD  < 2.0 | MQ < 40.0 | FS > 60.0 | HaplotypeScore > 13.0 | MQRankSum < -12.5 | ReadPosRankSum < -8.0 for SNP)?

Hello,

I asked this question in a comment under BestPractices but never got a response. Hopefully I will here. Here goes:

I have been running GATK v1.6.2 on several samples. It seems the way I had initially had run GATK for indel-realignment and quality re-calibration steps are reversed. For example, in order of processing, I ran:

• MarkDuplicates
• Count Covariates
• Table Recalibration
• Realigner Target Creator
• Indel Realigner

What are the consequences and effect to SNP and INDEL calling if the GATK steps are ran as above?. I'm aware that this is not according to the best-practices document (thank you for the thorough GATK documentation), but I wanted to know if it is essential for me to re-analyze the samples. The variant caller I'm using looks at BaseQuality scores when making calls.

Any help on this would be appreciated.

Mike

Hi, I am working on SOLiD 5500xl data and used SHRiMP2.2.3 for performing mapping. The library type is paired-end. I have read some discussions regarding SOLiD problem but I still have some doubts regarding some steps in best practices

1. Local-Realignment at In-dels: Since local realignments take place in base space instead of color-space I doubt the accuracy of the alignment
2. Mark/Remove Duplicates: Reads just lying in the same position (end to end) may not necessarily be duplicates. Some of these reads may have putative variants, which otherwise may be filtered out.
3. Base quality score recalibration: I am not sure whether this is applicable for 5500xl as well, since quality values have slightly changed on 5500 from previous SOLiD versions as far as I know.

So after mapping, I simply used GATK UnifiedGenotyper to call SNPs and InDels under default values. I end up getting around 40 million variants. Is there any way I can get a more refined variant calling? Do you still consider me applying the above pre-processing steps or do you recommend me applying some variant filteratiion on the called variants? If yes for the previous, then could you explain how my above concerns are taken care of? I was trying to look at some general recommended filter values on INFO fields in VCF format such as BQ, MQ, MQ0, DP, SB etc. Do you recommend some generally used values of these fields on which I can filter and hence refine my variant data?

I may have posted a subset of the above question, which I am not sure was posted successfully since at that time I just created an account. If you have already answered this question then I apologize for that. Could you then provide me the link where you answered it?

Hello,

I've just made a long needed update to the most recent version of GATK. I had been toying with the variant quality score recalibrator before but now that I have a great deal more exomes at my disposal I'd like to fully implement it in a meaningful way.

The phrase I'm confused about is "In our testing we've found that in order to achieve the best exome results one needs to use an exome callset with at least 30 samples." How exactly do I arrange these 30+ exomes?

Is there any difference or reason to choose one of the following two workflows over the other?

1. Input 30+ exomes in the "-I" argument of either the UnifiedGenotyper or HaplotypeCaller and then with my multi-sample VCF perform the variant recalibration procedure and then split the individual call sets out of the multi-sample vcf with SelectVariants?

2. Take 30+ individual vcf files, merge them together, and then perform variant recalibration on the merged vcf and then split the individual call sets out of the multi-sample vcf with SelectVariants?

3. Or some third option I'm missing

Any help is appreciated.

Thanks

Hi all!

I'm currently working on high-coverage non-human data (mammals).

After mapping via BWA, soritng and merging, my pipeline goes like this:

1. Remove duplicates via MarkDuplicates.
2. RealignerTargetCreator
3. IndelRealigner using the --consensusDeterminationModel USE_SW flag
4. Fix Mates

I currently want to begin the recalibration step before doing the actual variant calls via UnifiedGenotyper.

Since I'm working on non-human data, there is no thorough database I can trust as an input vcf file for the recalibration step.

What is your recommendation for this for non-human data?

Thank you very much!

Sagi

Hello,

I'm sorry if I'm being dense (I'm new to all this and it is making me feel very dense indeed!), but having read the section on 'Selecting an appropriate quality score threshold' on the 'Best Practice Variant Detection' page, I am still unclear as to whether you mean I should be looking for a QUAL score of at least 30 in a deep coverage data set and should filter out any suggested SNPs that don't meet this, or a GQ score of 30 in each individual sample genotyped at the SNP in question and I only need to filter out individual samples that don't meet this threshold.

I have pasted the bit of text I read below, just to make it clear to which bit I am referring.

Many thanks!

A common question is the confidence score threshold to use for variant detection. We recommend:

Deep (> 10x coverage per sample) data: we recommend a minimum confidence score threshold of Q30.

Shallow (< 10x coverage per sample) data: because variants have by necessity lower quality with shallower coverage we recommend a minimum confidence score of Q4 in projects with 100 samples or fewer and Q10 otherwise.

Hi,

I wonder how well GATK works with Ion Torrent data. Is there any recommended practice to handle Ion Torrent data, especially SNP and indel calling, with GATK?

Thanks, XZ

Hello,

I am new at using GATK (v 2.1-3). I do exome sequencing by sample using the following steps: Alignment with BWA (0.6.2) GATK :Local realignment around INDELs PICARD (1.67): FixMateInformation GATK: Recalibration (BaseRecalibrator + PrintReads -BQSR) Samtools for calling variants

## Samtools seems to run properly but no file (*.vcf and *.bcf) are created and no error message is prompted :

cd Sample_09 + samtools mpileup -BE -ug -q 20 -Q 20 -D -f human_g1k_v37.fasta realigned_fixed_recal.bam -C50 + bcftools view -bvcg - [mpileup] 1 samples in 1 input files Set max per-file depth to 8000 [bcfview] 100000 sites processed. [afs] 0:89274.054 1:6411.053 2:4314.893 [bcfview] 200000 sites processed. [afs] 0:89100.642 1:6125.883 2:4773.474 [bcfview] 300000 sites processed. [afs] 0:87374.996 1:7439.238 2:5185.766 [bcfview] 400000 sites processed. [afs] 0:87890.186 1:7087.628 2:5022.185 [bcfview] 500000 sites processed. [afs] 0:85322.061 1:8454.843 2:6223.096 [bcfview] 600000 sites processed. [afs] 0:85864.368 1:8410.777 2:5724.854 [bcfview] 700000 sites processed. [afs] 0:88813.814 1:6828.001 2:4358.185 [bcfview] 800000 sites processed. [afs] 0:89070.318 1:6302.924 2:4626.758 [bcfview] 900000 sites processed. [afs] 0:88364.380 1:6796.962 2:4838.658 [bcfview] 1000000 sites processed. [afs] 0:86892.531 1:7268.088 2:5839.381 [bcfview] 1100000 sites processed. [afs] 0:87184.845 1:7153.073 2:5662.081 [bcfview] 1200000 sites processed. [afs] 0:86762.756 1:7241.236 2:5996.008 [bcfview] 1300000 sites processed. [afs] 0:89346.143 1:6159.989 2:4493.868 [bcfview] 1400000 sites processed. [afs] 0:88658.355 1:7160.555 2:4181.089 [bcfview] 1500000 sites processed. [afs] 0:85985.305 1:8308.039 2:5706.656 [bcfview] 1600000 sites processed. [afs] 0:87346.636 1:7708.883 2:4944.480 [afs] 0:63097.202 1:3950.127 2:3572.670 + bcftools view .bcf

## + cd ..

I have seen that some groups use after realignment Picard:AddOrReplaceReadGroups and I wonder if I should use before calling the variant with samtools.