# Tagged with #best-practices 14 documentation articles | 14 announcements | 40 forum discussions

Created 2015-08-17 23:35:00 | Updated 2015-11-25 10:02:35 | Tags: best-practices workshop presentations slides

Laura Gauthier, Yossi Farjoun and Geraldine Van der Auwera presented this workshop in Pretoria, South Africa, upon invitation from the University of Pretoria.

This workshop covered the core steps involved in calling germline and somatic variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. The presentation materials describe why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

Additional considerations were covered, such as calling cohorts efficiently, as well as dealing with non-human data, RNAseq data, whole-genome vs. exome, basic quality control, and performance.

This was complemented by sets of hands-on tutorials aiming to teach basic GATK usage to new users, as well as introduce pipelining concepts using Queue.

The workshop was structured into five modules:

• Introductory materials
• Best Practices Phase 1: Pre-processing
• Best Practices Phase 2A: Calling germline variants
• Best Practices Phase 2B: Calling somatic variants
• Best Practices Phase 3: Preliminary analyses

The workshop materials are available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.

Created 2015-08-14 01:57:37 | Updated 2015-08-14 01:58:03 | Tags: best-practices workshop workflow presentations

Joel Thibault, Valentin Ruano-Rubio and Geraldine Van der Auwera presented this workshop in Edinburgh, Scotland, and Cambridge, England, upon invitation from the Universities of Edinburgh and Cambridge.

This workshop included two modules:

• Best Practices for Variant Calling with the GATK

The core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. The presentation materials describe why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

• Beyond the Best Practices

Additional considerations such as calling variants in RNAseq data and calling cohorts efficiently, as well as dealing with non-human data, RNAseq data, whole-genome vs. exome, basic quality control, and performance.

This was complemented by a set of hands-on exercises aiming to teach basic GATK usage to new users.

The workshop materials are available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.

Created 2015-08-14 01:50:16 | Updated 2015-08-14 01:56:56 | Tags: best-practices workshop workflow presentations

The full GATK team presented this workshop at the Broad Institute with support form the BroadE education program.

This workshop covered the core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. The presentation materials describe why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

The workshop materials are available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.

Created 2014-10-31 23:11:22 | Updated 2014-11-01 00:01:23 | Tags: best-practices presentations ashg

Ami Levy-Moonshine presented this condensed 90-minute workshop given at ASHG 2014 in San Diego, CA on October 21.

This workshop covered all the core steps involved in calling variants with the GATK, using the “Best Practices” developed by the GATK team. The presentation materials outline why each step is essential to the calling process and what are the key operations performed on the data at each step. This includes specific information about variant calling in RNAseq data and efficient analysis of cohorts.

His slide deck is available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.

GATK was also featured in another mini-workshop at ASHG which covered the iSeqTools network, focused on cloud-based analysis. The presentation slides will be posted to the iSeqTools website in the near future.

Best Practices for variant discovery in DNA:

Best Practices for variant discovery in RNAseq:

Excerpt from Ami's ASHG poster:

Created 2014-10-31 22:45:38 | Updated 2014-10-31 23:12:02 | Tags: best-practices workshop presentations

Eric Banks, Sheila Chandran and Geraldine Van der Auwera presented this workshop in Philadelphia, PA, upon invitation from the School of Medicine at UPenn.

This workshop covered all the core steps involved in calling variants with the GATK, using the “Best Practices” developed by the GATK team. The presentation materials describe why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset. This includes specific information about variant calling in RNAseq data and efficient analysis of cohorts.

The material was presented over two days, organized in the following modules:

• Data pre-processing: From FASTQ to analysis-ready BAM
• Variant Discovery: From BAM to analysis-ready VCF

This was complemented by a set of hands-on exercises aiming to teach basic GATK usage to new users.

The workshop materials are available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.

Created 2014-04-16 22:09:49 | Updated 2015-05-16 06:58:15 | Tags: best-practices rnaseq

This is our recommended workflow for calling variants in RNAseq data from single samples, in which all steps are performed per-sample. In future we will provide cohort analysis recommendations, but these are not yet available.

The workflow is divided in three main sections that are meant to be performed sequentially:

• Variant discovery: from reads (BAM files) to variants (VCF files)
• Refinement and evaluation: genotype refinement, functional annotation and callset QC

Compared to the DNAseq Best Practices, the key adaptations for calling variants in RNAseq focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller, which are highlighted in the figure below.

### Pre-Processing

The data generated by the sequencers are put through some pre-processing steps to make it suitable for variant calling analysis. The steps involved are: Mapping and Marking Duplicates; Split'N'Trim; Local Realignment Around Indels (optional); and Base Quality Score Recalibration (BQSR); performed in that order.

#### Mapping and Marking Duplicates

The sequence reads are first mapped to the reference using STAR aligner (2-pass protocol) to produce a file in SAM/BAM format sorted by coordinate. The next step is to mark duplicates. The rationale here is that during the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process identifies these reads as such so that the GATK tools know they should ignore them.

#### Split'N'Trim

Then, an RNAseq-specific step is applied: reads with N operators in the CIGAR strings (which denote the presence of a splice junction) are split into component reads and trimmed to remove any overhangs into splice junctions, which reduces the occurrence of artifacts. At this step, we also reassign mapping qualities from 255 (assigned by STAR) to 60 which is more meaningful for GATK tools.

#### Realignment Around Indels

Next, local realignment is performed around indels, because the algorithms that are used in the initial mapping step tend to produce various types of artifacts. For example, reads that align on the edges of indels often get mapped with mismatching bases that might look like evidence for SNPs, but are actually mapping artifacts. The realignment process identifies the most consistent placement of the reads relative to the indel in order to clean up these artifacts. It occurs in two steps: first the program identifies intervals that need to be realigned, then in the second step it determines the optimal consensus sequence and performs the actual realignment of reads. This step is considered optional for RNAseq.

#### Base Quality Score Recalibration

Finally, base quality scores are recalibrated, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic error, leading to over- or under-estimated base quality scores in the data. Base quality score recalibration is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This yields more accurate base qualities, which in turn improves the accuracy of the variant calls. The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants, then it adjusts the base quality scores in the data based on the model.

### Variant Discovery

Once the data has been pre-processed as described above, it is put through the variant discovery process, i.e. the identification of sites where the data displays variation relative to the reference genome, and calculation of genotypes for each sample at that site. Because some of the variation observed is caused by mapping and sequencing artifacts, the greatest challenge here is to balance the need for sensitivity (to minimize false negatives, i.e. failing to identify real variants) vs. specificity (to minimize false positives, i.e. failing to reject artifacts). It is very difficult to reconcile these objectives in a single step, so instead the variant discovery process is decomposed into separate steps: variant calling (performed per-sample) and variant filtering (also performed per-sample). The first step is designed to maximize sensitivity, while the filtering step aims to deliver a level of specificity that can be customized for each project.

Our current recommendation for RNAseq is to run all these steps per-sample. At the moment, we do not recommend applying the GVCF-based workflow to RNAseq data because although there is no obvious obstacle to doing so, we have not validated that configuration. Therefore, we cannot guarantee the quality of results that this would produce.

#### Per-Sample Variant Calling

We perform variant calling by running the HaplotypeCaller on each sample BAM file (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample VCFs containing raw SNP and indel calls.

#### Per-Sample Variant Filtering

For RNAseq, it is not appropriate to apply variant recalibration in its present form. Instead, we provide hard-filtering recommendations to filter variants based on specific annotation value thresholds. This produces a VCF of calls annotated with fiiltering information that can then be used in downstream analyses.

### Refinement and evaluation

In this last section, we perform some refinement steps on the genotype calls (GQ estimation and transmission phasing), add functional annotations if desired, and do some quality evaluation by comparing the callset to known resources. None of these steps are absolutely required, and the workflow may need to be adapted quite a bit to each project's requirements.

Important note on GATK versions

The Best Practices have been updated for GATK version 3. If you are running an older version, you should seriously consider upgrading. For more details about what has changed in each version, please see the Version History section. If you cannot upgrade your version of GATK for any reason, please look up the corresponding version of the GuideBook PDF (also in the Version History section) to ensure that you are using the appropriate recommendations for your version.

Created 2014-04-16 21:58:52 | Updated 2015-05-16 06:55:15 | Tags: best-practices gatk3

The "GATK Best Practices" are workflow descriptions that provide step-by-step recommendations for getting the best analysis results possible out of high-throughput sequencing data. At present, we provide the following Best Practice workflows:

These recommendations have been developed by the GATK development team over years of analysis work on many of the Broad Institute's sequencing projects, and are applied in the Broad's production pipelines. As a general rule, the command-line arguments and parameters given in the documentation examples are meant to be broadly applicable.

#### Important notes on context and caveats

Our testing focuses largely on data from human whole-genome or whole-exome samples sequenced with Illumina technology, so if you are working with different types of data or experimental designs, you may need to adapt certain branches of the workflow, as well as certain parameter selections and values. Unfortunately we are not able to provide official recommendations on how to deal with very different experimental designs or divergent datatypes (such as Ion Torrent).

In addition, the illustrations and tutorials provided in these pages tend to assume a simple experimental design where each sample is used to produce one DNA library that is sequenced separately on one lane of the machine. See the Guide for help dealing with other experimental designs.

Finally, please be aware that several key steps in the Best Practices workflow make use of existing resources such as known variants, which are readily available for humans (we provide several useful resource datasets for download from our FTP server). If no such resources are available for your organism, you may need to bootstrap your own or use alternative methods. We have documented useful methods to do this wherever possible, but be aware than some issues are currently still without a good solution.

Important note on GATK versions

The Best Practices have been updated for GATK version 3. If you are running an older version, you should seriously consider upgrading. For more details about what has changed in each version, please see the Version History section. If you cannot upgrade your version of GATK for any reason, please look up the corresponding version of the GuideBook PDF (also in the Version History section) to ensure that you are using the appropriate recommendations for your version.

Created 2014-03-06 08:12:28 | Updated 2015-05-16 07:34:51 | Tags: best-practices joint-discovery reference-model

This document describes the new approach to joint variant discovery that is available in GATK versions 3.0 and above. For a more detailed discussion of why it's better to perform joint discovery, see this FAQ article. For more details on how this fits into the overall reads-to-variants analysis workflow, see the Best Practices workflows documentation.

### Overview

This is the workflow recommended in our Best Practices for performing variant discovery analysis on cohorts of samples.

In a nutshell, we now call variants individually on each sample using the HaplotypeCaller in -ERC GVCF mode, leveraging the previously introduced reference model to produce a comprehensive record of genotype likelihoods and annotations for each site in the genome (or exome), in the form of a gVCF file (genomic VCF).

In a second step, we then perform a joint genotyping analysis of the gVCFs produced for all samples in a cohort. This allows us to achieve the same results as joint calling in terms of accurate genotyping results, without the computational nightmare of exponential runtimes, and with the added flexibility of being able to re-run the population-level genotyping analysis at any time as the available cohort grows.

This is meant to replace the joint discovery workflow that we previously recommended, which involved calling variants jointly on multiple samples, with a much smarter approach that reduces computational burden and solves the "N+1 problem".

### Workflow details

This is a quick overview of how to apply the workflow in practice. For more details, see the Best Practices workflows documentation.

#### 1. Variant calling

Run the HaplotypeCaller on each sample's BAM file(s) (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs, with the option -emitRefConfidence GVCF, and using the .g.vcf extension for the output file.

Note that versions older than 3.4 require passing the options --variant_index_type LINEAR --variant_index_parameter 128000 to set the correct index strategy for the output gVCF.

#### 2. Optional data aggregation step

If you have more than a few hundred samples, run CombineGVCFs on batches of ~200 gVCFs to hierarchically merge them into a single gVCF. This will make the next step more tractable.

#### 3. Joint genotyping

Take the outputs from step 2 (or step 1 if dealing with fewer samples) and run GenotypeGVCFs on all of them together to create the raw SNP and indel VCFs that are usually emitted by the callers.

#### 4. Variant recalibration

Finally, resume the classic GATK Best Practices workflow by running VQSR on these "regular" VCFs according to our usual recommendations.

That's it! Fairly simple in practice, but we predict this is going to have a huge impact in how people perform variant discovery in large cohorts. We certainly hope it helps people deal with the challenges posed by ever-growing datasets.

As always, we look forward to comments and observations from the research community!

Created 2014-03-06 07:15:51 | Updated 2014-10-31 23:57:32 | Tags: best-practices rnaseq

### Overview

This document describes the details of the GATK Best Practices workflow for SNP and indel calling on RNAseq data.

Please note that any command lines are only given as example of how the tools can be run. You should always make sure you understand what is being done at each step and whether the values are appropriate for your data. To that effect, you can find more guidance here.

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller. Here is a detailed overview:

### Caveats

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have been working with RNAseq for a somewhat shorter time, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

We know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data.

### The workflow

#### 1. Mapping to the reference

The first major difference relative to the DNAseq Best Practices is the mapping step. For DNA-seq, we recommend BWA. For RNA-seq, we evaluated all the major software packages that are specialized in RNAseq alignment, and we found that we were able to achieve the highest sensitivity to both SNPs and, importantly, indels, using STAR aligner. Specifically, we use the STAR 2-pass method which was described in a recent publication (see page 43 of the Supplemental text of the Pär G Engström et al. paper referenced below for full protocol details -- we used the suggested protocol with the default parameters). In brief, in the STAR 2-pass approach, splice junctions detected in a first alignment run are used to guide the final alignment.

Here is a walkthrough of the STAR 2-pass alignment steps:

1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:

genomeDir=/path/to/hg19
mkdir $genomeDir STAR --runMode genomeGenerate --genomeDir$genomeDir --genomeFastaFiles hg19.fa\  --runThreadN <n>


2) Alignment jobs were executed as follows:

runDir=/path/to/1pass
mkdir $runDir cd$runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>  3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass: genomeDir=/path/to/hg19_2pass mkdir$genomeDir
STAR --runMode genomeGenerate --genomeDir genomeDir --genomeFastaFiles hg19.fa \ --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>  4) The resulting index is then used to produce the final alignments as follows: runDir=/path/to/2pass mkdirrunDir
cd $runDir STAR --genomeDir$genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>


The above step produces a SAM file, which we then put through the usual Picard processing steps: adding read group information, sorting, marking duplicates and indexing.

java -jar AddOrReplaceReadGroups I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample

java -jar MarkDuplicates I=rg_added_sorted.bam O=dedupped.bam  CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics


#### 3. Split'N'Trim and reassign mapping qualities

Next, we use a new GATK tool called SplitNCigarReads developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.

In the future we plan to integrate this into the GATK engine so that it will be done automatically where appropriate, but for now it needs to be run as a separate step.

At this step we also add one important tweak: we need to reassign mapping qualities, because STAR assigns good alignments a MAPQ of 255 (which technically means “unknown” and is therefore meaningless to GATK). So we use the GATK’s ReassignOneMappingQuality read filter to reassign all good alignments to the default value of 60. This is not ideal, and we hope that in the future RNAseq mappers will emit meaningful quality scores, but in the meantime this is the best we can do. In practice we do this by adding the ReassignOneMappingQuality read filter to the splitter command.

Please note that we recently (6/11/14) edited this to fix a documentation error regarding the filter to use. See this announcement for details.

Finally, be sure to specify that reads with N cigars should be allowed. This is currently still classified as an "unsafe" option, but this classification will change to reflect the fact that this is now a supported option for RNAseq processing.

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS


#### 4. Indel Realignment (optional)

After the splitting step, we resume our regularly scheduled programming... to some extent. We have found that performing realignment around indels can help rescue a few indels that would otherwise be missed, but to be honest the effect is marginal. So while it can’t hurt to do it, we only recommend performing the realignment step if you have compute and time to spare (or if it’s important not to miss any potential indels).

#### 5. Base Recalibration

We do recommend running base recalibration (BQSR). Even though the effect is also marginal when applied to good quality data, it can absolutely save your butt in cases where the qualities have systematic error modes.

Both steps 4 and 5 are run as described for DNAseq (with the same known sites resource files), without any special arguments. Finally, please note that you should NOT run ReduceReads on your RNAseq data. The ReduceReads tool will no longer be available in GATK 3.0.

#### 6. Variant calling

Finally, we have arrived at the variant calling step! Here, we recommend using HaplotypeCaller because it is performing much better in our hands than UnifiedGenotyper (our tests show that UG was able to call less than 50% of the true positive indels that HC calls). We have added some functionality to the variant calling code which will intelligently take into account the information about intron-exon split regions that is embedded in the BAM file by SplitNCigarReads. In brief, the new code will perform “dangling head merging” operations and avoid using soft-clipped bases (this is a temporary solution) as necessary to minimize false positive and false negative calls. To invoke this new functionality, just add -dontUseSoftClippedBases to your regular HC command line. Note that the -recoverDanglingHeads argument which was previously required is no longer necessary as that behavior is now enabled by default in HaplotypeCaller. Also, we found that we get better results if we lower the minimum phred-scaled confidence threshold for calling variants on RNAseq data, so we use a default of 20 (instead of 30 in DNA-seq data).

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I input.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o output.vcf


#### 7. Variant filtering

To filter the resulting callset, you will need to apply hard filters, as we do not yet have the RNAseq training/truth resources that would be needed to run variant recalibration (VQSR).

We recommend that you filter clusters of at least 3 SNPs that are within a window of 35 bases between them by adding -window 35 -cluster 3 to your command. This filter recommendation is specific for RNA-seq data.

As in DNA-seq, we recommend filtering based on Fisher Strand values (FS > 30.0) and Qual By Depth values (QD < 2.0).

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf


Please note that we selected these hard filtering values in attempting to optimize both high sensitivity and specificity together. By applying the hard filters, some real sites will get filtered. This is a tradeoff that each analyst should consider based on his/her own project. If you care more about sensitivity and are willing to tolerate more false positives calls, you can choose not to filter at all (or to use less restrictive thresholds).

An example of filtered (SNPs cluster filter) and unfiltered false variant calls:

An example of true variants that were filtered (false negatives). As explained in text, there is a tradeoff that comes with applying filters:

### Known issues

There are a few known issues; one is that the allelic ratio is problematic. In many heterozygous sites, even if we can see in the RNAseq data both alleles that are present in the DNA, the ratio between the number of reads with the different alleles is far from 0.5, and thus the HaplotypeCaller (or any caller that expects a diploid genome) will miss that call. A DNA-aware mode of the caller might be able to fix such cases (which may be candidates also for downstream analysis of allele specific expression).

Although our new tool (splitNCigarReads) cleans many false positive calls that are caused by splicing inaccuracies by the aligners, we still call some false variants for that same reason, as can be seen in the example below. Some of those errors might be fixed in future versions of the pipeline with more sophisticated filters, with another realignment step in those regions, or by making the caller aware of splice positions.

As stated previously, we will continue to improve the tools and process over time. We have plans to improve the splitting/clipping functionalities, improve true positive and minimize false positive rates, as well as developing statistical filtering (i.e. variant recalibration) recommendations.

We also plan to add functionality to process DNAseq and RNAseq data from the same samples simultaneously, in order to facilitate analyses of post-transcriptional processes. Future extensions to the HaplotypeCaller will provide this functionality, which will require both DNAseq and RNAseq in order to produce the best results. Finally, we are also looking at solutions for measuring differential expression of alleles.

[1] Pär G Engström et al. “Systematic evaluation of spliced alignment programs for RNA-seq data”. Nature Methods, 2013

Created 2013-10-25 16:35:32 | Updated 2014-10-31 23:13:13 | Tags: best-practices workshop presentations videos

The full GATK team presented this workshop at the Broad Institute with support form the BroadE education program.

This workshop included two modules:

• Best Practices for Variant Calling with the GATK

The core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. View the workshop materials to learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

• Building Analysis Pipelines with Queue

An introduction to the Queue pipelining system. View the workshop materials to learn about how to use Queue to create analysis pipelines, scatter-gather them and run them locally or in parallel on a computing farm.

Created 2013-09-13 20:41:24 | Updated 2015-05-16 06:53:50 | Tags: best-practices dnaseq

This is our recommended workflow for calling variants in DNAseq data from cohorts of samples, in which steps from data processing up to variant calling are performed per-sample, and subsequent steps are performed jointly on all the individuals in the cohort.

The workflow is divided in three main sections that are meant to be performed sequentially:

• Variant discovery: from reads (BAM files) to variants (VCF files)
• Refinement and evaluation: genotype refinement, functional annotation and callset QC

### Pre-Processing

The data generated by the sequencers are put through some pre-processing steps to make it suitable for variant calling analysis. The steps involved are: Mapping and Marking Duplicates; Local Realignment Around Indels; and Base Quality Score Recalibration (BQSR); performed in that order.

#### Mapping and Marking Duplicates

The sequence reads are first mapped to the reference using BWA mem to produce a file in SAM/BAM format sorted by coordinate. The next step is to mark duplicates. The rationale here is that during the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process identifies these reads as such so that the GATK tools know they should ignore them.

#### Realignment Around Indels

Next, local realignment is performed around indels, because the algorithms that are used in the initial mapping step tend to produce various types of artifacts. For example, reads that align on the edges of indels often get mapped with mismatching bases that might look like evidence for SNPs, but are actually mapping artifacts. The realignment process identifies the most consistent placement of the reads relative to the indel in order to clean up these artifacts. It occurs in two steps: first the program identifies intervals that need to be realigned, then in the second step it determines the optimal consensus sequence and performs the actual realignment of reads.

#### Base Quality Score Recalibration

Finally, base quality scores are recalibrated, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic error, leading to over- or under-estimated base quality scores in the data. Base quality score recalibration is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This yields more accurate base qualities, which in turn improves the accuracy of the variant calls. The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants, then it adjusts the base quality scores in the data based on the model.

### Variant Discovery

Once the data has been pre-processed as described above, it is put through the variant discovery process, i.e. the identification of sites where the data displays variation relative to the reference genome, and calculation of genotypes for each sample at that site. Because some of the variation observed is caused by mapping and sequencing artifacts, the greatest challenge here is to balance the need for sensitivity (to minimize false negatives, i.e. failing to identify real variants) vs. specificity (to minimize false positives, i.e. failing to reject artifacts). It is very difficult to reconcile these objectives in a single step, so instead the variant discovery process is decomposed into separate steps: variant calling (performed per-sample), joint genotyping (performed per-cohort) and variant filtering (also performed per-cohort). The first two steps are designed to maximize sensitivity, while the filtering step aims to deliver a level of specificity that can be customized for each project.

#### Per-Sample Variant Calling

We perform variant calling by running the HaplotypeCaller on each sample BAM file (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs. If there are more than a few hundred samples, we combine the gVCFs in batches of ~200 gVCFs using a specialized tool, CombineGVCFs. This will make the next step more tractable and reflects that the processing bottleneck lies with the number of input files and not the number of samples in those files.

#### Joint Genotyping

All available samples are then jointly genotyped by taking the gVCFs produced earlier and running GenotypeGVCFs on all of them together to create a set of raw SNP and indel calls. This cohort-wide analysis empowers sensitive detection of variants even at difficult sites.

#### Variant Quality Score Recalibration

Variant recalibration involves using a machine learning method to assign a well-calibrated probability to each variant call in a raw call set. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, fine-tuned to balance specificity and sensitivity.

### Refinement and evaluation

In this last section, we perform some refinement steps on the genotype calls (GQ estimation and transmission phasing), add functional annotations if desired, and do some quality evaluation by comparing the callset to known resources. None of these steps are absolutely required, and the workflow may need to be adapted quite a bit to each project's requirements.

Important note on GATK versions

The Best Practices have been updated for GATK version 3. If you are running an older version, you should seriously consider upgrading. For more details about what has changed in each version, please see the Version History section. If you cannot upgrade your version of GATK for any reason, please look up the corresponding version of the GuideBook PDF (also in the Version History section) to ensure that you are using the appropriate recommendations for your version.

Created 2013-08-13 07:37:07 | Updated 2014-10-31 23:13:24 | Tags: official best-practices workshop

The full GATK team presented this workshop at the Broad Institute with support form the BroadE education program.

This workshop covered the core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. View the workshop materials to learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

Created 2013-08-02 20:23:38 | Updated 2015-08-13 23:14:55 | Tags: best-practices workflow multiplexing pre-processing

Our Best Practices Pre-processing documentation assumes a simple experimental design in which you have one set of input sequence files (forward/reverse or interleaved FASTQ, or unmapped uBAM) per sample, and you run each step of the pre-processing workflow separately for each sample, resulting in one BAM file per sample at the end of this phase.

However, if you are generating multiple libraries for each sample, and/or multiplexing samples within and/or across sequencing lanes, the data must be de-multiplexed before pre-processing, typically resulting in multiple sets of FASTQ files per sample all of which should have distinct read group IDs. At that point there are several different valid strategies for implementing the pre-processing workflow. At the Broad Institute, we run the entire pre-processing workflow separately on each individual read group, then we merge the data to produce a single BAM file for each sample, then we re-run two steps (Mark Duplicates and Indel Realignment) on the per-sample BAM files. See the worked-out example below and this presentation for more details.

Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.

### Example

Let's say we have this example data:

• sample1_lane1.fq
• sample1_lane2.fq
• sample2_lane1.fq
• sample2_lane2.fq

#### 1. Run all core steps per-lane once

At the basic level, all pre-processing steps are meant to be performed per-lane. Assuming that you received one FASTQ file per lane of sequence data, just run each file through each pre-processing step individually: map & dedup -> realign -> recal.

The example data becomes:

• sample1_lane1.dedup.realn.recal.bam
• sample1_lane2.dedup.realn.recal.bam
• sample2_lane1.dedup.realn.recal.bam
• sample2_lane2.dedup.realn.recal.bam

#### 2. Merge lanes per sample

Once you have pre-processed each lane individually, you merge lanes belonging to the same sample into a single BAM file.

The example data becomes:

• sample1.merged.bam
• sample2.merged.bam

#### 3. Per-sample refinement

You can increase the quality of your results by performing an extra round of dedupping and realignment, this time at the sample level. It is not absolutely required and will increase your computational costs, so it's up to you to decide whether you want to do it on your data, but that's how we do it internally at Broad.

The example data becomes:

• sample1.merged.dedup.realn.bam
• sample2.merged.dedup.realn.bam

This gets you two big wins:

• Dedupping per-sample eliminates PCR duplicates across all lanes in addition to optical duplicates (which are by definition only per-lane)
• Realigning per-sample means that you will have consistent alignments across all lanes within a sample.

People often ask also if it's worth the trouble to try realigning across all samples in a cohort. The answer is almost always no, unless you have very shallow coverage. The problem is that while it would be lovely to ensure consistent alignments around indels across all samples, the computational cost gets too ridiculous too fast. That being said, for contrastive calling projects -- such as cancer tumor/normals -- we do recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.

Finally, why not do base recalibration across lanes or across samples? Well, by definition there is no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine. That said, don't worry if you find yourself needing to recalibrate a BAM file with the lanes already merged -- the GATK's BaseRecalibrator is read group-aware, which means that it will identify separate lanes as such even if they are in the same BAM file, and it will always process them separately.

Note that BaseRecalibrator distinguishes read groups by RGID, or RGPU if it is available (PU takes precedence over ID).

Created 2013-01-05 01:17:21 | Updated 2014-10-31 23:13:36 | Tags: best-practices workshop presentations videos

The full GATK team presented this workshop at the Broad Institute with support form the BroadE education program.

This workshop covered the core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. View the workshop materials to learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

Created 2015-10-15 14:36:34 | Updated | Tags: best-practices presentations gvcf

Geraldine Van der Auwera presented this talk as part of the Broad Institute's Medical and Population Genetics (MPG) Primers series.

This talk provides a high-level overview of the workflow for performing variant discovery on high-throughput sequencing data, as described in the GATK Best Practices and implemented in the Broad's production pipelines.

The following points emphasized in this presentation are:

• Informational content of data file formats and flow of information throughout the pipeline
• Concepts involved in the data transformations (processing steps and analysis methods)
• Motivation and key mechanics of the GVCF workflow for scalable joint variant discovery
• Relation of the GATK Best Practices to the Broad's production pipeline implementation

The presentation slide deck is available at this link.

Created 2015-04-19 15:47:54 | Updated 2015-04-21 23:52:33 | Tags: best-practices workshop presentations topstory

The presentation slides from the 2015 "GATK in the UK" workshop (April 20-24) are available on DropBox here.

All slides will ultimately be posted in the Presentations section of the Guide.

Created 2015-02-23 21:13:29 | Updated 2015-04-02 07:31:25 | Tags: best-practices workshop

Registration is now LIVE for our upcoming BroadE Workshop: Best Practices for Variant Calling with the GATK.

WHEN: Thursday, March 19 & Friday, March 20, 2015

10:00 AM - 5:00 PM (Lecture, March 19)
2:00PM - 5:00 PM (Optional Tutorial, March 20)

Auditorium (lecture)/Yellowstone (Tutorial)
415 Main Street
Cambridge, Massachusetts 02142

Registration Schedule:

Registration closes February 27 at 5:00 PM.
Notification of acceptance or wait list status sent by March 4.

REGISTER HERE

This workshop will focus on the core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. The GATK development team and invited guests will give talks explaining the rationale, theory and real-life applications of the Best Practices. You will learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

An optional hands-on session will be available to select participants. In this session, the GATK team will help beginners work through interactive exercises and tutorials to learn how to use GATK and apply the Best Practices to real data.

Workshop attendees will gain broad insight into the rationale of the GATK Best Practices for variant discovery, as well as a solid understanding of how individual GATK tools work and how to apply them in practice. Novices to the GATK will come out of the workshop knowing enough to identify which questions they can use address using GATK tools, how to get started on designing their experiment and analytical workflow, and how to run the tools on their own computer. Existing GATK users will come out with a deeper understanding of how the GATK works under the hood and how to improve their results further, especially as regards the latest innovations.

Created 2014-12-16 21:48:12 | Updated 2014-12-17 17:06:58 | Tags: vqsr best-practices

The Best Practices recommendations for Variant Quality Score Recalibration have been slightly updated to use the new(ish) StrandOddsRatio (SOR) annotation, which complements FisherStrand (FS) as indicator of strand bias (only available in GATK version 3.3-0 and above).

While we were at it we also reconciled some inconsistencies between the tutorial and the FAQ document. As a reminder, if you ever find differences between parameters given in the VQSR docs, let us know, but FYI that the FAQ is the ultimate source of truth=true. Note also that the command line example given in VariantRecalibrator tool doc tends to be out of date because it can only be updated with the next release (due to a limitation of the tool doc generation system) and, well, we often forget to do it in time -- so it should never be used as a reference for Best Practice parameter values, as indicated in the caveat right underneath it which no one ever reads.

Speaking of caveats, there's no such thing as too much repetition of the fact that whole genomes and exomes have subtle differences that require some tweaks to your command lines. In the case of VQSR, that means dropping Coverage (DP) from your VQSR command lines if you're working with exomes.

Finally, keep in mind that the values we recommend for tranches are really just examples; if there's one setting you should freely experiment with, that's the one. You can specify as many tranche cuts as you want to get really fine resolution.

Created 2014-08-28 13:44:35 | Updated 2014-08-28 13:46:08 | Tags: best-practices workshop topstory

Here is the official announcement for the upcoming workshop in Philadelphia. Registration is not necessary for the lecture sessions, but it is required for the hands-on sessions (see link further below).

We look forward to seeing you there!

The Center for Genetics and Complex Traits (CGACT) and the Institute for Biomedical Informatics (IBI) of the University of Pennsylvania Perelman School of Medicine announce a Workshop on Variants Discovery in Next Generation Sequence Data on September 18 and 19, 2014.

This workshop will focus on the core steps involved in calling variants with the Broad Institute¹s Genome Analysis Toolkit (GATK), using the "Best Practices" developed by the GATK team, and will be presented by Dr. Geraldine Van der Auwera of the Broad Institute and other instructors from the GATK team. Participants will learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of their dataset.

The workshop will take place over two consecutive days (September 18 and 19, 2014). In the morning lecture sessions, attendees will learn the rationale, theory, and real-life applications of GATK Best Practices for variant discovery in high-throughput DNA sequencing data, including recommendations for additional experimental designs and datatypes such as RNAseq. In the afternoon hands-on sessions, attendees will learn to interact with the GATK tools and apply them effectively through interactive exercises and tutorials.

The morning lecture sessions will take place on Thursday, September 18, from 9:00 am to 12:30 pm, and Friday, September 19, from 9:00 am to 11:30 am, in the Dunlop Auditorium of Stemmler Hall, University of Pennsylvania, 3450 Hamilton Walk, Philadelphia, PA 19104. Both morning sessions are open to all participants and registration is not required.

The afternoon hands-on sessions will take place on Thursday, September 18, from 2:00 pm to 5:30 pm, and Friday, September 19, from 1:00 pm to 4:30 pm. The September 18 hands-on session is aimed mainly at beginners (though familiarity with the command line environment is expected). The September 19 hands-on session is aimed at more advanced users who are already familiar with the basic GATK functions. Attendance to the hands-on sessions is limited to 20 participants each day, and precedence will be given to members of the University of Pennsylvania or its affiliated hospitals and research institutes (HUP, CHOP, Monell, Wistar, etc.).

Registration for the hands-on sessions is mandatory and open through Friday, August 29th at http://ibi.upenn.edu/?p=996.

Created 2014-06-11 21:20:08 | Updated | Tags: best-practices bug error rnaseq topstory

We discovered today that we made an error in the documentation article that describes the RNAseq Best Practices workflow. The error is not critical but is likely to cause an increased rate of False Positive calls in your dataset.

The error was made in the description of the "Split & Trim" pre-processing step. We originally wrote that you need to reassign mapping qualities to 60 using the ReassignMappingQuality read filter. However, this causes all MAPQs in the file to be reassigned to 60, whereas what you want to do is reassign MAPQs only for good alignments which STAR identifies with MAPQ 255. This is done with a different read filter, called ReassignOneMappingQuality. The correct command is therefore:

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS


In our hands we see a bump in the rate of FP calls from 4% to 8% when the wrong filter is used. We don't see any significant amount of false negatives (lost true positives) with the bad command, although we do see a few more true positives show up in the results of the bad command. So basically the effect is to excessively increase sensitivity, at the expense of specificity, because poorly mapped reads are taken into account with a "good" mapping quality, where they would normally be discarded.

This effect will be stronger in datasets with lower overall quality, so your results may vary. Let us know if you observe any really dramatic effects, but we don't expect that to happen.

To be clear, we do recommend re-processing your data if you can, but if that is not an option, keep in mind how this affects the rate of false positive discovery in your data.

We apologize for this error (which has now been corrected in the documentation) and for the inconvenience it may cause you.

Created 2014-06-11 20:34:05 | Updated 2014-06-11 20:34:51 | Tags: best-practices workshop topstory brussels belgium

Calling all Belgians! (and immediate neighbors)

In case you didn't hear of this through your local institutions, I'm excited to announce that we are doing a GATK workshop in Belgium in two weeks (June 24-26 to be precise). The workshop, which is open and free to the scientific community, will be held at the Royal Institute of Natural Sciences in Brussels.

This is SUPER EXCITING to me because as a small child I spent many hours drooling in front of the Institute Museum's stunningly beautiful Iguanodons, likely leaving grubby handprints all over the glass cases, to the shame and annoyance of my parents. I also happen to have attended the Lycee Emile Jacqmain which is located in the same park, right next to the Museum (also within a stone's throw of the more recently added European Parliament) so for me this is a real trip into the past. Complete with dinosaurs!

That said, I expect you may find this workshop exciting for very different reasons, such as learning how the GATK can empower your research and hearing about the latest cutting-edge developments that you can expect for version 3.2.

See this website or the attached flyer for practical details (but note that the exact daily program may be slightly different than announced due to the latest changes in GATK) and be sure to register (it's required for admission!) by emailing cvangestel at naturalsciences.be with your name and affiliation.

Please note that the hands-on sessions (to be held on the third day) are already filled to capacity. The tutorial materials will be available on our website in the days following the workshop.

Created 2014-03-06 07:24:03 | Updated | Tags: best-practices rnaseq topstory

We’re excited to introduce our Best Practices recommendations for calling variants on RNAseq data. These recommendations are based on our classic DNA-focused Best Practices, with some key differences in the early data processing steps, as well as in the calling step.

### Best Practices workflow for RNAseq

This workflow is intended to be run per-sample; joint calling on RNAseq is not supported yet, though that is on our roadmap.

Please see the new document here for full details about how to run this workflow in practice.

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller.

Now, before you try to run this on your data, there are a few important caveats that you need to keep in mind.

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

For one thing, these recommendations are based on high quality RNA-seq data (30 million 75bp paired-end reads produced on Illumina HiSeq). Other types of data might need slightly different processing. In addition, we have currently worked only on data from one tissue from one individual. Once we’ve had the opportunity to get more experience with different types (and larger amounts) of data, we will update these recommendations to be more comprehensive.

Finally, we know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data. We look forward to hearing your thoughts and observations!

Created 2013-12-20 22:57:53 | Updated 2014-02-07 17:01:39 | Tags: official best-practices workshop queue presentations videos topstory

The presentation videos for:

• Day 1 (Best Practices for Variant Calling with GATK)
• Day 2 (Building Analysis Pipelines with Queue)

Created 2013-10-22 02:40:27 | Updated 2013-10-22 14:49:48 | Tags: best-practices workshop queue presentations slides 2013

Slides for :

• Day 1 (Best Practices for Variant Calling with GATK)
• Day 2 (Building Analysis Pipelines with Queue)

are available at this link in our DropBox slide archive :

https://www.dropbox.com/sh/ryz7bx4aeqc7mkl/44Q_2jvIyH

Created 2013-09-17 18:18:29 | Updated | Tags: official best-practices workshop queue pipeline

Register now for a spot at the upcoming GATK workshop, which will be held in Cambridge, MA on October 21-22.

This workshop will cover the following topics:

• GATK Best Practices for Variant Detection
• Building Analysis Pipelines with Queue

The workshop is scheduled right before ASHG Boston, so if you're going to be in town for the conference, make sure you come a couple of days early and attend the GATK workshop!

Created 2013-09-13 21:22:16 | Updated 2014-02-07 18:45:22 | Tags: official best-practices topstory

We're very pleased to announce that we have finally finished our big rewrite of the Best Practices documentation. We hope that the new format, which you can find here, will prove more user-friendly, searchable and overall more helpful than the previous version.

We have a few more improvements in mind (e.g. a clickable image map of the workflow) and there may be a few bugs here and there to iron out. So please feel free to comment on this announcement and give us feedback, whether flattering or critical, so we can improve it to help you as much as possible.

Created 2013-07-09 15:54:20 | Updated 2013-08-15 16:14:59 | Tags: official best-practices workshop presentations

The slides from the July 2013 Best Practices workshop are available here:

https://www.dropbox.com/sh/jhlk451jntywfdy/vJqbKTZZd_

The videos will be put online once they are processed.

Created 2013-01-30 04:43:29 | Updated | Tags: official best-practices workshop videos

These are videos of the presentations given at the GATK workshop on 4-5 Dec 2012 on "Best Practices for variant calling with the GATK".

Created 2015-11-19 16:02:02 | Updated | Tags: variantfiltrationwalker best-practices sor

Hi,

I'm trying to figure out a good threshold for the SOR for hard-filtering a SNP data set. The best practices documentation suggests using a cutoff threshold of 4, but in reading a bit more about odds-ratios, it seems that taking the natural logarithm of the odds-ratio may be more ideal. Is the SOR reported in the vcf file actually the natural logarithm of the odds ratio? This information is not provided in the page on statistical tests or the documentation for the SOR.

Thanks, Jen

Created 2015-09-25 23:52:49 | Updated | Tags: best-practices bam gatk

Hi,

So far, I have only tried running Best Practices on bam files where each bam file corresponds to a certain chromosome. How do I go about partitioning the bam files further and running Best Practices on parts of bam files? I want to make my pipeline even more efficient.

Created 2015-09-17 21:56:03 | Updated | Tags: bqsr best-practices bam gatk

Hi,

I recently re-generated my realigned bam files (realigned.bam) from the GATK Best Practices, but I encountered an error when I validated the bam files using the validation tool (ValidateSamFile.jar) in Picard:

ERROR: Record 188823, Read name HWI-ST1329:286:C2DT8ACXX:2:1303:11866:8561, Mate alignment does not match alignment start of mate ERROR: Record 188823, Read name HWI-ST1329:286:C2DT8ACXX:2:1303:11866:8561, Mate negative strand flag does not match read negative strand flag of mate

etc.

What exactly is going on? Where in the pipeline could this have occurred? I do not have the intermediary files so I cannot diagnose this accurately. Before I started the GATK Best Practices, my bam files had no problems; when I validated them I did not get any error messages. Does this mean that I have to regenerate the files again and start from the very beginning?

I tried to use the Picard tool FixMateInformation.jar, but then I got an error message regarding the index files. I tried deleting the bam index for one bam file and then recreating it to see if there was a problem with the indexes, but doing so did not resolve the issue. So it seems that something went wrong with the bam files at some earlier step.

Has this error been encountered before? Not sure how to proceed next, except restart everything.

Best, Alva

Created 2015-09-03 08:51:02 | Updated | Tags: indelrealigner malformedreadfilter best-practices

I recently came across(blog post) a scenario in which a subset of my reads had triggered the MalformedReadFilter during indel realignment. I'm curious to know about the definition of "malformed", as whilst invalid, I wouldn't have described incorrect mate names and alignment positions for unmapped reads as "malformed" and capable of causing job crashes...

Not a bug - I'm just curious to know of any implementation details or reasoning about the MalformedReadFilter.

Created 2015-08-31 11:58:59 | Updated | Tags: baserecalibrator best-practices ploidy pooling

Dear GATK Team,

baserecalibrator is in the newest Best Practices recommendations. Does this tool support or would you recommend it's usage for samples with ploidy higher then 2 (e.g. pooled samples)?

Thanks! Bernt

Created 2015-07-22 11:46:00 | Updated | Tags: combinevariants haplotypecaller best-practices vcf gatk genotypegvcfs combinegvcfs gvcf

I was trying to do combine sets of vcf files for all my samples so that I have one single vcf output using this command option below java -d64 -Xmx48g -jar ${GATK}/GenomeAnalysisTK.jar \ -R${REF} \ -T GenotypeGVCFs \ --variant A.g.vcf \ --variant B.g.vcf \ --variant C.g.vcf \ -stand_emit_conf 30 \ -stand_call_conf 30 \ -o genotype.vcf

but I got this error message “The following invalid GT allele index was encountered in the file: END=21994810”. I have tried to locate where the problem could be coming from but I do not understand this. Could you please advise me.

Created 2015-07-22 05:07:12 | Updated | Tags: bqsr baserecalibrator best-practices gatk

Hi,

I am running GATK Best Practices again and just wanted to assure myself that I am proceeding correctly. I remember reading somewhere on here that BQSR performs best when given the whole genome to work with (similar to VQSR in that respect). Better to just run using the whole genome rather than giving it single chromosomes. I couldn't really find this comment/recommendation when I tried to look for it again, so wanted to see what the final word was on this. How should one proceed?

Thanks, Alva

Created 2015-06-05 09:10:36 | Updated | Tags: bqsr best-practices

We are using a licensed version of GATK here. The version is GATK version -2014.3-3.2.2-7-g f9cba99.

While using the tool for analysis of exome data I had few questions.

1: Are there different parameters used in GATK at different stages for whole and targeted exom sequencing data ?

2: What are the guidelines being followed for targeted exome seq analysis if I use BQSR ?

I was reading the page ("https://www.broadinstitute.org/gatk/guide/tagged?tag=exome") and it was mentioned as " The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets."

Request you to kindly clear the doubts.

Created 2015-06-01 10:54:07 | Updated | Tags: best-practices picard mapping

I am trying to follow the best practices for mapping my (Paired-end Illumina HiSeq) reads to the reference, by following this presentation:

From what I understand, I should use MergeBamAlignment to clean up the output from bwa, and then use this cleaned up output for the rest of the analysis. However, when I run ValidateSamFile after running MergeBamAlignment I get a lot of errors, and running CleanSam on the file does not resolve any of them. What am I doing wrong? I've tried searching the web for more details about MergeBamAlignment but I haven't been able to find much. Please let me know if you require any additional information.

How I ran MergeBamAlignment picard-tools MergeBamAlignment \ UNMAPPED_BAM=unmapped_reads.sam \ ALIGNED_BAM=aligned_reads.sam \ OUTPUT=aligned_reads.merged.bam \ REFERENCE_SEQUENCE=/path/to/reference.fasta \ PAIRED_RUN=true # Why is this needed?

Error report from ValidateSamFile ## HISTOGRAM java.lang.String Error Type Count ERROR:INVALID_CIGAR 5261 ERROR:MATES_ARE_SAME_END 30 ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 30

Created 2015-05-23 21:49:38 | Updated | Tags: bqsr haplotypecaller best-practices dnaseq

Hi, I have ~150 WGS all sequenced in batches on Illumina HiSeq over a number of years, on a number of machines/flowcells.

I want to perform BQSR on my BAM files before calling variants. However for my organism I do not have access to a resource like dbSNP for true variants. So I am following the protocol by doing a first round of variant calling, subsetting to the SNPs with highest confidence and using these as the known variants for BQSR.

My question is, should I carry this out on samples individually, i.e. one BAM per sample, on which I use HaplotypeCaller for initial variant calling, then subset to best SNPs, use these for BaseRecalibrator and apply the calibration to the original sample before carrying on with single sample variant finding and then joint genotyping as per best practices....

OR

As I have multiple samples from the same flowcells and lanes, would I gain more information by combining those samples into a multisample BAM first, before carrying out BQSR? I'm a little unsure of how the covariates used by BQSR and actually calculated and whether I can increase the accuracy of my recalibration in this way? Each sample has between 500M and 5 billion nucleotides sequenced.

Many thanks.

Created 2015-05-21 19:53:49 | Updated 2015-05-21 19:54:06 | Tags: bqsr best-practices

Hi,

I am new to NGS/GATK & have a paired-end targeted (targeted to 1 region on 1 autosome) illumina sequencing project on roughly 470 samples. I was wondering whether Indel Realignment & BQSR were appropriate for my setting, or what considerations should be taken into account when judging whether these steps are appropriate?

I'm curious because I noticed this text here: Small targeted experiments The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.

I was unsure whether my project qualified as a 'small dataset' where BQSR should not be run.

Created 2015-05-09 14:48:21 | Updated | Tags: best-practices bam third-party-tools picard

Following GATK's best practices, I have individually realigned/recalibrated my sample-lane bams and merged them by sample:

sample1_lane1.dedup.realn.recal.bam --> sample1_lane2.dedup.realn.recal.bam --> sample1.merged.bam sample2_lane1.dedup.realn.recal.bam --> sample2_lane2.dedup.realn.recal.bam --> sample2.merged.bam ...

I am ready to dedup and realign my sample merged bams, however I am uncertain on the best approach. Is the consensus to convert back to fastq via Picard (MarkDuplicates, SortSam, and SamToFastq) and then run bwa mem? Or is it more expedient/accurate to realign the sample-merged bam using bwa aln followed by bwa sampe?

Created 2015-04-20 19:51:25 | Updated 2015-04-20 19:52:54 | Tags: best-practices combinegvcfs gatk3-3 wes

Dear @Sheila & @Geraldine_VdAuwera,

I hope you had a good weekend! My question, to start this week (hopefully the only thread!), is how to use CombineGVCFs. As you may remember I currently have ~100 WES samples for joint genotyping, and I have been testing GenotypeGVCFs directly and VQSR. However, in the near future I am going to have several hundred more samples in my hand, and hence I understand that making batches using CombineGVCFs is the way forward, as this is the only way to allow for subsequent merging of new gVCFs once sample size passes 200.

So I have been trying to run it this afternoon on a node that has 16 CPUs and 128Gb of RAM. Initial attempt was with -nt 14, but this gave the following message.

ERROR MESSAGE: Invalid command line: Argument nt has a bad value: The analysis CombineGVCFs currently does not support parallel execution with nt. Please run your analysis without the nt option.


So, therefore I started it without any threading, but then it appears it is going to take ~75hrs by it's own estimate after running for almost 1hr.

Is this really how long it should take? I have been trying to decipher the various possible issues looking at old threads on the forum, but I am not sure if there is any way I should be able to speed this up (short of splitting out the chromosomes and running them all individually)?

Also, I have seen mention somewhere that it may be because the files have been zipped in an incompatible manner - however, the input gVCFs are exactly the same that were successfully passed through GenotypeGVCFs, so this seems unlikely. I saw on one old thread that CombineGVCFs was super-slow, but I don't know if this is still the situation? For reference, my command was as shown below:

java -Xmx1g -jar GenomeAnalysisTK.jar -R hs37d5.fa -T GenotypeGVCFs -V vcfs.chr${numchr}.new.list -o final_chr${numchr}.vcf -L ${numchr} # CatVariants (generate 1 vcf file with all inds and all chrs) java -Xmx1g -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants -R hs37d5.fa -V final.new.list -out final_allHutt.vcf -assumeSorted # VQSR java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input final_allHutt.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches # Used excludeFiltered here java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input final_allHutt.vcf -mode SNP --ts_filter_level 99.9 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches -o recalibrated_snps_raw_indels_allHutt_filteredout.vcf java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches # Used excludeFiltered here java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -mode INDEL --ts_filter_level 99.0 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches -o recalibrated_variants_allHutt_filteredout.vcf # Genotype Refinement Workflow java -Xmx3g -jar GenomeAnalysisTK.jar -T CalculateGenotypePosteriors -R hs37d5.fa --supporting ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf -ped Hutt.ped -V recalibrated_variants_allHutt_filteredout.vcf -o recalibrated_variants_allHutt.postCGP.f.vcf java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantFiltration -R hs37d5.fa -V recalibrated_variants_allHutt.postCGP.f.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibrated_variants_allHutt.postCGP.Gfiltered.f.vcf Again, the first genotype in this example (indel) passed VariantFiltration even though its coverage was zero (2/2:0,0,0:0:PASS) The entire example is below: 1 20199272 . T TCTTC,C 3520.08 PASS AC=8,22;AF=0.160,0.440;AN=50;BaseQRankSum=-1.356e+00;ClippingRankSum=-1.267e+00;DP=487;FS=4.843;GQ_MEAN=27.84;GQ_STDDEV=40.31;InbreedingCoeff=0.1002;MLEAC=1,12;MLEAF=0.020,0.240;MQ=51.74;MQ0=0;MQRankSum=0.421;NCC=2;PG=0,0,0,0,0,0;QD=32.53;ReadPosRankSum=1.27;SOR=0.699;VQSLOD=0.687;culprit=FS GT:AD:DP:FT:GQ:PGT:PID:PL:PP 2/2:0,0,0:0:PASS:22:.:.:410,207,355,32,22,0:410,207,355,32,22,0 2/2:0,0,1:1:lowGQ:5:.:.:240,51,36,18,5,0:240,51,36,18,5,0 0/2:4,0,4:8:PASS:90:.:.:140,153,256,0,103,90:140,153,256,0,103,90 0/0:22,0,0:22:lowGQ:0:.:.:0,0,390,0,390,390:0,0,390,0,390,390 0/0:2,0,0:2:lowGQ:3:.:.:0,3,45,3,45,45:0,3,45,3,45,45 2/2:0,0,3:3:lowGQ:11:.:.:287,135,124,21,11,0:287,135,124,21,11,0 ./.:7,0,0:7:PASS 2/2:0,0,3:4:lowGQ:11:.:.:282,126,115,22,11,0:282,126,115,22,11,0 0/2:10,0,0:10:lowGQ:5:.:.:27,5,494,0,411,405:27,5,494,0,411,405 0/2:7,0,0:7:lowGQ:13:.:.:13,15,502,0,303,288:13,15,502,0,303,288 0/1:8,6,0:14:PASS:99:.:.:194,0,255,218,273,491:194,0,255,218,273,491 0/0:18,0,0:18:PASS:52:.:.:0,52,755,52,755,755:0,52,755,52,755,755 2/2:0,0,0:0:PASS:23:.:.:305,168,416,23,30,0:305,168,416,23,30,0 0/2:0,0,4:4:lowGQ:14:.:.:40,14,634,0,185,699:40,14,634,0,185,699 0/0:19,0,0:19:PASS:58:.:.:0,58,824,58,824,824:0,58,824,58,824,824 0/0:1,0,0:1:lowGQ:6:0|1:20199257_CT_C:0,6,91,6,91,91:0,6,91,6,91,91 1/1:0,0,0:0:lowGQ:2:.:.:177,11,0,12,2,44:177,11,0,12,2,44 0/1:0,0,3:3:PASS:34:.:.:94,0,388,34,38,304:94,0,388,34,38,304 0/2:15,0,2:17:lowGQ:18:0|1:20199249_CT_C:18,64,695,0,632,624:18,64,695,0,632,624 1/1:0,0,0:0:lowGQ:8:.:.:133,8,0,101,17,265:133,8,0,101,17,265 0/2:3,0,0:3:PASS:25:.:.:129,25,484,0,121,94:129,25,484,0,121,94 0/2:2,0,0:2:PASS:38:.:.:185,38,644,0,88,42:185,38,644,0,88,42 0/2:2,0,0:2:lowGQ:14:.:.:256,14,293,0,57,41:256,14,293,0,57,41./.:11,0,0:11:PASS 1/2:0,0,1:1:lowGQ:14:.:.:115,24,14,36,0,359:115,24,14,36,0,359 1/2:0,0,1:1:PASS:28:.:.:188,39,28,35,0,206:188,39,28,35,0,206 2/2:0,0,3:3:lowGQ:8:1|1:20199249_CT_C:88,88,89,8,9,0:88,88,89,8,9,0 Why are some genotypes being passed when there is no support for their genotype? Why are the Mendelian error rates so high? Thank you very much in advance, Alva Created 2015-03-11 20:38:44 | Updated | Tags: vqsr best-practices variantfiltration Hi all - I'm stumped and need your help. I'm following the GATK best practices for calling variants with HaplotypeCaller in GVCF mode. One of my samples is NA12878, among 119 others samples in my cohort. For some reason GATK is missing a bunch of variants in this sample that I can clearly see in IGV but are not listed in the VCF. I discovered that the variant is being filtered out..reason being VQSRTranchesSNP99.00to99.90. The genotype is homozygous variant, DP is 243, Qual is 524742.54 and its known in dbSNP. I suspect this is happening to other variants. How do I adjust VQSR or how tranches are used and variants get placed in? I supposed I need to fine tune my parameters...but I would think something as obvious as this variant would pass Filtering. Created 2015-01-27 21:59:14 | Updated 2015-01-27 21:59:46 | Tags: best-practices snp gatk combinegvcfs gvcf Hi, I am combining gcvf files into single gvcf files by chromosome, using CombineGVCFs, in order to run GenotypeGVCFs. When I checked the first gvcf file generated by CombineGVCFs, I noticed that at each position, all the alleles were missing. For example, at position 16050036, this is what comes up in the final gvcf file: 22 16050036 . A C,<NON_REF> . . BaseQRankSum=-7.360e-01;ClippingRankSum=-7.360e-01;DP=4;MQ=27.00;MQ0=0;MQRankSum=-7.360e-01;ReadPosRankSum=0.736 GT:AD:DP:MIN_DP:PL:SB ./.:1,2,0:3:.:55,0,23,58,29,86:1,0,2,0 ./.:.:1:1:0,0,0,0,0,0 ./.:.:0:0:0,0,0,0,0,0 But if we just take one of the precursor gvcf files (one individual), we clearly see the genotype at that site: 22 16050036 . A C,<NON_REF> 26.80 . BaseQRankSum=-0.736;ClippingRankSum=-0.736;DP=3;MLEAC=1,0;MLEAF=0.500,0.00;MQ=27.00;MQ0=0;MQRankSum=-0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB 0/1:1,2,0:3:23:55,0,23,58,29,86:1,0,2,0 The command I'm using to generate these files is: java -Xmx1g -jar GenomeAnalysisTK.jar -T CombineGVCFs -R hs37d5.fa -V vcfs.chr${numchr}.new.list -o mergeGvcf_${numchr}.vcf -L${numchr} where numchr is a variable previously defined (indicating the chromosome number).

It seems that all the information is being taken into account except the actual genotypes. How do I solve this problem?

Thanks, Alva

Created 2015-01-16 06:37:45 | Updated | Tags: haplotypecaller best-practices next-generation-sequencing

Excuse me:

After calling the genotype using Haplotyper Caller in gatk, i manually check the reads covering the variant sites. But i found one exception: The genotype of one sample was 0/0，wild type,but the reads for this site were .,,,..,,c.,c.c,.cccc. I think almost all the reads should be . or ,. Is this case normal?

Created 2014-12-05 21:10:20 | Updated | Tags: bqsr best-practices gatk

Hi,

I noticed that the pipeline for BQSR is slightly different between the "Workshop walkthrough (Brussels 2014)" document and the "(howto) Recalibrate base quality scores = run BQSR" document. I used the former document as my guide since it is more recent, but I am curious as to why these two are different. The first step is the same for the 2 docs, but then for the second step, the output is a recal.bam file in the brussels document instead of a post_recal.table file in the (howto) document. Then, I am confused about the (howto) document because it seems that the 4th step is exactly the same as the second step, except we generate a recal.bam file. Is the brussels document presenting steps that are more efficient? (since there is only 1 use of -BQSR to generate the recalibrated bam file we need, as opposed to 2 uses of the option)

Thanks, Alva

Created 2014-12-05 15:27:21 | Updated 2014-12-05 18:08:01 | Tags: haplotypecaller best-practices vcf gatk gvcf

Hi,

I used HaplotypeCaller in GVCF mode to generate a single sample GVCF, but when I checked my vcf file I see that the reference allele is not showing up:

22  1   .   N   <NON_REF>   .   .   END=16050022    GT:DP:GQ:MIN_DP:PL  0/0:0:0:0:0,0,0
22 16050023    .   C   <NON_REF>   .   .   END=16050023    GT:DP:GQ:MIN_DP:PL  0/0:1:3:1:0,3,37
22  16050024    .   A   <NON_REF>   .   .   END=16050026    GT:DP:GQ:MIN_DP:PL  0/0:2:6:2:0,6,73
22  16050027    .   A   <NON_REF>   .   .   END=16050035    GT:DP:GQ:MIN_DP:PL  0/0:3:9:3:0,9,110
22  16050037    .   G   <NON_REF>   .   .   END=16050037    GT:DP:GQ:MIN_DP:PL  0/0:3:9:3:0,9,109
22  16050038    .   A   <NON_REF>   .   .   END=16050039    GT:DP:GQ:MIN_DP:PL  0/0:4:12:4:0,12,153


I am not sure where to start troubleshooting for this, since all the steps prior to using HaplotypeCaller did not generate any obvious errors.

The basic command that I used was: java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hs37d5.fa -I recal_1.bam -o raw_1.vcf -L 22 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

Have you encountered this problem before? Where should I start troubleshooting?

Thanks very much in advance, Alva

Created 2014-11-25 14:27:16 | Updated | Tags: best-practices gatk bwa-and-gatk gatk-unified-genotyper

Chaps,

I am a newbie to GATK. I have started the best practices pipline but stuck in bwa. From where can I get the reference.fa file? The GATK bundle has .fasta .bam .fai but no .fa. Anyone please!

Created 2014-11-20 11:41:42 | Updated | Tags: best-practices gatk

Where can I get the fasta and ref files for experimenting on GATK?

Created 2014-09-30 19:53:52 | Updated | Tags: best-practices vcf

Hi, I am following the Best Practices for DNAseq analysis and have 2 quick questions: 1. I wanted to confirm if the VCF files produced after the VariantRecalibrator and ApplyRecalibration steps for SNP and for Indels are completely independent of each other. In other words, the VCF file produced after these two steps for SNPs (mode SNP) is just for SNPs and for Indels (mode INDEL) is just for Indels. 2. What is the source of the ALT alleles in the VCF file - is it the various annotation files that were used during the analysis? Thanks, - Pankaj

Created 2014-08-14 19:37:20 | Updated | Tags: best-practices rnaseq

Hi,

Thank you for providing guidelines on RNA-Seq variant discovery. For our data, we are currently playing with multiple mapping methods and have noticed that 2-step alignments work "better" than 1-step alignments. By 2-step alignments, I mean using STAR as step1 and then take the unmapped from this and use another aligner (like Bowtie2) for alignment. If I use such a methodology, will there be an issue in variant calling when during splitting cigar strings I ask it convert the 255 MAPQ to another value (like 60 in the best practices example), since bowtie2 gives different MAPQ scores. Sorry if this seems like a stupid question, but I am just a little curious how such a thing might affect the variant calls. Any insights/comment on this will be greatly appreciated.

Thanks!

Created 2014-07-31 14:05:07 | Updated | Tags: haplotypecaller best-practices

Hi,

I am wondering how HaplotypeCaller in GATK 3.1 handles multimappers? I thought I read that they are passed over for variant calling but stay in the realigned, recalibrated BAM for 'completions sake', like marking duplicates not removing them but cannot find supporting info on the website or from farther afield,

Presumably it is the NotPrimaryAlignmentFilter but there is no info on that posted yet. I know I can output a BAM from HC with haplotype info in there but can I just get reads used in variant calls? Or should I trim the BAM myself to retain reads I want used? I do this for mark duplicates (removed) but for multimappers I would like to know how you define so I can do the same. The reason is for coverage estimates, using bamtobed or such means I take all realigned, recalibrated which is many more lines including multimappers which skews my results.

Thanks,

Bruce.

Created 2014-06-10 20:57:36 | Updated | Tags: variantannotator best-practices analyst vcf developer performance walkers java gatk

Hi,

I was wondering if there is a nice way to apply multiple processing steps to each variant (or a group of variants) as they are read so that the variant file is not read again and again. My understanding is that even if I use Queue, each script would read the vcf again. Is that correct?

Created 2013-12-31 02:11:46 | Updated 2013-12-31 02:12:36 | Tags: vqsr best-practices non-human

Hello there! Thanks as always for the lovely tools, I continue to live in them.

• Been wondering how best to interpret my VQSLOD plots/tranches and subsequent VQSLOD scores. Attached are those plots, and a histogram of my VQSLOD scores as they are found across my replicate samples.

Methods Thus Far

We have HiSeq reads of "mutant" and wt fish, three replicates of each. The sequences were captured by size selected digest, so some have amazing coverage but not all. The mutant fish should contain de novo variants of an almost cancer-like variety (TiTv independent).

As per my interpretation of the best practices, I did an initial calling of the variants (HaplotypeCaller) and filtered them very heavily, keeping only those that could be replicated across all samples. Then I reprocessed and called variants again with that first set as a truth set. I also used the zebrafish dbSNP as "known", though I lowered the Bayesian priors of each from the suggested human ones. The rest of my pipeline follows the best practices fairly closely, GATK version was 2.7-2, and my mapping was with BWA MEM.

My semi-educated guess..

The spike in VQSLOD I see for samples found across all six replicates are simply the rediscovery of those in my truth set, and those with amazing coverage, which is probably fine/good. The part that worries me are the plots and tranches. The plots don't ever really show a section where the "known" set clusters with one set of obviously good variants but not with another. Is that OK or does that and my inflated VQSLOD values ring of poor practice?

Created 2013-09-05 09:53:11 | Updated 2013-09-05 09:54:32 | Tags: unifiedgenotyper best-practices

We have 50 human whole genome data (12x) from the same population and we have already used GATK to call variants and followed the best practice for population level calling. We have then sequenced 20 more individual (12x) from the same population. For the population level variant calling do we have to put all the BAM files (50+20) together and then call the variants(in UnifiedGenotyper) ? Or we can somehow use the VCF file which was created from the previous 50 individual and use them as training set and call the variant for the new 20 individuals?

Created 2013-08-19 08:54:23 | Updated | Tags: best-practices webinar callvariants

I am not sure where I should ask this question, but the GATK forum seemed the most appropriate place. I am currently viewing the webinars from the BroadE Workshop 2013 July 9-10 and the "Call Variants" streaming doesn't seem to work:

Would you have a work around to view this file? or is the file inaccessible for some reason?

Created 2013-02-06 05:56:12 | Updated 2013-02-06 16:21:59 | Tags: best-practices callset

Hi, I am estimating SNP number for a genome of a speceis.

1. After mapping the illumina reads to a reference seqence (of a related species), I used picard to remove duplicates and used samtools (mpileup) to convert the bam file to a fastq file.
2. I also use GATK to do local reglimant and recalication from the mappnig bam file. Then I used samtools (mpileup) to convert the bam files output by GATK to a fastq file.

I found that the number of SNP in the fastq going through GATK is 10 times more than the first fastq. Interestingly, if I use picard to do duplicats-removomg again to the GATK bam and used samtools to convert the bam to fastq file. The SNP jumps back to 10 time fewer.

What can be the reason that the SNP number can be 10 time different between the two methods? Actually, I expect the GATK output file has fewer SNP given the effect of recaliraiton or relingement. But the result is opposite.

Chih-Ming

Created 2012-11-16 16:17:13 | Updated 2012-11-16 21:54:20 | Tags: best-practices

Hi:

for the best practice V4, for the step: 2. Raw reads to analysis-ready reads at phase I, I only have lane-level sample data (each sample in one lane, no sample in multi-lanes), so I only did the Fast: lane-level realignment (at known sites only) and lane-level recalibration, which is

for each lane.bam
dedup.bam <- MarkDuplicate(lane.bam)
realigned.bam <- realign(dedup.bam) [at only known sites, if possible, otherwise skip]
recal.bam <- recal(realigned.bam)


Since I do not have multi-lane samples, I do not have to run Fast + per-sample processing, Better: per-sample realignment with known indels and recalibration, or Best: multi-sample realignment with known sites and recalibration, right? It seems that all of these schemes, Better, Best etc, are only for situation when some samples run at multiple lanes, is my understanding correct?

at the paragraph for Fast + per-sample processing, it mentioned: cross-lane realignment, not sure what it means exactly. if every sample of mine are in a single lane, it does not need, right?

Thanks

Mike

Created 2012-11-13 10:21:02 | Updated 2013-01-07 20:04:51 | Tags: best-practices exome

Hi all, I have a 4 exome experiment (high coverage). According to best practices I can't apply VQSR, so I must filter on SNP properties. One thing is not clear: should all the option apply? In other words should I select using boolean "and" (i.e. QD  < 2.0 & MQ < 40.0 & FS > 60.0 & HaplotypeScore > 13.0 & MQRankSum < -12.5 & ReadPosRankSum < -8.0 for SNP) or "or" (i.e. QD  < 2.0 | MQ < 40.0 | FS > 60.0 | HaplotypeScore > 13.0 | MQRankSum < -12.5 | ReadPosRankSum < -8.0 for SNP)?

Created 2012-11-12 22:55:36 | Updated 2013-01-07 20:06:43 | Tags: indelrealigner bqsr best-practices baserecalibration

Hello,

I asked this question in a comment under BestPractices but never got a response. Hopefully I will here. Here goes:

I have been running GATK v1.6.2 on several samples. It seems the way I had initially had run GATK for indel-realignment and quality re-calibration steps are reversed. For example, in order of processing, I ran:

• MarkDuplicates
• Count Covariates
• Table Recalibration
• Realigner Target Creator
• Indel Realigner

What are the consequences and effect to SNP and INDEL calling if the GATK steps are ran as above?. I'm aware that this is not according to the best-practices document (thank you for the thorough GATK documentation), but I wanted to know if it is essential for me to re-analyze the samples. The variant caller I'm using looks at BaseQuality scores when making calls.

Any help on this would be appreciated.

Mike

Created 2012-11-09 20:38:55 | Updated 2013-01-07 20:06:07 | Tags: best-practices solid

Hi, I am working on SOLiD 5500xl data and used SHRiMP2.2.3 for performing mapping. The library type is paired-end. I have read some discussions regarding SOLiD problem but I still have some doubts regarding some steps in best practices

1. Local-Realignment at In-dels: Since local realignments take place in base space instead of color-space I doubt the accuracy of the alignment
2. Mark/Remove Duplicates: Reads just lying in the same position (end to end) may not necessarily be duplicates. Some of these reads may have putative variants, which otherwise may be filtered out.
3. Base quality score recalibration: I am not sure whether this is applicable for 5500xl as well, since quality values have slightly changed on 5500 from previous SOLiD versions as far as I know.

So after mapping, I simply used GATK UnifiedGenotyper to call SNPs and InDels under default values. I end up getting around 40 million variants. Is there any way I can get a more refined variant calling? Do you still consider me applying the above pre-processing steps or do you recommend me applying some variant filteratiion on the called variants? If yes for the previous, then could you explain how my above concerns are taken care of? I was trying to look at some general recommended filter values on INFO fields in VCF format such as BQ, MQ, MQ0, DP, SB etc. Do you recommend some generally used values of these fields on which I can filter and hence refine my variant data?

I may have posted a subset of the above question, which I am not sure was posted successfully since at that time I just created an account. If you have already answered this question then I apologize for that. Could you then provide me the link where you answered it?

Created 2012-10-27 07:32:18 | Updated 2012-10-29 21:37:57 | Tags: unifiedgenotyper variantrecalibrator haplotypecaller best-practices multi-sample

Hello,

I've just made a long needed update to the most recent version of GATK. I had been toying with the variant quality score recalibrator before but now that I have a great deal more exomes at my disposal I'd like to fully implement it in a meaningful way.

The phrase I'm confused about is "In our testing we've found that in order to achieve the best exome results one needs to use an exome callset with at least 30 samples." How exactly do I arrange these 30+ exomes?

Is there any difference or reason to choose one of the following two workflows over the other?

1. Input 30+ exomes in the "-I" argument of either the UnifiedGenotyper or HaplotypeCaller and then with my multi-sample VCF perform the variant recalibration procedure and then split the individual call sets out of the multi-sample vcf with SelectVariants?

2. Take 30+ individual vcf files, merge them together, and then perform variant recalibration on the merged vcf and then split the individual call sets out of the multi-sample vcf with SelectVariants?

3. Or some third option I'm missing

Any help is appreciated.

Thanks

Created 2012-10-23 06:21:29 | Updated 2012-10-25 15:27:41 | Tags: bqsr best-practices known special-cases

Hi all!

I'm currently working on high-coverage non-human data (mammals).

After mapping via BWA, soritng and merging, my pipeline goes like this:

1. Remove duplicates via MarkDuplicates.
2. RealignerTargetCreator
3. IndelRealigner using the --consensusDeterminationModel USE_SW flag
4. Fix Mates

I currently want to begin the recalibration step before doing the actual variant calls via UnifiedGenotyper.

Since I'm working on non-human data, there is no thorough database I can trust as an input vcf file for the recalibration step.

What is your recommendation for this for non-human data?

Thank you very much!

Sagi

Created 2012-10-18 11:59:05 | Updated 2012-10-19 01:11:34 | Tags: best-practices genotype threshold

Hello,

I'm sorry if I'm being dense (I'm new to all this and it is making me feel very dense indeed!), but having read the section on 'Selecting an appropriate quality score threshold' on the 'Best Practice Variant Detection' page, I am still unclear as to whether you mean I should be looking for a QUAL score of at least 30 in a deep coverage data set and should filter out any suggested SNPs that don't meet this, or a GQ score of 30 in each individual sample genotyped at the SNP in question and I only need to filter out individual samples that don't meet this threshold.

I have pasted the bit of text I read below, just to make it clear to which bit I am referring.

Many thanks!

A common question is the confidence score threshold to use for variant detection. We recommend:

Deep (> 10x coverage per sample) data: we recommend a minimum confidence score threshold of Q30.

Shallow (< 10x coverage per sample) data: because variants have by necessity lower quality with shallower coverage we recommend a minimum confidence score of Q4 in projects with 100 samples or fewer and Q10 otherwise.

Created 2012-10-16 14:35:13 | Updated 2012-10-16 21:22:21 | Tags: best-practices iontorrent

Hi,

I wonder how well GATK works with Ion Torrent data. Is there any recommended practice to handle Ion Torrent data, especially SNP and indel calling, with GATK?

Thanks, XZ

Created 2012-08-29 10:02:02 | Updated 2013-01-07 20:42:47 | Tags: best-practices readgroup realignment baserecalibration

Hello,

I am new at using GATK (v 2.1-3). I do exome sequencing by sample using the following steps: Alignment with BWA (0.6.2) GATK :Local realignment around INDELs PICARD (1.67): FixMateInformation GATK: Recalibration (BaseRecalibrator + PrintReads -BQSR) Samtools for calling variants

## Samtools seems to run properly but no file (*.vcf and *.bcf) are created and no error message is prompted :

cd Sample_09 + samtools mpileup -BE -ug -q 20 -Q 20 -D -f human_g1k_v37.fasta realigned_fixed_recal.bam -C50 + bcftools view -bvcg - [mpileup] 1 samples in 1 input files Set max per-file depth to 8000 [bcfview] 100000 sites processed. [afs] 0:89274.054 1:6411.053 2:4314.893 [bcfview] 200000 sites processed. [afs] 0:89100.642 1:6125.883 2:4773.474 [bcfview] 300000 sites processed. [afs] 0:87374.996 1:7439.238 2:5185.766 [bcfview] 400000 sites processed. [afs] 0:87890.186 1:7087.628 2:5022.185 [bcfview] 500000 sites processed. [afs] 0:85322.061 1:8454.843 2:6223.096 [bcfview] 600000 sites processed. [afs] 0:85864.368 1:8410.777 2:5724.854 [bcfview] 700000 sites processed. [afs] 0:88813.814 1:6828.001 2:4358.185 [bcfview] 800000 sites processed. [afs] 0:89070.318 1:6302.924 2:4626.758 [bcfview] 900000 sites processed. [afs] 0:88364.380 1:6796.962 2:4838.658 [bcfview] 1000000 sites processed. [afs] 0:86892.531 1:7268.088 2:5839.381 [bcfview] 1100000 sites processed. [afs] 0:87184.845 1:7153.073 2:5662.081 [bcfview] 1200000 sites processed. [afs] 0:86762.756 1:7241.236 2:5996.008 [bcfview] 1300000 sites processed. [afs] 0:89346.143 1:6159.989 2:4493.868 [bcfview] 1400000 sites processed. [afs] 0:88658.355 1:7160.555 2:4181.089 [bcfview] 1500000 sites processed. [afs] 0:85985.305 1:8308.039 2:5706.656 [bcfview] 1600000 sites processed. [afs] 0:87346.636 1:7708.883 2:4944.480 [afs] 0:63097.202 1:3950.127 2:3572.670 + bcftools view .bcf

## + cd ..

I have seen that some groups use after realignment Picard:AddOrReplaceReadGroups and I wonder if I should use before calling the variant with samtools.