# Tagged with #best-practices 14 documentation articles | 16 announcements | 48 forum discussions

Created 2016-01-08 21:06:41 | Updated 2016-02-08 17:17:59 | Tags: best-practices bam markduplicates duplicates

Here we discuss two tools, MarkDuplicates and MarkDuplicatesWithMateCigar, that flag duplicates. We provide example data and example commands for you to follow along the tutorial (section 1) and include tips in estimating library complexity for PCR-free samples and patterned flow cell technologies. In section 2, we point out special memory considerations for these tools. In section 3, we highlight the similarities and differences between the two tools. Finally, we get into some details that may be of interest to some that includes comments on the metrics file (section 4).

To mark duplicates in RNA-Seq data, use MarkDuplicates. Reasons are explained in section 2 and section 3. And if you are considering using MarkDuplicatesWithMateCigar for your DNA data, be sure insert lengths are short and you have a low percentage of split or multi-mapping records.

Obviously, expect more duplicates for samples prepared with PCR than for PCR-free preparations. Duplicates arise from various sources, including within the sequencing run. As such, even PCR-free data can give rise to duplicates, albeit at low rates, as illustrated here with our example data.

#### Prerequisites

• Installed Picard tools
• Coordinate-sorted and indexed BAM alignment data. Secondary/supplementary alignments are flagged appropriately (256 and 2048 flags) and additionally with the mate unmapped (8) flag. See the MergeBamAlignment section (3C) of Tutorial#6483 for a description of how MergeBamAlignment ensures such flagging.
• For MarkDuplicatesWithMateCigar, pre-computed Mate CIGAR (MC) tags. Data produced according to Tutorial#6483 will have the MC tags added by MergeBamAlignment. Alternatively, see tools RevertOriginalBaseQualitiesAndAddMateCigar and FixMateInformation.
• Appropriately assigned Read Group (RG) information. Read Group library (RGLB) information is factored for molecular duplicate detection. Optical duplicates are limited to those from the same RGID.

• Use the advanced tutorial bundle's human_g1k_v37_decoy.fasta as reference. This same reference is available to load in IGV.
• tutorial_6747.tar.gz data contain human paired 2x150 whole genome sequence reads originally aligning at ~30x depth of coverage. The sample is a PCR-free preparation of the NA12878 individual run on the HiSeq X platform. This machine type, along with HiSeq 4000, has the newer patterned flow cell that differs from the typical non-patterned flow cell. I took the reads aligning to a one Mbp genomic interval (10:96,000,000-97,000,000) and sanitized and realigned the reads (BWA-MEM -M) to the entire genome according to the workflow presented in Tutorial#6483 to produce snippet.bam. The data has (i) no supplementary records; (ii) secondary records flagged with the 256 flag and the mate-unmapped (8) flag; and (iii) unmapped records (4 flag) with mapped mates (mates have 8 flag), zero MAPQ (column 5) and asterisks for CIGAR (column 6). The notation allows read pairs where one mate maps and the other does not to sort and remain together when we apply genomic intervals such as in the generation of the snippet.

## 1. Commands for MarkDuplicates and MarkDuplicatesWithMateCigar

The following commands take a coordinate-sorted and indexed BAM and return (i) a BAM with the same records in coordinate order and with duplicates marked by the 1024 flag, (ii) a duplication metrics file, and (iii) an optional matching BAI index.

For a given file with all MC (mate CIGAR) tags accounted for:

• and where all mates are accounted for, each tool--MarkDuplicates and MarkDuplicatesWithMateCigar--examines the same duplicate sets but prioritize which inserts get marked duplicate differently. This situation is represented by our snippet example data.
• but containing missing mates records, MarkDuplicates ignores the records, while MarkDuplicatesWithMateCigar still considers them for duplicate marking using the MC tag for mate information. Again, the duplicate scoring methods differ for each tool.

Use the following commands to flag duplicates for 6747_snippet.bam. These commands produce qualitatively different data.

Score duplicate sets based on the sum of base qualities using MarkDuplicates:

java -Xmx32G -jar picard.jar MarkDuplicates \
INPUT=6747_snippet.bam \ #specify multiple times to merge
OUTPUT=6747_snippet_markduplicates.bam \
METRICS_FILE=6747_snippet_markduplicates_metrics.txt \
OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \ #changed from default of 100
CREATE_INDEX=true \ #optional
TMP_DIR=/tmp

Score duplicate sets based on total mapped reference length using MarkDuplicatesWithMateCigar:

java -Xmx32G -jar picard.jar MarkDuplicatesWithMateCigar \
INPUT=6747_snippet.bam \ #specify multiple times to merge
OUTPUT=6747_snippet_markduplicateswithmatecigar.bam \
METRICS_FILE=6747_snippet_markduplicateswithmatecigar_metrics.txt \
OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \ #changed from default of 100
CREATE_INDEX=true \ #optional
TMP_DIR=/tmp

• Each tool has a distinct default DUPLICATE_SCORING_STRATEGY. For MarkDuplicatesWithMateCigar it is TOTAL_MAPPED_REFERENCE_LENGTH and this is the only scoring strategy available. For MarkDuplicates you can switch the DUPLICATE_SCORING_STRATEGY between the default SUM_OF_BASE_QUALITIES and TOTAL_MAPPED_REFERENCE_LENGTH. Both scoring strategies use information pertaining to both mates in a pair, but in the case of MarkDuplicatesWithMateCigar the information for the mate comes from the read's MC tag and not from the actual mate.
• To merge multiple files into a single output, e.g. when aggregating a sample from across lanes, specify the INPUT parameter for each file. The tools merge the read records from the multiple files into the single output file. The tools marks duplicates for the entire library (RGLB) and accounts for optical duplicates per RGID. INPUT files must be coordinate sorted and indexed.
• The Broad's production workflow increases OPTICAL_DUPLICATE_PIXEL_DISTANCE to 2500, to better estimate library complexity. The default setting for this parameter is 100. Changing this parameter does not alter duplicate marking. It only changes the count for optical duplicates and the library complexity estimate in the metrics file in that whatever is counted as an optical duplicate does not factor towards library complexity. The increase has to do with the fact that our example data was sequenced in a patterned flow cell of a HiSeq X machine. Both HiSeq X and HiSeq 4000 technologies decrease pixel unit area by 10-fold and so the equivalent pixel distance in non-patterned flow cells is 250. You may ask why are we still counting optical duplicates for patterned flow cells that by design should have no optical duplicates. We are hijacking this feature of the tools to account for other types of duplicates arising from the sequencer. Sequencer duplicates are not limited to optical duplicates and should be differentiated from PCR duplicates for more accurate library complexity estimates.
• By default the tools flag duplicates and retain them in the output file. To remove the duplicate records from the resulting file, set the REMOVE_DUPLICATES parameter to true. However, given you can set GATK tools to include duplicates in analyses by adding -drf DuplicateRead to commands, a better option for value-added storage efficiency is to retain the resulting marked file over the input file.
• To optionally create a .bai index, add and set the CREATE_INDEX parameter to true.

For snippet, the duplication metrics are identical whether marked by MarkDuplicates or MarkDuplicatesWithMateCigar. We have 13.4008% duplication, with 255 unpaired read duplicates and 18,254 paired read duplicates. However, as the screenshot at the top of this page illustrates, and as section 4 explains, the data qualitatively differ.

## 2. Slow or out of memory error? Special memory considerations for duplicate marking tools

The seemingly simple task of marking duplicates is one of the most memory hungry processes, especially for paired end reads. Both tools are compute-intensive and require upping memory compared to other processes.

Because of the single-pass nature of MarkDuplicatesWithMateCigar, for a given file its memory requirements can be greater than for MarkDuplicates. What this means is that MarkDuplicatesWithMateCigar streams the duplicate marking routine in a manner that allows for piping. Due to these memory constraints for MarkDuplicatesWithMateCigar, we recommend MarkDuplicates for alignments that have large reference skips, e.g. spliced RNA alignments.

For large files, (1) use the Java -Xmx setting and (2) set the environmental variable TMP_DIR for a temporary directory. These options allow the tool to run without slowing down as well as run without causing an out of memory error. For the purposes of this tutorial, commands are given as if the example data is a large file, which we know it is not.

    java -Xmx32G -jar picard.jar MarkDuplicates \
... \
TMP_DIR=/tmp 

These options can be omitted for small files such as the example data and the equivalent command is as follows.

    java -jar picard.jar MarkDuplicates ...   

### Set the java maxheapsize, specified by the -Xmx#G option, to the maximum your system allows.

The high memory cost, especially for MarkDuplicatesWithMateCigar, is in part because the tool systematically traverses genomic coordinate intervals for inserts in question, and for every read it marks as a duplicate it must keep track of the mate, which may or may not map nearby, so that reads are marked as pairs with each record emitted in its coordinate turn. In the meanwhile, this information is held in memory, which is the first choice for faster processing, until the memory limit is reached, at which point memory spills to disk. We set this limit high to minimize instances of memory spilling to disk.

In the example command, the -Xmx32G Java option caps the maximum heap size, or memory usage, to 32 gigabytes, which is the limit on the server I use. This is in contrast to the 8G setting I use for other processes on the same sample data--a 75G BAM file. To find a system's default maximum heap size, type java -XX:+PrintFlagsFinal -version, and look for MaxHeapSize.

### Set an additional temporary directory with the TMP_DIR parameter for memory spillage.

When the tool hits the memory limit, memory spills to disk. This causes data to traverse in and out of the processor's I/O device, slowing the process down. Disk is a location you specify with the TMP_DIR parameter. If you work on a server separate from where you read and write files to, setting TMP_DIR to the server's local temporary directory (typically /tmp) can reduce processing time compared to setting it to the storage disk. This is because the tool then additionally avoids traversing the network file system when spilling memory. Be sure the TMP_DIR location you specify provides enough storage space. Use df -h to see how much is available.

## 3. Conceptual overview of duplicate flagging

The aim of duplicate marking is to flag all but one of a duplicate set as duplicates and to use duplicate metrics to estimate library complexity. Duplicates have a higher probability of being non-independent measurements from the exact same template DNA. Duplicate inserts are marked by the 0x400 bit (1024 flag) in the second column of a SAM record, for each mate of a pair. This allows downstream GATK tools to exclude duplicates from analyses (most do this by default). Certain duplicates, i.e. PCR and sequencer duplicates, violate assumptions of variant calling and also potentially amplify errors. Removing these, even at the cost of removing serendipitous biological duplicates, allows us to be conservative in calculating the confidence of variants.

GATK tools allow you to disable the duplicate read filter with -drf DuplicateRead so you can include duplicates in analyses.

For a whole genome DNA sample, duplicates arise from three sources: (i) in DNA shearing from distinct molecular templates identical in insert mapping, (ii) from PCR amplification of a template (PCR duplicates), and (iii) from sequencing, e.g. optical duplicates. The tools cannot distinguish between these types of duplicates with the exception of optical duplicates. In estimating library complexity, the latter two types of duplicates are undesirable and should each factor differently.

When should we not care about duplicates? Given duplication metrics, we can make some judgement calls on the quality of our sample preparation and sequencer run. Of course, we may not expect a complex library if our samples are targeted amplicons. Also, we may expect minimal duplicates if our samples are PCR-free. Or it may be that because of the variation inherent in expression level data, e.g. RNA-Seq, duplicate marking becomes ritualistic. Unless you are certain of your edge case (amplicon sequencing, RNA-Seq allele-specific expression analysis, etc.) where duplicate marking adds minimal value, you should go ahead and mark duplicates. You may find yourself staring at an IGV session trying to visually calculate the strength of the evidence for a variant. We can pat ourselves on the back for having the forethought to systematically mark duplicates and turn on the IGV duplicate filter.

The Broad's Genomics Platform uses MarkDuplicates twice for multiplexed samples. Duplicates are flagged first per sample per lane to estimate lane-level library complexity, and second to aggregate data per sample while marking all library duplicates. In the second pass, duplicate marking tools again assess all reads for duplicates and overwrite any prior flags.

Our two duplicate flagging tools share common features but differ at the core. As the name implies, MarkDuplicatesWithMateCigar uses the MC (mate CIGAR) tag for mate alignment information. Unlike MarkDuplicates, it is a single-pass tool that requires pre-computed MC tags.

• For RNA-Seq data mapped against the genome, use MarkDuplicates. Specifically, MarkDuplicatesWithMateCigar will refuse to process data with large reference skips frequent in spliced RNA transcripts where the gaps are denoted with an N in the CIGAR string.
• Both tools only consider primary mappings, even if mapped to different contigs, and ignore secondary/supplementary alignments (256 flag and 2048 flag) altogether. Because of this, before flagging duplicates, be sure to mark primary alignments according to a strategy most suited to your experimental aims. See MergeBamAlignment's PRIMARY_ALIGNMENT_STRATEGY parameter for strategies the tool considers for changing primary markings made by an aligner.
• Both tools identify duplicate sets identically with the exception that MarkDuplicatesWithMateCigar additionally considers reads with missing mates. Missing mates occur for example when aligned reads are filtered using an interval list of genomic regions. This creates divorced reads whose mates aligned outside the targeted intervals.
• Both tools identify duplicates as sets of read pairs that have the same unclipped alignment start and unclipped alignment end. The tools intelligently factor for discordant pair orientations given these start and end coordinates. Within a duplicate set, with the exception of optical duplicates, read pairs may have either pair orientation--F1R2 or F2R1. For optical duplicates, pairs in the set must have the same orientation. Why this is is explained in section 4.
• Both tools take into account clipped and gapped alignments and singly mapping reads (mate unmapped and not secondary/supplementary).
• Each tool flags duplicates according to different priorities. MarkDuplicatesWithMateCigar prioritizes which pair to leave as the representative nondup based on the total mapped length of a pair while MarkDuplicates can prioritize based on the sum of base qualities of a pair (default) or the total mapped length of a pair. Duplicate inserts are marked at both ends.

## 4. Details of interest to some

To reach a high target coverage depth, some fraction of sequenced reads will by stochastic means be duplicate reads.

Let us hope the truth of a variant never comes down to so few reads that duplicates should matter so. Keep in mind the better evidence for a variant is the presence of overlapping reads that contain the variant. Also, take estimated library complexity at face value--an estimate.

### Don't be duped by identical numbers. Data from the two tools qualitatively differ.

First, let me reiterate that secondary and supplementary alignment records are skipped and never flagged as duplicate.

Given a file with no missing mates, each tool identifies the same duplicate sets from primary alignments only and therefore the same number of duplicates. To reiterate, the number of identical loci or duplicate sets and the records within each set are the same for each tool. However, each tool differs in how it decides which insert(s) within a set get flagged and thus which insert remains the representative nondup. Also, if there are ties, the tools may break them differently in that tie-breaking can depend on the sort order of the records in memory.

• MarkDuplicates by default prioritizes the sum of base qualities for both mates of a pair. The pair with the highest sum of base qualities remains as the nondup.
• As a consequence of using the mate's CIGAR string (provided by the MC tag), MarkDuplicatesWithMateCigar can only prioritize the total mapped reference length, as provided by the CIGAR string, in scoring duplicates in a set. The pair with the longest mapping length remains as the nondup.
• If there are ties after applying each scoring strategy, both tools break the ties down a chain of deterministic factors starting with read name.

### Duplicate metrics in brief

We can break down the metrics file into two parts: (1) a table of metrics that counts various categories of duplicates and gives the library complexity estimate, and (2) histogram values in two columns.

See DuplicationMetrics for descriptions of each metric. For paired reads, duplicates are considered for the insert. For single end reads, duplicates are considered singly for the read, increasing the likelihood of being identified as a duplicate. Given the lack of insert-level information for these singly mapping reads, the insert metrics calculations exclude these.

The library complexity estimate only considers the duplicates that remain after subtracting out optical duplicates. For the math to derive estimated library size, see formula (1.2) in Mathematical Notes on SAMtools Algorithms.

The histogram values extrapolate the calculated library complexity to a saturation curve plotting the gains in complexity if you sequence additional aliquots of the same library. The first bin's value represents the current complexity.

### Pair orientation F1R2 is distinct from F2R1 for optical duplicates

Here we refer you to a five minute video illustrating what happens at the molecular level in a typical sequencing by synthesis run.

What I would like to highlight is that each strand of an insert has a chance to seed a different cluster. I will also point out, due to sequencing chemistry, F1 and R1 reads typically have better base qualities than F2 and R2 reads.

Optical duplicate designation requires the same pair orientation.

Let us work out the implications of this for a paired end, unstranded DNA library. During sequencing, within the flow cell, for a particular insert produced by sample preparation, the strands of the insert are separated and each strand has a chance to seed a different cluster. Let's say for InsertAB, ClusterA and ClusterB and for InsertCD, ClusterC and ClusterD. InsertAB and InsertCD are identical in sequence and length and map to the same loci. It is possible InsertAB and InsertCD are PCR duplicates and also possible they represent original inserts. Each strand is then sequenced in the forward and reverse to give four pieces of information in total for the given insert, e.g. ReadPairA and ReadPairB for InsertAB. The pair orientation of these two pairs are reversed--one cluster will give F1R2 and the other will give F2R1 pair orientation. Both read pairs map exactly to the same loci. Our duplicate marking tools consider ReadPairA and ReadPairB in the same duplicate set for regular duplicates but not for optical duplicates. Optical duplicates require identical pair orientation.

Created 2015-08-17 23:35:00 | Updated 2015-11-25 10:02:35 | Tags: best-practices workshop presentations slides

Laura Gauthier, Yossi Farjoun and Geraldine Van der Auwera presented this workshop in Pretoria, South Africa, upon invitation from the University of Pretoria.

This workshop covered the core steps involved in calling germline and somatic variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. The presentation materials describe why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

Additional considerations were covered, such as calling cohorts efficiently, as well as dealing with non-human data, RNAseq data, whole-genome vs. exome, basic quality control, and performance.

This was complemented by sets of hands-on tutorials aiming to teach basic GATK usage to new users, as well as introduce pipelining concepts using Queue.

The workshop was structured into five modules:

• Introductory materials
• Best Practices Phase 1: Pre-processing
• Best Practices Phase 2A: Calling germline variants
• Best Practices Phase 2B: Calling somatic variants
• Best Practices Phase 3: Preliminary analyses

The workshop materials are available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.

Created 2015-08-14 01:57:37 | Updated 2015-08-14 01:58:03 | Tags: best-practices workshop workflow presentations

Joel Thibault, Valentin Ruano-Rubio and Geraldine Van der Auwera presented this workshop in Edinburgh, Scotland, and Cambridge, England, upon invitation from the Universities of Edinburgh and Cambridge.

This workshop included two modules:

• Best Practices for Variant Calling with the GATK

The core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. The presentation materials describe why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

• Beyond the Best Practices

Additional considerations such as calling variants in RNAseq data and calling cohorts efficiently, as well as dealing with non-human data, RNAseq data, whole-genome vs. exome, basic quality control, and performance.

This was complemented by a set of hands-on exercises aiming to teach basic GATK usage to new users.

The workshop materials are available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.

Created 2015-08-14 01:50:16 | Updated 2015-08-14 01:56:56 | Tags: best-practices workshop workflow presentations

The full GATK team presented this workshop at the Broad Institute with support form the BroadE education program.

This workshop covered the core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. The presentation materials describe why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

The workshop materials are available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.

Created 2014-10-31 23:11:22 | Updated 2014-11-01 00:01:23 | Tags: best-practices presentations ashg

Ami Levy-Moonshine presented this condensed 90-minute workshop given at ASHG 2014 in San Diego, CA on October 21.

This workshop covered all the core steps involved in calling variants with the GATK, using the “Best Practices” developed by the GATK team. The presentation materials outline why each step is essential to the calling process and what are the key operations performed on the data at each step. This includes specific information about variant calling in RNAseq data and efficient analysis of cohorts.

His slide deck is available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.

GATK was also featured in another mini-workshop at ASHG which covered the iSeqTools network, focused on cloud-based analysis. The presentation slides will be posted to the iSeqTools website in the near future.

Best Practices for variant discovery in DNA:

Best Practices for variant discovery in RNAseq:

Excerpt from Ami's ASHG poster:

Created 2014-10-31 22:45:38 | Updated 2014-10-31 23:12:02 | Tags: best-practices workshop presentations

Eric Banks, Sheila Chandran and Geraldine Van der Auwera presented this workshop in Philadelphia, PA, upon invitation from the School of Medicine at UPenn.

This workshop covered all the core steps involved in calling variants with the GATK, using the “Best Practices” developed by the GATK team. The presentation materials describe why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset. This includes specific information about variant calling in RNAseq data and efficient analysis of cohorts.

The material was presented over two days, organized in the following modules:

• Data pre-processing: From FASTQ to analysis-ready BAM
• Variant Discovery: From BAM to analysis-ready VCF

This was complemented by a set of hands-on exercises aiming to teach basic GATK usage to new users.

The workshop materials are available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.

Created 2014-04-16 22:09:49 | Updated 2015-05-16 06:58:15 | Tags: best-practices rnaseq

This is our recommended workflow for calling variants in RNAseq data from single samples, in which all steps are performed per-sample. In future we will provide cohort analysis recommendations, but these are not yet available.

The workflow is divided in three main sections that are meant to be performed sequentially:

• Variant discovery: from reads (BAM files) to variants (VCF files)
• Refinement and evaluation: genotype refinement, functional annotation and callset QC

Compared to the DNAseq Best Practices, the key adaptations for calling variants in RNAseq focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller, which are highlighted in the figure below.

### Pre-Processing

The data generated by the sequencers are put through some pre-processing steps to make it suitable for variant calling analysis. The steps involved are: Mapping and Marking Duplicates; Split'N'Trim; Local Realignment Around Indels (optional); and Base Quality Score Recalibration (BQSR); performed in that order.

#### Mapping and Marking Duplicates

The sequence reads are first mapped to the reference using STAR aligner (2-pass protocol) to produce a file in SAM/BAM format sorted by coordinate. The next step is to mark duplicates. The rationale here is that during the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process identifies these reads as such so that the GATK tools know they should ignore them.

#### Split'N'Trim

Then, an RNAseq-specific step is applied: reads with N operators in the CIGAR strings (which denote the presence of a splice junction) are split into component reads and trimmed to remove any overhangs into splice junctions, which reduces the occurrence of artifacts. At this step, we also reassign mapping qualities from 255 (assigned by STAR) to 60 which is more meaningful for GATK tools.

#### Realignment Around Indels

Next, local realignment is performed around indels, because the algorithms that are used in the initial mapping step tend to produce various types of artifacts. For example, reads that align on the edges of indels often get mapped with mismatching bases that might look like evidence for SNPs, but are actually mapping artifacts. The realignment process identifies the most consistent placement of the reads relative to the indel in order to clean up these artifacts. It occurs in two steps: first the program identifies intervals that need to be realigned, then in the second step it determines the optimal consensus sequence and performs the actual realignment of reads. This step is considered optional for RNAseq.

#### Base Quality Score Recalibration

Finally, base quality scores are recalibrated, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic error, leading to over- or under-estimated base quality scores in the data. Base quality score recalibration is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This yields more accurate base qualities, which in turn improves the accuracy of the variant calls. The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants, then it adjusts the base quality scores in the data based on the model.

### Variant Discovery

Once the data has been pre-processed as described above, it is put through the variant discovery process, i.e. the identification of sites where the data displays variation relative to the reference genome, and calculation of genotypes for each sample at that site. Because some of the variation observed is caused by mapping and sequencing artifacts, the greatest challenge here is to balance the need for sensitivity (to minimize false negatives, i.e. failing to identify real variants) vs. specificity (to minimize false positives, i.e. failing to reject artifacts). It is very difficult to reconcile these objectives in a single step, so instead the variant discovery process is decomposed into separate steps: variant calling (performed per-sample) and variant filtering (also performed per-sample). The first step is designed to maximize sensitivity, while the filtering step aims to deliver a level of specificity that can be customized for each project.

Our current recommendation for RNAseq is to run all these steps per-sample. At the moment, we do not recommend applying the GVCF-based workflow to RNAseq data because although there is no obvious obstacle to doing so, we have not validated that configuration. Therefore, we cannot guarantee the quality of results that this would produce.

#### Per-Sample Variant Calling

We perform variant calling by running the HaplotypeCaller on each sample BAM file (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample VCFs containing raw SNP and indel calls.

#### Per-Sample Variant Filtering

For RNAseq, it is not appropriate to apply variant recalibration in its present form. Instead, we provide hard-filtering recommendations to filter variants based on specific annotation value thresholds. This produces a VCF of calls annotated with fiiltering information that can then be used in downstream analyses.

### Refinement and evaluation

In this last section, we perform some refinement steps on the genotype calls (GQ estimation and transmission phasing), add functional annotations if desired, and do some quality evaluation by comparing the callset to known resources. None of these steps are absolutely required, and the workflow may need to be adapted quite a bit to each project's requirements.

Important note on GATK versions The [Best Practices](http://www.broadinstitute.org/gatk/guide/best-practices) have been updated for GATK version 3. If you are running an older version, you should seriously consider upgrading. For more details about what has changed in each version, please see the [Version History](http://www.broadinstitute.org/gatk/guide/version-history) section. If you cannot upgrade your version of GATK for any reason, please look up the corresponding version of the GuideBook PDF (also in the [Version History](http://www.broadinstitute.org/gatk/guide/version-history) section) to ensure that you are using the appropriate recommendations for your version.

Created 2014-04-16 21:58:52 | Updated 2016-01-27 05:21:52 | Tags: best-practices gatk3

The "GATK Best Practices" are workflow descriptions that provide step-by-step recommendations for getting the best analysis results possible out of high-throughput sequencing data. At present, we provide the following Best Practice workflows:

These recommendations have been developed by the GATK development team over years of analysis work on many of the Broad Institute's sequencing projects, and are applied in the Broad's production pipelines. As a general rule, the command-line arguments and parameters given in the documentation examples are meant to be broadly applicable.

#### Important notes on context and caveats

Our testing focuses largely on data from human whole-genome or whole-exome samples sequenced with Illumina technology, so if you are working with different types of data or experimental designs, you may need to adapt certain branches of the workflow, as well as certain parameter selections and values. Unfortunately we are not able to provide official recommendations on how to deal with very different experimental designs or divergent datatypes (such as Ion Torrent).

In addition, the illustrations and tutorials provided in these pages tend to assume a simple experimental design where each sample is used to produce one DNA library that is sequenced separately on one lane of the machine. See the Guide for help dealing with other experimental designs.

Finally, please be aware that several key steps in the Best Practices workflow make use of existing resources such as known variants, which are readily available for humans (we provide several useful resource datasets for download from our FTP server). If no such resources are available for your organism, you may need to bootstrap your own or use alternative methods. We have documented useful methods to do this wherever possible, but be aware than some issues are currently still without a good solution.

Important note on GATK versions The Best Practices have been updated for GATK version 3. If you are running an older version, you should seriously consider upgrading. For more details about what has changed in each version, please see the Version History section. If you cannot upgrade your version of GATK for any reason, please look up the corresponding version of the GuideBook PDF (also in the Version History section) to ensure that you are using the appropriate recommendations for your version.

Created 2014-03-06 08:12:28 | Updated 2015-05-16 07:34:51 | Tags: best-practices joint-discovery reference-model

This document describes the new approach to joint variant discovery that is available in GATK versions 3.0 and above. For a more detailed discussion of why it's better to perform joint discovery, see this FAQ article. For more details on how this fits into the overall reads-to-variants analysis workflow, see the Best Practices workflows documentation.

### Overview

This is the workflow recommended in our Best Practices for performing variant discovery analysis on cohorts of samples.

In a nutshell, we now call variants individually on each sample using the HaplotypeCaller in -ERC GVCF mode, leveraging the previously introduced reference model to produce a comprehensive record of genotype likelihoods and annotations for each site in the genome (or exome), in the form of a gVCF file (genomic VCF).

In a second step, we then perform a joint genotyping analysis of the gVCFs produced for all samples in a cohort. This allows us to achieve the same results as joint calling in terms of accurate genotyping results, without the computational nightmare of exponential runtimes, and with the added flexibility of being able to re-run the population-level genotyping analysis at any time as the available cohort grows.

This is meant to replace the joint discovery workflow that we previously recommended, which involved calling variants jointly on multiple samples, with a much smarter approach that reduces computational burden and solves the "N+1 problem".

### Workflow details

This is a quick overview of how to apply the workflow in practice. For more details, see the Best Practices workflows documentation.

#### 1. Variant calling

Run the HaplotypeCaller on each sample's BAM file(s) (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs, with the option -emitRefConfidence GVCF, and using the .g.vcf extension for the output file.

Note that versions older than 3.4 require passing the options --variant_index_type LINEAR --variant_index_parameter 128000 to set the correct index strategy for the output gVCF.

#### 2. Optional data aggregation step

If you have more than a few hundred samples, run CombineGVCFs on batches of ~200 gVCFs to hierarchically merge them into a single gVCF. This will make the next step more tractable.

#### 3. Joint genotyping

Take the outputs from step 2 (or step 1 if dealing with fewer samples) and run GenotypeGVCFs on all of them together to create the raw SNP and indel VCFs that are usually emitted by the callers.

#### 4. Variant recalibration

Finally, resume the classic GATK Best Practices workflow by running VQSR on these "regular" VCFs according to our usual recommendations.

That's it! Fairly simple in practice, but we predict this is going to have a huge impact in how people perform variant discovery in large cohorts. We certainly hope it helps people deal with the challenges posed by ever-growing datasets.

As always, we look forward to comments and observations from the research community!

Created 2014-03-06 07:15:51 | Updated 2015-12-07 11:08:30 | Tags: best-practices rnaseq

### Overview

This document describes the details of the GATK Best Practices workflow for SNP and indel calling on RNAseq data.

Please note that any command lines are only given as example of how the tools can be run. You should always make sure you understand what is being done at each step and whether the values are appropriate for your data. To that effect, you can find more guidance here.

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller. Here is a detailed overview:

### Caveats

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have been working with RNAseq for a somewhat shorter time, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

We know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data.

### The workflow

#### 1. Mapping to the reference

The first major difference relative to the DNAseq Best Practices is the mapping step. For DNA-seq, we recommend BWA. For RNA-seq, we evaluated all the major software packages that are specialized in RNAseq alignment, and we found that we were able to achieve the highest sensitivity to both SNPs and, importantly, indels, using STAR aligner. Specifically, we use the STAR 2-pass method which was described in a recent publication (see page 43 of the Supplemental text of the Pär G Engström et al. paper referenced below for full protocol details -- we used the suggested protocol with the default parameters). In brief, in the STAR 2-pass approach, splice junctions detected in a first alignment run are used to guide the final alignment.

Here is a walkthrough of the STAR 2-pass alignment steps:

1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:

genomeDir=/path/to/hg19
mkdir $genomeDir STAR --runMode genomeGenerate --genomeDir$genomeDir --genomeFastaFiles hg19.fa\  --runThreadN <n>

2) Alignment jobs were executed as follows:

runDir=/path/to/1pass
mkdir $runDir cd$runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n> 3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass: genomeDir=/path/to/hg19_2pass mkdir$genomeDir
STAR --runMode genomeGenerate --genomeDir genomeDir --genomeFastaFiles hg19.fa \ --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n> 4) The resulting index is then used to produce the final alignments as follows: runDir=/path/to/2pass mkdirrunDir
cd $runDir STAR --genomeDir$genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

The above step produces a SAM file, which we then put through the usual Picard processing steps: adding read group information, sorting, marking duplicates and indexing.

java -jar picard.jar AddOrReplaceReadGroups I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample

java -jar picard.jar MarkDuplicates I=rg_added_sorted.bam O=dedupped.bam  CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics 

#### 3. Split'N'Trim and reassign mapping qualities

Next, we use a new GATK tool called SplitNCigarReads developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.

In the future we plan to integrate this into the GATK engine so that it will be done automatically where appropriate, but for now it needs to be run as a separate step.

At this step we also add one important tweak: we need to reassign mapping qualities, because STAR assigns good alignments a MAPQ of 255 (which technically means “unknown” and is therefore meaningless to GATK). So we use the GATK’s ReassignOneMappingQuality read filter to reassign all good alignments to the default value of 60. This is not ideal, and we hope that in the future RNAseq mappers will emit meaningful quality scores, but in the meantime this is the best we can do. In practice we do this by adding the ReassignOneMappingQuality read filter to the splitter command.

Please note that we recently (6/11/14) edited this to fix a documentation error regarding the filter to use. See this announcement for details.

Finally, be sure to specify that reads with N cigars should be allowed. This is currently still classified as an "unsafe" option, but this classification will change to reflect the fact that this is now a supported option for RNAseq processing.

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

#### 4. Indel Realignment (optional)

After the splitting step, we resume our regularly scheduled programming... to some extent. We have found that performing realignment around indels can help rescue a few indels that would otherwise be missed, but to be honest the effect is marginal. So while it can’t hurt to do it, we only recommend performing the realignment step if you have compute and time to spare (or if it’s important not to miss any potential indels).

#### 5. Base Recalibration

We do recommend running base recalibration (BQSR). Even though the effect is also marginal when applied to good quality data, it can absolutely save your butt in cases where the qualities have systematic error modes.

Both steps 4 and 5 are run as described for DNAseq (with the same known sites resource files), without any special arguments. Finally, please note that you should NOT run ReduceReads on your RNAseq data. The ReduceReads tool will no longer be available in GATK 3.0.

#### 6. Variant calling

Finally, we have arrived at the variant calling step! Here, we recommend using HaplotypeCaller because it is performing much better in our hands than UnifiedGenotyper (our tests show that UG was able to call less than 50% of the true positive indels that HC calls). We have added some functionality to the variant calling code which will intelligently take into account the information about intron-exon split regions that is embedded in the BAM file by SplitNCigarReads. In brief, the new code will perform “dangling head merging” operations and avoid using soft-clipped bases (this is a temporary solution) as necessary to minimize false positive and false negative calls. To invoke this new functionality, just add -dontUseSoftClippedBases to your regular HC command line. Note that the -recoverDanglingHeads argument which was previously required is no longer necessary as that behavior is now enabled by default in HaplotypeCaller. Also, we found that we get better results if we lower the minimum phred-scaled confidence threshold for calling variants on RNAseq data, so we use a default of 20 (instead of 30 in DNA-seq data).

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I input.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o output.vcf

#### 7. Variant filtering

To filter the resulting callset, you will need to apply hard filters, as we do not yet have the RNAseq training/truth resources that would be needed to run variant recalibration (VQSR).

We recommend that you filter clusters of at least 3 SNPs that are within a window of 35 bases between them by adding -window 35 -cluster 3 to your command. This filter recommendation is specific for RNA-seq data.

As in DNA-seq, we recommend filtering based on Fisher Strand values (FS > 30.0) and Qual By Depth values (QD < 2.0).

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf 

Please note that we selected these hard filtering values in attempting to optimize both high sensitivity and specificity together. By applying the hard filters, some real sites will get filtered. This is a tradeoff that each analyst should consider based on his/her own project. If you care more about sensitivity and are willing to tolerate more false positives calls, you can choose not to filter at all (or to use less restrictive thresholds).

An example of filtered (SNPs cluster filter) and unfiltered false variant calls:

An example of true variants that were filtered (false negatives). As explained in text, there is a tradeoff that comes with applying filters:

### Known issues

There are a few known issues; one is that the allelic ratio is problematic. In many heterozygous sites, even if we can see in the RNAseq data both alleles that are present in the DNA, the ratio between the number of reads with the different alleles is far from 0.5, and thus the HaplotypeCaller (or any caller that expects a diploid genome) will miss that call. A DNA-aware mode of the caller might be able to fix such cases (which may be candidates also for downstream analysis of allele specific expression).

Although our new tool (splitNCigarReads) cleans many false positive calls that are caused by splicing inaccuracies by the aligners, we still call some false variants for that same reason, as can be seen in the example below. Some of those errors might be fixed in future versions of the pipeline with more sophisticated filters, with another realignment step in those regions, or by making the caller aware of splice positions.

As stated previously, we will continue to improve the tools and process over time. We have plans to improve the splitting/clipping functionalities, improve true positive and minimize false positive rates, as well as developing statistical filtering (i.e. variant recalibration) recommendations.

We also plan to add functionality to process DNAseq and RNAseq data from the same samples simultaneously, in order to facilitate analyses of post-transcriptional processes. Future extensions to the HaplotypeCaller will provide this functionality, which will require both DNAseq and RNAseq in order to produce the best results. Finally, we are also looking at solutions for measuring differential expression of alleles.

[1] Pär G Engström et al. “Systematic evaluation of spliced alignment programs for RNA-seq data”. Nature Methods, 2013

Created 2013-10-25 16:35:32 | Updated 2014-10-31 23:13:13 | Tags: best-practices workshop presentations videos

The full GATK team presented this workshop at the Broad Institute with support form the BroadE education program.

This workshop included two modules:

• Best Practices for Variant Calling with the GATK

The core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. View the workshop materials to learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

• Building Analysis Pipelines with Queue

An introduction to the Queue pipelining system. View the workshop materials to learn about how to use Queue to create analysis pipelines, scatter-gather them and run them locally or in parallel on a computing farm.

Created 2013-09-13 20:41:24 | Updated 2015-05-16 06:53:50 | Tags: best-practices dnaseq

This is our recommended workflow for calling variants in DNAseq data from cohorts of samples, in which steps from data processing up to variant calling are performed per-sample, and subsequent steps are performed jointly on all the individuals in the cohort.

The workflow is divided in three main sections that are meant to be performed sequentially:

• Variant discovery: from reads (BAM files) to variants (VCF files)
• Refinement and evaluation: genotype refinement, functional annotation and callset QC

### Pre-Processing

The data generated by the sequencers are put through some pre-processing steps to make it suitable for variant calling analysis. The steps involved are: Mapping and Marking Duplicates; Local Realignment Around Indels; and Base Quality Score Recalibration (BQSR); performed in that order.

#### Mapping and Marking Duplicates

The sequence reads are first mapped to the reference using BWA mem to produce a file in SAM/BAM format sorted by coordinate. The next step is to mark duplicates. The rationale here is that during the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process identifies these reads as such so that the GATK tools know they should ignore them.

#### Realignment Around Indels

Next, local realignment is performed around indels, because the algorithms that are used in the initial mapping step tend to produce various types of artifacts. For example, reads that align on the edges of indels often get mapped with mismatching bases that might look like evidence for SNPs, but are actually mapping artifacts. The realignment process identifies the most consistent placement of the reads relative to the indel in order to clean up these artifacts. It occurs in two steps: first the program identifies intervals that need to be realigned, then in the second step it determines the optimal consensus sequence and performs the actual realignment of reads.

#### Base Quality Score Recalibration

Finally, base quality scores are recalibrated, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic error, leading to over- or under-estimated base quality scores in the data. Base quality score recalibration is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This yields more accurate base qualities, which in turn improves the accuracy of the variant calls. The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants, then it adjusts the base quality scores in the data based on the model.

### Variant Discovery

Once the data has been pre-processed as described above, it is put through the variant discovery process, i.e. the identification of sites where the data displays variation relative to the reference genome, and calculation of genotypes for each sample at that site. Because some of the variation observed is caused by mapping and sequencing artifacts, the greatest challenge here is to balance the need for sensitivity (to minimize false negatives, i.e. failing to identify real variants) vs. specificity (to minimize false positives, i.e. failing to reject artifacts). It is very difficult to reconcile these objectives in a single step, so instead the variant discovery process is decomposed into separate steps: variant calling (performed per-sample), joint genotyping (performed per-cohort) and variant filtering (also performed per-cohort). The first two steps are designed to maximize sensitivity, while the filtering step aims to deliver a level of specificity that can be customized for each project.

#### Per-Sample Variant Calling

We perform variant calling by running the HaplotypeCaller on each sample BAM file (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs. If there are more than a few hundred samples, we combine the gVCFs in batches of ~200 gVCFs using a specialized tool, CombineGVCFs. This will make the next step more tractable and reflects that the processing bottleneck lies with the number of input files and not the number of samples in those files.

#### Joint Genotyping

All available samples are then jointly genotyped by taking the gVCFs produced earlier and running GenotypeGVCFs on all of them together to create a set of raw SNP and indel calls. This cohort-wide analysis empowers sensitive detection of variants even at difficult sites.

#### Variant Quality Score Recalibration

Variant recalibration involves using a machine learning method to assign a well-calibrated probability to each variant call in a raw call set. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, fine-tuned to balance specificity and sensitivity.

### Refinement and evaluation

In this last section, we perform some refinement steps on the genotype calls (GQ estimation and transmission phasing), add functional annotations if desired, and do some quality evaluation by comparing the callset to known resources. None of these steps are absolutely required, and the workflow may need to be adapted quite a bit to each project's requirements.

Important note on GATK versions The [Best Practices](http://www.broadinstitute.org/gatk/guide/best-practices) have been updated for GATK version 3. If you are running an older version, you should seriously consider upgrading. For more details about what has changed in each version, please see the [Version History](http://www.broadinstitute.org/gatk/guide/version-history) section. If you cannot upgrade your version of GATK for any reason, please look up the corresponding version of the GuideBook PDF (also in the [Version History](http://www.broadinstitute.org/gatk/guide/version-history) section) to ensure that you are using the appropriate recommendations for your version.

Created 2013-08-13 07:37:07 | Updated 2014-10-31 23:13:24 | Tags: best-practices workshop

The full GATK team presented this workshop at the Broad Institute with support form the BroadE education program.

This workshop covered the core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. View the workshop materials to learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

Created 2013-01-05 01:17:21 | Updated 2014-10-31 23:13:36 | Tags: best-practices workshop presentations videos

The full GATK team presented this workshop at the Broad Institute with support form the BroadE education program.

This workshop covered the core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. View the workshop materials to learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

Created 2016-02-12 01:18:16 | Updated | Tags: best-practices workshop

For the last few years, we have taught GATK workshops at institutions in various countries around the world. It started out as a small-scale effort with one or two workshops a year in addition to the annual workshop hosted at the Broad itself, but it seems that word has gone around because our dance card is filling up faster every year. Here's a quick recap of where we stand for the 2016 season.

### The tour so far

We kicked off this year's World Tour in style -- on the other side of the world! In two back-to-back 2-day workshops in Sydney and Melbourne, our 3-person workshop crew walked a diverse assortment of researchers and bioinformaticians through the essentials of GATK Best Practices for Variant Discovery, in theory on Day 1 and in practice on Day 2. Lots of great questions were asked, most were answered, and a lot of coffee was consumed. Our thanks to Bioplatforms Australia for setting up these workshops at UNSW in Sydney and at the University of Melbourne, and taking great care of us (so much good coffee!). We look forward to hearing from our Aussie users on the forum.

### Upcoming locations/dates

We have more workshops coming up in the US and in Europe, and registration is open for some of them! Here's the list up to June:

• March 2-4: Los Angeles, USA (UCLA) -- register here
• April 14-15: Edinburgh, UK (University of Edinburgh) -- register here
• April 18-19: Oxford, UK (University of Oxford)
• June 13-14: Cambridge, UK (University of Cambridge) -- register here
• June 16-17: Helsinki, Finland (CSC-IT Center for Science)

We are looking into possible workshop dates for later in the year (September - December).

### What if you can't attend?

If you can't make it to one of our workshops in person, check out the Presentations section of the documentation guide. There you can watch the workshop videos from the BroadE workshops, as well as download the workshop slides and tutorial materials from all workshops that we have taught so far. Some of the most recent haven't been posted yet but we'll get that done soon. In the meantime if you're in a hurry you can find those materials in previous posts; look for the "workshops" or "presentations" tags.

Created 2016-01-31 22:32:46 | Updated 2016-02-01 15:07:30 | Tags: best-practices workshop presentations slides

This week we are teaching a couple of GATK Best Practices workshops in Sydney (one Opera House, check) and Melbourne, Australia. Joining me are Methods Team developers Mark and Megan.

The presentation slides from Day 1 as well as the materials for the Day 2 hands-on tutorial are available at this Google Drive link.

Learned so far: the fierceness of the Australian sun is no laughing matter

Created 2015-10-15 14:36:34 | Updated | Tags: best-practices presentations gvcf

Geraldine Van der Auwera presented this talk as part of the Broad Institute's Medical and Population Genetics (MPG) Primers series.

This talk provides a high-level overview of the workflow for performing variant discovery on high-throughput sequencing data, as described in the GATK Best Practices and implemented in the Broad's production pipelines.

The following points emphasized in this presentation are:

• Informational content of data file formats and flow of information throughout the pipeline
• Concepts involved in the data transformations (processing steps and analysis methods)
• Motivation and key mechanics of the GVCF workflow for scalable joint variant discovery
• Relation of the GATK Best Practices to the Broad's production pipeline implementation

The presentation slide deck is available at this link.

Created 2015-04-19 15:47:54 | Updated 2015-04-21 23:52:33 | Tags: best-practices workshop presentations topstory

The presentation slides from the 2015 "GATK in the UK" workshop (April 20-24) are available on DropBox here.

All slides will ultimately be posted in the Presentations section of the Guide.

Created 2015-02-23 21:13:29 | Updated 2015-04-02 07:31:25 | Tags: best-practices workshop

Registration is now LIVE for our upcoming BroadE Workshop: Best Practices for Variant Calling with the GATK.

WHEN: Thursday, March 19 & Friday, March 20, 2015

10:00 AM - 5:00 PM (Lecture, March 19)
2:00PM - 5:00 PM (Optional Tutorial, March 20)

Auditorium (lecture)/Yellowstone (Tutorial)
415 Main Street
Cambridge, Massachusetts 02142

Registration Schedule:

Registration closes February 27 at 5:00 PM.
Notification of acceptance or wait list status sent by March 4.

REGISTER HERE

This workshop will focus on the core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. The GATK development team and invited guests will give talks explaining the rationale, theory and real-life applications of the Best Practices. You will learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

An optional hands-on session will be available to select participants. In this session, the GATK team will help beginners work through interactive exercises and tutorials to learn how to use GATK and apply the Best Practices to real data.

Workshop attendees will gain broad insight into the rationale of the GATK Best Practices for variant discovery, as well as a solid understanding of how individual GATK tools work and how to apply them in practice. Novices to the GATK will come out of the workshop knowing enough to identify which questions they can use address using GATK tools, how to get started on designing their experiment and analytical workflow, and how to run the tools on their own computer. Existing GATK users will come out with a deeper understanding of how the GATK works under the hood and how to improve their results further, especially as regards the latest innovations.

Created 2014-12-16 21:48:12 | Updated 2014-12-17 17:06:58 | Tags: vqsr best-practices

The Best Practices recommendations for Variant Quality Score Recalibration have been slightly updated to use the new(ish) StrandOddsRatio (SOR) annotation, which complements FisherStrand (FS) as indicator of strand bias (only available in GATK version 3.3-0 and above).

While we were at it we also reconciled some inconsistencies between the tutorial and the FAQ document. As a reminder, if you ever find differences between parameters given in the VQSR docs, let us know, but FYI that the FAQ is the ultimate source of truth=true. Note also that the command line example given in VariantRecalibrator tool doc tends to be out of date because it can only be updated with the next release (due to a limitation of the tool doc generation system) and, well, we often forget to do it in time -- so it should never be used as a reference for Best Practice parameter values, as indicated in the caveat right underneath it which no one ever reads.

Speaking of caveats, there's no such thing as too much repetition of the fact that whole genomes and exomes have subtle differences that require some tweaks to your command lines. In the case of VQSR, that means dropping Coverage (DP) from your VQSR command lines if you're working with exomes.

Finally, keep in mind that the values we recommend for tranches are really just examples; if there's one setting you should freely experiment with, that's the one. You can specify as many tranche cuts as you want to get really fine resolution.

Created 2014-08-28 13:44:35 | Updated 2014-08-28 13:46:08 | Tags: best-practices workshop topstory

Here is the official announcement for the upcoming workshop in Philadelphia. Registration is not necessary for the lecture sessions, but it is required for the hands-on sessions (see link further below).

We look forward to seeing you there!

The Center for Genetics and Complex Traits (CGACT) and the Institute for Biomedical Informatics (IBI) of the University of Pennsylvania Perelman School of Medicine announce a Workshop on Variants Discovery in Next Generation Sequence Data on September 18 and 19, 2014.

This workshop will focus on the core steps involved in calling variants with the Broad Institute¹s Genome Analysis Toolkit (GATK), using the "Best Practices" developed by the GATK team, and will be presented by Dr. Geraldine Van der Auwera of the Broad Institute and other instructors from the GATK team. Participants will learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of their dataset.

The workshop will take place over two consecutive days (September 18 and 19, 2014). In the morning lecture sessions, attendees will learn the rationale, theory, and real-life applications of GATK Best Practices for variant discovery in high-throughput DNA sequencing data, including recommendations for additional experimental designs and datatypes such as RNAseq. In the afternoon hands-on sessions, attendees will learn to interact with the GATK tools and apply them effectively through interactive exercises and tutorials.

The morning lecture sessions will take place on Thursday, September 18, from 9:00 am to 12:30 pm, and Friday, September 19, from 9:00 am to 11:30 am, in the Dunlop Auditorium of Stemmler Hall, University of Pennsylvania, 3450 Hamilton Walk, Philadelphia, PA 19104. Both morning sessions are open to all participants and registration is not required.

The afternoon hands-on sessions will take place on Thursday, September 18, from 2:00 pm to 5:30 pm, and Friday, September 19, from 1:00 pm to 4:30 pm. The September 18 hands-on session is aimed mainly at beginners (though familiarity with the command line environment is expected). The September 19 hands-on session is aimed at more advanced users who are already familiar with the basic GATK functions. Attendance to the hands-on sessions is limited to 20 participants each day, and precedence will be given to members of the University of Pennsylvania or its affiliated hospitals and research institutes (HUP, CHOP, Monell, Wistar, etc.).

Registration for the hands-on sessions is mandatory and open through Friday, August 29th at http://ibi.upenn.edu/?p=996

Created 2014-06-11 21:20:08 | Updated | Tags: best-practices bug error rnaseq topstory

We discovered today that we made an error in the documentation article that describes the RNAseq Best Practices workflow. The error is not critical but is likely to cause an increased rate of False Positive calls in your dataset.

The error was made in the description of the "Split & Trim" pre-processing step. We originally wrote that you need to reassign mapping qualities to 60 using the ReassignMappingQuality read filter. However, this causes all MAPQs in the file to be reassigned to 60, whereas what you want to do is reassign MAPQs only for good alignments which STAR identifies with MAPQ 255. This is done with a different read filter, called ReassignOneMappingQuality. The correct command is therefore:

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

In our hands we see a bump in the rate of FP calls from 4% to 8% when the wrong filter is used. We don't see any significant amount of false negatives (lost true positives) with the bad command, although we do see a few more true positives show up in the results of the bad command. So basically the effect is to excessively increase sensitivity, at the expense of specificity, because poorly mapped reads are taken into account with a "good" mapping quality, where they would normally be discarded.

This effect will be stronger in datasets with lower overall quality, so your results may vary. Let us know if you observe any really dramatic effects, but we don't expect that to happen.

To be clear, we do recommend re-processing your data if you can, but if that is not an option, keep in mind how this affects the rate of false positive discovery in your data.

We apologize for this error (which has now been corrected in the documentation) and for the inconvenience it may cause you.

Created 2014-06-11 20:34:05 | Updated 2014-06-11 20:34:51 | Tags: best-practices workshop topstory

Calling all Belgians! (and immediate neighbors)

In case you didn't hear of this through your local institutions, I'm excited to announce that we are doing a GATK workshop in Belgium in two weeks (June 24-26 to be precise). The workshop, which is open and free to the scientific community, will be held at the Royal Institute of Natural Sciences in Brussels.

This is SUPER EXCITING to me because as a small child I spent many hours drooling in front of the Institute Museum's stunningly beautiful Iguanodons, likely leaving grubby handprints all over the glass cases, to the shame and annoyance of my parents. I also happen to have attended the Lycee Emile Jacqmain which is located in the same park, right next to the Museum (also within a stone's throw of the more recently added European Parliament) so for me this is a real trip into the past. Complete with dinosaurs!

That said, I expect you may find this workshop exciting for very different reasons, such as learning how the GATK can empower your research and hearing about the latest cutting-edge developments that you can expect for version 3.2.

See this website or the attached flyer for practical details (but note that the exact daily program may be slightly different than announced due to the latest changes in GATK) and be sure to register (it's required for admission!) by emailing cvangestel at naturalsciences.be with your name and affiliation.

Please note that the hands-on sessions (to be held on the third day) are already filled to capacity. The tutorial materials will be available on our website in the days following the workshop.

Created 2014-03-06 07:24:03 | Updated | Tags: best-practices rnaseq topstory

We’re excited to introduce our Best Practices recommendations for calling variants on RNAseq data. These recommendations are based on our classic DNA-focused Best Practices, with some key differences in the early data processing steps, as well as in the calling step.

### Best Practices workflow for RNAseq

This workflow is intended to be run per-sample; joint calling on RNAseq is not supported yet, though that is on our roadmap.

Please see the new document here for full details about how to run this workflow in practice.

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller.

Now, before you try to run this on your data, there are a few important caveats that you need to keep in mind.

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

For one thing, these recommendations are based on high quality RNA-seq data (30 million 75bp paired-end reads produced on Illumina HiSeq). Other types of data might need slightly different processing. In addition, we have currently worked only on data from one tissue from one individual. Once we’ve had the opportunity to get more experience with different types (and larger amounts) of data, we will update these recommendations to be more comprehensive.

Finally, we know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data. We look forward to hearing your thoughts and observations!

Created 2013-12-20 22:57:53 | Updated 2014-02-07 17:01:39 | Tags: best-practices workshop queue presentations videos topstory

The presentation videos for:

• Day 1 (Best Practices for Variant Calling with GATK)
• Day 2 (Building Analysis Pipelines with Queue)

Created 2013-10-22 02:40:27 | Updated 2013-10-22 14:49:48 | Tags: best-practices workshop queue presentations slides

Slides for :

• Day 1 (Best Practices for Variant Calling with GATK)
• Day 2 (Building Analysis Pipelines with Queue)

are available at this link in our DropBox slide archive :

https://www.dropbox.com/sh/ryz7bx4aeqc7mkl/44Q_2jvIyH

Created 2013-09-17 18:18:29 | Updated | Tags: best-practices workshop queue pipeline

Register now for a spot at the upcoming GATK workshop, which will be held in Cambridge, MA on October 21-22.

This workshop will cover the following topics:

• GATK Best Practices for Variant Detection
• Building Analysis Pipelines with Queue

The workshop is scheduled right before ASHG Boston, so if you're going to be in town for the conference, make sure you come a couple of days early and attend the GATK workshop!

Created 2013-09-13 21:22:16 | Updated 2014-02-07 18:45:22 | Tags: best-practices topstory

We're very pleased to announce that we have finally finished our big rewrite of the Best Practices documentation. We hope that the new format, which you can find here, will prove more user-friendly, searchable and overall more helpful than the previous version.

We have a few more improvements in mind (e.g. a clickable image map of the workflow) and there may be a few bugs here and there to iron out. So please feel free to comment on this announcement and give us feedback, whether flattering or critical, so we can improve it to help you as much as possible.

Created 2013-07-09 15:54:20 | Updated 2013-08-15 16:14:59 | Tags: best-practices workshop presentations

The slides from the July 2013 Best Practices workshop are available here:

https://www.dropbox.com/sh/jhlk451jntywfdy/vJqbKTZZd_

The videos will be put online once they are processed.

Created 2013-01-30 04:43:29 | Updated | Tags: best-practices workshop videos

These are videos of the presentations given at the GATK workshop on 4-5 Dec 2012 on "Best Practices for variant calling with the GATK".

Created 2016-02-10 04:42:23 | Updated | Tags: best-practices genotypegvcfs gvcf

I am using the GATK pipeline to call variants by aligning reads to a draft quality reference genome that is ~367000 scaffolds. I split the scaffolds up into 50 intervals and successfully (and pretty quickly) generated GVCFs for 25 individuals using the -L option. However, I am having the worst of times with GenotypeGVCFs. After running for nearly 2 days on the first interval list, GenotypeGVCFs has not even output a file. Based on another post in the forum, I removed the scaffolds that are NOT in the interval from the GVCF header, and that sped up the process slightly - I have a combined VCF file with just the header generated after about 18 hours. Not sure how much longer the process has as the progress meter doesn't seem to be making any sense.

Is there any known way(s) to optimize this process?

Currently using the following command: java -Djava.io.tmpdir=/data/lwwvd/genoGVCF.tmp -XX:ParallelGCThreads=4 -Xmx15g -jar /usr/local/bin/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -nt 16 -T GenotypeGVCFs -R ../ref_genomes/bbu_ref_UMD_CASPUR_WB_2.0.fa -L interval_lists/bbub.refctgs.49.interval_list -V ./1095/1095.49.g.vcf.gz -V ./189/189.49.g.vcf.gz -V ./190/190.49.g.vcf.gz -V ./196/196.49.g.vcf.gz -V ./246/246.49.g.vcf.gz -V ./337/337.49.g.vcf.gz -V ./581/581.49.g.vcf.gz -V ./583/583.49.g.vcf.gz -V ./662/662.49.g.vcf.gz -V ./701/701.49.g.vcf.gz -V ./850/850.49.g.vcf.gz -V ./92764/92764.49.g.vcf.gz -V ./92765/92765.49.g.vcf.gz -V ./92766/92766.49.g.vcf.gz -V ./92767/92767.49.g.vcf.gz -V ./92768/92768.49.g.vcf.gz -V ./92769/92769.49.g.vcf.gz -V ./92770/92770.49.g.vcf.gz -V ./92771/92771.49.g.vcf.gz -V ./92774/92774.49.g.vcf.gz -V ./92775/92775.49.g.vcf.gz -V ./92776/92776.49.g.vcf.gz -V ./92777/92777.49.g.vcf.gz -V ./92778/92778.49.g.vcf.gz -V ./92795/92795.49.g.vcf.gz -o BBUB.combined.49.vcf

Created 2016-02-09 13:06:54 | Updated | Tags: haplotypecaller best-practices rna-editing mutec2

Dear GATK staff

I would like to use the GATK tool for the detection of possible RNA editing events. I followed the RNA-seq best practice up to the variant calling step itself. There I hesitate to use the haplotype caller because I would not assume that the editing sites follow any kind of allelic ratio. Therefore I wanted to ask if it might be better to use MuTec2 at this stage? I would call it like ...

java -jar GenomeAnalysisTK.jar -T MuTect2 -R reference.fasta -I:tumor normal1.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 --dbsnp dbSNP.vcf --artifact_detection_mode -o output.normal1.vcf
java -jar GenomeAnalysisTK.jar -T CombineVariants -R reference.fasta -V output.normal1.vcf -V output.normal2.vcf -minN 2 --setKey "null" --filteredAreUncalled --filteredrecordsmergetype KEEP_IF_ANY_UNFILTERED -o MuTect2_PON.vcf

Can you comment if this is a suitable modification of the best practice in the case of RNA editing calls?

bw, Fabian

Created 2016-02-05 09:25:43 | Updated | Tags: haplotypecaller best-practices merge rna-seq

Hello, and thanks for making all the GATK tools! I have recently started to try my hand at variant calling of my RNA-seq data, following the GATK Best Practices more or less verbatim, only excluding indel alignment (because I am only interested in SNPs at this point) and the BQSR (partly because I have very high quality data, but mostly because I couldn't get it to work in the workflow).

I have three replicates for each of my samples, and my question is where, if at all, I should merge the data from them. I am not sure if I can (or even should!) merge the FASTQ files before the alignment step, or merge the aligned BAM files, or something else. I read that for aligners such as BWA the options are (more or less) equivalent, but seeing as the RNA-seq Best Practice workflow using STAR... What would be the "correct" way to do it, if at all? How would merging (at some level) affect the speed of the workflow, and can I optimise that somehow?

If it's a bad idea to do merging, how would I determine the "true" variant from my three resulting VCF-files at the end, for cases where they differ?

Created 2016-01-29 20:13:08 | Updated 2016-01-29 20:13:53 | Tags: best-practices exome targetome

Hi, I was wondering if there is a set of best practices developed specifically for calling variants from a small region target capture sample. I looked but I could not find a one-stop comprehensive documentation. If not, are there some specific recommendations (such as certain values of parameters for the exome/targetome analysis? For eg., -L.

Thanks!

Created 2016-01-28 15:08:47 | Updated | Tags: indelrealigner variantrecalibrator vqsr haplotypecaller best-practices

The release notes for 3.5 state "Added new MQ jittering functionality to improve how VQSR handles MQ". My understanding is that in order to use this, we will need to use the --MQCapForLogitJitterTransform argument in VariantRecalibrator. I have a few questions on this: 1) Is the above correct, i.e. the new MQ jittering functionality is only used if --MQCapForLogitJitterTransform is set to something other than the default value of zero? 2) Is the use of MQCapForLogitJitterTransform recommended? 3) If we do use MQCapForLogitJitterTransform, the tool documentation states "We recommend to either use --read-filter ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller or use a MQCap=max+10 to take that into account". Is one of these to be preferred over the other? Given that it seems that sites that have been realigned can have values up to 70, but sites that have not can have values no higher than 60, it seems to me that the ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller option might be preferred, but I would like to check before running.

Created 2016-01-26 01:25:56 | Updated | Tags: best-practices

we began GATK with "dentify read group information", however,i'm still wondering that the format "@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1 ",does the "LB:lib1" and "PU:unit1" should be different when the sample changed?

Created 2016-01-06 13:01:18 | Updated | Tags: combinevariants haplotypecaller best-practices dbsnp gatk combinegvcfs gvcf

Hi guys, I have recently jointly called 27 full genome data using GenotypeGVCFs approach. While i was trying to extract some chromosomes from the final file, i got this error The provided VCF file is malformed at approximately line number 16076: Unparsable vcf record with allele *.

I look into the file and I found some of the multi-allellic sites having * as seen in the attached picture.

I feel the problem could be that the program realised that more than one allele is present at that position but could not ascertain which allele. I may be wrong but please what do you think I can do to solve this problem. LAWAL

Created 2015-11-19 16:02:02 | Updated | Tags: variantfiltrationwalker best-practices sor

Hi,

I'm trying to figure out a good threshold for the SOR for hard-filtering a SNP data set. The best practices documentation suggests using a cutoff threshold of 4, but in reading a bit more about odds-ratios, it seems that taking the natural logarithm of the odds-ratio may be more ideal. Is the SOR reported in the vcf file actually the natural logarithm of the odds ratio? This information is not provided in the page on statistical tests or the documentation for the SOR.

Thanks, Jen

Created 2015-09-25 23:52:49 | Updated | Tags: best-practices bam gatk

Hi,

So far, I have only tried running Best Practices on bam files where each bam file corresponds to a certain chromosome. How do I go about partitioning the bam files further and running Best Practices on parts of bam files? I want to make my pipeline even more efficient.

Created 2015-09-17 21:56:03 | Updated | Tags: bqsr best-practices bam gatk

Hi,

I recently re-generated my realigned bam files (realigned.bam) from the GATK Best Practices, but I encountered an error when I validated the bam files using the validation tool (ValidateSamFile.jar) in Picard:

ERROR: Record 188823, Read name HWI-ST1329:286:C2DT8ACXX:2:1303:11866:8561, Mate alignment does not match alignment start of mate ERROR: Record 188823, Read name HWI-ST1329:286:C2DT8ACXX:2:1303:11866:8561, Mate negative strand flag does not match read negative strand flag of mate

etc.

What exactly is going on? Where in the pipeline could this have occurred? I do not have the intermediary files so I cannot diagnose this accurately. Before I started the GATK Best Practices, my bam files had no problems; when I validated them I did not get any error messages. Does this mean that I have to regenerate the files again and start from the very beginning?

I tried to use the Picard tool FixMateInformation.jar, but then I got an error message regarding the index files. I tried deleting the bam index for one bam file and then recreating it to see if there was a problem with the indexes, but doing so did not resolve the issue. So it seems that something went wrong with the bam files at some earlier step.

Has this error been encountered before? Not sure how to proceed next, except restart everything.

Best, Alva

Created 2015-09-03 08:51:02 | Updated | Tags: indelrealigner malformedreadfilter best-practices

I recently came across(blog post) a scenario in which a subset of my reads had triggered the MalformedReadFilter during indel realignment. I'm curious to know about the definition of "malformed", as whilst invalid, I wouldn't have described incorrect mate names and alignment positions for unmapped reads as "malformed" and capable of causing job crashes...

Not a bug - I'm just curious to know of any implementation details or reasoning about the MalformedReadFilter.

Created 2015-08-31 11:58:59 | Updated | Tags: baserecalibrator best-practices ploidy pooling

Dear GATK Team,

baserecalibrator is in the newest Best Practices recommendations. Does this tool support or would you recommend it's usage for samples with ploidy higher then 2 (e.g. pooled samples)?

Thanks! Bernt

Created 2015-08-09 11:38:20 | Updated | Tags: best-practices dataprocessingpipeline cancer somatic-variants

Hello, We're currently running a research project involving exome sequencing of tumors and paired normal blood samples from brain cancer patients. I'm relatively new to NGS processing but thanks to this forum (and the excellent documentation provided by the Broad) I was able to put together the following pipeline for this project: I just wanted to check with the Cancer Team that everything looks OK, as well as get any suggestions, if possible. Looking forward to your valuable response, -E

Created 2015-07-22 11:46:00 | Updated | Tags: combinevariants haplotypecaller best-practices vcf gatk genotypegvcfs combinegvcfs gvcf

I was trying to do combine sets of vcf files for all my samples so that I have one single vcf output using this command option below java -d64 -Xmx48g -jar ${GATK}/GenomeAnalysisTK.jar \ -R${REF} \ -T GenotypeGVCFs \ --variant A.g.vcf \ --variant B.g.vcf \ --variant C.g.vcf \ -stand_emit_conf 30 \ -stand_call_conf 30 \ -o genotype.vcf

but I got this error message “The following invalid GT allele index was encountered in the file: END=21994810”. I have tried to locate where the problem could be coming from but I do not understand this. Could you please advise me.

Created 2015-07-22 05:07:12 | Updated | Tags: bqsr baserecalibrator best-practices gatk

Hi,

I am running GATK Best Practices again and just wanted to assure myself that I am proceeding correctly. I remember reading somewhere on here that BQSR performs best when given the whole genome to work with (similar to VQSR in that respect). Better to just run using the whole genome rather than giving it single chromosomes. I couldn't really find this comment/recommendation when I tried to look for it again, so wanted to see what the final word was on this. How should one proceed?

Thanks, Alva

Created 2015-06-05 09:10:36 | Updated | Tags: bqsr best-practices

We are using a licensed version of GATK here. The version is GATK version -2014.3-3.2.2-7-g f9cba99.

While using the tool for analysis of exome data I had few questions.

1: Are there different parameters used in GATK at different stages for whole and targeted exom sequencing data ?

2: What are the guidelines being followed for targeted exome seq analysis if I use BQSR ?

I was reading the page ("https://www.broadinstitute.org/gatk/guide/tagged?tag=exome") and it was mentioned as " The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets."

Request you to kindly clear the doubts.

Created 2015-06-01 10:54:07 | Updated | Tags: best-practices picard mapping

I am trying to follow the best practices for mapping my (Paired-end Illumina HiSeq) reads to the reference, by following this presentation:

From what I understand, I should use MergeBamAlignment to clean up the output from bwa, and then use this cleaned up output for the rest of the analysis. However, when I run ValidateSamFile after running MergeBamAlignment I get a lot of errors, and running CleanSam on the file does not resolve any of them. What am I doing wrong? I've tried searching the web for more details about MergeBamAlignment but I haven't been able to find much. Please let me know if you require any additional information.

How I ran MergeBamAlignment picard-tools MergeBamAlignment \ UNMAPPED_BAM=unmapped_reads.sam \ ALIGNED_BAM=aligned_reads.sam \ OUTPUT=aligned_reads.merged.bam \ REFERENCE_SEQUENCE=/path/to/reference.fasta \ PAIRED_RUN=true # Why is this needed?

Error report from ValidateSamFile

## HISTOGRAM java.lang.String

Error Type Count
ERROR:INVALID_CIGAR 5261
ERROR:MATES_ARE_SAME_END 30
ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 30

Created 2015-05-23 21:49:38 | Updated | Tags: bqsr haplotypecaller best-practices dnaseq

Hi, I have ~150 WGS all sequenced in batches on Illumina HiSeq over a number of years, on a number of machines/flowcells.

I want to perform BQSR on my BAM files before calling variants. However for my organism I do not have access to a resource like dbSNP for true variants. So I am following the protocol by doing a first round of variant calling, subsetting to the SNPs with highest confidence and using these as the known variants for BQSR.

My question is, should I carry this out on samples individually, i.e. one BAM per sample, on which I use HaplotypeCaller for initial variant calling, then subset to best SNPs, use these for BaseRecalibrator and apply the calibration to the original sample before carrying on with single sample variant finding and then joint genotyping as per best practices....

OR

As I have multiple samples from the same flowcells and lanes, would I gain more information by combining those samples into a multisample BAM first, before carrying out BQSR? I'm a little unsure of how the covariates used by BQSR and actually calculated and whether I can increase the accuracy of my recalibration in this way? Each sample has between 500M and 5 billion nucleotides sequenced.

Many thanks.

Created 2015-05-21 19:53:49 | Updated 2015-05-21 19:54:06 | Tags: bqsr best-practices

Hi,

I am new to NGS/GATK & have a paired-end targeted (targeted to 1 region on 1 autosome) illumina sequencing project on roughly 470 samples. I was wondering whether Indel Realignment & BQSR were appropriate for my setting, or what considerations should be taken into account when judging whether these steps are appropriate?

I'm curious because I noticed this text here: Small targeted experiments The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.

I was unsure whether my project qualified as a 'small dataset' where BQSR should not be run.

Created 2015-05-09 14:48:21 | Updated | Tags: best-practices bam third-party-tools picard

Following GATK's best practices, I have individually realigned/recalibrated my sample-lane bams and merged them by sample:

sample1_lane1.dedup.realn.recal.bam --> sample1_lane2.dedup.realn.recal.bam --> sample1.merged.bam sample2_lane1.dedup.realn.recal.bam --> sample2_lane2.dedup.realn.recal.bam --> sample2.merged.bam ...

I am ready to dedup and realign my sample merged bams, however I am uncertain on the best approach. Is the consensus to convert back to fastq via Picard (MarkDuplicates, SortSam, and SamToFastq) and then run bwa mem? Or is it more expedient/accurate to realign the sample-merged bam using bwa aln followed by bwa sampe?

Created 2015-04-20 19:51:25 | Updated 2015-04-20 19:52:54 | Tags: best-practices combinegvcfs gatk3-3

Dear @Sheila & @Geraldine_VdAuwera,

I hope you had a good weekend! My question, to start this week (hopefully the only thread!), is how to use CombineGVCFs. As you may remember I currently have ~100 WES samples for joint genotyping, and I have been testing GenotypeGVCFs directly and VQSR. However, in the near future I am going to have several hundred more samples in my hand, and hence I understand that making batches using CombineGVCFs is the way forward, as this is the only way to allow for subsequent merging of new gVCFs once sample size passes 200.

So I have been trying to run it this afternoon on a node that has 16 CPUs and 128Gb of RAM. Initial attempt was with -nt 14, but this gave the following message.

ERROR MESSAGE: Invalid command line: Argument nt has a bad value: The analysis CombineGVCFs currently does not support parallel execution with nt. Please run your analysis without the nt option.

So, therefore I started it without any threading, but then it appears it is going to take ~75hrs by it's own estimate after running for almost 1hr.

Is this really how long it should take? I have been trying to decipher the various possible issues looking at old threads on the forum, but I am not sure if there is any way I should be able to speed this up (short of splitting out the chromosomes and running them all individually)?

Also, I have seen mention somewhere that it may be because the files have been zipped in an incompatible manner - however, the input gVCFs are exactly the same that were successfully passed through GenotypeGVCFs, so this seems unlikely. I saw on one old thread that CombineGVCFs was super-slow, but I don't know if this is still the situation? For reference, my command was as shown below:

java -Xmx1g -jar GenomeAnalysisTK.jar -R hs37d5.fa -T GenotypeGVCFs -V vcfs.chr${numchr}.new.list -o final_chr${numchr}.vcf -L ${numchr} # CatVariants (generate 1 vcf file with all inds and all chrs) java -Xmx1g -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants -R hs37d5.fa -V final.new.list -out final_allHutt.vcf -assumeSorted # VQSR java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input final_allHutt.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches # Used excludeFiltered here java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input final_allHutt.vcf -mode SNP --ts_filter_level 99.9 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches -o recalibrated_snps_raw_indels_allHutt_filteredout.vcf java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches # Used excludeFiltered here java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -mode INDEL --ts_filter_level 99.0 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches -o recalibrated_variants_allHutt_filteredout.vcf # Genotype Refinement Workflow java -Xmx3g -jar GenomeAnalysisTK.jar -T CalculateGenotypePosteriors -R hs37d5.fa --supporting ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf -ped Hutt.ped -V recalibrated_variants_allHutt_filteredout.vcf -o recalibrated_variants_allHutt.postCGP.f.vcf java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantFiltration -R hs37d5.fa -V recalibrated_variants_allHutt.postCGP.f.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibrated_variants_allHutt.postCGP.Gfiltered.f.vcf Again, the first genotype in this example (indel) passed VariantFiltration even though its coverage was zero (2/2:0,0,0:0:PASS) The entire example is below: 1 20199272 . T TCTTC,C 3520.08 PASS AC=8,22;AF=0.160,0.440;AN=50;BaseQRankSum=-1.356e+00;ClippingRankSum=-1.267e+00;DP=487;FS=4.843;GQ_MEAN=27.84;GQ_STDDEV=40.31;InbreedingCoeff=0.1002;MLEAC=1,12;MLEAF=0.020,0.240;MQ=51.74;MQ0=0;MQRankSum=0.421;NCC=2;PG=0,0,0,0,0,0;QD=32.53;ReadPosRankSum=1.27;SOR=0.699;VQSLOD=0.687;culprit=FS GT:AD:DP:FT:GQ:PGT:PID:PL:PP 2/2:0,0,0:0:PASS:22:.:.:410,207,355,32,22,0:410,207,355,32,22,0 2/2:0,0,1:1:lowGQ:5:.:.:240,51,36,18,5,0:240,51,36,18,5,0 0/2:4,0,4:8:PASS:90:.:.:140,153,256,0,103,90:140,153,256,0,103,90 0/0:22,0,0:22:lowGQ:0:.:.:0,0,390,0,390,390:0,0,390,0,390,390 0/0:2,0,0:2:lowGQ:3:.:.:0,3,45,3,45,45:0,3,45,3,45,45 2/2:0,0,3:3:lowGQ:11:.:.:287,135,124,21,11,0:287,135,124,21,11,0 ./.:7,0,0:7:PASS 2/2:0,0,3:4:lowGQ:11:.:.:282,126,115,22,11,0:282,126,115,22,11,0 0/2:10,0,0:10:lowGQ:5:.:.:27,5,494,0,411,405:27,5,494,0,411,405 0/2:7,0,0:7:lowGQ:13:.:.:13,15,502,0,303,288:13,15,502,0,303,288 0/1:8,6,0:14:PASS:99:.:.:194,0,255,218,273,491:194,0,255,218,273,491 0/0:18,0,0:18:PASS:52:.:.:0,52,755,52,755,755:0,52,755,52,755,755 2/2:0,0,0:0:PASS:23:.:.:305,168,416,23,30,0:305,168,416,23,30,0 0/2:0,0,4:4:lowGQ:14:.:.:40,14,634,0,185,699:40,14,634,0,185,699 0/0:19,0,0:19:PASS:58:.:.:0,58,824,58,824,824:0,58,824,58,824,824 0/0:1,0,0:1:lowGQ:6:0|1:20199257_CT_C:0,6,91,6,91,91:0,6,91,6,91,91 1/1:0,0,0:0:lowGQ:2:.:.:177,11,0,12,2,44:177,11,0,12,2,44 0/1:0,0,3:3:PASS:34:.:.:94,0,388,34,38,304:94,0,388,34,38,304 0/2:15,0,2:17:lowGQ:18:0|1:20199249_CT_C:18,64,695,0,632,624:18,64,695,0,632,624 1/1:0,0,0:0:lowGQ:8:.:.:133,8,0,101,17,265:133,8,0,101,17,265 0/2:3,0,0:3:PASS:25:.:.:129,25,484,0,121,94:129,25,484,0,121,94 0/2:2,0,0:2:PASS:38:.:.:185,38,644,0,88,42:185,38,644,0,88,42 0/2:2,0,0:2:lowGQ:14:.:.:256,14,293,0,57,41:256,14,293,0,57,41./.:11,0,0:11:PASS 1/2:0,0,1:1:lowGQ:14:.:.:115,24,14,36,0,359:115,24,14,36,0,359 1/2:0,0,1:1:PASS:28:.:.:188,39,28,35,0,206:188,39,28,35,0,206 2/2:0,0,3:3:lowGQ:8:1|1:20199249_CT_C:88,88,89,8,9,0:88,88,89,8,9,0 Why are some genotypes being passed when there is no support for their genotype? Why are the Mendelian error rates so high? Thank you very much in advance, Alva Created 2015-03-11 20:38:44 | Updated | Tags: vqsr best-practices variantfiltration variant-calling Hi all - I'm stumped and need your help. I'm following the GATK best practices for calling variants with HaplotypeCaller in GVCF mode. One of my samples is NA12878, among 119 others samples in my cohort. For some reason GATK is missing a bunch of variants in this sample that I can clearly see in IGV but are not listed in the VCF. I discovered that the variant is being filtered out..reason being VQSRTranchesSNP99.00to99.90. The genotype is homozygous variant, DP is 243, Qual is 524742.54 and its known in dbSNP. I suspect this is happening to other variants. How do I adjust VQSR or how tranches are used and variants get placed in? I supposed I need to fine tune my parameters...but I would think something as obvious as this variant would pass Filtering. Created 2015-01-27 21:59:14 | Updated 2015-01-27 21:59:46 | Tags: best-practices snp gatk combinegvcfs gvcf Hi, I am combining gcvf files into single gvcf files by chromosome, using CombineGVCFs, in order to run GenotypeGVCFs. When I checked the first gvcf file generated by CombineGVCFs, I noticed that at each position, all the alleles were missing. For example, at position 16050036, this is what comes up in the final gvcf file: 22 16050036 . A C,<NON_REF> . . BaseQRankSum=-7.360e-01;ClippingRankSum=-7.360e-01;DP=4;MQ=27.00;MQ0=0;MQRankSum=-7.360e-01;ReadPosRankSum=0.736 GT:AD:DP:MIN_DP:PL:SB ./.:1,2,0:3:.:55,0,23,58,29,86:1,0,2,0 ./.:.:1:1:0,0,0,0,0,0 ./.:.:0:0:0,0,0,0,0,0 But if we just take one of the precursor gvcf files (one individual), we clearly see the genotype at that site: 22 16050036 . A C,<NON_REF> 26.80 . BaseQRankSum=-0.736;ClippingRankSum=-0.736;DP=3;MLEAC=1,0;MLEAF=0.500,0.00;MQ=27.00;MQ0=0;MQRankSum=-0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB 0/1:1,2,0:3:23:55,0,23,58,29,86:1,0,2,0 The command I'm using to generate these files is: java -Xmx1g -jar GenomeAnalysisTK.jar -T CombineGVCFs -R hs37d5.fa -V vcfs.chr${numchr}.new.list -o mergeGvcf_${numchr}.vcf -L${numchr} where numchr is a variable previously defined (indicating the chromosome number).

It seems that all the information is being taken into account except the actual genotypes. How do I solve this problem?

Thanks, Alva

Created 2015-01-16 06:37:45 | Updated | Tags: haplotypecaller best-practices next-generation-sequencing

Excuse me:

After calling the genotype using Haplotyper Caller in gatk, i manually check the reads covering the variant sites. But i found one exception: The genotype of one sample was 0/0，wild type,but the reads for this site were .,,,..,,c.,c.c,.cccc. I think almost all the reads should be . or ,. Is this case normal?

Created 2014-12-05 21:10:20 | Updated | Tags: bqsr best-practices gatk

Hi,

I noticed that the pipeline for BQSR is slightly different between the "Workshop walkthrough (Brussels 2014)" document and the "(howto) Recalibrate base quality scores = run BQSR" document. I used the former document as my guide since it is more recent, but I am curious as to why these two are different. The first step is the same for the 2 docs, but then for the second step, the output is a recal.bam file in the brussels document instead of a post_recal.table file in the (howto) document. Then, I am confused about the (howto) document because it seems that the 4th step is exactly the same as the second step, except we generate a recal.bam file. Is the brussels document presenting steps that are more efficient? (since there is only 1 use of -BQSR to generate the recalibrated bam file we need, as opposed to 2 uses of the option)

Thanks, Alva

Created 2014-12-05 15:27:21 | Updated 2014-12-05 18:08:01 | Tags: haplotypecaller best-practices vcf gatk gvcf

Hi,

I used HaplotypeCaller in GVCF mode to generate a single sample GVCF, but when I checked my vcf file I see that the reference allele is not showing up:

22  1   .   N   <NON_REF>   .   .   END=16050022    GT:DP:GQ:MIN_DP:PL  0/0:0:0:0:0,0,0
22 16050023    .   C   <NON_REF>   .   .   END=16050023    GT:DP:GQ:MIN_DP:PL  0/0:1:3:1:0,3,37
22  16050024    .   A   <NON_REF>   .   .   END=16050026    GT:DP:GQ:MIN_DP:PL  0/0:2:6:2:0,6,73
22  16050027    .   A   <NON_REF>   .   .   END=16050035    GT:DP:GQ:MIN_DP:PL  0/0:3:9:3:0,9,110
22  16050037    .   G   <NON_REF>   .   .   END=16050037    GT:DP:GQ:MIN_DP:PL  0/0:3:9:3:0,9,109
22  16050038    .   A   <NON_REF>   .   .   END=16050039    GT:DP:GQ:MIN_DP:PL  0/0:4:12:4:0,12,153

I am not sure where to start troubleshooting for this, since all the steps prior to using HaplotypeCaller did not generate any obvious errors.

The basic command that I used was: java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hs37d5.fa -I recal_1.bam -o raw_1.vcf -L 22 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

Have you encountered this problem before? Where should I start troubleshooting?

Thanks very much in advance, Alva

Created 2014-11-25 14:27:16 | Updated | Tags: best-practices gatk

Chaps,

I am a newbie to GATK. I have started the best practices pipline but stuck in bwa. From where can I get the reference.fa file? The GATK bundle has .fasta .bam .fai but no .fa. Anyone please!

Created 2014-11-20 11:41:42 | Updated | Tags: best-practices gatk

Where can I get the fasta and ref files for experimenting on GATK?

Created 2014-09-30 19:53:52 | Updated | Tags: best-practices vcf

Hi, I am following the Best Practices for DNAseq analysis and have 2 quick questions:

1. I wanted to confirm if the VCF files produced after the VariantRecalibrator and ApplyRecalibration steps for SNP and for Indels are completely independent of each other. In other words, the VCF file produced after these two steps for SNPs (mode SNP) is just for SNPs and for Indels (mode INDEL) is just for Indels.
2. What is the source of the ALT alleles in the VCF file - is it the various annotation files that were used during the analysis? Thanks,
• Pankaj

Created 2014-08-14 19:37:20 | Updated | Tags: best-practices rnaseq variant-calling

Hi,

Thank you for providing guidelines on RNA-Seq variant discovery. For our data, we are currently playing with multiple mapping methods and have noticed that 2-step alignments work "better" than 1-step alignments. By 2-step alignments, I mean using STAR as step1 and then take the unmapped from this and use another aligner (like Bowtie2) for alignment. If I use such a methodology, will there be an issue in variant calling when during splitting cigar strings I ask it convert the 255 MAPQ to another value (like 60 in the best practices example), since bowtie2 gives different MAPQ scores. Sorry if this seems like a stupid question, but I am just a little curious how such a thing might affect the variant calls. Any insights/comment on this will be greatly appreciated.

Thanks!

Created 2014-07-31 14:05:07 | Updated | Tags: haplotypecaller best-practices

Hi,

I am wondering how HaplotypeCaller in GATK 3.1 handles multimappers? I thought I read that they are passed over for variant calling but stay in the realigned, recalibrated BAM for 'completions sake', like marking duplicates not removing them but cannot find supporting info on the website or from farther afield,

Presumably it is the NotPrimaryAlignmentFilter but there is no info on that posted yet. I know I can output a BAM from HC with haplotype info in there but can I just get reads used in variant calls? Or should I trim the BAM myself to retain reads I want used? I do this for mark duplicates (removed) but for multimappers I would like to know how you define so I can do the same. The reason is for coverage estimates, using bamtobed or such means I take all realigned, recalibrated which is many more lines including multimappers which skews my results.

Thanks,

Bruce.

Created 2014-06-10 20:57:36 | Updated | Tags: variantannotator best-practices vcf developer performance walkers java gatk gatk-walkers

Hi,

I was wondering if there is a nice way to apply multiple processing steps to each variant (or a group of variants) as they are read so that the variant file is not read again and again. My understanding is that even if I use Queue, each script would read the vcf again. Is that correct?

Created 2013-12-31 02:11:46 | Updated 2013-12-31 02:12:36 | Tags: vqsr best-practices non-human

Hello there! Thanks as always for the lovely tools, I continue to live in them.

• Been wondering how best to interpret my VQSLOD plots/tranches and subsequent VQSLOD scores. Attached are those plots, and a histogram of my VQSLOD scores as they are found across my replicate samples.

Methods Thus Far

We have HiSeq reads of "mutant" and wt fish, three replicates of each. The sequences were captured by size selected digest, so some have amazing coverage but not all. The mutant fish should contain de novo variants of an almost cancer-like variety (TiTv independent).

As per my interpretation of the best practices, I did an initial calling of the variants (HaplotypeCaller) and filtered them very heavily, keeping only those that could be replicated across all samples. Then I reprocessed and called variants again with that first set as a truth set. I also used the zebrafish dbSNP as "known", though I lowered the Bayesian priors of each from the suggested human ones. The rest of my pipeline follows the best practices fairly closely, GATK version was 2.7-2, and my mapping was with BWA MEM.

My semi-educated guess..

The spike in VQSLOD I see for samples found across all six replicates are simply the rediscovery of those in my truth set, and those with amazing coverage, which is probably fine/good. The part that worries me are the plots and tranches. The plots don't ever really show a section where the "known" set clusters with one set of obviously good variants but not with another. Is that OK or does that and my inflated VQSLOD values ring of poor practice?

Created 2013-09-05 09:53:11 | Updated 2013-09-05 09:54:32 | Tags: unifiedgenotyper best-practices

We have 50 human whole genome data (12x) from the same population and we have already used GATK to call variants and followed the best practice for population level calling. We have then sequenced 20 more individual (12x) from the same population. For the population level variant calling do we have to put all the BAM files (50+20) together and then call the variants(in UnifiedGenotyper) ? Or we can somehow use the VCF file which was created from the previous 50 individual and use them as training set and call the variant for the new 20 individuals?

Created 2013-08-19 08:54:23 | Updated | Tags: best-practices webinar callvariants

I am not sure where I should ask this question, but the GATK forum seemed the most appropriate place. I am currently viewing the webinars from the BroadE Workshop 2013 July 9-10 and the "Call Variants" streaming doesn't seem to work:

Would you have a work around to view this file? or is the file inaccessible for some reason?

Created 2013-02-06 05:56:12 | Updated 2013-02-06 16:21:59 | Tags: best-practices callset

Hi, I am estimating SNP number for a genome of a speceis.

1. After mapping the illumina reads to a reference seqence (of a related species), I used picard to remove duplicates and used samtools (mpileup) to convert the bam file to a fastq file.
2. I also use GATK to do local reglimant and recalication from the mappnig bam file. Then I used samtools (mpileup) to convert the bam files output by GATK to a fastq file.

I found that the number of SNP in the fastq going through GATK is 10 times more than the first fastq. Interestingly, if I use picard to do duplicats-removomg again to the GATK bam and used samtools to convert the bam to fastq file. The SNP jumps back to 10 time fewer.

What can be the reason that the SNP number can be 10 time different between the two methods? Actually, I expect the GATK output file has fewer SNP given the effect of recaliraiton or relingement. But the result is opposite.

Chih-Ming

Created 2012-11-16 16:17:13 | Updated 2012-11-16 21:54:20 | Tags: best-practices

Hi:

for the best practice V4, for the step: 2. Raw reads to analysis-ready reads at phase I, I only have lane-level sample data (each sample in one lane, no sample in multi-lanes), so I only did the Fast: lane-level realignment (at known sites only) and lane-level recalibration, which is

for each lane.bam
dedup.bam <- MarkDuplicate(lane.bam)
realigned.bam <- realign(dedup.bam) [at only known sites, if possible, otherwise skip]
recal.bam <- recal(realigned.bam)

Since I do not have multi-lane samples, I do not have to run Fast + per-sample processing, Better: per-sample realignment with known indels and recalibration, or Best: multi-sample realignment with known sites and recalibration, right? It seems that all of these schemes, Better, Best etc, are only for situation when some samples run at multiple lanes, is my understanding correct?

at the paragraph for Fast + per-sample processing, it mentioned: cross-lane realignment, not sure what it means exactly. if every sample of mine are in a single lane, it does not need, right?

Thanks

Mike

Created 2012-11-13 10:21:02 | Updated 2013-01-07 20:04:51 | Tags: best-practices exome

Hi all, I have a 4 exome experiment (high coverage). According to best practices I can't apply VQSR, so I must filter on SNP properties. One thing is not clear: should all the option apply? In other words should I select using boolean "and" (i.e. QD  < 2.0 & MQ < 40.0 & FS > 60.0 & HaplotypeScore > 13.0 & MQRankSum < -12.5 & ReadPosRankSum < -8.0 for SNP) or "or" (i.e. QD  < 2.0 | MQ < 40.0 | FS > 60.0 | HaplotypeScore > 13.0 | MQRankSum < -12.5 | ReadPosRankSum < -8.0 for SNP)?

Created 2012-11-12 22:55:36 | Updated 2013-01-07 20:06:43 | Tags: indelrealigner bqsr best-practices baserecalibration

Hello,

I asked this question in a comment under BestPractices but never got a response. Hopefully I will here. Here goes:

I have been running GATK v1.6.2 on several samples. It seems the way I had initially had run GATK for indel-realignment and quality re-calibration steps are reversed. For example, in order of processing, I ran:

• MarkDuplicates
• Count Covariates
• Table Recalibration
• Realigner Target Creator
• Indel Realigner

What are the consequences and effect to SNP and INDEL calling if the GATK steps are ran as above?. I'm aware that this is not according to the best-practices document (thank you for the thorough GATK documentation), but I wanted to know if it is essential for me to re-analyze the samples. The variant caller I'm using looks at BaseQuality scores when making calls.

Any help on this would be appreciated.

Mike

Created 2012-11-09 20:38:55 | Updated 2013-01-07 20:06:07 | Tags: best-practices solid

Hi, I am working on SOLiD 5500xl data and used SHRiMP2.2.3 for performing mapping. The library type is paired-end. I have read some discussions regarding SOLiD problem but I still have some doubts regarding some steps in best practices

1. Local-Realignment at In-dels: Since local realignments take place in base space instead of color-space I doubt the accuracy of the alignment
2. Mark/Remove Duplicates: Reads just lying in the same position (end to end) may not necessarily be duplicates. Some of these reads may have putative variants, which otherwise may be filtered out.
3. Base quality score recalibration: I am not sure whether this is applicable for 5500xl as well, since quality values have slightly changed on 5500 from previous SOLiD versions as far as I know.

So after mapping, I simply used GATK UnifiedGenotyper to call SNPs and InDels under default values. I end up getting around 40 million variants. Is there any way I can get a more refined variant calling? Do you still consider me applying the above pre-processing steps or do you recommend me applying some variant filteratiion on the called variants? If yes for the previous, then could you explain how my above concerns are taken care of? I was trying to look at some general recommended filter values on INFO fields in VCF format such as BQ, MQ, MQ0, DP, SB etc. Do you recommend some generally used values of these fields on which I can filter and hence refine my variant data?

I may have posted a subset of the above question, which I am not sure was posted successfully since at that time I just created an account. If you have already answered this question then I apologize for that. Could you then provide me the link where you answered it?

Created 2012-10-27 07:32:18 | Updated 2012-10-29 21:37:57 | Tags: unifiedgenotyper variantrecalibrator haplotypecaller best-practices multi-sample

Hello,

I've just made a long needed update to the most recent version of GATK. I had been toying with the variant quality score recalibrator before but now that I have a great deal more exomes at my disposal I'd like to fully implement it in a meaningful way.

The phrase I'm confused about is "In our testing we've found that in order to achieve the best exome results one needs to use an exome callset with at least 30 samples." How exactly do I arrange these 30+ exomes?

Is there any difference or reason to choose one of the following two workflows over the other?

1. Input 30+ exomes in the "-I" argument of either the UnifiedGenotyper or HaplotypeCaller and then with my multi-sample VCF perform the variant recalibration procedure and then split the individual call sets out of the multi-sample vcf with SelectVariants?

2. Take 30+ individual vcf files, merge them together, and then perform variant recalibration on the merged vcf and then split the individual call sets out of the multi-sample vcf with SelectVariants?

3. Or some third option I'm missing

Any help is appreciated.

Thanks

Created 2012-10-23 06:21:29 | Updated 2012-10-25 15:27:41 | Tags: bqsr best-practices known special-cases

Hi all!

I'm currently working on high-coverage non-human data (mammals).

After mapping via BWA, soritng and merging, my pipeline goes like this:

1. Remove duplicates via MarkDuplicates.
2. RealignerTargetCreator
3. IndelRealigner using the --consensusDeterminationModel USE_SW flag
4. Fix Mates

I currently want to begin the recalibration step before doing the actual variant calls via UnifiedGenotyper.

Since I'm working on non-human data, there is no thorough database I can trust as an input vcf file for the recalibration step.

What is your recommendation for this for non-human data?

Thank you very much!

Sagi

Created 2012-10-18 11:59:05 | Updated 2012-10-19 01:11:34 | Tags: best-practices genotype threshold

Hello,

I'm sorry if I'm being dense (I'm new to all this and it is making me feel very dense indeed!), but having read the section on 'Selecting an appropriate quality score threshold' on the 'Best Practice Variant Detection' page, I am still unclear as to whether you mean I should be looking for a QUAL score of at least 30 in a deep coverage data set and should filter out any suggested SNPs that don't meet this, or a GQ score of 30 in each individual sample genotyped at the SNP in question and I only need to filter out individual samples that don't meet this threshold.

I have pasted the bit of text I read below, just to make it clear to which bit I am referring.

Many thanks!

A common question is the confidence score threshold to use for variant detection. We recommend:

Deep (> 10x coverage per sample) data: we recommend a minimum confidence score threshold of Q30.

Shallow (< 10x coverage per sample) data: because variants have by necessity lower quality with shallower coverage we recommend a minimum confidence score of Q4 in projects with 100 samples or fewer and Q10 otherwise.

Created 2012-10-16 14:35:13 | Updated 2012-10-16 21:22:21 | Tags: best-practices iontorrent

Hi,

I wonder how well GATK works with Ion Torrent data. Is there any recommended practice to handle Ion Torrent data, especially SNP and indel calling, with GATK?

Thanks, XZ

Created 2012-08-29 10:02:02 | Updated 2013-01-07 20:42:47 | Tags: best-practices readgroup realignment baserecalibration

Hello,

I am new at using GATK (v 2.1-3). I do exome sequencing by sample using the following steps: Alignment with BWA (0.6.2) GATK :Local realignment around INDELs PICARD (1.67): FixMateInformation GATK: Recalibration (BaseRecalibrator + PrintReads -BQSR) Samtools for calling variants

## Samtools seems to run properly but no file (.vcf and .bcf) are created and no error message is prompted :

cd Sample_09

• samtools mpileup -BE -ug -q 20 -Q 20 -D -f human_g1k_v37.fasta realigned_fixed_recal.bam -C50
• bcftools view -bvcg - [mpileup] 1 samples in 1 input files Set max per-file depth to 8000 [bcfview] 100000 sites processed. [afs] 0:89274.054 1:6411.053 2:4314.893 [bcfview] 200000 sites processed. [afs] 0:89100.642 1:6125.883 2:4773.474 [bcfview] 300000 sites processed. [afs] 0:87374.996 1:7439.238 2:5185.766 [bcfview] 400000 sites processed. [afs] 0:87890.186 1:7087.628 2:5022.185 [bcfview] 500000 sites processed. [afs] 0:85322.061 1:8454.843 2:6223.096 [bcfview] 600000 sites processed. [afs] 0:85864.368 1:8410.777 2:5724.854 [bcfview] 700000 sites processed. [afs] 0:88813.814 1:6828.001 2:4358.185 [bcfview] 800000 sites processed. [afs] 0:89070.318 1:6302.924 2:4626.758 [bcfview] 900000 sites processed. [afs] 0:88364.380 1:6796.962 2:4838.658 [bcfview] 1000000 sites processed. [afs] 0:86892.531 1:7268.088 2:5839.381 [bcfview] 1100000 sites processed. [afs] 0:87184.845 1:7153.073 2:5662.081 [bcfview] 1200000 sites processed. [afs] 0:86762.756 1:7241.236 2:5996.008 [bcfview] 1300000 sites processed. [afs] 0:89346.143 1:6159.989 2:4493.868 [bcfview] 1400000 sites processed. [afs] 0:88658.355 1:7160.555 2:4181.089 [bcfview] 1500000 sites processed. [afs] 0:85985.305 1:8308.039 2:5706.656 [bcfview] 1600000 sites processed. [afs] 0:87346.636 1:7708.883 2:4944.480 [afs] 0:63097.202 1:3950.127 2:3572.670
• bcftools view .bcf
• ## cd ..

I have seen that some groups use after realignment Picard:AddOrReplaceReadGroups and I wonder if I should use before calling the variant with samtools.