Tagged with #workflow
3 documentation articles | 0 announcements | 4 forum discussions



Created 2015-08-14 01:57:37 | Updated 2015-08-14 01:58:03 | Tags: best-practices workshop workflow presentations
Comments (0)

Joel Thibault, Valentin Ruano-Rubio and Geraldine Van der Auwera presented this workshop in Edinburgh, Scotland, and Cambridge, England, upon invitation from the Universities of Edinburgh and Cambridge.

This workshop included two modules:

  • Best Practices for Variant Calling with the GATK

    The core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. The presentation materials describe why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

  • Beyond the Best Practices

    Additional considerations such as calling variants in RNAseq data and calling cohorts efficiently, as well as dealing with non-human data, RNAseq data, whole-genome vs. exome, basic quality control, and performance.

This was complemented by a set of hands-on exercises aiming to teach basic GATK usage to new users.

The workshop materials are available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.


Created 2015-08-14 01:50:16 | Updated 2015-08-14 01:56:56 | Tags: best-practices workshop workflow presentations
Comments (4)

The full GATK team presented this workshop at the Broad Institute with support form the BroadE education program.

This workshop covered the core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. The presentation materials describe why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

The workshop materials are available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.


Created 2012-07-23 17:05:10 | Updated 2013-03-25 22:18:53 | Tags: dataprocessingpipeline queue workflow pacbio qscript intermediate
Comments (17)

Introduction

Processing data originated in the Pacific Biosciences RS platform has been evaluated by the GSA and publicly presented in numerous occasions. The guidelines we describe in this document were the result of a systematic technology development experiment on some datasets (human, E. coli and Rhodobacter) from the Broad Institute. These guidelines produced better results than the ones obtained using alternative pipelines up to this date (september 2011) for the datasets tested, but there is no guarantee that it will be the best for every dataset and that other pipelines won't supersede it in the future.

The pipeline we propose here is illustrated in a Q script (PacbioProcessingPipeline.scala) distributed with the GATK as an example for educational purposes. This pipeline has not been extensively tested and is not supported by the GATK team. You are free to use it and modify it for your needs following the guidelines below.

BWA alignment

First we take the filtered_subreads.fq file output by the Pacific Biosciences RS SMRT pipeline and align it using BWA. We use BWA with the bwasw algorithm and allow for relaxing the gap open penalty to account for the excess of insertions and deletions known to be typical error modes of the data. For an idea on what parameters to use check suggestions given by the BWA author in the BWA manual page that are specific to Pacbio. The goal is to account for Pacific Biosciences RS known error mode and benefit from the long reads for a high scoring overall match. (for older versions, you can use the filtered_subreads.fasta and combine the base quality scores extracted from the h5 files using Pacific Biosciences SMRT pipeline python tools)

To produce a BAM file that is sorted by coordinate with adequate read group information we use Picard tools: SortSam and AddOrReplaceReadGroups. These steps are necessary because all subsequent tools require that the BAM file follow these rules. It is also generally considered good practices to have your BAM file conform to these specifications.

Best Practices for Variant Calling

Once we have a proper BAM file, it is important to estimate the empirical quality scores using statistics based on a known callset (e.g. latest dbSNP) and the following covariates: QualityScore, Dinucleotide and ReadGroup. You can follow the GATK's Best Practices for Variant Detection according the type of data you have, with the exception of indel realignment, because the tool has not been adapted for Pacific Biosciences RS data.

Problems with Variant Calling with Pacific Biosciences

  • Calling must be more permissive of indels in the data.

You will have to adjust your calling thresholds in the Unified Genotyper to allow sites with a higher indel rate to be analyzed.

  • Base quality thresholds should be adjusted to the specifics of your data.

Be aware that the Unified Genotyper has cutoffs for base quality score and if your data is on average Q20 (a common occurrence with Pacific Biosciences RS data) you may need to adjust your quality thresholds to allow the GATK to analyze your data. There is no right answer here, you have to choose parameters consistent with your average base quality scores, evaluate the calls made with the selected threshold and modify as necessary.

  • Reference bias

To account for the high insertion and deletion error rate of the Pacific Biosciences data instrument, we often have to set the gap open penalty to be lower than the base mismatch penalty in order to maximize alignment performance. Despite aligning most of the reads successfully, this creates the side effect that the aligner will sometimes prefer to "hide" a true SNP inside an insertion. The result is accurate mapping, albeit with a reference-biased alignment. It is important to note however, that reference bias is an artifact of the alignment process, not the data, and can be greatly reduced by locally realigning the reads based on the reference and the data. Presently, the available software for local realignment is not compatible with the length and the high indel rate of Pacific Bioscience data, but we expect new tools to handle this problem in the future. Ultimately reference bias will mask real calls and you will have to inspect these by hand.

No posts found with the requested search criteria.

Created 2015-07-21 22:33:32 | Updated 2015-07-21 22:33:53 | Tags: workflow pipeline exome cancer
Comments (1)

I am new to Bioinformatics, and would like some advice on changes to the GATK workflow for cancer. I was told that the cancer workflow is different, and see that several different tools are available.

I have Exome data from tumor and normal. I have aligned them, and have BAM files for each sample. I am interested in identifying somatic variants.

The current workflow as I understand it is: -(Non-GATK) Picard Mark Duplicates or Samtools roundup -Indel Realignment (Realigner TargetCreator + Indel Realigner) -Base Quality Score Reacalibration (Base Recalibrator + PrintReads) -UnifiedGenotyper -Annotation using Oncotator (?) -MuTect (identify somatic mutations)

**My questions are:

  1. Is the above workflow reasonable/correct for what I'm trying to do?
  2. Is there any difference running samples one pair at a time, or running them all together? (I have 57 pairs. Should I do 57 runs of normal-tumor pairs, or 1 run of all 57 pairs?)**

Thank you, Gaius


Created 2014-02-12 20:31:19 | Updated | Tags: variantrecalibrator workflow multi-sample variant-calling
Comments (1)

Hi there,

We are sequencing a set of regions that covers about 1.5 megabases in total. We're running into problems with VQSR -- VariantRecalibrator says there are too few variants to do recalibration. To give a sense of numbers, in one sample we have about 3000 SNVs and 600 indels.

We seem to have too few indels to do VQSR on them and have a couple of questions:

  1. Can we combine multiple samples to increase the number of variants, or does VariantRecalibrator need to work on each sample individually?

  2. If we do not use VQSR for indels, should we also avoid VQSR for the SNPs?

  3. The other question is whether joint or batch variant calling across several samples would help us in this case?

Thanks in advance!


Created 2013-10-30 08:31:12 | Updated | Tags: license workflow graphics permission
Comments (3)

I have used GATK in my PhD project, and was wondering if I could get the permission to use the Best Practices workflow graphics [1] in my doctoral dissertation? How should I attribute your copyright?

[1] http://www.broadinstitute.org/gatk/img/BP_workflow.png


Created 2013-06-12 16:09:40 | Updated | Tags: commandlinegatk workflow rnaseq
Comments (15)

Hi all: I find that among all the work flows of GATK http://www.broadinstitute.org/gatk/guide/topic?name=methods-and-workflows there are no workflows for RNA-seq analysis. I understand that GATK mainly focuses on variant calling, can anyone tell me how to use GATK for RNA-seq analysis?

thanks daniel