Tagged with #tutorial
4 documentation articles | 2 announcements | 0 forum discussions



Created 2016-04-27 16:02:35 | Updated | Tags: tutorials tutorial firecloud

Comments (0)

FireCloud Webinars are now available on the FireCloud Youtube Channel.


Created 2016-04-21 17:14:27 | Updated 2016-04-27 15:18:12 | Tags: tutorials tutorial mutect picard qc oncotator contest mutect2 firecloud

Comments (0)

Broad Mutation Calling Workflow

Overview

The Broad Mutation Calling Workflow is a somatic mutation analysis and annotation pipeline. It is broken down into four Best Practice workspaces in FireCloud:

  • Broad_MutationCalling_MuTect_Workflow_V1_BestPractice

  • Broad_MutationCalling_QC_Workflow_V1_BestPractice

  • Broad_MutationCalling_Copy-Number_Workflow_V1_BestPractice (under construction)

  • Broad_MutationCalling_Filtering_Workflow_V1_BestPractice (under construction)

Broad_MutationCalling_MuTect_Workflow_V1_BestPractice

The Broad Mutation Calling MuTect workflow runs on a tumor and normal pair or pair set, and consists of three main steps: 1) preparation, 2) somatic mutation analysis, and 3) somatic mutation annotation. As a part of this workflow, MuTect is run in force-call mode to search for somatic variants at a set of specified loci for clinical relevance (for example, TP53 gene on chromosome 17).

The three main steps are described here: 1. Preparation: The tumor and normal BAMs are split up by chromosome conditioned on interval overlap. For each chromosome, the BAM files are decomposed if the supplied interval file has any overlap on the chromosome. In part of the workflow, the supplied intervals are from an Agilent targeted sequencing system. This step may be known as “scatter preparation.”

2. Somatic Mutation Analysis: 2 somatic variant analyses are carried out in parallel: MuTect1 and MuTect2. MuTect1 is used for its somatic single-nucleotide variant detection capabilities. MuTect2 is used for its indel detection capabilities. The programs are run in parallel based on the splitting of the BAMs and intervals from the preparation step before. This step may be known as “scatter”.

3. Somatic Mutation Gather: This final step serves to annotate variants found in the previous step. First the results are gathered and merged. Then, from the MuTect2 output, the index-variants are separated and merged with the single-nucleotide variants from MuTect1. Next, only variants tagged with PASS are fed to Oncotator for annotation with Oncotator. Finally, all variants are passed to the VEP (Variant Effect Predictor) program for further annotation.

Runtime

The total expected runtime for the Broad Mutation Calling MuTect workflow depends on the size of the pair you select for analysis. Pair HCC1954 runs on "mini" BAMs and the expected runtime is roughly 45 minutes. Pair HCC1143 runs on larger BAMs and the expected runtime is roughly 1.5 hours. If you run both pairs as a pair set, the pairs will complete at the same expected runtimes respectively with HCC1954 finishing first.

MuTect1

MuTect1 identifies somatic point mutations in next generation sequencing data of cancer genomes. It inputs sequencing data for matched normal and tumor tissue samples, and outputs mutation calls and optional coverage results.

In a nutshell, the analysis itself consists of three steps:

  • Pre-process the aligned reads in the tumor and normal sequencing data

  • Identify using statistical analysis, sites that are likely to carry somatic mutations with high confidence

  • Post-processing of candidate somatic mutations

For complete details, please see the 2013 publication in Nature Biotechnology.

MuTect2

MuTect2 is a somatic SNP and indel caller that combines the DREAM challenge-winning somatic genotyping engine of the original MuTect (Cibulskis et al., 2013) with the assembly-based machinery of HaplotypeCaller.

The basic operation of MuTect2 proceeds similarly to that of the HaplotypeCaller While the HaplotypeCaller relies on a ploidy assumption (diploid by default) to inform its genotype likelihood and variant quality calculations, MuTect2 allows for a varying allelic fraction for each variant, as is often seen in tumors with purity less than 100%, multiple subclones, and/or copy number variation (either local or aneuploidy). MuTect2 also differs from the HaplotypeCaller in that it does apply some hard filters to variants before producing output.

Oncotator

Oncotator is a tool for annotating information onto genomic point mutations (SNPs/SNVs) and indels. It is primarily used for human genome variant callsets. However, the tool can also be used to annotate any kind of information onto variant callsets from any organism.

By default, Oncotator uses a simple TSV file (e.g., MAFLITE) as an input and produces a TCGA MAF as an output. Oncotator also supports VCF files as an input and output format. By extension, Oncotator can be configured to annotate genomic point mutation data with HTML reports as it does in this workflow.

VEP (Variant Effect Predictor)

Ensembl’s VEP (Variant Effect Predictor) program processes variants for further annotation. This tool allows annotates variants and determines the effect on relevant transcripts and proteins.

VEP accepts as input coordinates any identified alleles or variant identifiers. A full list of input files is available here. If a variant that you enter as input causes a change in the protein sequence, the VEP will calculate the possible amino acids at that position and the variant would be given a consequence type of missense.

Inputs and Outputs

Below are the tool-specific inputs and outputs for this workflow. Note that these inputs and outputs may vary from the “standard” version of the tool.

MuTect1 and MuTect2 Inputs

  • normalBam

  • normalBamIndex

  • tumorBam

  • normalBamIndex

  • tumorBam

  • tumorBamIndex

  • ReferenceFasta

  • NormalPanel

  • ReferenceFastaIndex

  • ReferenceFastaDict

  • COSMICVCF

  • DBSNPVCF

  • fracContam

  • MutectIntervals

MuTect1 and MuTect2 Outputs

  • MuTect PowerFile

  • MuTect CoverageFile

  • MuTect CallStatsFile

Oncotator and VEP Inputs

  • MuTect PowerFile

  • MuTect CoverageFile

  • MuTect CallStatsFile

  • File oncoDBTarBall

  • VEP_File

  • ReferenceFastaIndex

  • ReferenceFasta

  • ReferenceFastaDict

Oncotator and VEP Outputs

  • Array[File] all_outs=glob("*")

  • File oncotator.log

  • File MuTect.1.2.call_stats.M1All.M2IndelsOnly.filtered.vcf.annotated.vcf

  • FileMuTect1.call_stats.txt

  • File MuTect2.call_stats.txt

  • File variant_effect_output.txt

  • File variant_effect_output.txt_summary.html

How to Run this Workflow in FireCloud

  • Access the Broad_MutationCalling_MuTect_Workflow_V1_BestPractice workspace to run this workflow.
  • Navigate to the Method Configurations tab and click on the method, MutationCalling_MuTect.
  • Click Launch Analysis.
  • In the Launch Analysis window, toggle to pair and select a pair on which to run this workflow, e.g., HCC1143. You can also run this workflow on a pair set. To do so, toggle to pair_set, select HCC_pairs and enter this.pairs in the Define Expression field.
  • Finally, click the Launch button. Check back on the Monitor tab later to view results from your workflow analysis.
  • When the status displays Done, select the most recent analysis run, e.g., HCC1143.
  • Click on Outputs: Show, then select output files to view the results of this analysis.

Use of Tutorial Workspaces

Tutorial workspaces contain open access data and workflows. These Tutorial workspaces will appear in your account after your account is activated.

Tutorial workspaces may ONLY be used for running the tutorial exercises. You must not upload your own data sets to these workspaces, nor should you add tools (Method Configs). If you do not follow these guidelines, your Firecloud account may be deactivated.

References

Broad_MutationCalling_QC_Workflow_V1_BestPractice

Runtime

The total expected runtime for this workflow depends on the size of the pair or pair set you select for analysis. Pair HCC1954 runs on "mini" BAMs and the expected runtime is roughly 15 minutes. Pair HCC1143 runs on larger BAMs and the expected runtime is roughly 2.5 hours. If you run both pairs as a pair set, the pairs will complete at the same expected runtimes respectively with HCC1954 finishing first.

QC Task

The QC task counts reads overlapping regions for tumor and normal BAM files. The task concludes with a report of the counts over the BAMs and lanes. Correlation values are included for comparison purposes.

ContEST Task

ContEst uses a Bayesian approach to calculate the posterior probability of the contamination level and determine the maximum a posteriori probability (MAP) estimate of the contamination level. ContEst supports array-free mode, where we genotype on the fly from matched normals, and use that as our source of homozygous variant calls. It currently calls anything with > 80% of bases as the alternate with at least 50X coverage a homozygous alternate site.

Picard Metrics Tasks

Picard Metrics Tasks invoke multiple metrics reporting routines from the Picard toolkit. The following routines are invoked:

  • CollectAlignmentSummaryMetrics

  • CollectInsertSizeMetrics

  • QualityScoreDistribution

  • CollectInsertSizeMetrics

  • QualityScoreDistribution

  • MeanQualityByCycle

  • CollectBaseDistributionByCycle

  • CollectSequencingArtifactMetrics

  • CollectQualityYieldMetrics

  • CollectOxoGMetrics.

SAM/BAM validation routines (ValidateSamFile) in both summary and verbose mode are called.

Other routines such as CollectHsMetrics, and MarkDuplicates are also called.

Inputs and Outputs

Below are the tool-specific inputs and outputs for this workflow. Note that these inputs and outputs may vary from the “standard” version of the tool.

QC Inputs

  • tumorBam

  • tumorBamIdx

  • normalBam

  • normalBamIdx

  • regionFile

  • readGroupBlackList

  • captureNormalsDBRCLZip

  • caseIdNoSpace

  • controlIdNoSpace

QC Outputs

  • tumorBamLaneList

  • normalBamLaneList

  • tumorRCL

  • normalRCL

  • CopyNumQC

ContEST Inputs

  • tumorBam

  • tumorBamIdx

  • normalBam

  • normalBamIdx

  • refFasta

  • refFastaIdx

  • refFastaDict

  • exomeIntervals

  • SNP6Bed

  • HapMapVCF

  • pairName

ContEST Outputs

  • contamination_validation.array_free.txt

  • contamination.af.txt

  • contamination.base_report.txt

  • contest_validation.output.tsv

Picard Metrics Inputs (Tumor)

  • tumorBam

  • tumorBamIdx

  • refFasta

  • HapMapVCF

  • picardHapMap

  • picardBaitIntervals

  • picardTargetIntervals

  • DB_SNP_VCF

  • DB_SNP_VCF_IDX

Picard Metrics Inputs (Normal)

  • normalBam

  • normalBamIdx

  • refFasta

  • HapMapVCF

  • picardHapMap

  • picardBaitIntervals

  • picardTargetIntervals

  • DB_SNP_VCF

  • DB_SNP_VCF_IDX

Picard Metrics Outputs

  • HSMetrics.txt

  • picard_crosscheck_report.txt

  • validation_summary.txt

  • validation_verbose.txt

  • piccard_multiple_metrics.zip

  • oxoG_metrics.txt

  • duplicate_records.bam

  • duplicate_metrics.txt

  • sequencing_artifact_metrics.txt.bait_bias_detail_metrics

  • sequencing_artifact_metrics.txt.bait_bias_summary_metrics

  • sequencing_artifact_metrics.txt.pre_adapter_detail_metrics

  • sequencing_artifact_metrics.txt.pre_adapter_summary_metrics

How to Run this Workflow in FireCloud

  • Access the Broad_MutationCalling_QC_Workflow_V1_BestPractice workspace to run this workflow.
  • Navigate to the Method Configurations tab and click on the method, MutationCalling_QC.
  • Click Launch Analysis.
  • In the Launch Analysis window, toggle to pair and select a pair on which to run this workflow, e.g., HCC1143. You can also run this workflow on a pair set. To do so, toggle to pair_set, select HCC_pairs and enter this.pairs in the Define Expression field.
  • Finally, click the Launch button. Check back on the Monitor tab later to view results from your workflow analysis.
  • You can view results and outputs by clicking on the most recent analysis run, e.g., HCC1143.
  • Click on Outputs: Show, then select output files to view the results of this analysis.

Use of Tutorial Workspaces

Tutorial workspaces contain open access data and workflows. These Tutorial workspaces will appear in your account after your account is activated.

Tutorial workspaces may ONLY be used for running the tutorial exercises. You must not upload your own data sets to these workspaces, nor should you add tools (Method Configs). If you do not follow these guidelines, your Firecloud account may be deactivated.

References


Created 2016-01-20 15:38:23 | Updated 2016-04-27 15:16:06 | Tags: tutorials tutorial firecloud gistic

Comments (0)

Overview

GISTIC identifies genomic regions that are significantly gained or lost across a set of tumors. It takes segmented copy number ratios as input, separates arm-level events from focal events, and then performs two tests: (i) identifies significantly amplified/deleted chromosome arms; and (ii) identifies regions that are significantly focally amplified or deleted. For the focal analysis, the significance levels (Q values) are calculated by comparing the observed gains/losses at each locus to those obtained by randomly permuting the events along the genome to reflect the null hypothesis that they are all 'passengers' and could have occurred anywhere. The locus-specific significance levels are then corrected for multiple hypothesis testing. The arm-level significance is calculated by comparing the frequency of gains/losses of each arm to the expected rate given its size. The method outputs genomic views of significantly amplified and deleted regions, as well as a table of genes with gain or loss scores.

CNV Description Regions of the genome that are prone to germ line variations in copy number are excluded from the GISTIC analysis using a list of germ line copy number variations (CNVs). A CNV is a DNA sequence that may be found at different copy numbers in the germ line of two different individuals. Such germ line variations can confound a GISTIC analysis, which finds significant somatic copy number variations in cancer. A more in depth discussion is provided in [6]. GISTIC currently uses two CNV exclusion lists. One is based on the literature describing copy number variation, and a second one comes from an analysis of significant variations among the blood normals in the TCGA data set.

Inputs

Segmentation File: The segmentation file contains the segmented data for all the samples identified by GLAD, CBS, or some other segmentation algorithm. (See GLAD file format in the Genepattern file formats documentation.) It is a six column, tab-delimited file with an optional first line identifying the columns. Positions are in base pair units.The column headers are: (1) Sample (sample name), (2) Chromosome (chromosome number), (3) Start Position (segment start position, in bases), (4) End Position (segment end position, in bases), (5) Num markers (number of markers in segment), (6) Seg.CN (log2() -1 of copy number).

Markers File: The markers file identifies the marker names and positions of the markers in the original dataset (before segmentation). It is a three column, tab-delimited file with an optional header. The column headers are: (1) Marker Name, (2) Chromosome, (3) Marker Position (in bases).

Reference Genome: The reference genome file contains information about the location of genes and cytobands on a given build of the genome. Reference genome files are created in Matlab and are not viewable with a text editor.

CNV Files: There are two options for the cnv file. The first option allows CNVs to be identified by marker name. The second option allows the CNVs to be identified by genomic location. Option #1: A two column, tab-delimited file with an optional header row. The marker names given in this file must match the marker names given in the markers file. The CNV identifiers are for user use and can be arbitrary. The column headers are: (1) Marker Name, (2) CNV Identifier. Option #2: A 6 column, tab-delimited file with an optional header row. The 'CNV Identifier' is for user use and can be arbitrary. 'Narrow Region Start' and 'Narrow Region End' are also not used. The column headers are: (1) CNV Identifier, (2) Chromosome, (3) Narrow Region Start, (4) Narrow Region End, (5) Wide Region Start, (6) Wide Region End

Amplification Threshold: Threshold for copy number amplifications. Regions with a log2 ratio above this value are considered amplified.

Deletion Threshold: Threshold for copy number deletions. Regions with a log2 ratio below the negative of this value are considered deletions.

Cap Values Minimum and maximum cap values on analyzed data. Regions with a log2 ratio greater than the cap are set to the cap value; regions with a log2 ratio less than -cap value are set to -cap. Values must be positive.

Broad Length Cutoff: Threshold used to distinguish broad from focal events, given in units of fraction of chromosome arm.

Remove X-Chromosome: Flag indicating whether to remove data from the X-chromosome before analysis. Allowed values= {1,0} (1: Remove X-Chromosome, 0: Do not remove X-Chromosome.

Confidence Level:Confidence level used to calculate the region containing a driver.

Join Segment Size: Smallest number of markers to allow in segments from the segmented data. Segments that contain fewer than this number of markers are joined to the neighboring segment that is closest in copy number.

Arm Level Peel Off: Flag set to enable arm-level peel-off of events during peak definition. The arm-level peel-off enhancement to the arbitrated peel-off method assigns all events in the same chromosome arm of the same sample to a single peak. It is useful when peaks are split by noise or chromothripsis. Allowed values= {1,0} (1: Use arm level peel off, 0: Use normal arbitrated peel-off).

Maximum Sample Segments: Maximum number of segments allowed for a sample in the input data. Samples with more segments than this threshold are excluded from the analysis.

Gene GISTIC: When enabled (value = 1), this option causes GISTIC to analyze deletions using genes instead of array markers to locate the lesion. In this mode, the copy number assigned to a gene is the lowest copy number among the markers that represent the gene.

Outputs

All Lesions File

E.g., alllesions.confXX.txt, where XX is the confidence level

The all lesions file summarizes the results from the GISTIC run. It contains data about the significant regions of amplification and deletion as well as which samples are amplified or deleted in each of these regions. The identified regions are listed down the first column, and the samples are listed across the first row, starting in column 10.

Region Data

Columns 1-9 present the data about the significant regions as follows:

(1) Unique Name: A name assigned to identify the region

(2) Descriptor: The genomic descriptor of that region.

(3) Wide Peak Limits: The "wide peak" boundaries most likely to contain the targeted genes. These are listed in genomic coordinates and marker (or probe) indices.

(4) Peak Limits: The boundaries of the region of maximal amplification or deletion.

(5) Region Limits: The boundaries of the entire significant region of amplification or deletion.

(6) q-values: The q-value of the peak region.

(7) Residual q-values: The q-value of the peak region after removing ("peeling off") amplifications or deletions that overlap other, more significant peak regions in the same chromosome.

(8) Broad or Focal: Identifies whether the region reaches significance due primarily to broad events (called "broad"), focal events (called “focal”), or independently significant broad and focal events (called “both”).

(9) Amplitude Threshold: Key giving the meaning of values in the subsequent columns associated with each sample.

Sample Data

Each of the analyzed samples is represented in one of the columns following the lesion data (columns 10 through end). The data contained in these columns varies slightly by section of the file.

The first section can be identified by the key given in column 9 – it starts in row 2 and continues until the row that reads "Actual Copy Change Given." This section contains summarized data for each sample. A ‘0’ indicates that the copy number of the sample was not amplified or deleted beyond the threshold amount in that peak region. A ‘1’ indicates that the sample had low-level copy number aberrations (exceeding the low threshold indicated in column 9), and a ‘2’ indicates that the sample had high-level copy number aberrations (exceeding the high threshold indicated in column 9).

The second section can be identified the rows in which column 9 reads "Actual Copy Change Given." The second section exactly reproduces the first section, except that here the actual changes in copy number are provided rather than zeroes, ones, and twos.

The final section is similar to the first section, except that here only broad events are included. A 1 in the samples columns (columns 10+) indicates that the median copy number of the sample across the entire significant region exceeded the threshold given in column 9. That is, it indicates whether the sample had a geographically extended event, rather than a focal amplification or deletion covering little more than the peak region.

Amplification Genes File

E.g., ampgenes.confXX.txt, where XX is the confidence level

The amp genes file contains one column for each amplification peak identified in the GISTIC analysis. The first four rows are:

(1) cytoband

(2) q-value

(3) residual q-value

(4) wide peak boundaries

These rows identify the lesion in the same way as the all lesions file. The remaining rows list the genes contained in each wide peak. For peaks that contain no genes, the nearest gene is listed in brackets.

Deletion Genes File

E.g., delgenes.confXX.txt, where XX is the confidence level

The del genes file contains one column for each deletion peak identified in the GISTIC analysis. The file format for the del genes file is identical to the format for the amp genes file.

GISTIC Scores File

E.g., scores.gistic

The scores file lists the q-values [presented as -log10(q)], G-scores, average amplitudes among aberrant samples, and frequency of aberration, across the genome for both amplifications and deletions. The scores file is viewable with the Genepattern SNPViewer module and may be imported into the Integrated Genomics Viewer (IGV).

Segmented Copy Number

E.g., raw_copy_number.pdf

A .pdf or .png file containing a heatmap image of the genomic profiles of the segmented input copy number data. The genome is represented along the vertical axis and samples are arranged horizontally.

Amplification Score GISTIC plot

E.g., amp_qplot.pdf

The amplification pdf is a plot of the G-scores (top) and q-values (bottom) with respect to amplifications for all markers over the entire region analyzed.

Deletion Score/q-value GISTIC plot

E.g., del_qplot.pdf

The deletion pdf is a plot of the G-scores (top) and q-values (bottom) with respect to deletions for all markers over the entire region analyzed.

How to run GISTIC2.0 in FireCloud

1. Navigate to the workspace broad-firecloud-tutorials/TCGA_ACC_OpenAccess_2016_01_01_TUTORIAL-GISTIC.

WS

2. Go to the Method Configurations tab and select copy_number_gistic.

Import

3. Click Launch Analysis.

Run

4. Sort to sample_set, then click Launch.

Run2

5. In the Monitor tab, view the status of your analysis. Initially, the status displays Submitted.

Monitor

6. When the status displays Done, click on ACC-TP (sample_set).

7. Click on Outputs: Show, then select output files to view the results of this analysis.

8. You can also view results, including Nozzle HTML reports and graphical analysis by viewing attributes in the Data tab.

Use of Tutorial Workspaces

Tutorial workspaces contain open access data and workflows. These Tutorial workspaces will appear in your account upon registration and will be active for 30 days.

Tutorial workspaces may ONLY be used for running the tutorial exercises. You must not upload your own data sets to these workspaces, nor should you add tools (Method Configs). If you do not follow these guidelines, your Firecloud account may be deactivated.

References

  • Mermel C, Schumacher S, et al. (2011). "GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers." Genome Biology, 12:R41.

  • Beroukhim R, Mermel C, et al. (2010). "The landscape of somatic copy -number alteration across human cancers." Nature, 463:899-905.

  • Beroukhim R, Getz G, et al. (2007). "Assessing the significance of chromosomal abberations in cancer: Methodology and application to glioma." Proc Natl Acad Sci, 104:20007-20012.

  • McCarroll, S. A. et al., Integrated detection and population-genetic analysis of SNPs and copy number variation, Nat Genet* Vol. 40(10):1166-1174 (2008)

  • GISTIC Version 1

  • GISTIC Version 2

  • The Sanger Institute: Cancer Gene Census

Created 2012-07-25 21:19:45 | Updated 2013-08-27 18:55:56 | Tags: intro tutorial developer run

Comments (3)

NOTICE:

This tutorial is slightly out of date so the output is a little different. We'll update this soon, but in the meantime, don't freak out if you get a result that reads something like

INFO 18:32:38,826 CountReads - CountReads counted 33 reads in the traversal 

instead of

INFO  16:17:46,061 Walker - [REDUCE RESULT] Traversal result is: 33 

You're doing the right thing and getting the right result.

And of course, in doubt, just post a comment on this article; we're here to answer your questions.


Objective

Run a basic analysis command on example data.

Prerequisites

Steps

  1. Invoke the GATK CountReads command
  2. Further exercises

1. Invoke the GATK CountReads command

A very simple analysis that you can do with the GATK is getting a count of the reads in a BAM file. The GATK is capable of much more powerful analyses, but this is a good starting example because there are very few things that can go wrong.

So we are going to count the reads in the file exampleBAM.bam, which you can find in the GATK resource bundle along with its associated index (same file name with .bai extension), as well as the example reference exampleFASTA.fasta and its associated index (same file name with .fai extension) and dictionary (same file name with .dict extension). Copy them to your working directory so that your directory contents look like this:

[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% ls -la
drwxr-xr-x  9 vdauwera  CHARLES\Domain Users     306 Jul 25 16:29 .
drwxr-xr-x@ 6 vdauwera  CHARLES\Domain Users     204 Jul 25 15:31 ..
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users    3635 Apr 10 07:39 exampleBAM.bam
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users     232 Apr 10 07:39 exampleBAM.bam.bai
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users     148 Apr 10 07:39 exampleFASTA.dict
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users  101673 Apr 10 07:39 exampleFASTA.fasta
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users      20 Apr 10 07:39 exampleFASTA.fasta.fai

Action

Type the following command:

java -jar <path to GenomeAnalysisTK.jar> -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam 

where -T CountReads specifies which analysis tool we want to use, -R exampleFASTA.fasta specifies the reference sequence, and -I exampleBAM.bam specifies the file of aligned reads we want to analyze.

For any analysis that you want to run on a set of aligned reads, you will always need to use at least these three arguments:

  • -T for the tool name, which specifices the corresponding analysis
  • -R for the reference sequence file
  • -I for the input BAM file of aligned reads

They don't have to be in that order in your command, but this way you can remember that you need them if you TRI...

Expected Result

After a few seconds you should see output that looks like to this:

INFO  16:17:45,945 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:17:45,946 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-22-g40f97eb, Compiled 2012/07/25 15:29:41 
INFO  16:17:45,947 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  16:17:45,947 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  16:17:45,947 HelpFormatter - Program Args: -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam 
INFO  16:17:45,947 HelpFormatter - Date/Time: 2012/07/25 16:17:45 
INFO  16:17:45,947 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:17:45,948 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:17:45,950 GenomeAnalysisEngine - Strictness is SILENT 
INFO  16:17:45,982 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  16:17:45,993 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  16:17:46,060 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] 
INFO  16:17:46,060 TraversalEngine -        Location processed.reads  runtime per.1M.reads completed total.runtime remaining 
INFO  16:17:46,061 Walker - [REDUCE RESULT] Traversal result is: 33 
INFO  16:17:46,061 TraversalEngine - Total runtime 0.00 secs, 0.00 min, 0.00 hours 
INFO  16:17:46,100 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%) 
INFO  16:17:46,729 GATKRunReport - Uploaded run statistics report to AWS S3 

Depending on the GATK release, you may see slightly different information output, but you know everything is running correctly if you see the line:

INFO  21:53:04,556 Walker - [REDUCE RESULT] Traversal result is: 33 

somewhere in your output.

If you don't see this, check your spelling (GATK commands are case-sensitive), check that the files are in your working directory, and if necessary, re-check that the GATK is properly installed.

If you do see this output, congratulations! You just successfully ran you first GATK analysis!

Basically the output you see means that the CountReadsWalker (which you invoked with the command line option -T CountReads) counted 33 reads in the exampleBAM.bam file, which is exactly what we expect to see.

Wait, what is this walker thing?

In the GATK jargon, we call the tools walkers because the way they work is that they walk through the dataset --either along the reference sequence (LocusWalkers), or down the list of reads in the BAM file (ReadWalkers)-- collecting the requested information along the way.


2. Further Exercises

Now that you're rocking the read counts, you can start to expand your use of the GATK command line.

Let's say you don't care about counting reads anymore; now you want to know the number of loci (positions on the genome) that are covered by one or more reads. The name of the tool, or walker, that does this is CountLoci. Since the structure of the GATK command is basically always the same, you can simply switch the tool name, right?

Action

Instead of this command, which we used earlier:

java -jar <path to GenomeAnalysisTK.jar> -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam 

this time you type this:

java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam 

See the difference?

Result

You should see something like this output:

INFO  16:18:26,183 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:18:26,185 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-22-g40f97eb, Compiled 2012/07/25 15:29:41 
INFO  16:18:26,185 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  16:18:26,185 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  16:18:26,186 HelpFormatter - Program Args: -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam 
INFO  16:18:26,186 HelpFormatter - Date/Time: 2012/07/25 16:18:26 
INFO  16:18:26,186 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:18:26,186 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:18:26,189 GenomeAnalysisEngine - Strictness is SILENT 
INFO  16:18:26,222 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  16:18:26,233 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  16:18:26,351 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] 
INFO  16:18:26,351 TraversalEngine -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
2052
INFO  16:18:26,411 TraversalEngine - Total runtime 0.08 secs, 0.00 min, 0.00 hours 
INFO  16:18:26,450 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%) 
INFO  16:18:27,124 GATKRunReport - Uploaded run statistics report to AWS S3 

Great! But wait -- where's the result? Last time the result was given on this line:

INFO  21:53:04,556 Walker - [REDUCE RESULT] Traversal result is: 33 

But this time there is no line that says [REDUCE RESULT]! Is something wrong?

Not really. The program ran just fine -- but we forgot to give it an output file name. You see, the CountLoci walker is set up to output the result of its calculations to a text file, unlike CountReads, which is perfectly happy to output its result to the terminal screen.

Action

So we repeat the command, but this time we specify an output file, like this:

java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam -o output.txt

where -o (lowercase o, not zero) is used to specify the output.

Result

You should get essentially the same output on the terminal screen as previously (but notice the difference in the line that contains Program Args -- the new argument is included):

INFO  16:29:15,451 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:29:15,453 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-22-g40f97eb, Compiled 2012/07/25 15:29:41 
INFO  16:29:15,453 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  16:29:15,453 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  16:29:15,453 HelpFormatter - Program Args: -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam -o output.txt 
INFO  16:29:15,454 HelpFormatter - Date/Time: 2012/07/25 16:29:15 
INFO  16:29:15,454 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:29:15,454 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:29:15,457 GenomeAnalysisEngine - Strictness is SILENT 
INFO  16:29:15,488 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  16:29:15,499 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  16:29:15,618 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] 
INFO  16:29:15,618 TraversalEngine -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
INFO  16:29:15,679 TraversalEngine - Total runtime 0.08 secs, 0.00 min, 0.00 hours 
INFO  16:29:15,718 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%) 
INFO  16:29:16,712 GATKRunReport - Uploaded run statistics report to AWS S3 

This time however, if we look inside the working directory, there is a newly created file there called output.txt.

[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% ls -la
drwxr-xr-x  9 vdauwera  CHARLES\Domain Users     306 Jul 25 16:29 .
drwxr-xr-x@ 6 vdauwera  CHARLES\Domain Users     204 Jul 25 15:31 ..
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users    3635 Apr 10 07:39 exampleBAM.bam
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users     232 Apr 10 07:39 exampleBAM.bam.bai
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users     148 Apr 10 07:39 exampleFASTA.dict
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users  101673 Apr 10 07:39 exampleFASTA.fasta
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users      20 Apr 10 07:39 exampleFASTA.fasta.fai
-rw-r--r--  1 vdauwera  CHARLES\Domain Users       5 Jul 25 16:29 output.txt

This file contains the result of the analysis:

[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% cat output.txt 
2052

This means that there are 2052 loci in the reference sequence that are covered by at least one or more reads in the BAM file.

Discussion

Okay then, but why not show the full, correct command in the first place? Because this was a good opportunity for you to learn a few of the caveats of the GATK command system, which may save you a lot of frustration later on.

Beyond the common basic arguments that almost all GATK walkers require, most of them also have specific requirements or options that are important to how they work. You should always check what are the specific arguments that are required, recommended and/or optional for the walker you want to use before starting an analysis.

Fortunately the GATK is set up to complain (i.e. terminate with an error message) if you try to run it without specifying a required argument. For example, if you try to run this:

java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta

the GATK will spit out a wall of text, including the basic usage guide that you can invoke with the --help option, and more importantly, the following error message:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.0-22-g40f97eb): 
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Walker requires reads but none were provided.
##### ERROR ------------------------------------------------------------------------------------------

You see the line that says ERROR MESSAGE: Walker requires reads but none were provided? This tells you exactly what was wrong with your command.

So the GATK will not run if a walker does not have all the required inputs. That's a good thing! But in the case of our first attempt at running CountLoci, the -o argument is not required by the GATK to run -- it's just highly desirable if you actually want the result of the analysis!

There will be many other cases of walkers with arguments that are not strictly required, but highly desirable if you want the results to be meaningful.

So, at the risk of getting repetitive, always read the documentation of each walker that you want to use!


Created 2016-03-04 22:20:20 | Updated | Tags: tutorial workshop

Comments (0)

The presentation slide decks and hands-on tutorial materials can be downloaded at this Google Drive link.


Created 2015-11-16 07:07:45 | Updated 2015-11-19 18:09:19 | Tags: tutorial workshop

Comments (5)

We are scheduled to do 2 hands-on modules at the BroadE GATK workshop at the Broad Institute this Thursday, Nov. 19.

The tutorial materials for each module include the following:

  • Workshop dataset and an appendix document containing detailed instructions for preparing for the workshop
  • Tutorial worksheet containing the actual exercises done in the workshop

Attendees will receive a printout of the worksheets for the module to which they have registered. We will not provide printouts of the appendix documents.

If you are registered to attend, you must have downloaded the materials and followed the instructions before the workshop starts, otherwise you will not be able to follow along and your workshop experience will be unsatisfying. We certainly don't want that to happen, so be sure to do your homework as follows below the fold. Be sure to identify correctly which workshop you are registered for!


Variant Discovery (morning session)

Get the dataset + appendix bundle: 5 minutes

Note that the file and directory names make reference to ASHG 2015 because these materials were originally developed for the GATK workshop we taught at the ASHG 2015 meeting. There, that's one mystery cleared up!

  • Download the 52 Mb zip file to the laptop you will bring to the workshop, and save it where you want the tutorial files to sit.
  • Open the zip file; this will create a directory called ASGH15_GATK.
  • Open the ASGH15_GATK directory, look at the contents and read the extremely brief README.txt file.

Refresh your memory of the topic and scientific context of the workshop: 10-20 minutes

  • Open the PDF document called ASGH15_GATK-Appendix.
  • Read the introduction. If any of the content is especially new or confusing to you, consider following the links included to the additional documentation on our website. We will cover this content briefly at the start of the tutorial, but the more you prepare, the better your workshop experience will be.

Install and test software: 30-45 minutes

Go over the second part of the ASGH15_GATK-Appendix document to acquaint yourself with the technical requirements of the workshop and follow the detailed installation instructions. In particular:

  • Make sure you have the correct version of Java installed (Java 7 / JRE 1.7).
  • Download all 3 software packages (GATK, Samtools, IGV) to the laptop you will bring to the workshop.
  • Copy the program files of all three to the ASGH15_GATK directory.
  • Run the test commands for GATK and Samtools, and launch IGV.

If you are new to the command line or GATK, we strongly recommend reviewing the sections of the document about tool syntax (pages 8-10) and practicing basic Unix commands. You may also benefit from spending some time gaining familiarity with IGV.


Callset Filtering & Evaluation (afternoon session)

The appendix file is now available from this link. The data bundle is available here.

Be sure to have done the workshop preparation homework as outlined below and described in the Appendix!

Refresh your memory of the topic and scientific context of the workshop: 10-20 minutes

  • Open the PDF document called BroadE2015-Filtering_Tutorial-Appendix (linked above).
  • Read the introduction. If any of the content is especially new or confusing to you, consider following the links included to the additional documentation on our website. We will cover this content briefly at the start of the tutorial, but the more you prepare, the better your workshop experience will be.

Install and test software: 30-45 minutes

Go over the second part of the BroadE2015-Filtering_Tutorial-Appendix document to acquaint yourself with the technical requirements of the workshop and follow the detailed installation instructions. In particular:

  • Make sure you have the correct version of Java installed (Java 7 / JRE 1.7).
  • Download all software packages listed (GATK, RStudio, IGV) to the laptop you will bring to the workshop.
  • Create a directory called BroadE2015 where you will put tutorial files.
  • Copy the program files of all three to the BroadE2015 directory.
  • Run the test commands for GATK and RStudio/ggplot, and launch IGV.

If you are new to the command line or GATK, we strongly recommend reviewing the sections of the document about tool syntax (pages 8-10) and practicing basic Unix commands. You may also benefit from spending some time gaining familiarity with IGV and RStudio.

No articles to display.