# Tagged with #tutorials 7 documentation articles | 2 announcements | 5 forum discussions

Created 2016-04-27 16:02:35 | Updated | Tags: tutorials tutorial firecloud

FireCloud Webinars are now available on the FireCloud Youtube Channel.

Created 2016-04-21 17:14:27 | Updated 2016-04-27 15:18:12 | Tags: tutorials tutorial mutect picard qc oncotator contest mutect2 firecloud

### Overview

The Broad Mutation Calling Workflow is a somatic mutation analysis and annotation pipeline. It is broken down into four Best Practice workspaces in FireCloud:

The Broad Mutation Calling MuTect workflow runs on a tumor and normal pair or pair set, and consists of three main steps: 1) preparation, 2) somatic mutation analysis, and 3) somatic mutation annotation. As a part of this workflow, MuTect is run in force-call mode to search for somatic variants at a set of specified loci for clinical relevance (for example, TP53 gene on chromosome 17).

The three main steps are described here: 1. Preparation: The tumor and normal BAMs are split up by chromosome conditioned on interval overlap. For each chromosome, the BAM files are decomposed if the supplied interval file has any overlap on the chromosome. In part of the workflow, the supplied intervals are from an Agilent targeted sequencing system. This step may be known as “scatter preparation.”

2. Somatic Mutation Analysis: 2 somatic variant analyses are carried out in parallel: MuTect1 and MuTect2. MuTect1 is used for its somatic single-nucleotide variant detection capabilities. MuTect2 is used for its indel detection capabilities. The programs are run in parallel based on the splitting of the BAMs and intervals from the preparation step before. This step may be known as “scatter”.

3. Somatic Mutation Gather: This final step serves to annotate variants found in the previous step. First the results are gathered and merged. Then, from the MuTect2 output, the index-variants are separated and merged with the single-nucleotide variants from MuTect1. Next, only variants tagged with PASS are fed to Oncotator for annotation with Oncotator. Finally, all variants are passed to the VEP (Variant Effect Predictor) program for further annotation.

#### Runtime

The total expected runtime for the Broad Mutation Calling MuTect workflow depends on the size of the pair you select for analysis. Pair HCC1954 runs on "mini" BAMs and the expected runtime is roughly 45 minutes. Pair HCC1143 runs on larger BAMs and the expected runtime is roughly 1.5 hours. If you run both pairs as a pair set, the pairs will complete at the same expected runtimes respectively with HCC1954 finishing first.

#### MuTect1

MuTect1 identifies somatic point mutations in next generation sequencing data of cancer genomes. It inputs sequencing data for matched normal and tumor tissue samples, and outputs mutation calls and optional coverage results.

In a nutshell, the analysis itself consists of three steps:

• Pre-process the aligned reads in the tumor and normal sequencing data

• Identify using statistical analysis, sites that are likely to carry somatic mutations with high confidence

• Post-processing of candidate somatic mutations

For complete details, please see the 2013 publication in Nature Biotechnology.

#### MuTect2

MuTect2 is a somatic SNP and indel caller that combines the DREAM challenge-winning somatic genotyping engine of the original MuTect (Cibulskis et al., 2013) with the assembly-based machinery of HaplotypeCaller.

The basic operation of MuTect2 proceeds similarly to that of the HaplotypeCaller While the HaplotypeCaller relies on a ploidy assumption (diploid by default) to inform its genotype likelihood and variant quality calculations, MuTect2 allows for a varying allelic fraction for each variant, as is often seen in tumors with purity less than 100%, multiple subclones, and/or copy number variation (either local or aneuploidy). MuTect2 also differs from the HaplotypeCaller in that it does apply some hard filters to variants before producing output.

#### Oncotator

Oncotator is a tool for annotating information onto genomic point mutations (SNPs/SNVs) and indels. It is primarily used for human genome variant callsets. However, the tool can also be used to annotate any kind of information onto variant callsets from any organism.

By default, Oncotator uses a simple TSV file (e.g., MAFLITE) as an input and produces a TCGA MAF as an output. Oncotator also supports VCF files as an input and output format. By extension, Oncotator can be configured to annotate genomic point mutation data with HTML reports as it does in this workflow.

#### VEP (Variant Effect Predictor)

Ensembl’s VEP (Variant Effect Predictor) program processes variants for further annotation. This tool allows annotates variants and determines the effect on relevant transcripts and proteins.

VEP accepts as input coordinates any identified alleles or variant identifiers. A full list of input files is available here. If a variant that you enter as input causes a change in the protein sequence, the VEP will calculate the possible amino acids at that position and the variant would be given a consequence type of missense.

#### Inputs and Outputs

Below are the tool-specific inputs and outputs for this workflow. Note that these inputs and outputs may vary from the “standard” version of the tool.

MuTect1 and MuTect2 Inputs

• normalBam

• normalBamIndex

• tumorBam

• normalBamIndex

• tumorBam

• tumorBamIndex

• ReferenceFasta

• NormalPanel

• ReferenceFastaIndex

• COSMICVCF

• DBSNPVCF

• fracContam

• MutectIntervals

MuTect1 and MuTect2 Outputs

• MuTect PowerFile

• MuTect CoverageFile

• MuTect CallStatsFile

Oncotator and VEP Inputs

• MuTect PowerFile

• MuTect CoverageFile

• MuTect CallStatsFile

• File oncoDBTarBall

• VEP_File

• ReferenceFastaIndex

• ReferenceFasta

Oncotator and VEP Outputs

• Array[File] all_outs=glob("*")

• File oncotator.log

• File MuTect.1.2.call_stats.M1All.M2IndelsOnly.filtered.vcf.annotated.vcf

• FileMuTect1.call_stats.txt

• File MuTect2.call_stats.txt

• File variant_effect_output.txt

• File variant_effect_output.txt_summary.html

#### How to Run this Workflow in FireCloud

• Access the Broad_MutationCalling_MuTect_Workflow_V1_BestPractice workspace to run this workflow.
• Navigate to the Method Configurations tab and click on the method, MutationCalling_MuTect.
• Click Launch Analysis.
• In the Launch Analysis window, toggle to pair and select a pair on which to run this workflow, e.g., HCC1143. You can also run this workflow on a pair set. To do so, toggle to pair_set, select HCC_pairs and enter this.pairs in the Define Expression field.
• Finally, click the Launch button. Check back on the Monitor tab later to view results from your workflow analysis.
• When the status displays Done, select the most recent analysis run, e.g., HCC1143.
• Click on Outputs: Show, then select output files to view the results of this analysis.

#### Use of Tutorial Workspaces

Tutorial workspaces contain open access data and workflows. These Tutorial workspaces will appear in your account after your account is activated.

Tutorial workspaces may ONLY be used for running the tutorial exercises. You must not upload your own data sets to these workspaces, nor should you add tools (Method Configs). If you do not follow these guidelines, your Firecloud account may be deactivated.

#### Runtime

The total expected runtime for this workflow depends on the size of the pair or pair set you select for analysis. Pair HCC1954 runs on "mini" BAMs and the expected runtime is roughly 15 minutes. Pair HCC1143 runs on larger BAMs and the expected runtime is roughly 2.5 hours. If you run both pairs as a pair set, the pairs will complete at the same expected runtimes respectively with HCC1954 finishing first.

The QC task counts reads overlapping regions for tumor and normal BAM files. The task concludes with a report of the counts over the BAMs and lanes. Correlation values are included for comparison purposes.

ContEst uses a Bayesian approach to calculate the posterior probability of the contamination level and determine the maximum a posteriori probability (MAP) estimate of the contamination level. ContEst supports array-free mode, where we genotype on the fly from matched normals, and use that as our source of homozygous variant calls. It currently calls anything with > 80% of bases as the alternate with at least 50X coverage a homozygous alternate site.

Picard Metrics Tasks invoke multiple metrics reporting routines from the Picard toolkit. The following routines are invoked:

• CollectAlignmentSummaryMetrics

• CollectInsertSizeMetrics

• QualityScoreDistribution

• CollectInsertSizeMetrics

• QualityScoreDistribution

• MeanQualityByCycle

• CollectBaseDistributionByCycle

• CollectSequencingArtifactMetrics

• CollectQualityYieldMetrics

• CollectOxoGMetrics.

SAM/BAM validation routines (ValidateSamFile) in both summary and verbose mode are called.

Other routines such as CollectHsMetrics, and MarkDuplicates are also called.

#### Inputs and Outputs

Below are the tool-specific inputs and outputs for this workflow. Note that these inputs and outputs may vary from the “standard” version of the tool.

QC Inputs

• tumorBam

• tumorBamIdx

• normalBam

• normalBamIdx

• regionFile

• captureNormalsDBRCLZip

• caseIdNoSpace

• controlIdNoSpace

QC Outputs

• tumorBamLaneList

• normalBamLaneList

• tumorRCL

• normalRCL

• CopyNumQC

ContEST Inputs

• tumorBam

• tumorBamIdx

• normalBam

• normalBamIdx

• refFasta

• refFastaIdx

• exomeIntervals

• SNP6Bed

• HapMapVCF

• pairName

ContEST Outputs

• contamination_validation.array_free.txt

• contamination.af.txt

• contamination.base_report.txt

• contest_validation.output.tsv

Picard Metrics Inputs (Tumor)

• tumorBam

• tumorBamIdx

• refFasta

• HapMapVCF

• picardHapMap

• picardBaitIntervals

• picardTargetIntervals

• DB_SNP_VCF

• DB_SNP_VCF_IDX

Picard Metrics Inputs (Normal)

• normalBam

• normalBamIdx

• refFasta

• HapMapVCF

• picardHapMap

• picardBaitIntervals

• picardTargetIntervals

• DB_SNP_VCF

• DB_SNP_VCF_IDX

Picard Metrics Outputs

• HSMetrics.txt

• picard_crosscheck_report.txt

• validation_summary.txt

• validation_verbose.txt

• piccard_multiple_metrics.zip

• oxoG_metrics.txt

• duplicate_records.bam

• duplicate_metrics.txt

• sequencing_artifact_metrics.txt.bait_bias_detail_metrics

• sequencing_artifact_metrics.txt.bait_bias_summary_metrics

#### How to Run this Workflow in FireCloud

• Access the Broad_MutationCalling_QC_Workflow_V1_BestPractice workspace to run this workflow.
• Navigate to the Method Configurations tab and click on the method, MutationCalling_QC.
• Click Launch Analysis.
• In the Launch Analysis window, toggle to pair and select a pair on which to run this workflow, e.g., HCC1143. You can also run this workflow on a pair set. To do so, toggle to pair_set, select HCC_pairs and enter this.pairs in the Define Expression field.
• Finally, click the Launch button. Check back on the Monitor tab later to view results from your workflow analysis.
• You can view results and outputs by clicking on the most recent analysis run, e.g., HCC1143.
• Click on Outputs: Show, then select output files to view the results of this analysis.

#### Use of Tutorial Workspaces

Tutorial workspaces contain open access data and workflows. These Tutorial workspaces will appear in your account after your account is activated.

Tutorial workspaces may ONLY be used for running the tutorial exercises. You must not upload your own data sets to these workspaces, nor should you add tools (Method Configs). If you do not follow these guidelines, your Firecloud account may be deactivated.

#### References

Created 2016-01-20 15:38:23 | Updated 2016-04-27 15:16:06 | Tags: tutorials tutorial firecloud gistic

### Overview

GISTIC identifies genomic regions that are significantly gained or lost across a set of tumors. It takes segmented copy number ratios as input, separates arm-level events from focal events, and then performs two tests: (i) identifies significantly amplified/deleted chromosome arms; and (ii) identifies regions that are significantly focally amplified or deleted. For the focal analysis, the significance levels (Q values) are calculated by comparing the observed gains/losses at each locus to those obtained by randomly permuting the events along the genome to reflect the null hypothesis that they are all 'passengers' and could have occurred anywhere. The locus-specific significance levels are then corrected for multiple hypothesis testing. The arm-level significance is calculated by comparing the frequency of gains/losses of each arm to the expected rate given its size. The method outputs genomic views of significantly amplified and deleted regions, as well as a table of genes with gain or loss scores.

CNV Description Regions of the genome that are prone to germ line variations in copy number are excluded from the GISTIC analysis using a list of germ line copy number variations (CNVs). A CNV is a DNA sequence that may be found at different copy numbers in the germ line of two different individuals. Such germ line variations can confound a GISTIC analysis, which finds significant somatic copy number variations in cancer. A more in depth discussion is provided in [6]. GISTIC currently uses two CNV exclusion lists. One is based on the literature describing copy number variation, and a second one comes from an analysis of significant variations among the blood normals in the TCGA data set.

### Inputs

Segmentation File: The segmentation file contains the segmented data for all the samples identified by GLAD, CBS, or some other segmentation algorithm. (See GLAD file format in the Genepattern file formats documentation.) It is a six column, tab-delimited file with an optional first line identifying the columns. Positions are in base pair units.The column headers are: (1) Sample (sample name), (2) Chromosome (chromosome number), (3) Start Position (segment start position, in bases), (4) End Position (segment end position, in bases), (5) Num markers (number of markers in segment), (6) Seg.CN (log2() -1 of copy number).

Markers File: The markers file identifies the marker names and positions of the markers in the original dataset (before segmentation). It is a three column, tab-delimited file with an optional header. The column headers are: (1) Marker Name, (2) Chromosome, (3) Marker Position (in bases).

Reference Genome: The reference genome file contains information about the location of genes and cytobands on a given build of the genome. Reference genome files are created in Matlab and are not viewable with a text editor.

CNV Files: There are two options for the cnv file. The first option allows CNVs to be identified by marker name. The second option allows the CNVs to be identified by genomic location. Option #1: A two column, tab-delimited file with an optional header row. The marker names given in this file must match the marker names given in the markers file. The CNV identifiers are for user use and can be arbitrary. The column headers are: (1) Marker Name, (2) CNV Identifier. Option #2: A 6 column, tab-delimited file with an optional header row. The 'CNV Identifier' is for user use and can be arbitrary. 'Narrow Region Start' and 'Narrow Region End' are also not used. The column headers are: (1) CNV Identifier, (2) Chromosome, (3) Narrow Region Start, (4) Narrow Region End, (5) Wide Region Start, (6) Wide Region End

Amplification Threshold: Threshold for copy number amplifications. Regions with a log2 ratio above this value are considered amplified.

Deletion Threshold: Threshold for copy number deletions. Regions with a log2 ratio below the negative of this value are considered deletions.

Cap Values Minimum and maximum cap values on analyzed data. Regions with a log2 ratio greater than the cap are set to the cap value; regions with a log2 ratio less than -cap value are set to -cap. Values must be positive.

Broad Length Cutoff: Threshold used to distinguish broad from focal events, given in units of fraction of chromosome arm.

Remove X-Chromosome: Flag indicating whether to remove data from the X-chromosome before analysis. Allowed values= {1,0} (1: Remove X-Chromosome, 0: Do not remove X-Chromosome.

Confidence Level:Confidence level used to calculate the region containing a driver.

Join Segment Size: Smallest number of markers to allow in segments from the segmented data. Segments that contain fewer than this number of markers are joined to the neighboring segment that is closest in copy number.

Arm Level Peel Off: Flag set to enable arm-level peel-off of events during peak definition. The arm-level peel-off enhancement to the arbitrated peel-off method assigns all events in the same chromosome arm of the same sample to a single peak. It is useful when peaks are split by noise or chromothripsis. Allowed values= {1,0} (1: Use arm level peel off, 0: Use normal arbitrated peel-off).

Maximum Sample Segments: Maximum number of segments allowed for a sample in the input data. Samples with more segments than this threshold are excluded from the analysis.

Gene GISTIC: When enabled (value = 1), this option causes GISTIC to analyze deletions using genes instead of array markers to locate the lesion. In this mode, the copy number assigned to a gene is the lowest copy number among the markers that represent the gene.

### Outputs

All Lesions File

E.g., alllesions.confXX.txt, where XX is the confidence level

The all lesions file summarizes the results from the GISTIC run. It contains data about the significant regions of amplification and deletion as well as which samples are amplified or deleted in each of these regions. The identified regions are listed down the first column, and the samples are listed across the first row, starting in column 10.

Region Data

Columns 1-9 present the data about the significant regions as follows:

(1) Unique Name: A name assigned to identify the region

(2) Descriptor: The genomic descriptor of that region.

(3) Wide Peak Limits: The "wide peak" boundaries most likely to contain the targeted genes. These are listed in genomic coordinates and marker (or probe) indices.

(4) Peak Limits: The boundaries of the region of maximal amplification or deletion.

(5) Region Limits: The boundaries of the entire significant region of amplification or deletion.

(6) q-values: The q-value of the peak region.

(7) Residual q-values: The q-value of the peak region after removing ("peeling off") amplifications or deletions that overlap other, more significant peak regions in the same chromosome.

(8) Broad or Focal: Identifies whether the region reaches significance due primarily to broad events (called "broad"), focal events (called “focal”), or independently significant broad and focal events (called “both”).

(9) Amplitude Threshold: Key giving the meaning of values in the subsequent columns associated with each sample.

Sample Data

Each of the analyzed samples is represented in one of the columns following the lesion data (columns 10 through end). The data contained in these columns varies slightly by section of the file.

The first section can be identified by the key given in column 9 – it starts in row 2 and continues until the row that reads "Actual Copy Change Given." This section contains summarized data for each sample. A ‘0’ indicates that the copy number of the sample was not amplified or deleted beyond the threshold amount in that peak region. A ‘1’ indicates that the sample had low-level copy number aberrations (exceeding the low threshold indicated in column 9), and a ‘2’ indicates that the sample had high-level copy number aberrations (exceeding the high threshold indicated in column 9).

The second section can be identified the rows in which column 9 reads "Actual Copy Change Given." The second section exactly reproduces the first section, except that here the actual changes in copy number are provided rather than zeroes, ones, and twos.

The final section is similar to the first section, except that here only broad events are included. A 1 in the samples columns (columns 10+) indicates that the median copy number of the sample across the entire significant region exceeded the threshold given in column 9. That is, it indicates whether the sample had a geographically extended event, rather than a focal amplification or deletion covering little more than the peak region.

Amplification Genes File

E.g., ampgenes.confXX.txt, where XX is the confidence level

The amp genes file contains one column for each amplification peak identified in the GISTIC analysis. The first four rows are:

(1) cytoband

(2) q-value

(3) residual q-value

(4) wide peak boundaries

These rows identify the lesion in the same way as the all lesions file. The remaining rows list the genes contained in each wide peak. For peaks that contain no genes, the nearest gene is listed in brackets.

Deletion Genes File

E.g., delgenes.confXX.txt, where XX is the confidence level

The del genes file contains one column for each deletion peak identified in the GISTIC analysis. The file format for the del genes file is identical to the format for the amp genes file.

GISTIC Scores File

E.g., scores.gistic

The scores file lists the q-values [presented as -log10(q)], G-scores, average amplitudes among aberrant samples, and frequency of aberration, across the genome for both amplifications and deletions. The scores file is viewable with the Genepattern SNPViewer module and may be imported into the Integrated Genomics Viewer (IGV).

Segmented Copy Number

E.g., raw_copy_number.pdf

A .pdf or .png file containing a heatmap image of the genomic profiles of the segmented input copy number data. The genome is represented along the vertical axis and samples are arranged horizontally.

Amplification Score GISTIC plot

E.g., amp_qplot.pdf

The amplification pdf is a plot of the G-scores (top) and q-values (bottom) with respect to amplifications for all markers over the entire region analyzed.

Deletion Score/q-value GISTIC plot

E.g., del_qplot.pdf

The deletion pdf is a plot of the G-scores (top) and q-values (bottom) with respect to deletions for all markers over the entire region analyzed.

### How to run GISTIC2.0 in FireCloud

1. Navigate to the workspace broad-firecloud-tutorials/TCGA_ACC_OpenAccess_2016_01_01_TUTORIAL-GISTIC.

2. Go to the Method Configurations tab and select copy_number_gistic.

3. Click Launch Analysis.

4. Sort to sample_set, then click Launch.

5. In the Monitor tab, view the status of your analysis. Initially, the status displays Submitted.

6. When the status displays Done, click on ACC-TP (sample_set).

7. Click on Outputs: Show, then select output files to view the results of this analysis.

8. You can also view results, including Nozzle HTML reports and graphical analysis by viewing attributes in the Data tab.

### Use of Tutorial Workspaces

Tutorial workspaces contain open access data and workflows. These Tutorial workspaces will appear in your account upon registration and will be active for 30 days.

Tutorial workspaces may ONLY be used for running the tutorial exercises. You must not upload your own data sets to these workspaces, nor should you add tools (Method Configs). If you do not follow these guidelines, your Firecloud account may be deactivated.

### References

• Mermel C, Schumacher S, et al. (2011). "GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers." Genome Biology, 12:R41.

• Beroukhim R, Mermel C, et al. (2010). "The landscape of somatic copy -number alteration across human cancers." Nature, 463:899-905.

• Beroukhim R, Getz G, et al. (2007). "Assessing the significance of chromosomal abberations in cancer: Methodology and application to glioma." Proc Natl Acad Sci, 104:20007-20012.

• McCarroll, S. A. et al., Integrated detection and population-genetic analysis of SNPs and copy number variation, Nat Genet* Vol. 40(10):1166-1174 (2008)

• GISTIC Version 1

• GISTIC Version 2

• The Sanger Institute: Cancer Gene Census

Created 2016-01-11 18:30:23 | Updated 2016-01-11 18:31:25 | Tags: tutorials

This document is intended to be a record of how the tutorial files were prepared for the AHSG 2015 hands-on workshop.

### Reference genome

This produces a 64 Mb file (uncompressed) which is small enough for our purposes, so we don't need to truncate it further, simplifying future data file preparations.

# Extract just chromosome 20
samtools faidx /humgen/gsa-hpprojects/GATK/bundle/current/b37/human_g1k_v37.fasta 20 > human_g1k_b37_20.fasta

# Create the reference index
samtools faidx human_g1k_b37_20.fasta

# Create sequence dictionary
java -jar $PICARD CreateSequenceDictionary R=human_g1k_b37_20.fasta O=human_g1k_b37_20.dict # Recap files -rw-rw-r-- 1 vdauwera wga 164 Oct 1 14:56 human_g1k_b37_20.dict -rw-rw-r-- 1 vdauwera wga 64075950 Oct 1 14:41 human_g1k_b37_20.fasta -rw-rw-r-- 1 vdauwera wga 20 Oct 1 14:46 human_g1k_b37_20.fasta.fai ### Sequence data We are using the 2nd generation CEU Trio of NA12878 and her husband and child in a WGS dataset produced at Broad with files names after the library preps, Solexa-xxxxxx.bam. #### 1. Extract just chromosome 20:10M-20M bp and filter out chimeric pairs with -rf BadMate java -jar$GATK -T PrintReads -R /path/to/bundle/current/b37/human_g1k_v37_decoy.fasta -I /path/to/Solexa-272221.bam -o NA12877_wgs_20_10M20M.bam -L 20:10000000-20000000 -rf BadMate

java -jar $GATK -T PrintReads -R /path/to/bundle/current/b37/human_g1k_v37_decoy.fasta -I /path/to/Solexa-272222.bam -o NA12878_wgs_20_10M20M.bam -L 20:10000000-20000000 -rf BadMate java -jar$GATK -T PrintReads -R /path/to/bundle/current/b37/human_g1k_v37_decoy.fasta -I /path/to/Solexa-272228.bam -o NA12882_wgs_20_10M20M.bam -L 20:10000000-20000000 -rf BadMate

# Recap files
-rw-rw-r-- 1 vdauwera wga     36240 Oct  2 11:55 NA12877_wgs_20_10M20M.bai
-rw-rw-r-- 1 vdauwera wga 512866085 Oct  2 11:55 NA12877_wgs_20_10M20M.bam
-rw-rw-r-- 1 vdauwera wga     36176 Oct  2 11:53 NA12878_wgs_20_10M20M.bai
-rw-rw-r-- 1 vdauwera wga 502282846 Oct  2 11:53 NA12878_wgs_20_10M20M.bam
-rw-rw-r-- 1 vdauwera wga     36464 Oct  2 12:00 NA12882_wgs_20_10M20M.bai
-rw-rw-r-- 1 vdauwera wga 505001668 Oct  2 12:00 NA12882_wgs_20_10M20M.bam

#### 2. Extract headers and edit manually to remove all contigs except 20 and sanitize internal filepaths

samtools view -H NA12877_wgs_20_10M20M.bam > NA12877_header.txt

samtools view -H NA12878_wgs_20_10M20M.bam > NA12878_header.txt

samtools view -H NA12882_wgs_20_10M20M.bam > NA12882_header.txt

Manual editing is not represented here; basically just delete unwanted contig SQ lines and remove identifying info from internal filepaths.

#### 3. Flip BAM to SAM

java -jar $PICARD SamFormatConverter I=NA12877_wgs_20_10M20M.bam O=NA12877_wgs_20_10M20M.sam java -jar$PICARD SamFormatConverter I=NA12878_wgs_20_10M20M.bam O=NA12878_wgs_20_10M20M.sam

java -jar $PICARD SamFormatConverter I=NA12882_wgs_20_10M20M.bam O=NA12882_wgs_20_10M20M.sam #Recap files -rw-rw-r-- 1 vdauwera wga 1694169101 Oct 2 12:28 NA12877_wgs_20_10M20M.sam -rw-rw-r-- 1 vdauwera wga 1661483309 Oct 2 12:30 NA12878_wgs_20_10M20M.sam -rw-rw-r-- 1 vdauwera wga 1696553456 Oct 2 12:31 NA12882_wgs_20_10M20M.sam #### 4. Re-header the SAMs java -jar$PICARD ReplaceSamHeader I=NA12877_wgs_20_10M20M.sam O=NA12877_wgs_20_10M20M_RH.sam HEADER=NA12877_header.txt

java -jar $PICARD ReplaceSamHeader I=NA12878_wgs_20_10M20M.sam O=NA12878_wgs_20_10M20M_RH.sam HEADER=NA12878_header.txt java -jar$PICARD ReplaceSamHeader I=NA12882_wgs_20_10M20M.sam O=NA12882_wgs_20_10M20M_RH.sam HEADER=NA12882_header.txt

# Recap files
-rw-rw-r-- 1 vdauwera wga 1694153715 Oct  2 12:35 NA12877_wgs_20_10M20M_RH.sam
-rw-rw-r-- 1 vdauwera wga 1661467923 Oct  2 12:37 NA12878_wgs_20_10M20M_RH.sam
-rw-rw-r-- 1 vdauwera wga 1696538104 Oct  2 12:38 NA12882_wgs_20_10M20M_RH.sam

#### 5. Sanitize the SAMs to get rid of MATE_NOT_FOUND errors

java -jar $PICARD RevertSam I=NA12877_wgs_20_10M20M_RH.sam O=NA12877_wgs_20_10M20M_RS.sam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=false REMOVE_DUPLICATE_INFORMATION=false REMOVE_ALIGNMENT_INFORMATION=false ATTRIBUTE_TO_CLEAR=null SANITIZE=true MAX_DISCARD_FRACTION=0.001 java -jar$PICARD RevertSam I=NA12878_wgs_20_10M20M_RH.sam O=NA12878_wgs_20_10M20M_RS.sam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=false REMOVE_DUPLICATE_INFORMATION=false REMOVE_ALIGNMENT_INFORMATION=false ATTRIBUTE_TO_CLEAR=null SANITIZE=true MAX_DISCARD_FRACTION=0.001

java -jar $PICARD RevertSam I=NA12882_wgs_20_10M20M_RH.sam O=NA12882_wgs_20_10M20M_RS.sam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=false REMOVE_DUPLICATE_INFORMATION=false REMOVE_ALIGNMENT_INFORMATION=false ATTRIBUTE_TO_CLEAR=null SANITIZE=true MAX_DISCARD_FRACTION=0.001 # Recap files -rw-rw-r-- 1 vdauwera wga 1683827201 Oct 2 12:45 NA12877_wgs_20_10M20M_RS.sam -rw-rw-r-- 1 vdauwera wga 1652093793 Oct 2 12:49 NA12878_wgs_20_10M20M_RS.sam -rw-rw-r-- 1 vdauwera wga 1688143091 Oct 2 12:54 NA12882_wgs_20_10M20M_RS.sam #### 6. Sort the SAMs, convert back to BAM and create index java -jar$PICARD SortSam I=NA12877_wgs_20_10M20M_RS.sam O=NA12877_wgs_20_10M20M_V.bam SORT_ORDER=coordinate CREATE_INDEX=TRUE

java -jar $PICARD SortSam I=NA12878_wgs_20_10M20M_RS.sam O=NA12878_wgs_20_10M20M_V.bam SORT_ORDER=coordinate CREATE_INDEX=TRUE java -jar$PICARD SortSam I=NA12882_wgs_20_10M20M_RS.sam O=NA12882_wgs_20_10M20M_V.bam SORT_ORDER=coordinate CREATE_INDEX=TRUE

#recap files
-rw-rw-r-- 1 vdauwera wga     35616 Oct  2 13:08 NA12877_wgs_20_10M20M_V.bai
-rw-rw-r-- 1 vdauwera wga 508022682 Oct  2 13:08 NA12877_wgs_20_10M20M_V.bam
-rw-rw-r-- 1 vdauwera wga     35200 Oct  2 13:06 NA12878_wgs_20_10M20M_V.bai
-rw-rw-r-- 1 vdauwera wga 497742417 Oct  2 13:06 NA12878_wgs_20_10M20M_V.bam
-rw-rw-r-- 1 vdauwera wga     35632 Oct  2 13:04 NA12882_wgs_20_10M20M_V.bai
-rw-rw-r-- 1 vdauwera wga 500446729 Oct  2 13:04 NA12882_wgs_20_10M20M_V.bam

#### 7. Validate BAMs; should all output "No errors found"

java -jar $PICARD ValidateSamFile I=NA12877_wgs_20_10M20M_V.bam M=SUMMARY java -jar$PICARD ValidateSamFile I=NA12878_wgs_20_10M20M_V.bam M=SUMMARY

java -jar PICARD ValidateSamFile I=NA12882_wgs_20_10M20M_V.bam M=SUMMARY Created 2015-04-26 02:02:50 | Updated 2015-04-26 02:08:43 | Tags: tutorials haplotypecaller bamout ### 1. Overview As you may know, HaplotypeCaller performs a local reassembly and realignment of the reads in the region surrounding potential variant sites (see the HaplotypeCaller method docs for more details on why and how this is done). So it often happens that during the calling process, the reads get moved to different mapping positions than what you can observe in the BAM file that you originally provided to HC as input. These remappings usually explain most discordances between calls that are expected based on the original data and actual calls made by HaplotypeCaller, so it's very useful to be able to visualize what rearrangements the tool has made. ### 2. Generating the bamout for a single site or interval To generate the bamout file for a specific site or interval, just run HaplotypeCaller on the region around the site or interval of interest using the -L argument to restrict the analysis to that region (adding about 500 bp on either side) and using the -bamout argument to specify the name of the bamout file that will be generated. java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I recalibrated.bam -o hc_variants.vcf -L 20:10255630-10255840 -bamout bamout.bam If you were using any additional parameters in your original variant calling (including -ERC and related arguments), make sure to include them in this command as well so that you can make an apples-to-apples comparison. Then you open up both the original bam and the bamout file together in a genome browser such as IGV. On some test data from our favorite sample, NA12878, this is what you would see: You can see that the bamout file, on top, contains data only for the ActiveRegion that was within the analysis interval specified by -L. The two blue reads represent the artificial haplotypes constructed by HaplotypeCaller (you may need to adjust your IGV settings to see the same thing on your machine). You can see a whole group of reads neatly aligned, with an insertion in the middle. In comparison, the original data shown in the lower track has fewer reads with insertions, but has several reads with mismapped ends. This is a classic example of a site where realignment through reassembly has provided additional evidence for an indel, allowing HaplotypeCaller to call it confidently. In contrast, UnifiedGenotyper was not able to call this insertion confidently. ### 3. Generating the bamout for multiple intervals or the whole genome Although we don't recommend doing this by default because it will cause slower performance and take up a lot of storage space, you can generate a bamout that contains many more intervals, or even covers the whole genome. To do so, just run the same command, but this time, pass your list of intervals to -L, or simply omit it if you want the entire genome to be included. java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I recalibrated.bam -o hc_variants.vcf -bamout bamout.bam This time, if you zoom out a bit in IGV, you will see multiple stacks of reads corresponding to the various ActiveRegions that were identified and processed. ### 4. Forcing an output in a region that is not covered in the bamout In some cases HaplotypeCaller does not complete processing on an ActiveRegion that it has started. This is typically because there is either almost no evidence of variation once the remapping has been done, or on the contrary, the region is very messy and there is too much complexity. In both cases, the program is designed to give up in order to avoid wasting time. This is a good thing most of the time, but it does mean that sometimes you will have no output in the bamout for the site you are trying to troubleshoot. The good news is that in most cases it is possible to force HaplotypeCaller to go through with the full processing so that it will produce bamout output for your site of interest. To do so, simply add the flags -forceActive and -disableOptimizations to your command line, in addition to the -bamout argument of course. java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I recalibrated.bam -L 20:10371667-10375021 -o hc_forced.vcf -bamout force_bamout.bam -forceActive -disableOptimizations  In this other region, you can see that the original mapping (middle track) was a bit messy with some possible evidence of variation, and in fact UnifiedGenotyper called a SNP in this region (top variant track). But HaplotypeCaller did not call the SNP, and did not output anything in our first bamout file (top read track). When you force an output in that region using the two new flags, you see in the forced bamout (bottom read track) that the remapped data is a lot cleaner and the evidence for variation is essentially gone. It is also possible to force an ActiveRegion to be triggered at specific intervals; see the HaplotypeCaller tool docs for more details on how this is done. Created 2014-09-16 00:04:00 | Updated 2016-03-24 02:32:22 | Tags: tutorials bundle workshop So you're going to a GATK workshop, and you've been selected to participate in a hands-on session? Fantastic! We're looking forward to walking you through some exercises that will help you master the tools. However -- in order to make the best of the time we have together, we'd like to ask you to come prepared. Specifically, please complete the following steps: #### - Download and install all necessary software as described in this tutorial. We don't always get around to using RStudio, but all others are required. Note that if you are a Mac user, you may need to install Apple's XCode Tools, which are free but fairly large, so plan ahead because it can take a loooong time to download them if your connection is anything less than super-fast. #### - Download the tutorial bundle from the link provided by the workshop organizers. This will typically be provided by email two to three weeks before the date of the workshop. At the start of the session, we'll give you handouts with a walkthrough of the session so you can follow along and take notes (highly recommended!). With that, you should be all set. See you soon! Created 2013-06-17 21:18:18 | Updated 2016-02-11 15:12:57 | Tags: tutorials #### Objective Recalibrate base quality scores in order to correct sequencing errors and other experimental artifacts. #### Prerequisites • TBD #### Steps 1. Analyze patterns of covariation in the sequence dataset 2. Do a second pass to analyze covariation remaining after recalibration 3. Generate before/after plots 4. Apply the recalibration to your sequence data ### 1. Analyze patterns of covariation in the sequence dataset #### Action Run the following GATK command: java -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -R reference.fa \ -I realigned_reads.bam \ -L 20 \ -knownSites dbsnp.vcf \ -knownSites gold_indels.vcf \ -o recal_data.table  #### Expected Result This creates a GATKReport file called recal_data.table containing several tables. These tables contain the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data. It is imperative that you provide the program with a set of known sites, otherwise it will refuse to run. The known sites are used to build the covariation model and estimate empirical base qualities. For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation. ### 2. Do a second pass to analyze covariation remaining after recalibration #### Action Run the following GATK command: java -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -R reference.fa \ -I realigned_reads.bam \ -L 20 \ -knownSites dbsnp.vcf \ -knownSites gold_indels.vcf \ -BQSR recal_data.table \ -o post_recal_data.table  #### Expected Result This creates another GATKReport file, which we will use in the next step to generate plots. Note the use of the -BQSR flag, which tells the GATK engine to perform on-the-fly recalibration based on the first recalibration data table. ### 3. Generate before/after plots #### Action Run the following GATK command: java -jar GenomeAnalysisTK.jar \ -T AnalyzeCovariates \ -R reference.fa \ -L 20 \ -before recal_data.table \ -after post_recal_data.table \ -plots recalibration_plots.pdf #### Expected Result This generates a document called recalibration_plots.pdf containing plots that show how the reported base qualities match up to the empirical qualities calculated by the BaseRecalibrator. Comparing the before and after plots allows you to check the effect of the base recalibration process before you actually apply the recalibration to your sequence data. For details on how to interpret the base recalibration plots, please see the online GATK documentation. ### 4. Apply the recalibration to your sequence data #### Action Run the following GATK command: java -jar GenomeAnalysisTK.jar \ -T PrintReads \ -R reference.fa \ -I realigned_reads.bam \ -L 20 \ -BQSR recal_data.table \ -o recal_reads.bam  #### Expected Result This creates a file called recal_reads.bam containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores. By default, the original quality scores are discarded in order to keep the file size down. However, you have the option to retain them by adding the flag –emit_original_quals to the PrintReads command, in which case the original qualities will also be written in the file, tagged OQ. Notice how this step uses a very simple tool, PrintReads, to apply the recalibration. What’s happening here is that we are loading in the original sequence data, having the GATK engine recalibrate the base qualities on-the-fly thanks to the -BQSR flag (as explained earlier), and just using PrintReads to write out the resulting data to the new file. Created 2015-10-04 07:15:53 | Updated | Tags: tutorials workshop ashg Are you excited? We sure are. Especially in the secondary sense defined by dictionary.reference.com as "stimulated to activity; brisk:". We're presenting a poster on Thursday and a 90-minute workshop on Friday, but neither is ready yet. Good thing the weather this weekend is crappy; if we were missing out on proper New England fall foliage / leaf-peeping weather we'd be pretty cheesed off. But we ain't afraid of no deadline -- we'll be ready. We developed a completely new workshop tutorial for the occasion, and we're going to have a big room full of people rocking GVCFs. It's going to be epic. The tutorial data bundle, sans worksheet (because that's the part that's not quite ready yet) is already available here (not a direct link to the data because we want you to read the part about the homework). It does have an appendix document with installation instructions and some context info about the tutorial objectives, which you must read through (and act on) before the workshop, if you're attending. If you're at ASHG but you can't make it to the workshop, you can still come see our poster, which covers the same topic (the GVCF workflow part of the Best Practices), but flatter and less interactive. Although Sheila and I will be there to answer questions one on one, so in that sense it will be more interactive. Just with less keyboard action. So, Thursday 9 Oct between 12 and 1 pm at the Bioinformatics and Genomic Technology session in the Exhibit Hall, Level 1; Convention Center, poster #1664/T. Be there. We'll talk. We'll also be around at the Broad Institute Genomic Services booth in the Exhibit Hall (booth #1720, right around the corner from Qiagen). Not sure yet when we'll be there, but send me a private message if you'd like to chat and we can figure out a time. See you there! Created 2013-07-09 13:26:11 | Updated 2013-07-17 01:59:08 | Tags: tutorials workshop Before the workshop, you should run through this tutorial to install all the software on your laptop: We use a small test dataset that you can download from this link (1.1 Gb): During the hands-on session of the workshop, we walk through the following tutorials, with some minor modifications: Created 2016-05-17 01:26:22 | Updated | Tags: tutorials best-practices bug In the "Realign Indels" step of the GATK Best Practices (https://www.broadinstitute.org/gatk/guide/bp_step.php?p=1), the link pointing to the step-by-step tutorial of (howto) Perform local realignment around indels (https://www.broadinstitute.org/gatk/guide/article?id=2800) no longer works. I believe the new URL is https://www.broadinstitute.org/gatk/guide/article?id=7156 Created 2013-08-27 18:39:22 | Updated 2013-08-27 18:49:11 | Tags: tutorials countreads Have I done this right? These are the results I received. According to the tutorial if I saw the line that had the word "Walker" in it, then I did it right. But I'm not sure if I'm right or wrong because CountReads gave me the number of reads counted (2075853) ----------------------------------- INFO 18:32:02,582 HelpFormatter - -------------------------------------------------------------------------------- INFO 18:32:03,761 GenomeAnalysisEngine - Strictness is SILENT INFO 18:32:04,457 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 18:32:04,469 SAMDataSourceSAMReaders - Initializing SAMRecords in serial
INFO  18:32:04,569 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.08 INFO 18:32:04,933 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 18:32:04,939 GenomeAnalysisEngine - Done preparing for traversal INFO 18:32:04,940 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 18:32:04,940 ProgressMeter - Location processed.reads runtime per.1M.reads completed total.runtime remaining INFO 18:32:04,944 ReadShardBalancer$1 - Loading BAM index data
INFO  18:32:35,157 ProgressMeter - NODE_375_length_263320_cov_14.647926:263299        1.75e+06   30.0 s       17.0 s     84.0%        35.0 s     5.0 s
INFO  18:32:38,828 ProgressMeter -            done        2.08e+06   33.0 s       16.0 s    100.0%        33.0 s     0.0 s
INFO  18:32:38,828 ProgressMeter - Total runtime 33.89 secs, 0.56 min, 0.01 hours
INFO  18:32:38,960 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 2075853 total reads (0.00%)
INFO  18:32:38,960 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO  18:32:39,806 GATKRunReport - Uploaded run statistics report to AWS S3 

Created 2013-07-05 16:57:08 | Updated | Tags: tutorials baserecalibrator analyzecovariates

in Step 3, the example of code still has the deprecated walker
-T AnalyzeCovariants
which when used generates this,
"ERROR MESSAGE: Walker AnalyzeCovariates is no longer available in the GATK; it has been deprecated since version 2.0 (use BaseRecalibrator instead; see documentation for usage)"

Created 2013-05-29 16:52:40 | Updated | Tags: tutorials reducereads

I am a complete newb. Even with help and support from my lab mates, I need to read your materials. I was sent by the GATK Guide Book (page 10; section 4) to Dropbox location https://www.dropbox.com/sh/e31kvbg5v63s51t/ajQmlTL6YH where I picked up ReduceReads.pdf On page 11 of that document there are ten graphs. The resolution of the .pdf file is so low that I cannot read the legends on the left side and bottom of these ten graphs. Could you point me to the high resolution version of this .pdf ?

Thanks

Created 2013-05-21 18:08:59 | Updated | Tags: tutorials error

on the forum page

there are two examples. The first runs fine. The second generates this error

MESSAGE: Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '10'

but the input files are the same. I only changed "Reads" to "Loci" in the command. I am running Unix so I do not need to retype the entire command. This command works fine

java -jar GenomeAnalysisTK.jar -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam

This command produces the error

java -jar GenomeAnalysisTK.jar -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam -o output.txt

Any suggestions?