So you're going to a GATK workshop, and you've been selected to participate in a hands-on session? Fantastic! We're looking forward to walking you through some exercises that will help you master the tools. However -- in order to make the best of the time we have together, we'd like to ask you to come prepared. Specifically, please complete the following steps:

#### - Download and install all necessary software as described in this tutorial.

No need to install HTSlib, but all others are required. Note that if you are a Mac user, you may need to install Apple's XCode Tools, which are free but fairly large, so plan ahead because it can take a loooong time to download them if your connection is anything less than super-fast.

#### - Download the basic tutorial data bundle from our FTP server.

Speaking of long downloads, this one is also pretty big (740M), so again, don't leave it until last minute. This mini-bundle contains chromosome 20 of the human genome reference, a BAM file snippet and accompanying dbsnp + known indels files.

If you are attending the advanced hands-on session (if you're not sure, it's usually the one on the second or third day of the workshop), you'll need some extra files that aren't in the basic tutorial bundle. This add-on bundle is also quite large (870M) because it contains the complete human genome and a complete whole-genome callset. Note that this will take around 4G of space on your hard drive once it's uncompressed, so make sure you have plenty of space available on your machine.

At the start of the session, we'll give you handouts with a walkthrough of the session so you can follow along and take notes (highly recommended!).

With that, you should be all set. See you soon!

#### Objective

Recalibrate base quality scores in order to correct sequencing errors and other experimental artifacts.

• TBD

#### Steps

1. Analyze patterns of covariation in the sequence dataset
2. Do a second pass to analyze covariation remaining after recalibration
3. Generate before/after plots
4. Apply the recalibration to your sequence data

### 1. Analyze patterns of covariation in the sequence dataset

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R reference.fa \
-L 20 \
-knownSites dbsnp.vcf \
-knownSites gold_indels.vcf \
-o recal_data.table


#### Expected Result

This creates a GATKReport file called recal_data.grp containing several tables. These tables contain the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data.

It is imperative that you provide the program with a set of known sites, otherwise it will refuse to run. The known sites are used to build the covariation model and estimate empirical base qualities. For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation.

### 2. Do a second pass to analyze covariation remaining after recalibration

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R reference.fa \
-L 20 \
-knownSites dbsnp.vcf \
-knownSites gold_indels.vcf \
-BQSR recal_data.table \
-o post_recal_data.table


#### Expected Result

This creates another GATKReport file, which we will use in the next step to generate plots. Note the use of the -BQSR flag, which tells the GATK engine to perform on-the-fly recalibration based on the first recalibration data table.

### 3. Generate before/after plots

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T AnalyzeCovariates \
-R reference.fa \
-L 20 \
-before recal_data.table \
-after post_recal_data.table \
-plots recalibration_plots.pdf


#### Expected Result

This generates a document called recalibration_plots.pdf containing plots that show how the reported base qualities match up to the empirical qualities calculated by the BaseRecalibrator. Comparing the before and after plots allows you to check the effect of the base recalibration process before you actually apply the recalibration to your sequence data. For details on how to interpret the base recalibration plots, please see the online GATK documentation.

### 4. Apply the recalibration to your sequence data

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-R reference.fa \
-L 20 \
-BQSR recal_data.table \


#### Expected Result

This creates a file called recal_reads.bam containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores. By default, the original quality scores are discarded in order to keep the file size down. However, you have the option to retain them by adding the flag –emit_original_quals to the PrintReads command, in which case the original qualities will also be written in the file, tagged OQ.

Notice how this step uses a very simple tool, PrintReads, to apply the recalibration. What’s happening here is that we are loading in the original sequence data, having the GATK engine recalibrate the base qualities on-the-fly thanks to the -BQSR flag (as explained earlier), and just using PrintReads to write out the resulting data to the new file.

Before the workshop, you should run through this tutorial to install all the software on your laptop:

• https://s3.amazonaws.com/gatk-workshop/mini-bundle.zip

During the hands-on session of the workshop, we walk through the following tutorials, with some minor modifications:

Have I done this right? These are the results I received. According to the tutorial if I saw the line that had the word "Walker" in it, then I did it right. But I'm not sure if I'm right or wrong because CountReads gave me the number of reads counted (2075853)

-----------------------------------
INFO  18:32:02,582 HelpFormatter - --------------------------------------------------------------------------------
INFO  18:32:03,761 GenomeAnalysisEngine - Strictness is SILENT
INFO  18:32:04,457 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO  18:32:04,469 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 18:32:04,569 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.08
INFO  18:32:04,933 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO  18:32:04,939 GenomeAnalysisEngine - Done preparing for traversal
INFO  18:32:04,940 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  18:32:04,944 ReadShardBalancer$1 - Loading BAM index data INFO 18:32:04,945 ReadShardBalancer$1 - Done loading BAM index data
INFO  18:32:35,157 ProgressMeter - NODE_375_length_263320_cov_14.647926:263299        1.75e+06   30.0 s       17.0 s     84.0%        35.0 s     5.0 s
INFO  18:32:38,828 ProgressMeter -            done        2.08e+06   33.0 s       16.0 s    100.0%        33.0 s     0.0 s
INFO  18:32:38,828 ProgressMeter - Total runtime 33.89 secs, 0.56 min, 0.01 hours
INFO  18:32:38,960 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 2075853 total reads (0.00%)
INFO  18:32:38,960 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO  18:32:39,806 GATKRunReport - Uploaded run statistics report to AWS S3


in Step 3, the example of code still has the deprecated walker
-T AnalyzeCovariants
which when used generates this,
"ERROR MESSAGE: Walker AnalyzeCovariates is no longer available in the GATK; it has been deprecated since version 2.0 (use BaseRecalibrator instead; see documentation for usage)"

I am a complete newb. Even with help and support from my lab mates, I need to read your materials. I was sent by the GATK Guide Book (page 10; section 4) to Dropbox location https://www.dropbox.com/sh/e31kvbg5v63s51t/ajQmlTL6YH where I picked up ReduceReads.pdf On page 11 of that document there are ten graphs. The resolution of the .pdf file is so low that I cannot read the legends on the left side and bottom of these ten graphs. Could you point me to the high resolution version of this .pdf ?

Thanks

on the forum page

there are two examples. The first runs fine. The second generates this error

MESSAGE: Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '10'

but the input files are the same. I only changed "Reads" to "Loci" in the command. I am running Unix so I do not need to retype the entire command. This command works fine

java -jar GenomeAnalysisTK.jar -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam

This command produces the error

java -jar GenomeAnalysisTK.jar -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam -o output.txt

Any suggestions?