This document is intended to be a record of how the tutorial files were prepared for the AHSG 2015 hands-on workshop.
This produces a 64 Mb file (uncompressed) which is small enough for our purposes, so we don't need to truncate it further, simplifying future data file preparations.
# Extract just chromosome 20 samtools faidx /humgen/gsa-hpprojects/GATK/bundle/current/b37/human_g1k_v37.fasta 20 > human_g1k_b37_20.fasta # Create the reference index samtools faidx human_g1k_b37_20.fasta # Create sequence dictionary java -jar $PICARD CreateSequenceDictionary R=human_g1k_b37_20.fasta O=human_g1k_b37_20.dict # Recap files -rw-rw-r-- 1 vdauwera wga 164 Oct 1 14:56 human_g1k_b37_20.dict -rw-rw-r-- 1 vdauwera wga 64075950 Oct 1 14:41 human_g1k_b37_20.fasta -rw-rw-r-- 1 vdauwera wga 20 Oct 1 14:46 human_g1k_b37_20.fasta.fai
We are using the 2nd generation CEU Trio of NA12878 and her husband and child in a WGS dataset produced at Broad with files names after the library preps, Solexa-xxxxxx.bam.
java -jar $GATK -T PrintReads -R /path/to/bundle/current/b37/human_g1k_v37_decoy.fasta -I /path/to/Solexa-272221.bam -o NA12877_wgs_20_10M20M.bam -L 20:10000000-20000000 -rf BadMate java -jar $GATK -T PrintReads -R /path/to/bundle/current/b37/human_g1k_v37_decoy.fasta -I /path/to/Solexa-272222.bam -o NA12878_wgs_20_10M20M.bam -L 20:10000000-20000000 -rf BadMate java -jar $GATK -T PrintReads -R /path/to/bundle/current/b37/human_g1k_v37_decoy.fasta -I /path/to/Solexa-272228.bam -o NA12882_wgs_20_10M20M.bam -L 20:10000000-20000000 -rf BadMate # Recap files -rw-rw-r-- 1 vdauwera wga 36240 Oct 2 11:55 NA12877_wgs_20_10M20M.bai -rw-rw-r-- 1 vdauwera wga 512866085 Oct 2 11:55 NA12877_wgs_20_10M20M.bam -rw-rw-r-- 1 vdauwera wga 36176 Oct 2 11:53 NA12878_wgs_20_10M20M.bai -rw-rw-r-- 1 vdauwera wga 502282846 Oct 2 11:53 NA12878_wgs_20_10M20M.bam -rw-rw-r-- 1 vdauwera wga 36464 Oct 2 12:00 NA12882_wgs_20_10M20M.bai -rw-rw-r-- 1 vdauwera wga 505001668 Oct 2 12:00 NA12882_wgs_20_10M20M.bam
samtools view -H NA12877_wgs_20_10M20M.bam > NA12877_header.txt samtools view -H NA12878_wgs_20_10M20M.bam > NA12878_header.txt samtools view -H NA12882_wgs_20_10M20M.bam > NA12882_header.txt
Manual editing is not represented here; basically just delete unwanted contig SQ lines and remove identifying info from internal filepaths.
java -jar $PICARD SamFormatConverter I=NA12877_wgs_20_10M20M.bam O=NA12877_wgs_20_10M20M.sam java -jar $PICARD SamFormatConverter I=NA12878_wgs_20_10M20M.bam O=NA12878_wgs_20_10M20M.sam java -jar $PICARD SamFormatConverter I=NA12882_wgs_20_10M20M.bam O=NA12882_wgs_20_10M20M.sam #Recap files -rw-rw-r-- 1 vdauwera wga 1694169101 Oct 2 12:28 NA12877_wgs_20_10M20M.sam -rw-rw-r-- 1 vdauwera wga 1661483309 Oct 2 12:30 NA12878_wgs_20_10M20M.sam -rw-rw-r-- 1 vdauwera wga 1696553456 Oct 2 12:31 NA12882_wgs_20_10M20M.sam
java -jar $PICARD ReplaceSamHeader I=NA12877_wgs_20_10M20M.sam O=NA12877_wgs_20_10M20M_RH.sam HEADER=NA12877_header.txt java -jar $PICARD ReplaceSamHeader I=NA12878_wgs_20_10M20M.sam O=NA12878_wgs_20_10M20M_RH.sam HEADER=NA12878_header.txt java -jar $PICARD ReplaceSamHeader I=NA12882_wgs_20_10M20M.sam O=NA12882_wgs_20_10M20M_RH.sam HEADER=NA12882_header.txt # Recap files -rw-rw-r-- 1 vdauwera wga 1694153715 Oct 2 12:35 NA12877_wgs_20_10M20M_RH.sam -rw-rw-r-- 1 vdauwera wga 1661467923 Oct 2 12:37 NA12878_wgs_20_10M20M_RH.sam -rw-rw-r-- 1 vdauwera wga 1696538104 Oct 2 12:38 NA12882_wgs_20_10M20M_RH.sam
java -jar $PICARD RevertSam I=NA12877_wgs_20_10M20M_RH.sam O=NA12877_wgs_20_10M20M_RS.sam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=false REMOVE_DUPLICATE_INFORMATION=false REMOVE_ALIGNMENT_INFORMATION=false ATTRIBUTE_TO_CLEAR=null SANITIZE=true MAX_DISCARD_FRACTION=0.001 java -jar $PICARD RevertSam I=NA12878_wgs_20_10M20M_RH.sam O=NA12878_wgs_20_10M20M_RS.sam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=false REMOVE_DUPLICATE_INFORMATION=false REMOVE_ALIGNMENT_INFORMATION=false ATTRIBUTE_TO_CLEAR=null SANITIZE=true MAX_DISCARD_FRACTION=0.001 java -jar $PICARD RevertSam I=NA12882_wgs_20_10M20M_RH.sam O=NA12882_wgs_20_10M20M_RS.sam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=false REMOVE_DUPLICATE_INFORMATION=false REMOVE_ALIGNMENT_INFORMATION=false ATTRIBUTE_TO_CLEAR=null SANITIZE=true MAX_DISCARD_FRACTION=0.001 # Recap files -rw-rw-r-- 1 vdauwera wga 1683827201 Oct 2 12:45 NA12877_wgs_20_10M20M_RS.sam -rw-rw-r-- 1 vdauwera wga 1652093793 Oct 2 12:49 NA12878_wgs_20_10M20M_RS.sam -rw-rw-r-- 1 vdauwera wga 1688143091 Oct 2 12:54 NA12882_wgs_20_10M20M_RS.sam
java -jar $PICARD SortSam I=NA12877_wgs_20_10M20M_RS.sam O=NA12877_wgs_20_10M20M_V.bam SORT_ORDER=coordinate CREATE_INDEX=TRUE java -jar $PICARD SortSam I=NA12878_wgs_20_10M20M_RS.sam O=NA12878_wgs_20_10M20M_V.bam SORT_ORDER=coordinate CREATE_INDEX=TRUE java -jar $PICARD SortSam I=NA12882_wgs_20_10M20M_RS.sam O=NA12882_wgs_20_10M20M_V.bam SORT_ORDER=coordinate CREATE_INDEX=TRUE #recap files -rw-rw-r-- 1 vdauwera wga 35616 Oct 2 13:08 NA12877_wgs_20_10M20M_V.bai -rw-rw-r-- 1 vdauwera wga 508022682 Oct 2 13:08 NA12877_wgs_20_10M20M_V.bam -rw-rw-r-- 1 vdauwera wga 35200 Oct 2 13:06 NA12878_wgs_20_10M20M_V.bai -rw-rw-r-- 1 vdauwera wga 497742417 Oct 2 13:06 NA12878_wgs_20_10M20M_V.bam -rw-rw-r-- 1 vdauwera wga 35632 Oct 2 13:04 NA12882_wgs_20_10M20M_V.bai -rw-rw-r-- 1 vdauwera wga 500446729 Oct 2 13:04 NA12882_wgs_20_10M20M_V.bam
java -jar $PICARD ValidateSamFile I=NA12877_wgs_20_10M20M_V.bam M=SUMMARY java -jar $PICARD ValidateSamFile I=NA12878_wgs_20_10M20M_V.bam M=SUMMARY java -jar $PICARD ValidateSamFile I=NA12882_wgs_20_10M20M_V.bam M=SUMMARY
As you may know, HaplotypeCaller performs a local reassembly and realignment of the reads in the region surrounding potential variant sites (see the HaplotypeCaller method docs for more details on why and how this is done). So it often happens that during the calling process, the reads get moved to different mapping positions than what you can observe in the BAM file that you originally provided to HC as input.
These remappings usually explain most discordances between calls that are expected based on the original data and actual calls made by HaplotypeCaller, so it's very useful to be able to visualize what rearrangements the tool has made.
To generate the bamout file for a specific site or interval, just run HaplotypeCaller on the region around the site or interval of interest using the
-L argument to restrict the analysis to that region (adding about 500 bp on either side) and using the
-bamout argument to specify the name of the bamout file that will be generated.
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I recalibrated.bam -o hc_variants.vcf -L 20:10255630-10255840 -bamout bamout.bam
If you were using any additional parameters in your original variant calling (including
-ERC and related arguments), make sure to include them in this command as well so that you can make an apples-to-apples comparison.
Then you open up both the original bam and the bamout file together in a genome browser such as IGV. On some test data from our favorite sample, NA12878, this is what you would see:
You can see that the bamout file, on top, contains data only for the ActiveRegion that was within the analysis interval specified by
-L. The two blue reads represent the artificial haplotypes constructed by HaplotypeCaller (you may need to adjust your IGV settings to see the same thing on your machine).
You can see a whole group of reads neatly aligned, with an insertion in the middle. In comparison, the original data shown in the lower track has fewer reads with insertions, but has several reads with mismapped ends. This is a classic example of a site where realignment through reassembly has provided additional evidence for an indel, allowing HaplotypeCaller to call it confidently. In contrast, UnifiedGenotyper was not able to call this insertion confidently.
Although we don't recommend doing this by default because it will cause slower performance and take up a lot of storage space, you can generate a bamout that contains many more intervals, or even covers the whole genome. To do so, just run the same command, but this time, pass your list of intervals to
-L, or simply omit it if you want the entire genome to be included.
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I recalibrated.bam -o hc_variants.vcf -bamout bamout.bam
This time, if you zoom out a bit in IGV, you will see multiple stacks of reads corresponding to the various ActiveRegions that were identified and processed.
In some cases HaplotypeCaller does not complete processing on an ActiveRegion that it has started. This is typically because there is either almost no evidence of variation once the remapping has been done, or on the contrary, the region is very messy and there is too much complexity. In both cases, the program is designed to give up in order to avoid wasting time. This is a good thing most of the time, but it does mean that sometimes you will have no output in the bamout for the site you are trying to troubleshoot.
The good news is that in most cases it is possible to force HaplotypeCaller to go through with the full processing so that it will produce bamout output for your site of interest. To do so, simply add the flags
-disableOptimizations to your command line, in addition to the
-bamout argument of course.
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I recalibrated.bam -L 20:10371667-10375021 -o hc_forced.vcf -bamout force_bamout.bam -forceActive -disableOptimizations
In this other region, you can see that the original mapping (middle track) was a bit messy with some possible evidence of variation, and in fact UnifiedGenotyper called a SNP in this region (top variant track). But HaplotypeCaller did not call the SNP, and did not output anything in our first bamout file (top read track). When you force an output in that region using the two new flags, you see in the forced bamout (bottom read track) that the remapped data is a lot cleaner and the evidence for variation is essentially gone.
It is also possible to force an ActiveRegion to be triggered at specific intervals; see the HaplotypeCaller tool docs for more details on how this is done.
So you're going to a GATK workshop, and you've been selected to participate in a hands-on session? Fantastic! We're looking forward to walking you through some exercises that will help you master the tools. However -- in order to make the best of the time we have together, we'd like to ask you to come prepared. Specifically, please complete the following steps:
We don't always get around to using RStudio, but all others are required. Note that if you are a Mac user, you may need to install Apple's XCode Tools, which are free but fairly large, so plan ahead because it can take a loooong time to download them if your connection is anything less than super-fast.
The basic tutorial data bundle
Speaking of long downloads, this one is also pretty big (740M), so again, don't leave it until last minute. This mini-bundle contains chromosome 20 of the human genome reference, a BAM file snippet and accompanying dbsnp + known indels files.
At the start of the session, we'll give you handouts with a walkthrough of the session so you can follow along and take notes (highly recommended!).
With that, you should be all set. See you soon!
Recalibrate base quality scores in order to correct sequencing errors and other experimental artifacts.
Run the following GATK command:
java -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -R reference.fa \ -I realigned_reads.bam \ -L 20 \ -knownSites dbsnp.vcf \ -knownSites gold_indels.vcf \ -o recal_data.table
This creates a GATKReport file called
recal_data.grp containing several tables. These tables contain the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data.
It is imperative that you provide the program with a set of known sites, otherwise it will refuse to run. The known sites are used to build the covariation model and estimate empirical base qualities. For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation.
Run the following GATK command:
java -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -R reference.fa \ -I realigned_reads.bam \ -L 20 \ -knownSites dbsnp.vcf \ -knownSites gold_indels.vcf \ -BQSR recal_data.table \ -o post_recal_data.table
This creates another GATKReport file, which we will use in the next step to generate plots. Note the use of the
-BQSR flag, which tells the GATK engine to perform on-the-fly recalibration based on the first recalibration data table.
Run the following GATK command:
java -jar GenomeAnalysisTK.jar \ -T AnalyzeCovariates \ -R reference.fa \ -L 20 \ -before recal_data.table \ -after post_recal_data.table \ -plots recalibration_plots.pdf
This generates a document called
recalibration_plots.pdf containing plots that show how the reported base qualities match up to the empirical qualities calculated by the BaseRecalibrator. Comparing the before and after plots allows you to check the effect of the base recalibration process before you actually apply the recalibration to your sequence data. For details on how to interpret the base recalibration plots, please see the online GATK documentation.
Run the following GATK command:
java -jar GenomeAnalysisTK.jar \ -T PrintReads \ -R reference.fa \ -I realigned_reads.bam \ -L 20 \ -BQSR recal_data.table \ -o recal_reads.bam
This creates a file called
recal_reads.bam containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores. By default, the original quality scores are discarded in order to keep the file size down. However, you have the option to retain them by adding the flag
–emit_original_quals to the PrintReads command, in which case the original qualities will also be written in the file, tagged
Notice how this step uses a very simple tool, PrintReads, to apply the recalibration. What’s happening here is that we are loading in the original sequence data, having the GATK engine recalibrate the base qualities on-the-fly thanks to the
-BQSR flag (as explained earlier), and just using PrintReads to write out the resulting data to the new file.
Are you excited? We sure are. Especially in the secondary sense defined by dictionary.reference.com as "stimulated to activity; brisk:". We're presenting a poster on Thursday and a 90-minute workshop on Friday, but neither is ready yet. Good thing the weather this weekend is crappy; if we were missing out on proper New England fall foliage / leaf-peeping weather we'd be pretty cheesed off.
But we ain't afraid of no deadline -- we'll be ready. We developed a completely new workshop tutorial for the occasion, and we're going to have a big room full of people rocking GVCFs. It's going to be epic. The tutorial data bundle, sans worksheet (because that's the part that's not quite ready yet) is already available here (not a direct link to the data because we want you to read the part about the homework). It does have an appendix document with installation instructions and some context info about the tutorial objectives, which you must read through (and act on) before the workshop, if you're attending.
If you're at ASHG but you can't make it to the workshop, you can still come see our poster, which covers the same topic (the GVCF workflow part of the Best Practices), but flatter and less interactive. Although Sheila and I will be there to answer questions one on one, so in that sense it will be more interactive. Just with less keyboard action. So, Thursday 9 Oct between 12 and 1 pm at the Bioinformatics and Genomic Technology session in the Exhibit Hall, Level 1; Convention Center, poster #1664/T. Be there. We'll talk.
We'll also be around at the Broad Institute Genomic Services booth in the Exhibit Hall (booth #1720, right around the corner from Qiagen). Not sure yet when we'll be there, but send me a private message if you'd like to chat and we can figure out a time.
See you there!
Before the workshop, you should run through this tutorial to install all the software on your laptop:
We use a small test dataset that you can download from this link (1.1 Gb):
During the hands-on session of the workshop, we walk through the following tutorials, with some minor modifications:
Have I done this right? These are the results I received. According to the tutorial if I saw the line that had the word "Walker" in it, then I did it right. But I'm not sure if I'm right or wrong because CountReads gave me the number of reads counted (2075853)
----------------------------------- INFO 18:32:02,582 HelpFormatter - -------------------------------------------------------------------------------- INFO 18:32:03,761 GenomeAnalysisEngine - Strictness is SILENT INFO 18:32:04,457 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 18:32:04,469 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 18:32:04,569 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.08 INFO 18:32:04,933 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 18:32:04,939 GenomeAnalysisEngine - Done preparing for traversal INFO 18:32:04,940 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 18:32:04,940 ProgressMeter - Location processed.reads runtime per.1M.reads completed total.runtime remaining INFO 18:32:04,944 ReadShardBalancer$1 - Loading BAM index data INFO 18:32:04,945 ReadShardBalancer$1 - Done loading BAM index data INFO 18:32:35,157 ProgressMeter - NODE_375_length_263320_cov_14.647926:263299 1.75e+06 30.0 s 17.0 s 84.0% 35.0 s 5.0 s INFO 18:32:38,826 CountReads - CountReads counted 2075853 reads in the traversal INFO 18:32:38,828 ProgressMeter - done 2.08e+06 33.0 s 16.0 s 100.0% 33.0 s 0.0 s INFO 18:32:38,828 ProgressMeter - Total runtime 33.89 secs, 0.56 min, 0.01 hours INFO 18:32:38,960 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 2075853 total reads (0.00%) INFO 18:32:38,960 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 18:32:39,806 GATKRunReport - Uploaded run statistics report to AWS S3
in Step 3, the example of code still has the deprecated walker
which when used generates this,
"ERROR MESSAGE: Walker AnalyzeCovariates is no longer available in the GATK; it has been deprecated since version 2.0 (use BaseRecalibrator instead; see documentation for usage)"
I am a complete newb. Even with help and support from my lab mates, I need to read your materials. I was sent by the GATK Guide Book (page 10; section 4) to Dropbox location https://www.dropbox.com/sh/e31kvbg5v63s51t/ajQmlTL6YH where I picked up ReduceReads.pdf On page 11 of that document there are ten graphs. The resolution of the .pdf file is so low that I cannot read the legends on the left side and bottom of these ten graphs. Could you point me to the high resolution version of this .pdf ?
on the forum page
there are two examples. The first runs fine. The second generates this error
MESSAGE: Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '10'
but the input files are the same. I only changed "Reads" to "Loci" in the command. I am running Unix so I do not need to retype the entire command. This command works fine
java -jar GenomeAnalysisTK.jar -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam
This command produces the error
java -jar GenomeAnalysisTK.jar -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam -o output.txt