The Best Practices variant discovery workflow depends on having sequence data in the form of reads that are aligned to a reference genome. So the very first step is of course to map your reads to the reference to produce a file in SAM/BAM format. We recommend using BWA, but depending on your data and how it was sequenced, you may need to use a different aligner. Once you have mapped the reads, you'll need to make sure they are sorted in the proper order (by coordinate).
Then you can proceed to mark duplicates. The rationale here is that during the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process (sometimes called **dedupping** in bioinformatics slang) identifies these reads as such so that the GATK tools know they should ignore them.
These steps are performed with tools such as Samtools and Picard that are not part of GATK, so we don't provide detailed documentation of all the options available. For more details, please see those tools' respective documentations.
Map the read data to the reference and mark duplicates.
The read group information is key for downstream GATK functionality. The GATK will not work without a read group tag. Make sure to enter as much metadata as you know about your data in the read group fields provided. For more information about all the possible fields in the @RG tag, take a look at the SAM specification.
Compose the read group identifier in the following format:
\t stands for the tab character.
Run the following BWA command:
In this command, replace read group info by the read group identifier composed in the previous step.
bwa mem -M -R ’<read group info>’ -p reference.fa raw_reads.fq > aligned_reads.sam
<read group info> bit with the read group identifier you composed at the previous step.
-M flag causes BWA to mark shorter split hits as secondary (essential for Picard compatibility).
This creates a file called
aligned_reads.sam containing the aligned reads from all input files, combined, annotated and aligned to the same reference.
Note that here we are using a command that is specific for pair ended data in an interleaved fastq file, which is what we are providing to you as a tutorial file. To map other types of datasets (e.g. single-ended or pair-ended in forward/reverse read files) you will need to adapt the command accordingly. Please see the BWA documentation for exact usage and more options for these commands.
These initial pre-processing operations format the data to suit the requirements of the GATK tools.
Run the following Picard command to sort the SAM file and convert it to BAM:
java -jar SortSam.jar \ INPUT=aligned_reads.sam \ OUTPUT=sorted_reads.bam \ SORT_ORDER=coordinate
This creates a file called
sorted_reads.bam containing the aligned reads sorted by coordinate.
Run the following Picard command to mark duplicates:
java -jar MarkDuplicates.jar \ INPUT=sorted_reads.bam \ OUTPUT=dedup_reads.bam \ METRICS=metrics.txt
This creates a sorted BAM file called
dedup_reads.bam with the same content as the input file, except that any duplicate reads are marked as such. It also produces a metrics file called
metrics.txt containing (can you guess?) metrics.
Run the following Picard command to index the BAM file:
java -jar BuildBamIndex.jar \ INPUT=dedup_reads.bam
This creates an index file for the BAM file called
Hi, Does GATK2 provide a walker/option to summarize the read alignment in a given BAM file? The summary including total reads, reads mapped/%, reads uniquely mapped/%, reads uniquely mapped with 0mm/%, reads mapped on-target/%, reads uniquely mapped on-target%, etc is of great use to assess the mapping quality for whole genome or targeted analysis. Please advice me on how I can obtain this using any of the walkers available. Thanks, Raj