Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.
Let's say we have this example data:
At the basic level, all pre-processing steps are meant to be performed per-lane. Assuming that you received one FASTQ file per lane of sequence data, just run each file through each pre-processing step individually: map & dedup -> realign -> recal.
The example data becomes:
Once you have pre-processed each lane individually, you merge lanes belonging to the same sample into a single BAM file.
The example data becomes:
You can increase the quality of your results by performing an extra round of dedupping and realignment, this time at the sample level. It is not absolutely required and will increase your computational costs, so it's up to you to decide whether you want to do it on your data, but that's how we do it internally at Broad.
The example data becomes:
This gets you two big wins:
People often ask also if it's worth the trouble to try realigning across all samples in a cohort. The answer is almost always no, unless you have very shallow coverage. The problem is that while it would be lovely to ensure consistent alignments around indels across all samples, the computational cost gets too ridiculous too fast. That being said, for contrastive calling projects -- such as cancer tumor/normals -- we do recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.
Finally, why not do base recalibration across lanes or across samples? Well, by definition there is no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine. That said, don't worry if you find yourself needing to recalibrate a BAM file with the lanes already merged -- the GATK's BaseRecalibrator is read group-aware, which means that it will identify separate lanes as such even if they are in the same BAM file, and it will always process them separately.
Processing data originated in the Pacific Biosciences RS platform has been evaluated by the GSA and publicly presented in numerous occasions. The guidelines we describe in this document were the result of a systematic technology development experiment on some datasets (human, E. coli and Rhodobacter) from the Broad Institute. These guidelines produced better results than the ones obtained using alternative pipelines up to this date (september 2011) for the datasets tested, but there is no guarantee that it will be the best for every dataset and that other pipelines won't supersede it in the future.
The pipeline we propose here is illustrated in a Q script (PacbioProcessingPipeline.scala) distributed with the GATK as an example for educational purposes. This pipeline has not been extensively tested and is not supported by the GATK team. You are free to use it and modify it for your needs following the guidelines below.
First we take the filtered_subreads.fq file output by the Pacific Biosciences RS SMRT pipeline and align it using BWA. We use BWA with the
bwasw algorithm and allow for relaxing the gap open penalty to account for the excess of insertions and deletions known to be typical error modes of the data. For an idea on what parameters to use check suggestions given by the BWA author in the BWA manual page that are specific to Pacbio. The goal is to account for Pacific Biosciences RS known error mode and benefit from the long reads for a high scoring overall match. (for older versions, you can use the filtered_subreads.fasta and combine the base quality scores extracted from the h5 files using Pacific Biosciences SMRT pipeline python tools)
To produce a BAM file that is sorted by coordinate with adequate read group information we use Picard tools: SortSam and AddOrReplaceReadGroups. These steps are necessary because all subsequent tools require that the BAM file follow these rules. It is also generally considered good practices to have your BAM file conform to these specifications.
Once we have a proper BAM file, it is important to estimate the empirical quality scores using statistics based on a known callset (e.g. latest dbSNP) and the following covariates: QualityScore, Dinucleotide and ReadGroup. You can follow the GATK's Best Practices for Variant Detection according the type of data you have, with the exception of indel realignment, because the tool has not been adapted for Pacific Biosciences RS data.
You will have to adjust your calling thresholds in the Unified Genotyper to allow sites with a higher indel rate to be analyzed.
Be aware that the Unified Genotyper has cutoffs for base quality score and if your data is on average Q20 (a common occurrence with Pacific Biosciences RS data) you may need to adjust your quality thresholds to allow the GATK to analyze your data. There is no right answer here, you have to choose parameters consistent with your average base quality scores, evaluate the calls made with the selected threshold and modify as necessary.
To account for the high insertion and deletion error rate of the Pacific Biosciences data instrument, we often have to set the gap open penalty to be lower than the base mismatch penalty in order to maximize alignment performance. Despite aligning most of the reads successfully, this creates the side effect that the aligner will sometimes prefer to "hide" a true SNP inside an insertion. The result is accurate mapping, albeit with a reference-biased alignment. It is important to note however, that reference bias is an artifact of the alignment process, not the data, and can be greatly reduced by locally realigning the reads based on the reference and the data. Presently, the available software for local realignment is not compatible with the length and the high indel rate of Pacific Bioscience data, but we expect new tools to handle this problem in the future. Ultimately reference bias will mask real calls and you will have to inspect these by hand.
We are sequencing a set of regions that covers about 1.5 megabases in total. We're running into problems with VQSR -- VariantRecalibrator says there are too few variants to do recalibration. To give a sense of numbers, in one sample we have about 3000 SNVs and 600 indels.
We seem to have too few indels to do VQSR on them and have a couple of questions:
Can we combine multiple samples to increase the number of variants, or does VariantRecalibrator need to work on each sample individually?
If we do not use VQSR for indels, should we also avoid VQSR for the SNPs?
The other question is whether joint or batch variant calling across several samples would help us in this case?
Thanks in advance!
I have used GATK in my PhD project, and was wondering if I could get the permission to use the Best Practices workflow graphics  in my doctoral dissertation? How should I attribute your copyright?
Hi all: I find that among all the work flows of GATK http://www.broadinstitute.org/gatk/guide/topic?name=methods-and-workflows there are no workflows for RNA-seq analysis. I understand that GATK mainly focuses on variant calling, can anyone tell me how to use GATK for RNA-seq analysis?