Tagged with #workflow
5 documentation articles | 0 announcements | 0 forum discussions


Comments (21)

Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.

Let's say we have this example data:

  • sample1_lane1.fq
  • sample1_lane2.fq
  • sample2_lane1.fq
  • sample2_lane2.fq

1. Run all core steps per-lane once

At the basic level, all pre-processing steps are meant to be performed per-lane. Assuming that you received one FASTQ file per lane of sequence data, just run each file through each pre-processing step individually: map & dedup -> realign -> recal.

The example data becomes:

  • sample1_lane1.dedup.realn.recal.bam
  • sample1_lane2.dedup.realn.recal.bam
  • sample2_lane1.dedup.realn.recal.bam
  • sample2_lane2.dedup.realn.recal.bam

2. Merge lanes per sample

Once you have pre-processed each lane individually, you merge lanes belonging to the same sample into a single BAM file.

The example data becomes:

  • sample1.merged.bam
  • sample2.merged.bam

3. Per-sample refinement

You can increase the quality of your results by performing an extra round of dedupping and realignment, this time at the sample level. It is not absolutely required and will increase your computational costs, so it's up to you to decide whether you want to do it on your data, but that's how we do it internally at Broad.

The example data becomes:

  • sample1.merged.dedup.realn.bam
  • sample2.merged.dedup.realn.bam

This gets you two big wins:

  • Dedupping per-sample eliminates PCR duplicates across all lanes in addition to optical duplicates (which are by definition only per-lane)
  • Realigning per-sample means that you will have consistent alignments across all lanes within a sample.

People often ask also if it's worth the trouble to try realigning across all samples in a cohort. The answer is almost always no, unless you have very shallow coverage. The problem is that while it would be lovely to ensure consistent alignments around indels across all samples, the computational cost gets too ridiculous too fast. That being said, for contrastive calling projects -- such as cancer tumor/normals -- we do recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.

Finally, why not do base recalibration across lanes or across samples? Well, by definition there is no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine. That said, don't worry if you find yourself needing to recalibrate a BAM file with the lanes already merged -- the GATK's BaseRecalibrator is read group-aware, which means that it will identify separate lanes as such even if they are in the same BAM file, and it will always process them separately.

Comments (8)

What is a synthetic read?

When running reduce reads, the algorithm will find regions of low variation in the genome and compress them together. To represent this compressed region, we use a synthetic read that carries all the information necessary to downstream tools to perform likelihood calculations over the reduced data.

They are called Synthetic because they are not read by a sequencer, these reads are automatically generated by the GATK and can be extremely long. In a synthetic read, each base will represent the consensus base for that genomic location. Each base will have it's consensus quality score represented in the equivalent offset in the quality score string.

Consensus Bases

ReduceReads has several filtering parameters for consensus regions. Consensus is created based on base qualities, mapping qualities and other adjustable parameters from the command line. All filters are described in the technical documentation of reduce reads.

Consensus Quality Scores

The consensus quality score of a consensus base is essentially the mean of all bases that passed all the filters and represent an observation of that base. It is represented in the quality score field of the SAM format.

n is the number of bases that contributed to the consensus base and q_i is the corresponding quality score of each base.

Insertion quality scores and Deletion quality scores (generated by BQSR) will undergo the same process and will be represented the same way.

Mapping Quality

The mapping quality of a synthetic read is a value representative of the mapping qualities of all the reads that contributed to it. This is an average of the root mean square of the mapping quality of all reads that contributed to the bases of the synthetic read. It is represented in the mapping quality score field of the SAM format.

where n is the number of reads and x_i is the mapping quality of each read.

Original Alignments

A synthetic read may come with up to two extra tags representing its original alignment information. Due to many filters in ReduceReads, reads are hard-clipped to the are of interest. These hard-clips are always represented in the cigar string with the H element and the length of the clipping in genomic coordinates. Sometimes hard clipping will make it impossible to retrieve what was the original alignment start / end of a read. In those cases, the read will contain extra tags with integer values representing their original alignment start or end.

Here are the two integer tags:

  • OP -- original alignment start
  • OE -- original alignment end

For all other reads, where this can still be obtained through the cigar string (i.e. using getAlignmentStart() or getUnclippedStart()), these tags are not created.

The RR Tag

the RR tag is a tag that holds the observed depth (after filters) of every base that contributed to a reduce read. That means all bases that passed the mapping and base quality filters, and had the same observation as the one in the reduced read.

The RR tag carries an array of bytes and for increased compression, it works like this: the first number represents the depth of the first base in the reduced read. all subsequent numbers will represent the offset depth from the first base. Therefore, to calculate the depth of base "i" using the RR array, one must use :

RR[0] + RR[i]

but make sure i > 0. Here is the code we use to return the depth of the i'th base:

return (i==0) ? firstCount : (byte) Math.min(firstCount + offsetCount, Byte.MAX_VALUE);


Using Synthetic Reads with GATK tools

The GATK is 100% compatible with synthetic reads. You can use Reduced BAM files in combination with non-reduced BAM files in any GATK analysis tools and it will work seamlessly.

Programming in the GATK

If you are programming using the GATK framework, the GATKSAMRecord class carries all the necessary functionality to use synthetic reads transparently with methods like:

  • public final byte getReducedCount(final int i)
  • public int getOriginalAlignmentStart()
  • public int getOriginalAlignmentEnd()
  • public boolean isReducedRead()
Comments (0)

WARNING: unfortunately we do not have the resources to directly support the HLA typer at this time. As such this tool is no longer under active development or supported by our group. The source code is available in the GATK as is. This tool may or may not work without substantial experimentation by an analyst.

Introduction

Inherited DNA sequence variation in the major histocompatibilty complex (MHC) on human chromosome 6 significantly influences the inherited risk for autoimmune diseases and the host response to pathogenic infections. Collecting allelic sequence information at the classical human leukocyte antigen (HLA) genes is critical for matching in organ transplantation and for genetic association studies, but is complicated due to the high degree of polymorphism across the MHC. Next-generation sequencing offers a cost-effective alternative to Sanger-based sequencing, which has been the standard for classical HLA typing. To bridge the gap between traditional typing and newer sequencing technologies, we developed a generic algorithm to call HLA alleles at 4-digit resolution from next-generation sequence data.

Downloading the HLA tools

The HLA-specific walkers/tools (FindClosestHLA, CalculateBaseLikelihoods, and HLACaller) are available as a separate download from our FTP site and as source code only. Instructions for obtaining and compiling them are as follows:

Download the source code (in a tar ball):

location: ftp://gsapubftp-anonymous@ftp.broadinstitute.org password: <blank> subdirectory: HLA/

Untar the file, 'cd' into the untar'ed directory and compile with 'ant'

Remember that we no longer support this tool, so if you encounter issues with any of these steps please do NOT post them to our support forum.

The algorithm

Algorithmic components of the HLA caller:

HLA caller

The HLA caller algorithm, developed as part of the open-source GATK, examines sequence reads aligned to the classical HLA loci taking SAM/BAM formatted files as input and calculates, for each locus, the posterior probabilities for all pairs of classical alleles based on three key considerations: (1) genotype calls at each base position, (2) phase information of nearby variants, and (3) population-specific allele frequencies. See the diagram below for a visualization of the heuristic. The output of the algorithm is a list of HLA allele pairs with the highest posterior probabilities.

Functionally, the HLA caller was designed to run in three steps:

  1. The "FindClosestAllele" walker detects misaligned reads by comparing each read to the dictionary of HLA alleles (reads with < 75% SNP homology to the closest matching allele are removed)

  2. The "CalculateBaseLikelihoods" walker calculates the likelihoods for each genotype at each position within the HLA loci and finds the polymorphic sites in relation to the reference

  3. The "HLAcaller" walker reads the output of the previous steps, and makes the likelihood / probability calculations based on base genotypes, phase information, and allele frequencies.

Required inputs

These inputs are required:

  • Aligned sequence (.bam) file - input data
  • Genomic reference (.bam) file - human genome build 36.
  • HLA exons (HLA.intervals) file - list of HLA loci / exons to examine.
  • HLA dictionary - list of HLA alleles, DNA sequences, and genomic positions.
  • HLA allele frequencies - allele frequencies for HLA alleles across multiple populations.
  • HLA polymorphic sites - list of polymorphic sites (used by FindClosestHLA walker)

You can download the latter four here.

Usage and Arguments

Standard GATK arguments (applies to subsequent functions)

The GATK contains a wealth of tools for analysis of sequencing data. Required inputs include an aligned bam file and reference fasta file. The following example shows how to calculate depth of coverage.

Usage:

java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -I input.bam -R ref.fasta -L input.intervals > output.doc

Arguments:

  • -T (required) name of walker/function
  • -I (required) Input (.bam) file.
  • -R (required) Genomic reference (.fasta) file.
  • -L (optional) Interval or list of genomic intervals to run the genotyper on.

FindClosestHLA

The FindClosestHLA walker traverses each read and compares it to all overlapping HLA alleles (at specific polymorphic sites), and identifies the closest matching alleles. This is useful for detecting misalignments (low concordance with best-matching alleles), and helps narrow the list of candidate alleles (narrowing the search space reduces computational speed) for subsequent analysis by the HLACaller walker. Inputs include the HLA dictionary, a list of polymorphic sites in the HLA, and the exons of interest. Output is a file (output.filter) that includes the closest matching alleles and statistics for each read.

Usage:

java -jar GenomeAnalysisTK.jar -T FindClosestHLA -I input.bam -R ref.fasta -L HLA_EXONS.intervals -HLAdictionary HLA_DICTIONARY.txt \ 
    -PolymorphicSites HLA_POLYMORPHIC_SITES.txt -useInterval HLA_EXONS.intervals | grep -v INFO > output.filter

Arguments:

  • -HLAdictionary (required) HLA_DICTIONARY.txt file
  • -PolymorphicSites (required) HLA_POLYMORPHIC_SITES.txt file
  • -useInterval (required) HLA_EXONS.intervals file

CalculateBaseLikelihoods

CalculateBestLikelihoods walker traverses each base position to determine the likelihood for each of the 10 diploid genotypes. These calculations are used later by HLACaller to determine likelihoods for HLA allele pairs based on genotypes, as well as determining the polymorphic sites used in the phasing algorithm. Inputs include aligned bam input, (optional) results from FindClosestHLA (to remove misalignments), and cutoff values for inclusion or exclusion of specific reads. Output is a file (output.baselikelihoods) that contains base likelihoods at each position.

Usage:

java -jar GenomeAnalysisTK.jar -T CalculateBaseLikelihoods -I input.bam -R ref.fasta -L HLA_EXONS.intervals -filter output.filter \
-maxAllowedMismatches 6 -minRequiredMatches 0  | grep -v "INFO"  | grep -v "MISALIGNED" > output.baselikelihoods

Arguments:

  • -filter (optional) file = output of FindClosestHLA walker (output.filter - to exclude misaligned reads in genotype calculations)
  • -maxAllowedMismatches (optional) max number of mismatches tolerated between a read and the closest allele (default = 6)
  • -minRequiredMatches (optional) min number of base matches required between a read and the closest allele (default = 0)

HLACaller

The HLACaller walker calculates the likelihoods for observing pairs of HLA alleles given the data based on genotype, phasing, and allele frequency information. It traverses through each read as part of the phasing algorithm to determine likelihoods based on phase information. The inputs include an aligned bam files, the outputs from FindClosestHLA and CalculateBaseLikelihoods, the HLA dictionary and allele frequencies, and optional cutoffs for excluding specific reads due to misalignment (maxAllowedMismatches and minRequiredMatches).

Usage:

java -jar GenomeAnalysisTK.jar -T HLACaller -I input.bam -R ref.fasta -L HLA_EXONS.intervals -filter output.filter -baselikelihoods output.baselikelihoods\
-maxAllowedMismatches 6 -minRequiredMatches 5  -HLAdictionary HLA_DICTIONARY.txt -HLAfrequencies HLA_FREQUENCIES.txt | grep -v "INFO"  > output.calls

Arguments:

  • -baseLikelihoods (required) output of CalculateBaseLikelihoods walker (output.baselikelihoods - genotype likelihoods / list of polymorphic sites from the data)
  • -HLAdictionary (required) HLA_DICTIONARY.txt file
  • -HLAfrequencies (required) HLA_FREQUENCIES.txt file
  • -useInterval (required) HLA_EXONS.intervals file
  • -filter (optional) file = output of FindClosestAllele walker (to exclude misaligned reads in genotype calculations)
  • -maxAllowedMismatches (optional) max number of mismatched bases tolerated between a read and the closest allele (default = 6)
  • -minRequiredMatches (optional) min number of base matches required between a read and the closest allele (default = 5)
  • -minFreq (option) minimum allele frequency required to consider the HLA allele (default = 0.0).

Example workflow (genome-wide HiSeq data in NA12878 from HapMap; computations were performed on the Broad servers)

Extract sequences from the HLA loci and make a new bam file:

use Java-1.6

set HLA=/seq/NKseq/sjia/HLA_CALLER
set GATK=/seq/NKseq/sjia/Sting/dist/GenomeAnalysisTK.jar
set REF=/humgen/1kg/reference/human_b36_both.fasta

cp $HLA/samheader NA12878.HLA.sam

java -jar $GATK -T PrintReads \
-I /seq/dirseq/ftp/NA12878_exome/NA12878.bam -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta \
-L $HLA/HLA.intervals | grep -v RESULT | sed 's/chr6/6/g' >> NA12878.HLA.sam

/home/radon01/sjia/bin/SamToBam.csh NA12878.HLA

Use FindClosestHLA to find closest matching HLA alleles and to detect possible misalignments:

java -jar $GATK -T FindClosestHLA -I NA12878.HLA.bam -R $REF -L $HLA/HLA_EXONS.intervals -useInterval $HLA/HLA_EXONS.intervals \
-HLAdictionary $HLA/HLA_DICTIONARY.txt -PolymorphicSites $HLA/HLA_POLYMORPHIC_SITES.txt | grep -v INFO > NA12878.HLA.filter

READ_NAME   START-END       S   %Match  Matches Discord Alleles
20FUKAAXX100202:7:63:8309:75917 30018423-30018523   1.0 1.000   1   0   HLA_A*01010101,HLA_A*01010102N,HLA_A*010102,HLA_A*010103,HLA_A*010104,...
20GAVAAXX100126:3:24:13495:18608    30018441-30018541   1.0 1.000   3   0   HLA_A*0312,HLA_A*110101,HLA_A*110102,HLA_A*110103,HLA_A*110104,...
20FUKAAXX100202:8:44:16857:92134    30018442-30018517   1.0 1.000   1   0   HLA_A*01010101,HLA_A*01010102N,HLA_A*010102,HLA_A*010103,HLA_A*010104,HLA_A*010105,...
20FUKAAXX100202:8:5:4309:85338  30018452-30018552   1.0 1.000   3   0   HLA_A*0312,HLA_A*110101,HLA_A*110102,HLA_A*110103,HLA_A*110104,HLA_A*110105,...
20GAVAAXX100126:3:28:7925:160832    30018453-30018553   1.0 1.000   3   0   HLA_A*0312,HLA_A*110101,HLA_A*110102,HLA_A*110103,HLA_A*110104,HLA_A*110105,...
20FUKAAXX100202:1:2:10539:169258    30018459-30018530   1.0 1.000   1   0   HLA_A*01010101,HLA_A*01010102N,HLA_A*010102,HLA_A*010103,.
20FUKAAXX100202:8:43:18611:44456    30018460-30018560   1.0 1.000   3   0   HLA_A*01010101,HLA_A*01010102N,HLA_A*010102,HLA_A*010103,HLA_A*010104,...

Use CalculateBaseLikelihoods to determine genotype likelihoods at every base position:

java -jar $GATK -T CalculateBaseLikelihoods -I NA12878.HLA.bam -R $REF -L $HLA/HLA_EXONS.intervals \
-filter NA12878.HLA.filter -maxAllowedMismatches 6 -minRequiredMatches 0 | grep -v INFO  | grep -v MISALIGNED > NA12878.HLA.baselikelihoods

chr:pos     Ref Counts          AA  AC  AG  AT  CC  CG  CT  GG  GT  TT
6:30018513  G   A[0]C[0]T[1]G[39]   -113.58 -113.58 -13.80  -113.29 -113.58 -13.80  -113.29 -3.09   -13.50  -113.11
6:30018514  C   A[0]C[39]T[0]G[0]   -119.91 -13.00  -119.91 -119.91 -2.28   -13.00  -13.00  -119.91 -119.91 -119.91
6:30018515  T   A[0]C[0]T[39]G[0]   -118.21 -118.21 -118.21 -13.04  -118.21 -118.21 -13.04  -118.21 -13.04  -2.35
6:30018516  C   A[0]C[38]T[1]G[0]   -106.91 -13.44  -106.91 -106.77 -3.05   -13.44  -13.30  -106.91 -106.77 -106.66
6:30018517  C   A[0]C[38]T[0]G[0]   -103.13 -13.45  -103.13 -103.13 -3.64   -13.45  -13.45  -103.13 -103.13 -103.13
6:30018518  C   A[0]C[38]T[0]G[0]   -112.23 -12.93  -112.23 -112.23 -2.71   -12.93  -12.93  -112.23 -112.23 -112.23
...

Run HLACaller using outputs from previous steps to determine the most likely alleles at each locus:

java -jar $GATK -T HLACaller -I NA12878.HLA.bam -R $REF -L $HLA/HLA_EXONS.intervals -useInterval $HLA/HLA_EXONS.intervals \
-bl NA12878.HLA.baselikelihoods -filter NA12878.HLA.filter -maxAllowedMismatches 6 -minRequiredMatches 5 \
-HLAdictionary $HLA/HLA_DICTIONARY.txt -HLAfrequencies $HLA/HLA_FREQUENCIES.txt > NA12878.HLA.info

grep -v INFO NA12878.HLA.info > NA12878.HLA.calls

Locus   A1  A2  Geno    Phase   Frq1    Frq2    L   Prob    Reads1  Reads2  Locus   EXP White   Black   Asian
A   0101    1101    -1229.5 -15.2   -0.82   -0.73   -1244.7 1.00    180 191 229 1.62    -1.99   -3.13   -2.07
B   0801    5601    -832.3  -37.3   -1.01   -2.15   -872.1  1.00    58  59  100 1.17    -3.31   -4.10   -3.95
C   0102    0701    -1344.8 -37.5   -0.87   -0.86   -1384.2 1.00    91  139 228 1.01    -2.35   -2.95   -2.31
DPA1    0103    0201    -842.1  -1.8    -0.12   -0.79   -846.7  1.00    72  48  120 1.00    -0.90   -INF    -1.27
DPB1    0401    1401    -991.5  -18.4   -0.45   -1.55   -1010.7 1.00    64  48  113 0.99    -2.24   -3.14   -2.64
DQA1    0101    0501    -1077.5 -15.9   -0.90   -0.62   -1095.4 1.00    160 77  247 0.96    -1.53   -1.60   -1.87
DQB1    0201    0501    -709.6  -18.6   -0.77   -0.76   -729.7  0.95    50  87  137 1.00    -1.76   -1.54   -2.23
DRB1    0101    0301    -1513.8 -317.3  -1.06   -0.94   -1832.6 1.00    52  32  101 0.83    -1.99   -2.83   -2.34

Make a SAM/BAM file of the called alleles:

awk '{if (NR > 1){print $1 "*" $2 "\n" $1 "*" $3}}' NA12878.HLA.calls | sort -u > NA12878.HLA.calls.unique
cp $HLA/samheader NA12878.HLA.calls.sam
awk '{split($1,a,"*"); print "grep \"" a[1] "[*]" a[2] "\" '$HLA/HLA_DICTIONARY.sam' >> 'NA12878.HLA'.tmp";}' NA12878.HLA.calls.unique | sh
sort -k4 -n NA12878.HLA.tmp >> NA12878.HLA.calls.sam
/home/radon01/sjia/bin/SamToBam.csh NA12878.HLA.calls
rm NA12878.HLA.tmp

Performance Considerations / Tradeoffs

There exist a few performance / accuracy tradeoffs in the HLA caller, as in any algorithm. The following are a few key considerations that the user should keep in mind when using the software for HLA typing.

Robustness to sequencing/alignment artifact vs. Ability to recognize rare alleles

In polymorphic regions of the genome like the HLA, misaligned reads (presence of erroneous reads or lack of proper sequences) and sequencing errors (indels, systematic PCR errors) may cause the HLA caller to call rare alleles with polymorphisms at the affected bases. The user can manually spot these errors when the algorithm calls a rare allele (Frq1 and Frq2 columns in the output of HLACaller indicate log10 of the allele frequencies). Alternatively, the user can choose to consider only non-rare alleles (use the "-minFreq 0.001" option in HLACaller) to make the algorithm (faster and) more robust against sequencing or alignment errors. The drawback to this approach is that the algorithm may not be able to correctly identifying rare alleles when they are truly present. We recommend using the -minFreq option for genome-wide sequencing datasets, but not for high-quality (targeted PCR 454) data specifically captured for HLA typing in large cohorts.

Misalignment Detection and Data Pre-Processing

The FindClosestAllele walker (optional step) is recommended for two reasons:

  1. The ability to detect misalignments for reads that don't match very well to the closest appearing HLA allele - removing these misaligned reads improves calling accuracy.

  2. Creating a list of closest-matching HLA alleles reduces the search space (over 3,000 HLA alleles across the class I and class II loci) that HLACaller has to iterate through, reducing computational burdon.

However, using this pre-processing step is not without costs:

  1. Any cutoff chosen for %concordance, min base matches, or max base mismatches will not distinguish between correctly aligned and misaligned reads 100% of the time - there is a chance that correctly aligned reads may be removed, and misaligned reads not removed.

  2. The list of closest-matching alleles in some cases may not contain the true allele if there is sufficient sequencing error, in which case the true allele will not be considered by the HLACaller walker. In our experience, the advantages of using this pre-processing FindClosestAllele walker greatly outweighs the disadvantages, as recommend including it in the pipeline long as the user understands the possible risks of using this function.

Contributions

The HLA caller algorithm was was developed by Xiaoming (Sherman) Jia with generous support of the GATK team (especially Mark Depristo, Eric Banks), and Paul de Bakker.

  • xiaomingjia at gmail dot com
  • depristo at broadinstitute dot org
  • ebanks at broadinstitute dot org
  • pdebakker at rics dot bwh dot harvard dot edu

Comments (17)

Introduction

Processing data originated in the Pacific Biosciences RS platform has been evaluated by the GSA and publicly presented in numerous occasions. The guidelines we describe in this document were the result of a systematic technology development experiment on some datasets (human, E. coli and Rhodobacter) from the Broad Institute. These guidelines produced better results than the ones obtained using alternative pipelines up to this date (september 2011) for the datasets tested, but there is no guarantee that it will be the best for every dataset and that other pipelines won't supersede it in the future.

The pipeline we propose here is illustrated in a Q script (PacbioProcessingPipeline.scala) distributed with the GATK as an example for educational purposes. This pipeline has not been extensively tested and is not supported by the GATK team. You are free to use it and modify it for your needs following the guidelines below.

BWA alignment

First we take the filtered_subreads.fq file output by the Pacific Biosciences RS SMRT pipeline and align it using BWA. We use BWA with the bwasw algorithm and allow for relaxing the gap open penalty to account for the excess of insertions and deletions known to be typical error modes of the data. For an idea on what parameters to use check suggestions given by the BWA author in the BWA manual page that are specific to Pacbio. The goal is to account for Pacific Biosciences RS known error mode and benefit from the long reads for a high scoring overall match. (for older versions, you can use the filtered_subreads.fasta and combine the base quality scores extracted from the h5 files using Pacific Biosciences SMRT pipeline python tools)

To produce a BAM file that is sorted by coordinate with adequate read group information we use Picard tools: SortSam and AddOrReplaceReadGroups. These steps are necessary because all subsequent tools require that the BAM file follow these rules. It is also generally considered good practices to have your BAM file conform to these specifications.

Best Practices for Variant Calling

Once we have a proper BAM file, it is important to estimate the empirical quality scores using statistics based on a known callset (e.g. latest dbSNP) and the following covariates: QualityScore, Dinucleotide and ReadGroup. You can follow the GATK's Best Practices for Variant Detection according the type of data you have, with the exception of indel realignment, because the tool has not been adapted for Pacific Biosciences RS data.

Problems with Variant Calling with Pacific Biosciences

  • Calling must be more permissive of indels in the data.

You will have to adjust your calling thresholds in the Unified Genotyper to allow sites with a higher indel rate to be analyzed.

  • Base quality thresholds should be adjusted to the specifics of your data.

Be aware that the Unified Genotyper has cutoffs for base quality score and if your data is on average Q20 (a common occurrence with Pacific Biosciences RS data) you may need to adjust your quality thresholds to allow the GATK to analyze your data. There is no right answer here, you have to choose parameters consistent with your average base quality scores, evaluate the calls made with the selected threshold and modify as necessary.

  • Reference bias

To account for the high insertion and deletion error rate of the Pacific Biosciences data instrument, we often have to set the gap open penalty to be lower than the base mismatch penalty in order to maximize alignment performance. Despite aligning most of the reads successfully, this creates the side effect that the aligner will sometimes prefer to "hide" a true SNP inside an insertion. The result is accurate mapping, albeit with a reference-biased alignment. It is important to note however, that reference bias is an artifact of the alignment process, not the data, and can be greatly reduced by locally realigning the reads based on the reference and the data. Presently, the available software for local realignment is not compatible with the length and the high indel rate of Pacific Bioscience data, but we expect new tools to handle this problem in the future. Ultimately reference bias will mask real calls and you will have to inspect these by hand.

Comments (25)

Please note that the DataProcessingPipeline qscript is no longer available.

The DPP script was only provided has an example, but many people were using it "out of the box" without properly understanding how it works. In order to protect users from mishandling this tool, and to decrease our support burden, we have taken the difficult decision of removing the script from our public repository. If you would like to put together your own version of the DPP, please have a look at our other example scripts to understand how Qscripts work, and read the Best Practices documentation to understand what are the processing steps and what parameters you need to set/adjust.

Data Processing Pipeline

The Data Processing Pipeline is a Queue script designed to take BAM files from the NGS machines to analysis ready BAMs for the GATK.

Introduction

Reads come off the sequencers in a raw state that is not suitable for analysis using the GATK. In order to prepare the dataset, one must perform the steps described here. This pipeline performs the following steps: indel cleaning, duplicate marking and base score recalibration, following the GSA's latest definition of best practices. The product of this pipeline is a set of analysis ready BAM files (one per sample sequenced).

Requirements

This pipeline is a Queue script that uses tools from the GATK, Picard and BWA (optional) software suites which are all freely available through their respective websites. Queue is a GATK companion that is included in the GATK package.

Warning: This pipeline was designed specifically to handle the Broad Institute's main sequencing pipeline with Illumina BAM files and BWA alignment. The GSA cannot support its use for other types of datasets. It is possible however, with some effort, to modify it for your needs.

Command-line arguments

Required Parameters

Argument (short-name) Argument (long-name) Description
-i <BAM file / BAM list> --input <BAM file / BAM list> input BAM file - or list of BAM files.
-R <fasta> --reference <fasta> Reference fasta file.
-D <vcf> --dbsnp <dbsnp vcf> dbsnp ROD to use (must be in VCF format).

Optional Parameters

Argument (short-name) Argument (long-name) Description
-indels <vcf> --extra_indels <vcf> VCF files to use as reference indels for Indel Realignment.
-bwa <path> --path_to_bwa <path> The path to the binary of bwa (usually BAM files have already been mapped - but if you want to remap this is the option)
-outputDir <path> --output_directory <path> Output path for the processed BAM files.
-L <GATK interval string> --gatk_interval_string <GATK interval string> the -L interval string to be used by GATK - output bams at interval only
-intervals <GATK interval file> --gatk_interval_file <GATK interval file> an intervals file to be used by GATK - output bams at intervals

Modes of Operation (also optional parameters)

Argument (short-name) Argument (long-name) Description
-p <name> --project <name> the project name determines the final output (BAM file) base name. Example NA12878 yields NA12878.processed.bam
-knowns --knowns_only Perform cleaning on knowns only.
-sw --use_smith_waterman Perform cleaning using Smith Waterman
-bwase --use_bwa_single_ended Decompose input BAM file and fully realign it using BWA and assume Single Ended reads
-bwape --use_bwa_pair_ended Decompose input BAM file and fully realign it using BWA and assume Pair Ended reads

The Pipeline

Data processing pipeline of the best practices for raw data processing, from sequencer data (fastq files) to analysis read reads (bam file):

the data processing pipeline

Following the group's Best Practices definition, the data processing pipeline does all the processing at the sample level. There are two high-level parts of the pipeline:

BWA alignment

This option is for datasets that have already been processed using a different pipeline or different criteria, and you want to reprocess it using this pipeline. One example is a BAM file that has been processed at the lane level, or did not perform some of the best practices steps of the current pipeline. By using the optional BWA stage of the processing pipeline, your BAM file will be realigned from scratch before creating sample level bams and entering the pipeline.

Sample Level Processing

This is the where the pipeline applies its main procedures: Indel Realignment and Base Quality Score Recalibration.

Indel Realignment

This is a two step process. First we create targets using the Realigner Target Creator (either for knowns only, or including data indels), then we realign the targets using the Indel Realigner (see [Local realignment around indels]) with an optional smith waterman realignment. The Indel Realigner also fixes mate pair information for reads that get realigned.

Base Quality Score Recalibration

This is a crucial step that re-adjusts the quality score using statistics based on several different covariates. In this pipeline we utilize four: Read Group Covariate, Quality Score Covariate, Cycle Covariate, Dinucleotide Covariate

The Outputs

The Data Processing Pipeline produces 3 types of output for each file: a fully processed bam file, a validation report on the input bam and output bam files, a analysis before and after base quality score recalibration. If you look at the pipeline flowchart, the grey boxes indicate processes that generate an output.

Processed Bam File

The final product of the pipeline is one BAM file per sample in the dataset. It also provides one BAM list with all the bams in the dataset. This file is named <project name>.cohort.list, and each sample bam file has the name <project name>.<sample name>.bam. The sample names are extracted from the input BAM headers, and the project name is provided as a parameter to the pipeline.

Validation Files

We validate each unprocessed sample level BAM file and each final processed sample level BAM file. The validation is performed using Picard's ValidateSamFile. Because the parameters of this validation are very strict, we don't enforce that the input BAM has to pass all validation, but we provide the log of the validation as an informative companion to your input. The validation file is named : <project name>.<sample name>.pre.validation and <project name>.<sample name>.post.validation.

Notice that even if your BAM file fails validation, the pipeline can still go through successfully. The validation is a strict report on how your BAM file is looking. Some errors are not critical, but the output files (both pre.validation and post.validation) should give you some input on how to make your dataset better organized in the BAM format.

Base Quality Score Recalibration Analysis

PDF plots of the base qualities are generated before and after recalibration for further analysis on the impact of recalibrating the base quality scores in each sample file. These graphs are explained in detail here. The plots are created in directories named : <project name>.<sample name>.pre and <project name>.<sample name>.post.

Examples

  1. Example script that runs the data processing pipeline with its standard parameters and uses LSF for scatter/gathering (without bwa)

    java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/GATK/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -p myFancyProjectName \ -i myDataSet.list \ -R reference.fasta \ -D dbSNP.vcf \ -run

  2. Performing realignment and the full data processing pipeline in one pair-ended bam file

    java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -bwa path/to/bwa \ -i test.bam \ -R reference.fasta \ -D dbSNP.vcf \ -p myProjectWithRealignment \ -bwape \ -run

No posts found with the requested search criteria.
No posts found with the requested search criteria.