Whole genome, deep coverage v1

From GSA

(Redirected from Whole genome, deep coverage)
Jump to: navigation, search

Warning: the material on this page is considered out of date by the GSA team.


The current version is Best Practice Variant Detection with the GATK v3


Our current best practice for making SNP calls in whole genome, deep coverage data is 5 steps: base quality recalibration, indel realignment, indel calling, SNP calling, and SNP filtering. Below, we provide the precise commands that we used on a recent project for this pipeline; however, please be aware that the specific values used in each of the commands are not necessarily right for each project. Care should be taken by the analyst running our tools to understand what each parameter does and to evaluate which value best fits his/her data.


Contents

Base Quality Score Recalibration

[see here for details]

Bam files produced through e.g. the sequencing pipeline at the Broad or published by the 1000 Genomes Project have already been base-quality-score recalibrated. If that is the case with your particular bam file, you can skip to the Indel Realignment section. The commands provided below use the standard values from the Broad production pipeline.

CountCovariates

We determine (or 'count') the covariates affecting base quality scores in the initial BAM file, emitting a CSV file when done.

java -Xmx4g -jar GenomeAnalysisTK.jar \
  -R resources/Homo_sapiens_assembly18.fasta \
  --DBSNP resources/dbsnp_129_hg18.rod \
  -I original.bam \
  -T CountCovariates \ 
   -cov ReadGroupCovariate \
   -cov QualityScoreCovariate \
   -cov CycleCovariate \
   -cov DinucCovariate \
   -recalFile recal_data.csv

TableRecalibration

After counting covariates in the initial BAM file, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field), using the data from the CSV file, into a new BAM file.

java -Xmx4g -jar GenomeAnalysisTK.jar \
  -R resources/Homo_sapiens_assembly18.fasta \
  -I original.bam \
  -T TableRecalibration \
  -outputBam recalibrated.bam \
  -recalFile recal_data.csv


Local Realignment Around Indels

[see here for details]

The Indel Realigner (or 'Cleaner') has been tested both with shorter reads (on which it performs very well) and will reads >= 100bp (on which its performance is satisfactory). It is becoming apparent that the realigner has trouble with indels longer than ~10-20bp (which generally don't align with reads < 100bp). Until the next version of the realigner is ready, our current practice is both to 'clean' in this step and also to filter out SNPs near called indels (in the last step).

Creating Intervals

First, we determine (small) suspicious intervals which are likely in need of realignment.

 java -jar GenomeAnalysisTK.jar \
  -T RealignerTargetCreator \
  -I recalibrated.bam \
  -R resources/Homo_sapiens_assembly18.fasta \
  -o forRealigner.intervals \
  -D resources/dbsnp_129_hg18.rod

Realigning

Next, we actually run the realigner over the intervals to create a 'cleaned' version of the bam file.

java -Djava.io.tmpdir=/path/to/tmpdir \  [this argument is highly recommended]
  -jar GenomeAnalysisTK.jar \
  -I recalibrated.bam \
  -R resources/Homo_sapiens_assembly18.fasta \
  -T IndelRealigner \
  -targetIntervals forRealigner.intervals \
  --output cleaned.bam \
  -D resources/dbsnp_129_hg18.rod

Fixing Mates

[see here for details]

Making Indel Calls

[see here for details]

IndelGenotyper

We first run the genotyper to generate raw indel calls.

java -jar GenomeAnalysisTK.jar \
   -T IndelGenotyperV2 \
   -R resources/Homo_sapiens_assembly18.fasta \
   -I cleaned.bam \
   -O indels.raw.bed \
   -o detailed.output.bed \
   --verbose \
   -mnr 1000000  

Post-processing filter

We then need to filter the raw indel calls to get the final set of confident calls. Note that this tool is only available in the full SVN checkout of the GATK.

perl/filterSingleSampleCalls.pl \
   --calls detailed.output.bed \
   --max_cons_av_mm 3.0 \
   --max_cons_nqs_av_mm 0.5 \
   --mode ANNOTATE > indels.filtered.bed


Making SNP Calls

[see here for details]

To make SNP calls, we run the Unified Genotyper.

java -jar GenomeAnalysisTK.jar \
 -R resources/Homo_sapiens_assembly18.fasta \
 -T UnifiedGenotyper \
 -I cleaned.bam \
 -D resources/dbsnp_129_hg18.rod \
 -varout snps.raw.vcf \
 -stand_call_conf 30.0

Filtering Your SNP Calls

[see here for details]

Some pre-processing

We need to get the intervals over which to mask out SNPs near indels. To do this, we have a script that creates a bed file representing the masking intervals based on the output of the Indel Genotyper. Note that this tool is only available in the full SVN checkout of the GATK, although the strategy is simple: for each indel, create an interval which extends N bases to either side of it.

python python/makeIndelMask.py <raw_indels> <mask_window> <output>
e.g.
python python/makeIndelMask.py indels.raw.bed 10 indels.mask.bed

Filtering

Finally, we need to filter the SNP callset for those calls with suspicious attributes (e.g. their proximity to indel calls or strange allele balance). In this step especially it is crucial to choose filter parameters geared to your own particular data set. For example, filtering calls at loci with depth > 100 made sense in the example chosen below, but it might not make sense for your data set.

java -jar GenomeAnalysisTK.jar \
  -T VariantFiltration \
  -R resources/Homo_sapiens_assembly18.fasta \
  -o snps.filtered.vcf \
  -B:variant,VCF snps.raw.vcf \
  -B:mask,Bed indels.mask.bed \
  --maskName InDel \
  --clusterWindowSize 10 \
  --filterExpression "AB > 0.75 && DP > 40 || DP > 100 || MQ0 > 40 || SB > -0.10" \
  --filterName "StandardFilters"
  --filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" \
  --filterName "HARD_TO_VALIDATE"
Personal tools