Normal sequencing and alignment protocols can often yield pileups with vast numbers of reads aligned to a single section of the genome in otherwise well-behaved datasets. Because of the frequency of these 'speed bumps', the GATK now downsamples pileup data unless explicitly overridden.
Note that there is also a proportional "downsample to fraction" mechanism that is mostly intended for testing the effect of different overall coverage means on analysis results.
See below for details of how this is implemented and controlled in GATK.
The principle of this downsampling type is to downsample reads to a given capping threshold coverage. Its purpose is to get rid of excessive coverage, because above a certain depth, having additional data is not informative and imposes unreasonable computational costs. The downsampling process takes two different forms depending on the type of analysis it is used with. For locus-based traversals (LocusWalkers like UnifiedGenotyper and ActiveRegionWalkers like HaplotypeCaller), downsample_to_coverage controls the maximum depth of coverage at each locus. For read-based traversals (ReadWalkers like BaseRecalibrator), it controls the maximum number of reads sharing the same alignment start position. For ReadWalkers you will typically need to use much lower dcov values than you would with LocusWalkers to see an effect. Note that this downsampling option does not produce an unbiased random sampling from all available reads at each locus: instead, the primary goal of the to-coverage downsampler is to maintain an even representation of reads from all alignment start positions when removing excess coverage. For a truly unbiased random sampling of reads, use -dfrac instead. Also note that the coverage target is an approximate goal that is not guaranteed to be met exactly: the downsampling algorithm will under some circumstances retain slightly more or less coverage than requested.
The GATK's default downsampler (invoked by
-dcov) exhibits the following properties:
By default, the downsampler is limited to 1000 reads per sample. This value can be adjusted either per-walker or per-run.
From the command line:
To modify the walker's default behavior:
by=<value>. Override the downsampling depth by changing the
The downsampler algorithm is designed to maintain uniform coverage while preserving a low memory footprint in regions of especially deep data. Given an already established pileup, a single-base locus, and a pile of reads with an alignment start of single-base locus + 1, the outline of the algorithm is as follows:
For each sample:
Now walk backward through each set of reads having the same alignment start. If the count of reads having the same alignment start is > 1, throw out one randomly selected read.
Reads will be downsampled so the specified fraction remains; e.g. if you specify -dfrac 0.25, three-quarters of the reads will be removed, and the remaining one quarter will be used in the analysis. This method of downsampling is truly unbiased and random. It is typically used to simulate the effect of generating different amounts of sequence data for a given sample. For example, you can use this in a pilot experiment to evaluate how much target coverage you need to aim for in order to obtain enough coverage in all loci of interest.
I am doing an analysis on chimp variants, and do not have enough variants to train the model in order to perform VQSR. I am using hard filters at this point, as per the recommendation. I keep reading that the recommended cutoff values are only recommendations and that I should expect to tweak the parameters further. However, how do I do this in an effective manner? I'm just not sure how to approach exploring the parameter space.
Should I try out different values around the recommended cutoffs and, down the line, analyze how many mutations I get/calculate certain values for quality control and check that they agree with what is expected (CpG transition rate, and so on)? Is there a way to inspect the data to see whether it is behaving well, or do I have to wait until down the line to see how things are going? Any insight into this would help tremendously.
Hi I am new to using HC. I am just using the HapMap sample NA12878 to validate my pipeline. I ran the following command on a BAM file that I generated. I noticed that certain SNPs were filtered out most of which I can understand the rationale.
java -Xms454m -Xmx3181m -XX:+UseSerialGC -Djava.io.tmpdir=/home/tx/tmpYdAKAC -jar /usr/local/share/gatk/3.4-46-gbc02625/GenomeAnalysisTK.jar \ -R /usr/local/share/genomes/Hsapiens/GRCh37/seq/GRCh37.fa \ -I /home/INDELanalysis/validationrun/NA12878-sort-11_0_15994663-prep.bam \ --dbsnp /usr/local/share/genomes/Hsapiens/GRCh37/variation/dbsnp_138.vcf.gz \ -L 11:1628982-1651500 \ -T HaplotypeCaller \ -o /home/INDELanalysis/validationrun/NA12878-11_0_15994663-raw.vcf.gz
I ran the same analysis using Samtools and the following SNP was in the final filtered output
Samtools output 11 1651228 rs71454096 C G 117 PASS AC=1;AN=2;BQB=0.61207;BaseQRankSum=-0.458;DB;DP=21;DP4=11,0,5,3;FS=7.225;GC=71.29;HOB=0.5;HRun=3;ICB=1;MQ=56.97;MQ0=0;MQ0F=0;MQB=0.395294;MQRankSum=-1.091;MQSB=0.798516;QD=5.57;RPB=0.902014;ReadPosRankSum=-1.162;SGB=-0.651104;VDB=0.0103095;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|gCg/gGg|A53G|237|KRTAP5-5|protein_coding|CODING|ENST00000399676|1|G) GT:DP:PL 0/1:19:150,0,158
However when I run it through HC with the above settings the above variant is filtered out of the analysis. When I ran HC with
--emitRefConfidence GVCF I got the following output for this variant:
HC output 11 1651228 rs71454096 C G,<NON_REF> 0 . DB;DP=20;MLEAC=0,0;MLEAF=0.00,0.00;MQ=58.22 GT:AD:DP:GQ:PGT:PID:PL:SB 0/0:10,0,0:10:34:0|1:1651120_G_T:0,34,442,34,442,442:10,0,0,0
I am a little confused why it got a QUAL score of 0. So looking nearby I also noticed that there were some sizeable INDELS
11 1651198 . GAGGCTGTGGGGGCTGTGGCTCCGGCTGTGC G,<NON_REF> 0 . DP=25;MLEAC=0,0;MLEAF=0.00,0.00;MQ=58.38 GT:AD:DP:GQ:PL:SB 0/0:14,0,0:14:37:0,37,585,42,589,594:12,2,0,0 11 1651199 rs71025763 A AGGCTGTGGCTCC,<NON_REF> 0 . DB;DP=18;MLEAC=0,0;MLEAF=0.00,0.00;MQ=57.74 GT:AD:DP:GQ:PL:SB 0/0:6,0,0:6:28:0,28,332,28,332,332:5,1,0,0
However neither of these INDELS were called either. Can anyone shed some light as to why nothing is called in this region? If I look at the site 11 1651228 in the bam file I can see that most of the calls were made on the forward strand so there is likely to be some strand bias here and in actual fact we can see that HC called the genotype 0/0 with only 10 reads contributing to this genotype call. I presume the remaining reads were filtered out. However when I look at the pileup for this position I don't understand why over half the reads were filtered
11 1651228 C CCCCGGGGCCCGGCCCGGCGG BBFIFB<FFIIFFFFFFFBBB
Any advice help would be much appreciated.
I have 51bp reads and am trying to filter out those where about 20-30 bases have been soft-clipped. The command I have been using is:
java -Xmx8g -jar $GATK_JAR -R $REFERENCE -T PrintReads -rf OverclippedRead --filter_is_too_short_value 40 -o /dev/stdout --disable_bam_indexing -I snippet.bam
Using GATK version v3.4-46, human reference hg19.
The problem I'm encountering is that this doesn't actually filter out any reads. I'm looking at many reads with 27 soft-clipped bases and 24 matches, but those are included in the output. Also the log is unambiguous:
-> 0 reads (0.00% of total) failing OverclippedReadFilter MicroScheduler - 0 reads were filtered out during the traversal out of approximately 2474 total reads (0.00%)
I am wondering how HaplotypeCaller in GATK 3.1 handles multimappers? I thought I read that they are passed over for variant calling but stay in the realigned, recalibrated BAM for 'completions sake', like marking duplicates not removing them but cannot find supporting info on the website or from farther afield,
Presumably it is the NotPrimaryAlignmentFilter but there is no info on that posted yet. I know I can output a BAM from HC with haplotype info in there but can I just get reads used in variant calls? Or should I trim the BAM myself to retain reads I want used? I do this for mark duplicates (removed) but for multimappers I would like to know how you define so I can do the same. The reason is for coverage estimates, using bamtobed or such means I take all realigned, recalibrated which is many more lines including multimappers which skews my results.
Hi, I would like to analyze a dataset consisting of RADseq (Restriction-site Associated DNA) tags from tumor and normal samples. By nature of the technique, all of the reads start at restriction enzyme cut sites in the genome - therefore the assumptions that mutations will be covered by reads from both directions and staggered with respect to position in the read are violated. Is there a way to override the strand bias and clustered position filters in the MuTect pipeline?
We've called variants from our own exome data sets. When compaired with the protocol of 1000 genomes, we found that they just marked duplicates and replicates after local realignment and bqsr rather than removing them before, which is the case of ours. Since we wanna use the CHB and CHS as controls, will the distinct calling strategy affect the final output a lot?