I would like to ask you about some inconsistencies (maybe!) that I am observing regarding the downsampling of reads in GATK. This is the first time I am dealing with real high coverage (several 1000/sample) and it starts to matter for me to know how to adjust these parameters. All of the below is from GenomeAnalysisTK-1.6.596/GenomeAnalysisTKLite.jar.
I was anticipating the need for some tuning, so I read the CommandLine GATK documentation, where three options related to this issue are listed: --downsample_to_coverage --downsample_to_fraction --downsampling_type All of them have NA as default value, so I assumed, that all my reads would count for the variant calling.
In the documentation for the UnifiedGenotyper, there are no options related to downsampling. My command: java -jar GenomeAnalysisTKLite.jar -T UnifiedGenotyper -R ref.fa -I 1.bam -I 2.bam -o GATK.vcf -glm BOTH -stand_call_conf 1 -stand_emit_conf 1
Neither are in the VariantAnnotator which I used subsequently on the UG output. My command: java -jar GenomeAnalysisTKLite.jar -T VariantAnnotator -R ref.fa -I 1.bam -I 2.bam -o GATK_anno.vcf --variant GATK.vcf -A DepthPerAlleleBySample -A BaseCounts -A AlleleBalanceBySample -A AlleleBalance -A DepthOfCoverage -A SampleList
Nevertheless, my variant-called and annotated vcf file features the following:
UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[F323/lib.218C/C0TV2ACXX_7_12_1.fastq_process_MQ35.bam, F325/lib.219C/C0TV2ACXX_7_13_1.fastq_process_MQ35.bam] read_buffer_size=null phone_home=STANDA RD gatk_key=null read_filter= intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/project/production/Indexes/samtools/hsapiens_v37_chrM.fa no nDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null [etc]
VariantAnnotator="analysis_type=VariantAnnotator input_file=[F323/lib.218C/C0TV2ACXX_7_12_1.fastq_process_MQ35.bam, F325/lib.219C/C0TV2ACXX_7_13_1.fastq_process_MQ35.bam] read_buffer_size=null phone_home=STANDA RD gatk_key=null read_filter= intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/project/production/Indexes/samtools/hsapiens_v37_chrM.fa no nDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null [etc]
Due to the different thresholds applied by UG and VA, I have now these inconsistenf values for per-sample AD and DP.
Is this supposed to work like this, or am I using a "non-cannonical" sequence of operations by following up the UG with the VA? Is there a way to switch OFF downsampling?
Many thanks as always for your comments!
I'm running the Haplotype Caller on 80 bams, multithreaded (-nt 14), with either vcfs or bed files as the intervals file. I am trying to downsample to approximately 200 reads (-dcov 200). However, my output vcfs clearly have much higher depth (over 300-500 reads) in some areas. Also, HC drastically slows down whenever it hits a region of very deep pile-ups (over 1000 reads).
Is there any way to speed up HC on regions of very deep pileups? Is there a way to make downsampling stricter?
Thank you for your help. Elise
I haven't been using GATK for long, but I assumed that downsample_to_coverage feature wouldn't ever be a cause for concern. I just tried running UnifiedGenotyper with -dcov set at 500, 5,000, and 50,000 on the same 1-sample BAM file. One would expect the results to be similar. However, 500 yielded 26 variants, 5,000 yielded 13, and 50,000 yielded just 1. Depth of that one variant was about 1,300 in the 50,000 cutoff. Why are the results so different?
Most of the other variants are in the biggest set were cut off at 500, so some reads were filtered. A few of them are at relatively low frequency, but most are at 25% or higher. If they are appearing by chance, they should not be at such high frequencies.
In addition, there are some variants that are below 500, so they should not be affected by the cutoff. Why are those showing up with the low cutoff and not the higher cutoff?
I am using GATK 2.1-8. I am looking at a single gene only, so that is why there are so few variants and such high coverage.
I'm trying to print a bam file downsampled to a certain coverage at given loci. What I've tried thus far is to use PrintReads with -dcov XXX, but as the documentation states, it will downsample to that coverage based on the beginning of the aligned read. Is there a way to get a bam file that has been downsampled to a (nearly) precise coverage at specified loci?
I am not using GATK to call variants or any other purpose, so I can't do it while calling variants.
Thanks for your help!