Tagged with #downsample
0 documentation articles | 0 events or announcements | 3 forum discussions


Sorry, there are no publicly available documents of this type with the tag #downsample. Try one of the other types.
Sorry, there are no publicly available documents of this type with the tag #downsample. Try one of the other types.

Dear all,

I would like to ask you about some inconsistencies (maybe!) that I am observing regarding the downsampling of reads in GATK. This is the first time I am dealing with real high coverage (several 1000/sample) and it starts to matter for me to know how to adjust these parameters. All of the below is from GenomeAnalysisTK-1.6.596/GenomeAnalysisTKLite.jar.

  1. I was anticipating the need for some tuning, so I read the CommandLine GATK documentation, where three options related to this issue are listed: --downsample_to_coverage --downsample_to_fraction --downsampling_type All of them have NA as default value, so I assumed, that all my reads would count for the variant calling.

  2. In the documentation for the UnifiedGenotyper, there are no options related to downsampling. My command: java -jar GenomeAnalysisTKLite.jar -T UnifiedGenotyper -R ref.fa -I 1.bam -I 2.bam -o GATK.vcf -glm BOTH -stand_call_conf 1 -stand_emit_conf 1

  3. Neither are in the VariantAnnotator which I used subsequently on the UG output. My command: java -jar GenomeAnalysisTKLite.jar -T VariantAnnotator -R ref.fa -I 1.bam -I 2.bam -o GATK_anno.vcf --variant GATK.vcf -A DepthPerAlleleBySample -A BaseCounts -A AlleleBalanceBySample -A AlleleBalance -A DepthOfCoverage -A SampleList

Nevertheless, my variant-called and annotated vcf file features the following:

  1. In the header, I see the full set of options used for the UG and the annotations:

UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[F323/lib.218C/C0TV2ACXX_7_12_1.fastq_process_MQ35.bam, F325/lib.219C/C0TV2ACXX_7_13_1.fastq_process_MQ35.bam] read_buffer_size=null phone_home=STANDA RD gatk_key=null read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/project/production/Indexes/samtools/hsapiens_v37_chrM.fa no nDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null [etc]

VariantAnnotator="analysis_type=VariantAnnotator input_file=[F323/lib.218C/C0TV2ACXX_7_12_1.fastq_process_MQ35.bam, F325/lib.219C/C0TV2ACXX_7_13_1.fastq_process_MQ35.bam] read_buffer_size=null phone_home=STANDA RD gatk_key=null read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/project/production/Indexes/samtools/hsapiens_v37_chrM.fa no nDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null [etc]

  1. My INFO and FORMAT fields for one position look like this:

ABHom=0.997;AC=4;AF=1.00;AN=4;BaseCounts=1990,4,5,0;BaseQRankSum=-0.059;DP=2000;DS;Dels=0.00;FS=0.000;HaplotypeScore=6.1837;MLEAC=4;MLEAF=1.00;MQ=35 .00;MQ0=0;MQRankSum=1.551;OND=5.000e-03;QD=33.99;ReadPosRankSum=1.133;SB=-9.057e-03;Samples=F323,F325

GT:AD:DP:GQ:PL

1/1:1,998:250:99:8592,700,0

1/1:4,992:250:99:8405,719,0

Due to the different thresholds applied by UG and VA, I have now these inconsistenf values for per-sample AD and DP.

Is this supposed to work like this, or am I using a "non-cannonical" sequence of operations by following up the UG with the VA? Is there a way to switch OFF downsampling?

Many thanks as always for your comments!

cheers,

Sophia

I haven't been using GATK for long, but I assumed that downsample_to_coverage feature wouldn't ever be a cause for concern. I just tried running UnifiedGenotyper with -dcov set at 500, 5,000, and 50,000 on the same 1-sample BAM file. One would expect the results to be similar. However, 500 yielded 26 variants, 5,000 yielded 13, and 50,000 yielded just 1. Depth of that one variant was about 1,300 in the 50,000 cutoff. Why are the results so different?

Most of the other variants are in the biggest set were cut off at 500, so some reads were filtered. A few of them are at relatively low frequency, but most are at 25% or higher. If they are appearing by chance, they should not be at such high frequencies.

In addition, there are some variants that are below 500, so they should not be affected by the cutoff. Why are those showing up with the low cutoff and not the higher cutoff?

I am using GATK 2.1-8. I am looking at a single gene only, so that is why there are so few variants and such high coverage.

Hi,

I'm trying to print a bam file downsampled to a certain coverage at given loci. What I've tried thus far is to use PrintReads with -dcov XXX, but as the documentation states, it will downsample to that coverage based on the beginning of the aligned read. Is there a way to get a bam file that has been downsampled to a (nearly) precise coverage at specified loci?

I am not using GATK to call variants or any other purpose, so I can't do it while calling variants.

Thanks for your help!