Downsampling
Posted in Ask the GATK team | Last updated on 2012-07-31 16:53:56


Comments (3)

Dear all,

I would like to ask you about some inconsistencies (maybe!) that I am observing regarding the downsampling of reads in GATK. This is the first time I am dealing with real high coverage (several 1000/sample) and it starts to matter for me to know how to adjust these parameters. All of the below is from GenomeAnalysisTK-1.6.596/GenomeAnalysisTKLite.jar.

  1. I was anticipating the need for some tuning, so I read the CommandLine GATK documentation, where three options related to this issue are listed: --downsample_to_coverage --downsample_to_fraction --downsampling_type All of them have NA as default value, so I assumed, that all my reads would count for the variant calling.

  2. In the documentation for the UnifiedGenotyper, there are no options related to downsampling. My command: java -jar GenomeAnalysisTKLite.jar -T UnifiedGenotyper -R ref.fa -I 1.bam -I 2.bam -o GATK.vcf -glm BOTH -stand_call_conf 1 -stand_emit_conf 1

  3. Neither are in the VariantAnnotator which I used subsequently on the UG output. My command: java -jar GenomeAnalysisTKLite.jar -T VariantAnnotator -R ref.fa -I 1.bam -I 2.bam -o GATK_anno.vcf --variant GATK.vcf -A DepthPerAlleleBySample -A BaseCounts -A AlleleBalanceBySample -A AlleleBalance -A DepthOfCoverage -A SampleList

Nevertheless, my variant-called and annotated vcf file features the following:

  1. In the header, I see the full set of options used for the UG and the annotations:

UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[F323/lib.218C/C0TV2ACXX_7_12_1.fastq_process_MQ35.bam, F325/lib.219C/C0TV2ACXX_7_13_1.fastq_process_MQ35.bam] read_buffer_size=null phone_home=STANDA RD gatk_key=null read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/project/production/Indexes/samtools/hsapiens_v37_chrM.fa no nDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null [etc]

VariantAnnotator="analysis_type=VariantAnnotator input_file=[F323/lib.218C/C0TV2ACXX_7_12_1.fastq_process_MQ35.bam, F325/lib.219C/C0TV2ACXX_7_13_1.fastq_process_MQ35.bam] read_buffer_size=null phone_home=STANDA RD gatk_key=null read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/project/production/Indexes/samtools/hsapiens_v37_chrM.fa no nDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null [etc]

  1. My INFO and FORMAT fields for one position look like this:

ABHom=0.997;AC=4;AF=1.00;AN=4;BaseCounts=1990,4,5,0;BaseQRankSum=-0.059;DP=2000;DS;Dels=0.00;FS=0.000;HaplotypeScore=6.1837;MLEAC=4;MLEAF=1.00;MQ=35 .00;MQ0=0;MQRankSum=1.551;OND=5.000e-03;QD=33.99;ReadPosRankSum=1.133;SB=-9.057e-03;Samples=F323,F325

GT:AD:DP:GQ:PL

1/1:1,998:250:99:8592,700,0

1/1:4,992:250:99:8405,719,0

Due to the different thresholds applied by UG and VA, I have now these inconsistenf values for per-sample AD and DP.

Is this supposed to work like this, or am I using a "non-cannonical" sequence of operations by following up the UG with the VA? Is there a way to switch OFF downsampling?

Many thanks as always for your comments!

cheers,

Sophia


Return to top Comment on this article in the forum