I have noticed some strange structure in the samples produced by the LevelingDownSampler class and was hoping someone might be nice enough to explain what is going on and how to avoid it.
I was attempting to use the unified genotyper to infer frequencies of heteroplasmic mutations in mitochondrial DNA by setting a high ploidy level to account for the multiplicity of mtDNA in a sample. However, when test datasets with different mitochondrial sequences were combined at known frequencies, the frequency inferred by the GATK (alleles at different levels of ploidy) was consistently different from the known mixing frequency. In other words, the recovered frequencies were biased.
As the bias was reduced as the dcov parameter was increased, I looked at the downsampler class to check for issues. It appeared that the levelGroups function was not randomly sampling reads, as there was a lot of structure in the groups it was leveling as well as the way it chose to level them.
I created a toy dataset by placing a SNP in a 300 bp window with high coverage and then merging two BAM files with samtools. I then had the class output at the end of the GATK analysis the number of reads that came from each original alignment position and from each file (or those reads returned by the consumeFinalizedItems method). The two issues were:
First, the downsampler appeared to sample too many reads at the start of the genome segment before leveling off to correct level, before dropping again at the end of the segment being sequenced (though plenty of reads were available). Is it possible to avoid this? Second, although in the middle of the genomic segment the read counts were uniformly distributed, their source in the BAM file was not. At low downsampling settings (<75) there would be bursts where one section of reads came entirely from one of the source bam files, followed by another burst where all the reads came from a different bam file (though these files were combined in the merge). Does anyone know if it is the case that spatial structure in the BAM file will appear as spatial structure in the downsampled reads? Is there any way to avoid this behavior?