Dear GATK team,
I'm trying to get genotype calls for whole genome data sequenced on Illumina HiSeq. I have ~1.2B reads from one individual and would like to test the effect of varying the number of reads used as input on GATK calls. I first divide the unsorted reads into parcels of ~200M. Then I use either 1 parcel or 2 parcels or 3 parcels and so on as input to the GATK pipeline to simulate having different numbers of reads. When I ran this process on hg18 aligned data, the vcf files increased in size as number of input parcels increased, as expected. However, when I ran the same process on hg19 aligned data, the vcf files sizes were all similar and the loci and read numbers in all the files were similar. The input files to UnifiedGenotyper varied in size as expected so the problem is likely at the last step.
The only difference between the hg18 and hg19 runs was that for hg18, I used the -L target.intervals option at the RealignerTargetCreator step (and later on whenever required) which solved the filter N cigar problem. Somehow that didn't work for the hg19 run and so I had to use the filter N cigar option. Could this affect the genotyping step?
A diagram of the workflow is attached.
Thank you! Stephanie