I am working on a large project and I am stuck on whether I should use cohort in variant calling or call samples individually. Here are my sample info:
My comparison results on SNP:
Table 1: cohort vs individual calling performance cohort filter Sample NRS NRD OGC f42 filter s5 0.920 0.003 0.997 s5 filter s5 0.937 0.001 0.999 f42 filter s1 0.921 0.002 0.998 s1 filter s1 0.945 0.001 0.999 f42 noFilter s5 0.932 0.003 0.997 s5 noFilter s5 0.947 0.001 0.999 f42 noFilter s1 0.934 0.003 0.997 s1 noFilter s1 0.951 0.001 0.999
note: 1. NRS: Non-Reference Sensitivity; NRD: Non-Reference Discrepancy; OGC: OverallGenotype Concordance; 2. cohort s5 means calling s5 individually
I do not see a filter option with VariantEval, so I suppose the following results are for all the called sites
Table 2. Number of variants and Presence in dbSNP cohort sample variants novelSites f42 s5 41,978 48 s5 s5 42,714 47 f42 s1 42,042 77 s1 s1 43,088 68 Table 3. Number of common and unique variants in cohort and individual calling sample cohort 1 cohort 2 intersection unique 1 unique 2 s5 f42 s5 41,284 694 1430 s1 f42 s1 41,653 389 1435
Basically, the three tables suggest the following: Cohort calling leads to lower calling sensitivity and concordance rate, higher discrepancy rate, and fewer called variants, although it does find more novel sites.
Did I do something wrong here or the benefit of a cohort does not apply to samples at 80X depth? The papers demonstrating improved accuracy in multi-sample calling use samples of low depth, such as 4X. Did anyone check that on deeply sequenced samples?
I appreciate any comments!
I'm trying to measure the effect of the composition of a cohort on the calling of individual sample. So my cohort includes 46 Caucasian and 1 Japanese (J1) exome seq samples from 1KG. I want to check if the Japanese specific variants will be buried in such a cohort.
I'm only using chr22 and the average depth of coverage on all samples are around 10X. The input files to HC are the reduced bam files. Here are the results and my questions:
1). the command:
java -Xmx4g -jar $gatkDir/GenomeAnalysisTK.jar -T HaplotypeCaller \ -R $refGenome \ --dbsnp $dbSNP \ -stand_call_conf 50.0 \ -stand_emit_conf 10.0 \ -o $cohort.raw.var.vcf \ -I $cohort.list
2). While the calls to the Caucasians seem normal, all calls to J1 are "./."
3). Then I run HC with J1 alone, the resulting vcf file only contains headers, no content; i.e., the last line in that file is the following:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT jpt.ERR034603
The HC finished w/o reporting any error. Here is a portion from the output of HC:
INFO 15:13:25,888 MicroScheduler - 136025 reads were filtered out during the traversal out of approximately 2686707 total reads (5.06%) INFO 15:13:25,888 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 15:13:25,889 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 15:13:25,889 MicroScheduler - -> 136025 reads (5.06% of total) failing HCMappingQualityFilter INFO 15:13:25,889 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 15:13:25,890 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 15:13:25,890 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 15:13:25,890 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter
4). I run HC with one Caucasian alone, the vcf file looks normal.
5). Then I run HC using a cohort including J1 and four other Japanese samples, the calling to J1 seems normal.
Could anyone explain 2. 3, and 4? Should I increase the percentage of Japanese in the Caucasian cohort in order to get calling on J1? Thanks a lot!