Hi GATK team, I was calling variants individually to generate gVCF files for each sample. Then I thought maybe I should also try using multiple samples as input from the ‘RealignerTargetCreator ‘ step to generate a joint bam file that contained alignment from different samples (I used different SM tag for each sample). I paste my command below. So in the ‘HaplotypeCaller’ step, I got the message ‘emitRefConfidence has a bad value’ because my bam file is mixed with different SM tag. I suppose I shouldn't change SM tags into the same, or I will not be able to identify what variants are from which individual. I think that I misunderstood the meaning of cohort calling from the beginning. So I want to clarify two points to see if I am understanding correctly now. 1.The’ HaplotypeCaller’ will only use the SM tag information to identify different individuals. Other tags like ID, PL, LB will not be considered. If all samples from different individuals have the same SM tag,HC will treat it as from one individual. Is it correct? 2. My target is to find variants between individuals, rather than in all individuals. Then I should call variant one sample per time and run ‘GenotypeGVCFs’ to join all samples together before hard filtering. Do I still misunderstand something? Thanks. The following is the command line I used. java -Xmx32g -jar $GATK_JARS/GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R ucsc.hg19.fasta \ -I aligned_TKSAHB.dedup.sorted.bam \ -I aligned_TKSAHV.dedup.sorted.bam \ -I aligned_TKSASA.dedup.sorted.bam \ -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \ -known 1000G_phase1.indels.hg19.sites.vcf \ -o target_interval_TKSA.list \ && java -Xmx32g -jar $GATK_JARS/GenomeAnalysisTK.jar \ -T IndelRealigner \ -R ucsc.hg19.fasta \ -I aligned_TKSAHB.dedup.sorted.bam \ -I aligned_TKSAHV.dedup.sorted.bam \ -I aligned_TKSASA.dedup.sorted.bam \ -targetIntervals target_interval_TKSA.list \ -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \ -known 1000G_phase1.indels.hg19.sites.vcf \ -o realigned_TKSA.dedup.sorted.bam
Dear GATK team,
Is there a value in cohort calling in RNA-Seq similar to what is recommended in the GATK DNA-Seq workflow? I am trying to understand why cohort calling is highly emphasized in DNA-Seq but not mentioned in the RNA-Seq workflow.
I am working on a large project and I am stuck on whether I should use cohort in variant calling or call samples individually. Here are my sample info:
My comparison results on SNP:
Table 1: cohort vs individual calling performance cohort filter Sample NRS NRD OGC f42 filter s5 0.920 0.003 0.997 s5 filter s5 0.937 0.001 0.999 f42 filter s1 0.921 0.002 0.998 s1 filter s1 0.945 0.001 0.999 f42 noFilter s5 0.932 0.003 0.997 s5 noFilter s5 0.947 0.001 0.999 f42 noFilter s1 0.934 0.003 0.997 s1 noFilter s1 0.951 0.001 0.999
note: 1. NRS: Non-Reference Sensitivity; NRD: Non-Reference Discrepancy; OGC: OverallGenotype Concordance; 2. cohort s5 means calling s5 individually
I do not see a filter option with VariantEval, so I suppose the following results are for all the called sites
Table 2. Number of variants and Presence in dbSNP cohort sample variants novelSites f42 s5 41,978 48 s5 s5 42,714 47 f42 s1 42,042 77 s1 s1 43,088 68 Table 3. Number of common and unique variants in cohort and individual calling sample cohort 1 cohort 2 intersection unique 1 unique 2 s5 f42 s5 41,284 694 1430 s1 f42 s1 41,653 389 1435
Basically, the three tables suggest the following: Cohort calling leads to lower calling sensitivity and concordance rate, higher discrepancy rate, and fewer called variants, although it does find more novel sites.
Did I do something wrong here or the benefit of a cohort does not apply to samples at 80X depth? The papers demonstrating improved accuracy in multi-sample calling use samples of low depth, such as 4X. Did anyone check that on deeply sequenced samples?
I appreciate any comments!
I'm trying to measure the effect of the composition of a cohort on the calling of individual sample. So my cohort includes 46 Caucasian and 1 Japanese (J1) exome seq samples from 1KG. I want to check if the Japanese specific variants will be buried in such a cohort.
I'm only using chr22 and the average depth of coverage on all samples are around 10X. The input files to HC are the reduced bam files. Here are the results and my questions:
1). the command:
java -Xmx4g -jar $gatkDir/GenomeAnalysisTK.jar -T HaplotypeCaller \ -R $refGenome \ --dbsnp $dbSNP \ -stand_call_conf 50.0 \ -stand_emit_conf 10.0 \ -o $cohort.raw.var.vcf \ -I $cohort.list
2). While the calls to the Caucasians seem normal, all calls to J1 are "./."
3). Then I run HC with J1 alone, the resulting vcf file only contains headers, no content; i.e., the last line in that file is the following:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT jpt.ERR034603
The HC finished w/o reporting any error. Here is a portion from the output of HC:
INFO 15:13:25,888 MicroScheduler - 136025 reads were filtered out during the traversal out of approximately 2686707 total reads (5.06%) INFO 15:13:25,888 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 15:13:25,889 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 15:13:25,889 MicroScheduler - -> 136025 reads (5.06% of total) failing HCMappingQualityFilter INFO 15:13:25,889 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 15:13:25,890 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 15:13:25,890 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 15:13:25,890 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter
4). I run HC with one Caucasian alone, the vcf file looks normal.
5). Then I run HC using a cohort including J1 and four other Japanese samples, the calling to J1 seems normal.
Could anyone explain 2. 3, and 4? Should I increase the percentage of Japanese in the Caucasian cohort in order to get calling on J1? Thanks a lot!