Tagged with #cohort
0 documentation articles | 0 announcements | 4 forum discussions

No articles to display.

No articles to display.

Created 2015-08-18 13:57:14 | Updated | Tags: haplotypecaller multiple-inputs cohort

Comments (1)

Hi GATK team, I was calling variants individually to generate gVCF files for each sample. Then I thought maybe I should also try using multiple samples as input from the ‘RealignerTargetCreator ‘ step to generate a joint bam file that contained alignment from different samples (I used different SM tag for each sample). I paste my command below. So in the ‘HaplotypeCaller’ step, I got the message ‘emitRefConfidence has a bad value’ because my bam file is mixed with different SM tag. I suppose I shouldn't change SM tags into the same, or I will not be able to identify what variants are from which individual. I think that I misunderstood the meaning of cohort calling from the beginning. So I want to clarify two points to see if I am understanding correctly now. 1.The’ HaplotypeCaller’ will only use the SM tag information to identify different individuals. Other tags like ID, PL, LB will not be considered. If all samples from different individuals have the same SM tag,HC will treat it as from one individual. Is it correct? 2. My target is to find variants between individuals, rather than in all individuals. Then I should call variant one sample per time and run ‘GenotypeGVCFs’ to join all samples together before hard filtering. Do I still misunderstand something? Thanks. The following is the command line I used. java -Xmx32g -jar $GATK_JARS/GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R ucsc.hg19.fasta \ -I aligned_TKSAHB.dedup.sorted.bam \ -I aligned_TKSAHV.dedup.sorted.bam \ -I aligned_TKSASA.dedup.sorted.bam \ -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \ -known 1000G_phase1.indels.hg19.sites.vcf \ -o target_interval_TKSA.list \ && java -Xmx32g -jar $GATK_JARS/GenomeAnalysisTK.jar \ -T IndelRealigner \ -R ucsc.hg19.fasta \ -I aligned_TKSAHB.dedup.sorted.bam \ -I aligned_TKSAHV.dedup.sorted.bam \ -I aligned_TKSASA.dedup.sorted.bam \ -targetIntervals target_interval_TKSA.list \ -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \ -known 1000G_phase1.indels.hg19.sites.vcf \ -o realigned_TKSA.dedup.sorted.bam

Created 2014-12-11 18:37:39 | Updated | Tags: rnaseq cohort rna-seq

Comments (2)

Dear GATK team,

Is there a value in cohort calling in RNA-Seq similar to what is recommended in the GATK DNA-Seq workflow? I am trying to understand why cohort calling is highly emphasized in DNA-Seq but not mentioned in the RNA-Seq workflow.

Thank you!


Created 2014-01-14 17:24:36 | Updated 2014-01-14 17:30:10 | Tags: cohort

Comments (33)

I am working on a large project and I am stuck on whether I should use cohort in variant calling or call samples individually. Here are my sample info:

  1. Cohort f42: 42 female Caucasian
  2. Among the 42, s1 and s5 are NA12878
  3. All samples were sequenced by the same vendor using Agilent v4pUTR exome capturing kit.
  4. Mean exome sequencing depth: f42: ~80 for each sample; s1: 108; s5:70
  5. Golden standard: Genome in a Bottle (GIB) NA12878 calling set; bed file is the intersection of GIB bed file and v4pUTR region.

My comparison results on SNP:

Table 1: cohort vs individual calling performance
cohort    filter    Sample    NRS      NRD      OGC
f42       filter    s5       0.920    0.003    0.997
s5        filter    s5       0.937    0.001    0.999
f42       filter    s1       0.921    0.002    0.998
s1        filter    s1       0.945    0.001    0.999
f42     noFilter    s5       0.932    0.003    0.997
s5      noFilter    s5       0.947    0.001    0.999
f42     noFilter    s1       0.934    0.003    0.997
s1      noFilter    s1       0.951    0.001    0.999


  1. NRS: Non-Reference Sensitivity; NRD: Non-Reference Discrepancy; OGC: OverallGenotype Concordance;
  2. cohort s5 means calling s5 individually

I do not see a filter option with VariantEval, so I suppose the following results are for all the called sites

Table 2. Number of variants and Presence in dbSNP
cohort  sample    variants    novelSites
f42       s5       41,978       48
s5        s5       42,714       47
f42       s1       42,042       77
s1        s1       43,088       68

Table 3. Number of common and unique variants in cohort and individual calling
sample    cohort 1    cohort 2    intersection    unique 1    unique 2
s5         f42          s5          41,284         694          1430
s1         f42          s1          41,653         389          1435

Basically, the three tables suggest the following: Cohort calling leads to lower calling sensitivity and concordance rate, higher discrepancy rate, and fewer called variants, although it does find more novel sites.

Did I do something wrong here or the benefit of a cohort does not apply to samples at 80X depth? The papers demonstrating improved accuracy in multi-sample calling use samples of low depth, such as 4X. Did anyone check that on deeply sequenced samples?

I appreciate any comments!

Created 2013-10-28 20:13:09 | Updated 2013-10-28 20:34:07 | Tags: haplotypecaller cohort

Comments (15)

I'm trying to measure the effect of the composition of a cohort on the calling of individual sample. So my cohort includes 46 Caucasian and 1 Japanese (J1) exome seq samples from 1KG. I want to check if the Japanese specific variants will be buried in such a cohort.

I'm only using chr22 and the average depth of coverage on all samples are around 10X. The input files to HC are the reduced bam files. Here are the results and my questions:

1). the command:

java -Xmx4g -jar $gatkDir/GenomeAnalysisTK.jar -T HaplotypeCaller \
 -R $refGenome \
 --dbsnp $dbSNP \
 -stand_call_conf 50.0 \
 -stand_emit_conf 10.0 \
 -o $cohort.raw.var.vcf \
-I $cohort.list

2). While the calls to the Caucasians seem normal, all calls to J1 are "./."

3). Then I run HC with J1 alone, the resulting vcf file only contains headers, no content; i.e., the last line in that file is the following:


The HC finished w/o reporting any error. Here is a portion from the output of HC:

INFO  15:13:25,888 MicroScheduler - 136025 reads were filtered out during the traversal out of approximately 2686707 total reads (5.06%)
INFO  15:13:25,888 MicroScheduler -   -> 0 reads (0.00% of total) failing DuplicateReadFilter
INFO  15:13:25,889 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO  15:13:25,889 MicroScheduler -   -> 136025 reads (5.06% of total) failing HCMappingQualityFilter
INFO  15:13:25,889 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO  15:13:25,890 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO  15:13:25,890 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
INFO  15:13:25,890 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter

4). I run HC with one Caucasian alone, the vcf file looks normal.

5). Then I run HC using a cohort including J1 and four other Japanese samples, the calling to J1 seems normal.

Could anyone explain 2. 3, and 4? Should I increase the percentage of Japanese in the Caucasian cohort in order to get calling on J1? Thanks a lot!