Tagged with #cohort
0 documentation articles | 0 announcements | 2 forum discussions


No posts found with the requested search criteria.
No posts found with the requested search criteria.
Comments (26)

I am working on a large project and I am stuck on whether I should use cohort in variant calling or call samples individually. Here are my sample info:

  1. Cohort f42: 42 female Caucasian
  2. Among the 42, s1 and s5 are NA12878
  3. All samples were sequenced by the same vendor using Agilent v4pUTR exome capturing kit.
  4. Mean exome sequencing depth: f42: ~80 for each sample; s1: 108; s5:70
  5. Golden standard: Genome in a Bottle (GIB) NA12878 calling set; bed file is the intersection of GIB bed file and v4pUTR region.

My comparison results on SNP:

Table 1: cohort vs individual calling performance
cohort    filter    Sample    NRS      NRD      OGC
f42       filter    s5       0.920    0.003    0.997
s5        filter    s5       0.937    0.001    0.999
f42       filter    s1       0.921    0.002    0.998
s1        filter    s1       0.945    0.001    0.999
f42     noFilter    s5       0.932    0.003    0.997
s5      noFilter    s5       0.947    0.001    0.999
f42     noFilter    s1       0.934    0.003    0.997
s1      noFilter    s1       0.951    0.001    0.999

note: 1. NRS: Non-Reference Sensitivity; NRD: Non-Reference Discrepancy; OGC: OverallGenotype Concordance; 2. cohort s5 means calling s5 individually

I do not see a filter option with VariantEval, so I suppose the following results are for all the called sites

Table 2. Number of variants and Presence in dbSNP
cohort  sample    variants    novelSites
f42       s5       41,978       48
s5        s5       42,714       47
f42       s1       42,042       77
s1        s1       43,088       68

Table 3. Number of common and unique variants in cohort and individual calling
sample    cohort 1    cohort 2    intersection    unique 1    unique 2
s5         f42          s5          41,284         694          1430
s1         f42          s1          41,653         389          1435

Basically, the three tables suggest the following: Cohort calling leads to lower calling sensitivity and concordance rate, higher discrepancy rate, and fewer called variants, although it does find more novel sites.

Did I do something wrong here or the benefit of a cohort does not apply to samples at 80X depth? The papers demonstrating improved accuracy in multi-sample calling use samples of low depth, such as 4X. Did anyone check that on deeply sequenced samples?

I appreciate any comments!

Comments (15)

I'm trying to measure the effect of the composition of a cohort on the calling of individual sample. So my cohort includes 46 Caucasian and 1 Japanese (J1) exome seq samples from 1KG. I want to check if the Japanese specific variants will be buried in such a cohort.

I'm only using chr22 and the average depth of coverage on all samples are around 10X. The input files to HC are the reduced bam files. Here are the results and my questions:

1). the command:

java -Xmx4g -jar $gatkDir/GenomeAnalysisTK.jar -T HaplotypeCaller \
 -R $refGenome \
 --dbsnp $dbSNP \
 -stand_call_conf 50.0 \
 -stand_emit_conf 10.0 \
 -o $cohort.raw.var.vcf \
-I $cohort.list

2). While the calls to the Caucasians seem normal, all calls to J1 are "./."

3). Then I run HC with J1 alone, the resulting vcf file only contains headers, no content; i.e., the last line in that file is the following:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT jpt.ERR034603

The HC finished w/o reporting any error. Here is a portion from the output of HC:

INFO  15:13:25,888 MicroScheduler - 136025 reads were filtered out during the traversal out of approximately 2686707 total reads (5.06%)
INFO  15:13:25,888 MicroScheduler -   -> 0 reads (0.00% of total) failing DuplicateReadFilter
INFO  15:13:25,889 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO  15:13:25,889 MicroScheduler -   -> 136025 reads (5.06% of total) failing HCMappingQualityFilter
INFO  15:13:25,889 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO  15:13:25,890 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO  15:13:25,890 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
INFO  15:13:25,890 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter

4). I run HC with one Caucasian alone, the vcf file looks normal.

5). Then I run HC using a cohort including J1 and four other Japanese samples, the calling to J1 seems normal.

Could anyone explain 2. 3, and 4? Should I increase the percentage of Japanese in the Caucasian cohort in order to get calling on J1? Thanks a lot!