Tagged with #ug
0 documentation articles | 0 announcements | 3 forum discussions

No articles to display.

No articles to display.

Created 2014-12-11 00:55:58 | Updated 2014-12-11 01:47:11 | Tags: unifiedgenotyper ug

Comments (3)


I've been trying to use the --allSitePLs option with Unified Genotyper with GaTK v3.2-2-gec30cee.

My command is:

java -Xmx${MEM} -jar ${gatk_dir}/GenomeAnalysisTK.jar -R ${ref_dir}/${genome} -T UnifiedGenotyper \
    --min_base_quality_score 30 \
    -I ${in_dir}/"34E_"${REGION}"_ddkbt_RS74408_recalibrated_3.bam" \
    -I ${in_dir}/"34E_"${REGION}"_ddber_RS86405_recalibrated_3.bam" \
    --intervals $sites_dir/$file \
    -o ${out_dir}/"38F_"${REGION}"_UG_allPL.vcf" \
    --output_mode EMIT_ALL_SITES \

I ran the same command with and without the --allSitePLs option and compared, and the output seems strange.

Specifically, with --allSitePLs - the FILTER column: 5053700 sites = lowqual, 19303 = . and 5059568 had QUAL < 30. By contrast WITHOUT the --allSitePLs 5067411 = lowqual, 5592 = . and 11460 had QUAL < 30. I don't understand why does adding this option changes the QUAL so much when I'm running UG on the exact same data but just requesting that I get a PL for all sites.

Lastly, how is the specific ALT allele selected? Is it random - because they all seem equally unlikely in the example below, where only one allele was seen in both my samples.

## 2 lines from the output
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ddber_RS86405   ddkbt_RS74408
chr01   1753201 .   C   A   0   LowQual AC=0;AF=0.00;AN=4;DP=43;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0 GT:AD:APL:DP:GQ:PL   0/0:17,0:655,51,0,655,51,655,655,51,655,655:17:51:0,51,655  0/0:26,0:1010,78,0,1010,78,1010,1010,78,1010,1010:26:78:0,78,1010
chr01   1753202 .   T   A   0   LowQual AC=0;AF=0.00;AN=4;DP=43;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0 GT:AD:APL:DP:GQ:PL   0/0:17,0:630,630,630,630,630,630,48,48,48,0:17:48:0,48,630  0/0:26,0:1027,1027,1027,1027,1027,1027,78,78,78,0:26:78:0,78,1027

Created 2014-01-23 10:42:02 | Updated | Tags: exome ug ulimit

Comments (3)

Hi there

I am trying to run UG across just over 2,200 individuals (exome sequencing). I have successfully done this on our computing cluster with just over 1,000 samples without issues (apart from having to get the limit on no. of open files (ulimit) increased).

I got another increase in ulimit to allow me to run UG on the larger set. However, our IO is being pushed over the edge with the 2,200 input samples. I have two questions:

  • does UG open all of the input bam files at the same time? It seems like it, since a ulimit of 2048 was not sufficient for 2,200 input files.
  • is there a way to optimise this, possibly by getting UG to open files sequentially - or do they have to be all open at the same time? I suspect this will become more of a problem as the size of the datasets available increases.

Would appreciate any advice you would have on getting this to run on this size of data. Thanks!

Created 2013-08-05 07:43:36 | Updated | Tags: ug

Comments (9)

Hi there,

I am using UG (gatk version 2.5-2-gf57256b) to call variants across ~1,100 samples (made up of both cases an d controls). I follow this by VariantAnnotator, VariantRecalibrator, and ApplyRecalibration (separately on SNPs and indels). While doing some downstream analysis (with the SNP sites tagged with the PASS filter), I am finding that some heterozygous haploid sites have been called (~9,000 sites, plink gives out about them). What would be the cause of this? Should I treat this as a red flag, or just ignore those sites?

The UG command I am using is: java -Xmx8g -jar $path2Gatk/GenomeAnalysisTK.jar -T UnifiedGenotyper -l INFO -R $path2SeqIndex $bam_args -o $name.vcf --dbsnp:vcf $path2Dbsnp -stand_call_conf 10 -stand_emit_conf 10 -rf BadCigar -glm BOTH --intervals:bed $intfile --pedigree $ped --pedigreeValidationType SILENT -dcov 250

Would appreciate your thoughts - thank you.