Tagged with #ug
0 documentation articles | 0 announcements | 3 forum discussions


No posts found with the requested search criteria.
No posts found with the requested search criteria.
Comments (3)

Hi

I've been trying to use the --allSitePLs option with Unified Genotyper with GaTK v3.2-2-gec30cee.

My command is:

java -Xmx${MEM} -jar ${gatk_dir}/GenomeAnalysisTK.jar -R ${ref_dir}/${genome} -T UnifiedGenotyper \
    --min_base_quality_score 30 \
    -I ${in_dir}/"34E_"${REGION}"_ddkbt_RS74408_recalibrated_3.bam" \
    -I ${in_dir}/"34E_"${REGION}"_ddber_RS86405_recalibrated_3.bam" \
    --intervals $sites_dir/$file \
    -o ${out_dir}/"38F_"${REGION}"_UG_allPL.vcf" \
    --output_mode EMIT_ALL_SITES \
   --allSitePLs 

I ran the same command with and without the --allSitePLs option and compared, and the output seems strange.

Specifically, with --allSitePLs - the FILTER column: 5053700 sites = lowqual, 19303 = . and 5059568 had QUAL < 30. By contrast WITHOUT the --allSitePLs 5067411 = lowqual, 5592 = . and 11460 had QUAL < 30. I don't understand why does adding this option changes the QUAL so much when I'm running UG on the exact same data but just requesting that I get a PL for all sites.

Lastly, how is the specific ALT allele selected? Is it random - because they all seem equally unlikely in the example below, where only one allele was seen in both my samples.

## 2 lines from the output
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ddber_RS86405   ddkbt_RS74408
chr01   1753201 .   C   A   0   LowQual AC=0;AF=0.00;AN=4;DP=43;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0 GT:AD:APL:DP:GQ:PL   0/0:17,0:655,51,0,655,51,655,655,51,655,655:17:51:0,51,655  0/0:26,0:1010,78,0,1010,78,1010,1010,78,1010,1010:26:78:0,78,1010
chr01   1753202 .   T   A   0   LowQual AC=0;AF=0.00;AN=4;DP=43;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0 GT:AD:APL:DP:GQ:PL   0/0:17,0:630,630,630,630,630,630,48,48,48,0:17:48:0,48,630  0/0:26,0:1027,1027,1027,1027,1027,1027,78,78,78,0:26:78:0,78,1027
Comments (3)

Hi there

I am trying to run UG across just over 2,200 individuals (exome sequencing). I have successfully done this on our computing cluster with just over 1,000 samples without issues (apart from having to get the limit on no. of open files (ulimit) increased).

I got another increase in ulimit to allow me to run UG on the larger set. However, our IO is being pushed over the edge with the 2,200 input samples. I have two questions:

  • does UG open all of the input bam files at the same time? It seems like it, since a ulimit of 2048 was not sufficient for 2,200 input files.
  • is there a way to optimise this, possibly by getting UG to open files sequentially - or do they have to be all open at the same time? I suspect this will become more of a problem as the size of the datasets available increases.

Would appreciate any advice you would have on getting this to run on this size of data. Thanks!

Comments (9)

Hi there,

I am using UG (gatk version 2.5-2-gf57256b) to call variants across ~1,100 samples (made up of both cases an d controls). I follow this by VariantAnnotator, VariantRecalibrator, and ApplyRecalibration (separately on SNPs and indels). While doing some downstream analysis (with the SNP sites tagged with the PASS filter), I am finding that some heterozygous haploid sites have been called (~9,000 sites, plink gives out about them). What would be the cause of this? Should I treat this as a red flag, or just ignore those sites?

The UG command I am using is: java -Xmx8g -jar $path2Gatk/GenomeAnalysisTK.jar -T UnifiedGenotyper -l INFO -R $path2SeqIndex $bam_args -o $name.vcf --dbsnp:vcf $path2Dbsnp -stand_call_conf 10 -stand_emit_conf 10 -rf BadCigar -glm BOTH --intervals:bed $intfile --pedigree $ped --pedigreeValidationType SILENT -dcov 250

Would appreciate your thoughts - thank you.