Tagged with #UnifiedGenotyper
4 documentation articles | 1 announcement | 94 forum discussions



Created 2013-08-23 21:34:04 | Updated 2014-10-24 16:21:11 | Tags: unifiedgenotyper official haplotypecaller ploidy
Comments (1)

Use HaplotypeCaller!

The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper. Its ability to call SNPs is equivalent to that of the UnifiedGenotyper, its ability to call indels is far superior, and it is now capable of calling non-diploid samples. It also comprises several unique functionalities such as the reference confidence model (which enables efficient and incremental variant discovery on ridiculously large cohorts) and special settings for RNAseq data.

As of GATK version 3.3, we recommend using HaplotypeCaller in all cases, with no exceptions.

Caveats for older versions

If you are limited to older versions for project continuity, you may opt to use UnifiedGenotyper in the following cases:

  • If you are working with non-diploid organisms (UG can handle different levels of ploidy while older versions of HC cannot)
  • If you are working with pooled samples (also due to the HC’s limitation regarding ploidy)
  • If you want to analyze more than 100 samples at a time (for performance reasons) (versions 2.x)

Created 2013-06-17 21:39:48 | Updated 2015-05-08 22:01:39 | Tags: unifiedgenotyper variant-discovery locus-based
Comments (5)

Note: the UnifiedGenotyper has been replaced by HaplotypeCaller, which is a much better tool. UG is still available but you should really consider using HC instead.

Objective

Call variants on a haploid genome with the UnifiedGenotyper, producing a raw (unfiltered) VCF.

Prerequisites

  • TBD

Steps

  1. Determine the basic parameters of the analysis
  2. Call variants in your sequence data

1. Determine the basic parameters of the analysis

If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.

  • Ploidy (-ploidy)

In its basic use, this is the ploidy (number of chromosomes) per sample. By default it is set to 2, to process diploid organisms' genomes, but it can be set to any other desired value, starting at 1 for haploid organisms, and up for polyploids. This argument can also be used to handle pooled data. For that purpose, you'll need to set -ploidy to Number of samples in each pool * Sample Ploidy. There is no fixed upper limit, but keep in mind that high-level ploidy will increase processing times since the calculations involved are more complex. For full details on how to process pooled data, see Eran et al. (in preparation).

  • Genotype likelihood model (-glm)

This is the model that the program will use to calculate the genotype likelihoods. By default, it is set to SNP, but it can also be set to INDEL or BOTH. If set to BOTH, both SNPs and Indels will be called in the same run and be output to the same variants file.

  • Emission confidence threshold (–stand_emit_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.

  • Calling confidence threshold (–stand_call_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.

The terms called and filtered are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.


2. Call variants in your sequence data

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T UnifiedGenotyper \ 
    -R haploid_reference.fa \ 
    -I haploid_reads.bam \ 
    -L 20 \ 
    --glm BOTH \ 
    --stand_call_conf 30 \ 
    --stand_emit_conf 10 \ 
    -o raw_ug_variants.vcf 

This creates a VCF file called raw_ug_variants.vcf, containing all the sites that the UnifiedGenotyper evaluated to be potentially variant.

Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.


Created 2012-07-30 17:37:12 | Updated 2015-04-26 02:30:27 | Tags: unifiedgenotyper haplotypecaller snp bamout debugging
Comments (34)

This can happen when you expect a call to be made based on the output of other variant calling tools, or based on examination of the data in a genome browser like IGV.

There are several possibilities, and among them, it is possible that GATK may be missing a real variant. But we are generally very confident in the calculations made by our tools, and in our experience, most of the time, the problem lies elsewhere. So, before you post this issue in our support forum, please follow these troubleshooting guidelines, which hopefully will help you figure out what's going on.

In all cases, to diagnose what is happening, you will need to look directly at the sequencing data at the position in question.

1. Generate the bamout and compare it to the input bam

If you are using HaplotypeCaller to call your variants (as you nearly always should) you'll need to run an extra step first to produce a file called the "bamout file". See this tutorial for step-by-step instructions on how to do this.

What often happens is that when you look at the reads in the original bam file, it looks like a variant should be called. However, once HaplotypeCaller has performed the realignment, the reads may no longer support the expected variant. Generating the bamout file and comparing it to the original bam will allow you to elucidate such cases.

In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!

2. Check the base qualities of the non-reference bases

The variant callers apply a minimum base quality threshold, under which bases will not be counted as supporting evidence for a variant. This is because low base qualities mean that the sequencing machine was not confident that it called the right bases. If your expected variant is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the DP field of the VCF is the unfiltered depth. You may believe you have good coverage at your site of interest, but since the variant callers ignore bases that fail the quality filters, the actual coverage seen by the variant callers may be lower than you think.

3. Check the mapping qualities of the reads that support the non-reference allele(s)

The quality of a base is capped by the mapping quality of the read that it is on. This is because low mapping qualities mean that the aligner had little confidence that the read was mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- in fact, you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

4. Check how many alternate alleles are present

By default the variant callers will only consider a certain number of alternate alleles. This parameter can be relaxed using the --max_alternate_alleles argument (see the HaplotypeCaller documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles increases the computational cost of the processing, scaling exponentially with the number of alternate alleles, which means it will use more resources and take longer. Unless you have a really good reason to change the default value, we highly recommend that you not modify this parameter.

5. When using UnifiedGenotyper, check for overlapping deletions

The UnifiedGenotyper ignores sites if there are too many overlapping deletions. This parameter can be relaxed using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument) but be aware that increasing its value could adversely affect the reliability of your results.

6. Check for systematic biases introduced by your sequencing technology

Some sequencing technologies introduce particular sources of bias. For example, in data produced by the SOLiD platform, alignments tend to have reference bias and it can be severe in some cases. If the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site, you are probably seeing false positives.


Created 2012-07-26 14:50:55 | Updated 2014-10-24 00:55:34 | Tags: unifiedgenotyper official ploidy haploid analyst intermediate
Comments (14)

In general most GATK tools don't care about ploidy. The major exception is, of course, at the variant calling step: the variant callers need to know what ploidy is assumed for a given sample in order to perform the appropriate calculations.

Ploidy-related functionalities

As of version 3.3, the HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the -ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the -ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the -ploidy argument.

For earlier versions (all the way to 2.0) the fallback option is UnifiedGenotyper, which also accepts the -ploidy argument.

Cases where ploidy needs to be specified

  1. Native variant calling in haploid or polyploid organisms.
  2. Pooled calling where many pooled organisms share a single barcode and hence are treated as a single "sample".
  3. Pooled validation/genotyping at known sites.

For normal organism ploidy, you just set the -ploidy argument to the desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e. (Ploidy per individual) * (Individuals in pool).

Important limitations

Several variant annotations are not appropriate for use with non-diploid cases. In particular, InbreedingCoeff will not be annotated on non-diploid calls. Annotations that do work and are supported in non-diploid use cases are the following: QUAL, QD, SB, FS, AC, AF, and Genotype annotations such as PL, AD, GT, etc.

You should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.


Created 2013-03-14 12:29:25 | Updated 2014-12-01 17:34:07 | Tags: unifiedgenotyper official haplotypecaller printreads bug
Comments (36)

As reported here:

  • http://gatkforums.broadinstitute.org/discussion/2342/unifiedgenotyper-causes-arrayindexoutofboundsexception-3-how-to-fix-it
  • http://gatkforums.broadinstitute.org/discussion/2343/printreads-yet-another-arrayindexoutofboundsexception
  • http://gatkforums.broadinstitute.org/discussion/2345/arrayindexoutofboundsexception-in-haplotypecaller

If you encounter this bug too, please don't post a new question about it. Feel free to comment in this thread to let us know you have also had the same problem. Tell us what version of the GATK you were using and post your command line.

Thank you for your patience while we work to fix this issue.

Latest update: we found that the three tools (PrintRead, HaplotypeCaller and UnifiedGenotyper) had different issues that only produced the same symptom (the ArrayIndexOutOfBoundsException error). The underlying issues have all been fixed in version 2.5. If you encounter an ArrayIndexOutOfBoundsException in later versions, it is certainly caused by a different problem, so please open a new thread to report the issue.


Created 2015-08-24 09:35:24 | Updated | Tags: unifiedgenotyper vqsr bqsr haplotypecaller
Comments (1)

I'm a bit uncertain as to the optimal pipeline for calling variants. I've sequenced a population sample of ~200 at high coverage ~30X, with no prior information on nucleotide variation.

The most rigorous pipeline would seem to be: 1. Call variants with UG on 'raw' (realigned) bams. 2. Extract out high-confidence variants (high QUAL, high DP, not near indels or repeats, high MAF) 3. Perform BQSR using the high-confidence variants. 4. Call variants with HaplotypeCaller on recalibrated bams. 5. Perform VQSR using high-confidence variants. 6. Any other hard filters.

Is this excessive? Does using HaplotypeCaller negate the use of *QSR? Is it worthwhile performing VQSR if BQSR hasn't been done? Otherwise I'm just running HaplotyperCaller on un-recalibrated bams, and then hard-filtering.


Created 2015-08-18 11:55:19 | Updated 2015-08-18 11:57:08 | Tags: unifiedgenotyper vcf lowqual emit-all-confident-sites emit-all-sites
Comments (14)

Hi,

I've run UnifiedGenotyper on a BAM file with both EMIT_ALL_SITES and EMIT_ALL_CONFIDENT_SITES. I've noticed some of the calls that have been omitted with the EMIT_ALL_SITES option seem to be of comparable quality to others that have been emitted. For example, sites like HE670865 368605 are removed as non-confident sites while the site 368591 has not been.

I understand why the positions flagged as "LowQual" are not present when using EMIT_ALL_CONFIDENT_SITES. But I can't see why other positions (such as HE670865 368605 and HE670865 368600) are not being emitted. Of particular importance is the fact that some of these seemingly 'good sites' that have been removed are SNPs that might be of biological importance.

These are the two commands used together with some relevant lines from the resulting vcfs:

java -Xmx4g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -nt 4 -out_mode EMIT_ALL_SITES -baq CALCULATE_AS_NECESSARY -hets 0.015 -R genome_ref.fasta -o output_all.vcf -I input_divergentscaffs.realign.bam &> divergentscaffs_out.GATKlog &

java -Xmx4g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -nt 4 -out_mode EMIT_ALL_CONFIDENT_SITES -baq CALCULATE_AS_NECESSARY -hets 0.015 -R genome_ref.fasta -o output_confident.vcf -I input_divergentscaffs.realign.bam &> divergentscaffs_confident.GATKlog &

These are a sample of lines from the outputted VCF file when using EMIT_ALL_SITES:

HE670865 368589 . A . 46.89 . AN=16;DP=441;MQ=43.26;MQ0=13 GT:DP 0/0:35 0/0:45 0/0:58 0/0:38 0/0:75 0/0:64 0/0:64 0/0:47 HE670865 368590 . A . 50.15 . AN=16;DP=447;MQ=43.41;MQ0=13 GT:DP 0/0:35 0/0:45 0/0:60 0/0:39 0/0:76 0/0:66 0/0:65 0/0:47 HE670865 368591 . A . 39.12 . AN=16;DP=448;MQ=43.32;MQ0=12 GT:DP 0/0:36 0/0:45 0/0:61 0/0:39 0/0:77 0/0:66 0/0:66 0/0:46 HE670865 368592 . C T 5437.28 . AC=16;AF=1.00;AN=16;BaseQRankSum=1.036;DP=445;Dels=0.06;FS=0.000;HaplotypeScore=99.1839;MLEAC=16;MLEAF=1.00;MQ=43.48;MQ0=9;MQRankSum=4.214;QD=12.22;ReadPosRankSum=6.263 GT:AD:DP:GQ:PL 1/1:1,27:29:27:296,27,0 1/1:0,45:45:30:344,30,0 1/1:6,43:52:33:327,33,0 1/1:1,38:39:48:506,48,0 1/1:5,68:76:99:1100,102,0 1/1:4,61:65:68:1068,68,0 1/1:0,65:66:99:1035,99,0 1/1:0,46:46:69:774,69,0 HE670865 368593 . A . 50.76 . AN=16;DP=448;MQ=43.48;MQ0=9 GT:DP 0/0:29 0/0:45 0/0:53 0/0:41 0/0:76 0/0:65 0/0:66 0/0:47 HE670865 368594 . G . 69.75 . AN=16;DP=449;MQ=43.48;MQ0=9 GT:DP 0/0:28 0/0:46 0/0:51 0/0:41 0/0:76 0/0:66 0/0:66 0/0:47 HE670865 368595 . A . 66.75 . AN=16;DP=447;MQ=43.43;MQ0=9 GT:DP 0/0:30 0/0:46 0/0:50 0/0:41 0/0:75 0/0:66 0/0:66 0/0:47 HE670865 368596 . A . 72.75 . AN=16;DP=445;MQ=43.54;MQ0=9 GT:DP 0/0:31 0/0:46 0/0:48 0/0:41 0/0:76 0/0:66 0/0:67 0/0:47 HE670865 368597 . G . 66.75 . AN=16;DP=442;MQ=43.48;MQ0=9 GT:DP 0/0:25 0/0:46 0/0:41 0/0:40 0/0:74 0/0:67 0/0:67 0/0:47 HE670865 368598 . A . 75.75 . AN=16;DP=444;MQ=43.44;MQ0=9 GT:DP 0/0:25 0/0:46 0/0:41 0/0:39 0/0:75 0/0:69 0/0:67 0/0:47 HE670865 368599 . A . 72.75 . AN=16;DP=445;MQ=43.49;MQ0=9 GT:DP 0/0:24 0/0:45 0/0:39 0/0:39 0/0:76 0/0:69 0/0:68 0/0:48 HE670865 368600 . T A 4405.18 . AC=8;AF=0.500;AN=16;BaseQRankSum=2.278;DP=441;Dels=0.08;FS=6.510;HaplotypeScore=100.3222;MLEAC=8;MLEAF=0.500;MQ=43.26;MQ0=9;MQRankSum=0.498;QD=24.75;ReadPosRankSum=1.103 GT:AD:DP:GQ:PL 1/1:3,19:22:48:570,48,0 1/1:2,45:47:99:1386,114,0 1/1:4,33:37:96:1196,96,0 1/1:0,38:38:99:1307,105,0 0/0:76,2:78:99:0,178,2196 0/0:65,0:65:99:0,159,2002 0/0:70,0:70:99:0,168,2042 0/0:49,0:49:99:0,129,1549 HE670865 368601 . T . 63.75 . AN=16;DP=442;MQ=43.37;MQ0=9 GT:DP 0/0:19 0/0:47 0/0:37 0/0:38 0/0:77 0/0:66 0/0:72 0/0:48 HE670865 368602 . A . 66.75 . AN=16;DP=444;MQ=43.44;MQ0=9 GT:DP 0/0:19 0/0:47 0/0:38 0/0:38 0/0:78 0/0:65 0/0:74 0/0:48 HE670865 368603 . C . 66.75 . AN=16;DP=441;MQ=43.50;MQ0=11 GT:DP 0/0:20 0/0:47 0/0:39 0/0:37 0/0:76 0/0:64 0/0:73 0/0:48 HE670865 368604 . A . 63.75 . AN=16;DP=441;MQ=43.24;MQ0=11 GT:DP 0/0:21 0/0:47 0/0:39 0/0:38 0/0:74 0/0:64 0/0:74 0/0:48 HE670865 368605 . G . 60.75 . AN=16;DP=436;MQ=43.17;MQ0=12 GT:DP 0/0:20 0/0:47 0/0:40 0/0:38 0/0:73 0/0:63 0/0:73 0/0:47 HE670865 368606 . A . 60.75 . AN=16;DP=436;MQ=43.05;MQ0=12 GT:DP 0/0:22 0/0:45 0/0:40 0/0:39 0/0:73 0/0:63 0/0:75 0/0:46 HE670865 368607 . T C 3984.32 . AC=15;AF=0.938;AN=16;BaseQRankSum=0.225;DP=433;Dels=0.06;FS=11.987;HaplotypeScore=229.2107;MLEAC=15;MLEAF=0.938;MQ=42.94;MQ0=12;MQRankSum=1.624;QD=9.20;ReadPosRankSum=-0.463 GT:AD:DP:GQ:PL 1/1:7,18:26:12:109,12,0 1/1:2,39:44:57:475,57,0 1/1:6,36:44:51:426,51,0 1/1:1,36:39:39:328,39,0 1/1:6,67:73:44:603,44,0 1/1:0,63:63:99:1016,105,0 1/1:4,69:74:90:817,90,0 0/1:22,23:46:99:232,0,373 HE670865 368608 . A . 24.45 LowQual AN=16;DP=438;MQ=42.86;MQ0=12 GT:DP 0/0:34 0/0:45 0/0:55 0/0:39 0/0:74 0/0:63 0/0:74 0/0:47 HE670865 368609 . A . 29.97 LowQual AN=14;DP=434;MQ=42.67;MQ0=12 GT:DP ./. 0/0:44 0/0:53 0/0:39 0/0:67 0/0:61 0/0:74 0/0:47 HE670865 368610 . A . 20.62 LowQual AN=16;DP=432;MQ=42.47;MQ0=12 GT:DP 0/0:33 0/0:45 0/0:54 0/0:39 0/0:67 0/0:59 0/0:71 0/0:47 HE670865 368611 . C A 434.85 . AC=13;AF=0.813;AN=16;BaseQRankSum=0.922;DP=420;Dels=0.17;FS=0.883;HaplotypeScore=330.1362;MLEAC=14;MLEAF=0.875;MQ=42.34;MQ0=12;MQRankSum=0.871;QD=1.13;ReadPosRankSum=-2.717 GT:AD:DP:GQ:PL 0/0:17,11:28:3:0,3,23 1/1:27,10:38:3:32,3,0 1/1:23,23:46:6:64,6,0 1/1:12,16:29:3:32,3,0 1/1:22,30:53:9:73,9,0 1/1:18,35:53:6:56,6,0 1/1:26,34:60:18:188,18,0 0/1:23,17:40:16:16,0,61 HE670865 368612 . A . 6.62 LowQual AN=16;DP=409;MQ=42.14;MQ0=12 GT:DP 0/0:27 0/0:35 0/0:36 0/0:26 0/0:49 0/0:47 0/0:55 0/0:31 HE670865 368613 . G A 325 . AC=11;AF=0.917;AN=12;BaseQRankSum=3.200;DP=404;Dels=0.46;FS=1.094;HaplotypeScore=347.5139;MLEAC=11;MLEAF=0.917;MQ=41.95;MQ0=12;MQRankSum=1.300;QD=1.02;ReadPosRankSum=1.782 GT:AD:DP:GQ:PL 1/1:4,16:21:3:22,3,0 ./. 1/1:0,25:26:3:23,3,0 1/1:1,15:21:3:26,3,0 0/1:4,28:36:10:62,0,10 1/1:1,35:36:6:52,6,0 1/1:2,35:39:15:156,15,0 ./. HE670865 368614 . A . 17.30 LowQual AN=12;DP=416;MQ=41.75;MQ0=12 GT:DP ./. 0/0:25 0/0:36 ./. 0/0:47 0/0:41 0/0:50 0/0:25 HE670865 368615 . A . 16.42 LowQual AN=12;DP=411;MQ=41.66;MQ0=11 GT:DP ./. 0/0:25 0/0:35 ./. 0/0:45 0/0:40 0/0:49 0/0:22 HE670865 368616 . A . 17.42 LowQual AN=10;DP=406;MQ=41.71;MQ0=10 GT:DP ./. 0/0:24 ./. ./. 0/0:40 0/0:40 0/0:47 0/0:22 HE670865 368617 . T C 94.41 . AC=4;AF=0.667;AN=6;BaseQRankSum=0.262;DP=385;Dels=0.38;FS=7.489;HaplotypeScore=362.4384;MLEAC=5;MLEAF=0.833;MQ=41.72;MQ0=9;MQRankSum=1.775;QD=0.84;ReadPosRankSum=4.850 GT:AD:DP:GQ:PL ./. 0/0:15,1:20:3:0,3,32 ./. ./. ./. 1/1:5,31:36:9:85,9,0 1/1:14,28:42:3:26,3,0 ./. HE670865 368618 . C . 17.27 LowQual AN=6;DP=370;MQ=41.65;MQ0=8 GT:DP ./. 0/0:10 ./. ./. ./. 0/0:34 0/0:37 ./. HE670865 368619 . G A 35.14 . AC=4;AF=0.500;AN=8;BaseQRankSum=-3.320;DP=365;Dels=0.54;FS=0.000;HaplotypeScore=399.7009;MLEAC=4;MLEAF=0.500;MQ=41.71;MQ0=8;MQRankSum=-1.237;QD=0.35;ReadPosRankSum=-4.879 GT:AD:DP:GQ:PL ./. 1/1:0,4:8:3:28,3,0 ./. ./. 0/0:17,3:22:3:0,3,23 0/0:31,3:34:3:0,3,22 1/1:28,8:36:3:25,3,0 ./. HE670865 368620 . T . 15.82 LowQual AN=2;DP=358;MQ=41.72;MQ0=8 GT:DP ./. 0/0:8 ./. ./. ./. ./. ./. ./. HE670865 368621 . T . . . DP=358;MQ=41.63;MQ0=8 GT ./. ./. ./. ./. ./. ./. ./. ./. HE670865 368622 . T . 14.11 LowQual AN=6;DP=358;MQ=41.53;MQ0=8 GT:DP ./. ./. ./. ./. ./. 0/0:11 0/0:19 0/0:17 HE670865 368623 . T . 14.11 LowQual AN=6;DP=365;MQ=41.46;MQ0=7 GT:DP ./. 0/0:24 ./. 0/0:16 ./. 0/0:20 ./. ./. HE670865 368624 . A . 13.85 LowQual AN=8;DP=366;MQ=41.48;MQ0=7 GT:DP 0/0:22 0/0:26 ./. 0/0:23 ./. 0/0:20 ./. ./. HE670865 368625 . T . 14.12 LowQual AN=6;DP=353;MQ=41.39;MQ0=7 GT:DP 0/0:20 0/0:26 ./. ./. ./. 0/0:18 ./. ./. HE670865 368626 . T . . . DP=348;MQ=41.31;MQ0=7 GT ./. ./. ./. ./. ./. ./. ./. ./. HE670865 368627 . T A 15.18 LowQual AC=2;AF=1.00;AN=2;BaseQRankSum=-1.221;DP=348;Dels=0.55;FS=0.000;HaplotypeScore=403.2395;MLEAC=2;MLEAF=1.00;MQ=41.29;MQ0=7;MQRankSum=-0.971;QD=0.23;ReadPosRankSum=-0.344 GT:AD:DP:GQ:PL ./. ./. ./. ./. 1/1:30,4:34:3:26,3,0 ./. ./. ./. HE670865 368628 . T . 13.85 LowQual AN=8;DP=357;MQ=41.07;MQ0=8 GT:DP ./. 0/0:22 ./. 0/0:10 0/0:33 0/0:15 ./. ./. HE670865 368629 . A . 15.14 LowQual AN=10;DP=357;MQ=40.99;MQ0=8 GT:DP 0/0:14 0/0:22 ./. ./. 0/0:34 0/0:15 0/0:21 ./. HE670865 368630 . T . 18.75 LowQual AN=10;DP=358;MQ=40.89;MQ0=8 GT:DP 0/0:15 0/0:23 ./. 0/0:12 ./. 0/0:16 0/0:21 ./. HE670865 368631 . C . 18.33 LowQual AN=14;DP=359;MQ=40.70;MQ0=8 GT:DP 0/0:15 0/0:23 0/0:14 ./. 0/0:30 0/0:16 0/0:20 0/0:33 HE670865 368632 . G . 17.87 LowQual AN=16;DP=361;MQ=40.63;MQ0=9 GT:DP 0/0:15 0/0:23 0/0:15 0/0:12 0/0:29 0/0:16 0/0:21 0/0:33 HE670865 368633 . T . 18.90 LowQual AN=14;DP=364;MQ=40.51;MQ0=10 GT:DP 0/0:17 0/0:24 0/0:16 ./. 0/0:32 0/0:17 0/0:22 0/0:34 HE670865 368634 . A . 17 LowQual AN=14;DP=369;MQ=40.67;MQ0=10 GT:DP 0/0:18 0/0:24 0/0:18 ./. 0/0:36 0/0:19 0/0:22 0/0:34 HE670865 368635 . T . 17.22 LowQual AN=14;DP=364;MQ=40.66;MQ0=10 GT:DP 0/0:18 0/0:24 0/0:22 ./. 0/0:37 0/0:20 0/0:21 0/0:32 HE670865 368636 . C . 16.03 LowQual AN=14;DP=362;MQ=40.70;MQ0=10 GT:DP 0/0:16 ./. 0/0:15 0/0:14 0/0:36 0/0:20 0/0:22 0/0:29 HE670865 368637 . A . 0.02 LowQual AN=8;DP=356;MQ=40.91;MQ0=11 GT:DP 0/0:26 ./. 0/0:49 0/0:31 0/0:55 ./. ./. ./. HE670865 368638 . T G 24.10 LowQual AC=3;AF=0.250;AN=12;BaseQRankSum=-6.075;DP=355;Dels=0.14;FS=8.612;HaplotypeScore=349.9679;MLEAC=3;MLEAF=0.250;MQ=40.80;MQ0=12;MQRankSum=-2.113;QD=0.24;ReadPosRankSum=1.216 GT:AD:DP:GQ:PL ./. 0/0:22,3:25:6:0,6,44 0/0:46,3:49:6:0,6,44 0/0:29,1:30:6:0,6,49 0/0:46,8:54:3:0,3,22 1/1:40,2:42:3:29,3,0 0/1:47,5:52:17:18,0,17 ./. HE670865 368639 . A . 18.78 LowQual AN=16;DP=352;MQ=40.91;MQ0=13 GT:DP 0/0:25 0/0:26 0/0:48 0/0:30 0/0:52 0/0:42 0/0:50 0/0:26 HE670865 368640 . T . 20.94 LowQual AN=16;DP=351;MQ=40.93;MQ0=13 GT:DP 0/0:25 0/0:26 0/0:48 0/0:30 0/0:51 0/0:42 0/0:50 0/0:26 HE670865 368641 . C . 23.55 LowQual AN=16;DP=347;MQ=40.91;MQ0=13 GT:DP 0/0:24 0/0:25 0/0:49 0/0:26 0/0:45 0/0:39 0/0:47 0/0:18 HE670865 368642 . G T 24 LowQual AC=4;AF=0.250;AN=16;BaseQRankSum=-1.989;DP=349;Dels=0.21;FS=7.313;HaplotypeScore=259.1726;MLEAC=3;MLEAF=0.188;MQ=40.80;MQ0=13;MQRankSum=-2.006;QD=0.14;ReadPosRankSum=1.275 GT:AD:DP:GQ:PL 0/1:21,3:24:11:11,0,88 0/0:23,3:26:18:0,18,152 0/0:47,2:49:30:0,30,252 0/0:25,0:25:30:0,30,256 0/0:38,7:45:18:0,18,164 0/1:37,2:39:27:27,0,177 0/1:43,4:47:2:2,0,144 0/1:13,6:19:13:13,0,79 HE670865 368643 . T . 27.64 LowQual AN=16;DP=345;MQ=40.82;MQ0=13 GT:DP 0/0:24 0/0:26 0/0:49 0/0:26 0/0:44 0/0:38 0/0:46 0/0:20 HE670865 368644 . A . 26.97 LowQual AN=14;DP=343;MQ=40.74;MQ0=13 GT:DP ./. 0/0:22 0/0:47 0/0:26 0/0:36 0/0:37 0/0:42 0/0:17 HE670865 368645 . T . 20.86 LowQual AN=14;DP=340;MQ=40.85;MQ0=12 GT:DP ./. 0/0:21 0/0:47 0/0:26 0/0:36 0/0:36 0/0:41 0/0:17 HE670865 368646 . G C,T 237.20 . AC=5,7;AF=0.417,0.583;AN=12;BaseQRankSum=-1.775;DP=353;Dels=0.25;FS=0.000;HaplotypeScore=196.6140;MLEAC=5,6;MLEAF=0.417,0.500;MQ=40.93;MQ0=12;MQRankSum=-0.594;QD=0.89;ReadPosRankSum=1.392 GT:AD:DP:GQ:PL ./. 2/2:13,14,7:34:3:30,30,30,3,3,0 ./. 1/2:2,13,5:20:15:80,24,15,56,0,53 1/2:10,27,9:46:14:64,20,14,44,0,41 1/2:4,32,2:38:18:61,24,18,37,0,34 2/2:8,30,8:46:9:77,77,77,9,9,0 1/1:18,11,1:30:6:43,6,0,43,6,43 HE670865 368647 . T . 29.30 LowQual AN=16;DP=368;MQ=40.74;MQ0=12 GT:DP 0/0:31 0/0:37 0/0:54 0/0:36 0/0:52 0/0:42 0/0:50 0/0:34 HE670865 368648 . G . 51.60 . AN=16;DP=366;MQ=40.87;MQ0=12 GT:DP 0/0:34 0/0:40 0/0:57 0/0:37 0/0:59 0/0:44 0/0:53 0/0:42

These are the comparable lines from the outputted VCF when using EMIT_ALL_CONFIDENT_SITES:

HE670865 368589 . A . 46.89 . AN=16;DP=441;MQ=43.26;MQ0=13 GT:DP 0/0:35 0/0:45 0/0:58 0/0:38 0/0:75 0/0:64 0/0:64 0/0:47 HE670865 368590 . A . 50.15 . AN=16;DP=447;MQ=43.41;MQ0=13 GT:DP 0/0:35 0/0:45 0/0:60 0/0:39 0/0:76 0/0:66 0/0:65 0/0:47 HE670865 368591 . A . 39.12 . AN=16;DP=448;MQ=43.32;MQ0=12 GT:DP 0/0:36 0/0:45 0/0:61 0/0:39 0/0:77 0/0:66 0/0:66 0/0:46 HE670865 368648 . G . 51.60 . AN=16;DP=366;MQ=40.87;MQ0=12 GT:DP 0/0:34 0/0:40 0/0:57 0/0:37 0/0:59 0/0:44 0/0:53 0/0:42 HE670865 368649 . T . 75.72 . AN=16;DP=368;MQ=40.89;MQ0=12 GT:DP 0/0:34 0/0:41 0/0:57 0/0:38 0/0:59 0/0:44 0/0:53 0/0:42 HE670865 368650 . T . 72.74 . AN=16;DP=369;MQ=40.87;MQ0=12 GT:DP 0/0:34 0/0:41 0/0:57 0/0:38 0/0:59 0/0:44 0/0:53 0/0:43 HE670865 368651 . T . 87.71 . AN=16;DP=369;MQ=40.88;MQ0=12 GT:DP 0/0:33 0/0:42 0/0:58 0/0:37 0/0:59 0/0:44 0/0:53 0/0:43 HE670865 368652 . T . 93.72 . AN=16;DP=373;MQ=40.89;MQ0=12 GT:DP 0/0:32 0/0:43 0/0:59 0/0:37 0/0:59 0/0:45 0/0:54 0/0:43

Any help or insight would be greatly appreciated.

Comments (5)

Hello,

I used bwa to map my samples to a mitochondrial genome of a non-model organism. Afterwards I careated a merged .bam file from multiple (288) sample .bams (used samtools merge and re-assigned RG tags), but when I run UnifiedGenotyper on that file it gets stuck at 32.1% and never moves forward from there. I also wanted to run RealignerTargetCreator, but I always get a truncated realigned.bam file. Any suggestions for how to troubleshoot this?

Thanks you.


Created 2015-07-22 12:08:20 | Updated 2015-07-22 12:09:10 | Tags: unifiedgenotyper haplotypecaller snp-calling input-prior
Comments (3)

I'm using HaplotypeCaller (but it could be also possible to use this option with UnifiedGenotyper) for a very special experimental design in a no-human species, where we have an expectation for the prior probabilities of each genotype. I'm planning to call SNPs for single diploid individuals using HaplotypeCaller and afterwards for the whole dataset with GenotypeGVCFs.

Nevertheless, I'm confused about the structure of the prior probabilities command line. In the documentation, it says: "Values will be added and Pr(AC=0) will be 1-sum, so that they sum up to one. f) If user-defined values add to more than one, an error will be produced". So I'll require to provide two prior probabilities out of the 3 for each genotype (0/0, 0/1 and 1/1). My first guess is that the prior that I don't need to provide is for the reference homozygous (0/0) based on the Pr(AC=0) specified in the documentation. I would like to know if this idea is correct.

My second problem if is the two input_prior options are positional parameters. If so, and if my first guess for the Pr(AC=0) is correct, do they represent the probability of 0/1 and 1/1, that is, Pr(AC=1) and Pr(AC=2)?

More concretely, I'm going to provide an example where you don't expect any heterozygous call. In that case, is it correct that the argument will be --input_prior 0.5 --input_prior 0?

Thank you very much.


Created 2015-07-21 07:49:36 | Updated | Tags: unifiedgenotyper pooled-calls minimum-number-of-reads
Comments (4)

Hello,

I am using the UnifiedGenotyper for pooled samples and would like to set the minimum count (fraction) of reads supporting an alternate allele in order to distinguish low frequency alleles from sequencing errors. Is there a way to do it? I couldn't find a means to do that or the default threshold in the tool's documentation . Thank you for your support.


Created 2015-07-16 20:23:04 | Updated | Tags: unifiedgenotyper vcf
Comments (14)

Hello!

I am writing to seek your help on a curious result I received from running the Unified Genotyper. I ran Unified Genotyper to obtain all confident sites (both invariant and variant) on a suite of different bam files. For certain regions of the genome that overlap certain bam file reads, it appears that it did not output invariant sites, only variant sites. I do not know if this is due to differences among bam files, my call to GATK, or is a glitch.

Here is a little more background. * First, I am using Unified Genotyper to maintain continuity with some analyses I did over a year go. * My data: - I have 92 low coverage bam files from samples produced from illumina sequencing a RADtag type enzyme digestion library. The bams contain single-end reads, each 44-bp long. I will refer to these reads as ‘RADtags’. - I also have 18 high coverage bams produced from illumina sequencing whole genome paired-end libraries, with read lengths varying from 36 to 100. I will refer to these as WGS samples. - Previously I called SNPs using Unified Genotyper with mode set to EMIT_VARIANTS_ONLY, on all this data at once, to increase the chances of identifying quality SNPs in the low coverage RADtag data and to get genotypes from the high coverage WGS data at the same sites. The data seemed fine.

  • Current issue:
  • I ran the Unified Genotyper with the mode set to EMIT_ALL_CONFIDENT_SITES for all the same data. The vcf includes both invariant and variant calls. However, it seems I am not getting any invariant sites from regions that overlap the RADtag reads. It seems the program is not emitting invariant sites within RADtags, but is reporting invariant sites outside of RADtags and only for WGS samples.

I have attached a subset of my vcf file, and a snapshot image from viewing these samples in igv. In the igv view, the middle samples are the RADtag samples, the higher coverage samples at the top and bottom are the WGS samples. As you can see, there should be plenty of invariant sites in the regions overlapping the RADTAGs and there should be calls that include these data regions and samples.

Can you think of a reason that could be causing this? I have pasted my call to GATK below.

Thank you! Amanda

java -jar /usr/local/gatk/3.0-0/GenomeAnalysisTK.jar \ -R mg.new.ref.fa \ -T UnifiedGenotyper \ -I AHQT1G_q29.paired.nodup.realign.bam -I BOG10G_q29.paired.nodup.realign.bam -I CAC6G_q29.paired.nodup.realign.bam \ -I CAC9N_q29.paired.nodup.realign.bam -I DUNG_q29.paired.nodup.realign.bam -I IM62.JGI_q29.paired.nodup.realign.bam \ -I KOOTN_q29.paired.nodup.realign.bam -I LMC24G_q29.paired.nodup.realign.bam -I MAR3G_q29.paired.nodup.realign.bam \ -I MED84G_q29.paired.nodup.realign.bam -I MEN104_q29.paired.nodup.realign.bam -I NHN26_q29.paired.nodup.realign.bam \ -I PED5G_q29.paired.nodup.realign.bam -I REM8G_q29.paired.nodup.realign.bam -I SF5N_q29.paired.nodup.realign.bam \ -I SLP9G_q29.paired.nodup.realign.bam -I SWBG_q29.paired.nodup.realign.bam -I YJS6G_q29.paired.nodup.realign.bam \ -I CACid.277_index2.GCTCGG.sortedrealign.bam -I CACid.279b_index2.CTGATT.sortedrealign.bam -I CACid.247_index2.AATACT.sortedrealign.bam -I CACid.237_index2.CGCTGT.sortedrealign.bam \ -I CACid.240b_index2.GTCTCT.sortedrealign.bam -I CACid.272_index2.ATCTCC.sortedrealign.bam -I CACid.282b_index2.TCTGCT.sortedrealign.bam -I CACid.179_index2.TATAGT.sortedrealign.bam \ -I CACid.187_index2.CAGCAT.sortedrealign.bam -I CACid.189_index2.TAATCC.sortedrealign.bam -I CACid.192_index2.GCAGTT.sortedrealign.bam -I CACid.339_index2.CCTGAA.sortedrealign.bam \ -I CACid.235_index2.AAGCGA.sortedrealign.bam -I CACid.236_index2.AACTTA.sortedrealign.bam -I CACid.249_index2.GTTCCT.sortedrealign.bam -I CACid.370_index2.CTAATA.sortedrealign.bam \ -I CACid.372_index2.CCGAAT.sortedrealign.bam -I CACid.215_index2.CTTGTA.sortedrealign.bam -I CACid.218_index2.CCCTCG.sortedrealign.bam -I CACid.219_index2.ACTGAC.sortedrealign.bam \ -I CACid.221_index2.GTGACT.sortedrealign.bam -I CACid.225_index2.ACTAGC.sortedrealign.bam -I CACid.226_index2.CGACTA.sortedrealign.bam -I CACid.231_index2.GTGAGA.sortedrealign.bam \ -I CACid.232_index2.AATCTA.sortedrealign.bam -I CACid.233_index2.CAAGCT.sortedrealign.bam -I CACid.212_index2.CTCTCA.sortedrealign.bam -I CACid.211_index2.ACTCCT.sortedrealign.bam \ -I CACid.403_index2.GATCAA.sortedrealign.bam -I CACid.280_index2.GGCTTA.sortedrealign.bam -I CACid.283_index2.GTGGAA.sortedrealign.bam -I CACid.285_index2.AAATGA.sortedrealign.bam \ -I CACid.003_index2.GGGTAA.sortedrealign.bam -I CACid.007_index2.CGTCAA.sortedrealign.bam -I CACid.008_index2.GTTGCA.sortedrealign.bam -I CACid.055_index2.ATATAC.sortedrealign.bam \ -I CACid.056_index2.TTAGTA.sortedrealign.bam -I CACid.071_index2.TCGCTT.sortedrealign.bam -I CACid.313_index2.TCGTCT.sortedrealign.bam -I CACid.040_index2.CAATAT.sortedrealign.bam \ -I CACid.292_index2.TCATGG.sortedrealign.bam -I CACid.080_index2.CAGGAA.sortedrealign.bam -I CACid.088_index2.ATCGAC.sortedrealign.bam -I CACid.322_index2.TTGCGA.sortedrealign.bam \ -I CACid.114_index1.GCTCGG.sortedrealign.bam -I CACid.279a_index1.CTGATT.sortedrealign.bam -I CACid.181_index1.AATACT.sortedrealign.bam -I CACid.184_index1.CGCTGT.sortedrealign.bam \ -I CACid.185_index1.GTCTCT.sortedrealign.bam -I CACid.191_index1.ATCTCC.sortedrealign.bam -I CACid.194_index1.TCTGCT.sortedrealign.bam -I CACid.240a_index1.TATAGT.sortedrealign.bam \ -I CACid.238_index1.CAGCAT.sortedrealign.bam -I CACid.371_index1.TAATCC.sortedrealign.bam -I CACid.253_index1.GCAGTT.sortedrealign.bam -I CACid.260_index1.CCTGAA.sortedrealign.bam \ -I CACid.262_index1.AAGCGA.sortedrealign.bam -I CACid.257_index1.AACTTA.sortedrealign.bam -I CACid.258_index1.GTTCCT.sortedrealign.bam -I CACid.216_index1.CTAATA.sortedrealign.bam \ -I CACid.220_index1.CCGAAT.sortedrealign.bam -I CACid.223_index1.CTTGTA.sortedrealign.bam -I CACid.224_index1.CCCTCG.sortedrealign.bam -I CACid.228_index1.ACTGAC.sortedrealign.bam \ -I CACid.229_index1.GTGACT.sortedrealign.bam -I CACid.281_index1.ACTAGC.sortedrealign.bam -I CACid.282a_index1.CGACTA.sortedrealign.bam -I CACid.284_index1.GTGAGA.sortedrealign.bam \ -I CACid.286_index1.AATCTA.sortedrealign.bam -I CACid.287_index1.CAAGCT.sortedrealign.bam -I CACid.288_index1.CTCTCA.sortedrealign.bam -I CACid.072_index1.ACTCCT.sortedrealign.bam \ -I CACid.077_index1.GATCAA.sortedrealign.bam -I CACid.078_index1.GGCTTA.sortedrealign.bam -I CACid.084_index1.GTGGAA.sortedrealign.bam -I CACid.087a_index1.AAATGA.sortedrealign.bam \ -I CACid.087b_index1.GGGTAA.sortedrealign.bam -I CACid.091_index1.CGTCAA.sortedrealign.bam -I CACid.092_index1.GTTGCA.sortedrealign.bam -I CACid.103_index1.CTCACT.sortedrealign.bam \ -I CACid.095_index1.TCCATA.sortedrealign.bam -I CACid.100_index1.ACTCGA.sortedrealign.bam -I CACid.101_index1.GTTCGA.sortedrealign.bam -I CACid.106_index1.ATATAC.sortedrealign.bam \ -I CACid.108_index1.TTAGTA.sortedrealign.bam -I CACid.110_index1.TCGCTT.sortedrealign.bam -I CACid.112_index1.TCGTCT.sortedrealign.bam -I CACid.144_index1.CAATAT.sortedrealign.bam \ -I CACid.149_index1.TCATGG.sortedrealign.bam -I CACid.321_index1.CAGGAA.sortedrealign.bam -I CACid.323_index1.ATCGAC.sortedrealign.bam -I CACid.324_index1.TTGCGA.sortedrealign.bam \ -o CACid.MIM.sites.q40.6.14.15.vcf \ --metrics_file CACid.MIM.sites.q40.6.14.15.metrics.txt \ --genotype_likelihoods_model SNP \ --output_mode EMIT_ALL_CONFIDENT_SITES \ -stand_call_conf 40 \ -stand_emit_conf 40 \ -l INFO


Created 2015-05-11 13:57:37 | Updated | Tags: unifiedgenotyper depthofcoverage haplotypecaller mendelianviolations
Comments (1)

Hi,

Two questions, which relate to Unified Genotyper or Haplocaller:

  1. Detecting and determining the exact genotype of an individual seems to be inaccurate at read depths less than 10X. This is because there aren't enough reads to accurately say whether it could be heterozygous. Is there an option in either Unified Genotyper or Haplocaller, that excludes sites in individuals that have below 10X? Alternatively how can these sites be removed from the vcf - or set to missing ?

  2. I'm sure I've read somewhere that pedigree information can be supplied to a variant caller (such as Unified Genotyper or Haplocaller), and used to improve the calling accuracy/speed/efficiency. I am calling variants on one parent and multiple offspring.

Apologies if these questions are answered in the user manual. I regularly look it but have not found much to answer my questions.

Cheers,

Blue


Created 2015-05-10 19:31:44 | Updated | Tags: unifiedgenotyper duplicatereadfilter
Comments (2)

Hi All,

My understanding of DuplicateReadFilter is that is removes PCR duplicates (ie: identical reads that map to the same location in the genome). Is there a way to prevent UnifiedGenotyper from using this filter?

Many of my 'duplicate' reads are not really duplicates. I have 50bp single ended yeast RNA-seq data, and 10 samples/lane results in highly over-sampled data. Removing duplicates results in an undercounting of reads at the most highly expressed genes, and therefore an increased number of sub-clonal variants at these genes (because the reads from the major clonal population are discarded, but reads of sub-clonal populations are kept because they have a different sequence). At least I think that is what is happening.

-Lucas


Created 2015-05-06 14:48:11 | Updated | Tags: unifiedgenotyper strand-bias
Comments (8)

Hi there, I have been struggling with the interpretation of the SOR annotation. I do get that a higher value is the sign of a strand bias and that this value is never negative.

I did read the SOR annotation documention but still cannot figure out how you calculate the SOR.

Here is one of my example where a SB is present for sure REF + strand = 2185 reads REF - strand = 5 reads ALT + strand = 7 reads ALT - strand = 2370 reads.

When I calculate "R" as indicated in the documentation, I obtain a very high value of 147 955. But for this variant in the VCF file, SOR = 11.382

How is the 11.382 obtained ? Could you please help me in understanding the computation ?

Thanks Manon


Created 2015-05-06 09:14:36 | Updated | Tags: unifiedgenotyper low-coverage
Comments (3)

Is it possible to call variants using unifiedgenotyper from data with less than 1 X coverage per sample? We have a total of 25 samples and each has a coverage of < 1 X. does changing dcov parameter to 50 or lower work?


Created 2015-04-30 12:15:29 | Updated | Tags: unifiedgenotyper multi-sample
Comments (2)

Dear GATK,

I've got an issue using the UnifiedGenotyper, it does not output all possible variants sharing the same start coordinate.

My command line: GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ReferenceGenome.fasta -I list_of_Samples.list -ploidy 1 -glm BOTH --heterozygosity 0.001 --indel_heterozygosity 0.005 -gt_mode DISCOVERY -dcov 1000 -o output_raw_variants.vcf

Using "Batch 1" , about 50 Plasmodium falciparum samples, I identified the following insertion in Sample1, which I know is real :

chrom14 2261798 . T TTA 1576.91 1:2,42:59:99:1:1.00:1609,0

However when I run the same command line with Batch1 + Batch2, total of 100 samples, I get the following result for Sample1:

chrom14 2261798 . T TTATATA 23812.75 0:39,5:59:99:0:0.00:0,775

Some samples from Batch2 have a longer insertion starting at the same coordinate, and now the original insertion in Sample1 does not appear in the VCF anymore... Why UnifiedGenotyper did not output multiple lines in this example? (for some other positions it did output multiple lines sharing the same coordinate) I did not change the parameter --max_alternate_alleles 6.

Thanks a lot in advance for your help,

Antoine


Created 2015-04-30 06:34:42 | Updated 2015-05-01 15:02:14 | Tags: unifiedgenotyper ploidy queue
Comments (7)

Hi, I have several questions about the function "UnifiedGenotyper" and its corresponding Qscript.

  1. I first test this function on a single chromosome of my chicken sample (bam file is about 800M) to call SNPs, and set the ploidy value to 30. It took me half a day to finish. Is it normal that it took so long for a single chromosome?

As a result of this, I tried to use the scatter-gather to parallel my job in order to get the full use of my multi-node server. I built my own script based on modifying the ExampleUnifiedGenotyper.scala. I defined "ploidy" as an argument in my Qscript, then I assign a value of 30 to ploidy in my bash command line.

java -Djava.io.tmpdir=$tmp_dir \
    -jar /sw/apps/bioinfo/GATK-Queue/3.2.2/Queue.jar \
    -S /path/to/ug.scala \
    -R $ref_file \
    -I /path/to/bamfile.bam \
    -o /path/to/my.vcf  \
    -ploidy 30 \
    -run

2.The script ran successfully, but the output VCF file was based sample_ploidy=2. It seems that it does not respond to the argument "ploidy". Then if I add one command line in the def script() part, which is" genotyper.sample_ploidy = qscript.sample_ploidy ". GATK will give warning of errors "sample_ploidy is not a member of TestUnifiedGenotyper.this.UnifiedGenotyperArguments". Could you please help point out where I did wrong?

Also, I have some questions about how to assign values to argument such as " this.memoryLimit" and "genotyper.scatterCount".

  1. Based on the explanation given in the example scala file, argument "this.memoryLimit" sets the memory limit for each command. So in my case, I have only one command in my script, so that I can assign as much memory as I have to it? I am running on a server that has multiple-node, each node with 16 cores, and each core has 8G RAM. So in my understanding, if I run this job on one node with 16 cores, can I assign "this.memoryLimit=128" (16*8)? Do I understand correctly?

  2. about the scatterCount, can I assign it with the number of cores I have in the server? That is, based on the example I listed above, 16.

Thank you very much for your help and time! Attached is my Qscript in relating with my question.

My Qscript is as follows:

package org.broadinstitute.gatk.queue.qscripts.examples

import org.broadinstitute.gatk.queue.QScript
import org.broadinstitute.gatk.queue.extensions.gatk._

class TestUnifiedGenotyper extends QScript {
      qscript =>
     @Input(doc="The reference file for the bam files.", shortName="R")
      var referenceFile: File = _ 

      @Input(doc="Bam file to genotype.", shortName="o")
     var outputFile: File = _

      @Input(doc="Bam file to genotype.", shortName="I")
      var bamFile: File = _

      @Input(doc="An optional file with dbSNP information.", shortName="D", required=false)
       var dbsnp: File = _

      @Argument(doc="Ploidy (number of chromosomes) per sample.", shortName="ploidy")
      var sample_ploidy: List[String] = Nil 

      trait UnifiedGenotyperArguments extends CommandLineGATK {
           this.reference_sequence = qscript.referenceFile
           this.memoryLimit = 2
      }

  def script() {
    val genotyper = new UnifiedGenotyper with UnifiedGenotyperArguments

    genotyper.scatterCount = 3
    genotyper.input_file :+= qscript.bamFile
   // genotyper.sample_ploidy = qscript.sample_ploidy
    genotyper.out = swapExt(outputFile, qscript.bamFile, "bam", "unfiltered.vcf")

    add(genotyper)
  }
}

Created 2015-04-28 08:17:21 | Updated | Tags: unifiedgenotyper haplotypecaller coverage bias
Comments (4)

Hi,

I am doing joint variant calling for Illumina paired end data of 150 monkeys. Coverage varies from 3-30 X with most individuals having around 4X coverage.

I was doing all the variant detection and hard-filtering (GATK Best Practices) process with both UnifiedGenotyper and Haplotype caller.

My problem is that HaplotypeCaller shows a much stronger bias for calling the reference allele in low coverage individuals than UnifiedGenotyper does. Is this a known issue?

In particular, consider pairwise differences across individuals: The absolute values are lower for low coverage individuals than for high coverage, for both methods, since it is more difficult to make calls for them. However, for UnifiedGenotyper, I can correct for this by calculating the "accessible genome size" for each pair of individuals by substracting from the total reference length all the filtered sites and sites where one of the two individuals has no genotype call (./.). If I do this, there is no bias in pairwise differences for UnifiedGenotyper. Values are comparable for low and high coverage individuals (If both pairs consist of members of similar populations).

However, for HaplotypeCaller, this correction does not remove bias due to coverage. Hence, it seems that for UnifiedGenotyper low coverage individuals are more likely to have no call (./.) but if there is a call it is not biased towards reference or alternative allele (at least compared to high coverage individuals). For HaplotypeCaller, on the other hand, it seems that in cases of doubt the genotype is more likely to be set to reference. I can imagine that this is an effect of looking for similar haplotypes in the population.

Can you confirm this behaviour? For population genetic analysis this effect is highly problematic. I would trade in more false positive if this removed the bias. Note that when running HaplotypeCaller, I used a value of 3*10^(-3) for the expected heterozygosity (--heterozygosity) which is the average cross individuals diversity and thus already at the higher-end for within individual heterozygosity. I would expect the problem to be even worse if I chose lower values.

Can you give me any recommendation, should I go back using UnifiedGenotyper or is there any way to solve this problem?

Many thanks in advance, Hannes


Created 2015-04-20 10:57:05 | Updated 2015-04-20 11:03:28 | Tags: unifiedgenotyper combinevariants haplotypecaller genotype-given-alleles
Comments (4)

Question: Is it possible to have CV merge like bcftools does it?

I get this warning, when running UG in GGA mode using an -alleles vcf generated with CV:

WARN  10:17:21,394 GenotypingGivenAllelesUtils - Multiple valid VCF records detected in the alleles input file at site 20:106089, only considering the first record 

I made this call with HC from 10 samples:

20  106089  .   CA  C

And this call with UG from 10 other samples:

20  106089  .   C   A

CV merges like this:

20  106089  .   C   A
20  106089  .   CA  C

bcftools merges like this:

20  106089  .   CA  AA,C

The UG recall from the CV generated -alleles vcf is incomplete:

20  106089  .   C   A

The UG recall from the bcftools generated -alleles vcf is complete:

20  106089  .   CA  AA,C

Is it possible to have CV merge like bcftools does it?

In another thread @Geraldine_VdAuwera said:

I'm really not sure. It's not a use case that UG was designed for (with UG we kept SNPs and indels separate until post-analysis), so I would recommend being cautious with it.

I checked the genotypes and UG seems to handle merged MNPs and indels just fine; see below. But I will do some additional testing. Or I might just take the safe path and do the recalling separately for SNPs and indels as suggested. The reason I have UG and HC calls in the first place is because I have low and high coverage data for different cohorts. I want to create a merged dataset.

Despite --interval_padding 100 helping to recall more sites with HC in GGA mode as per previous recommendation, some sites still fail to be called with HC in GGA mode. Hence I opted for UG.

UG calls on samples 1-10:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  535 545 546 550 554 564 567 574 575 578
20  106089  .   C   A   16.19   .   AC=2;AF=0.125;AN=16;BaseQRankSum=-0.854;DP=37;Dels=0.00;FS=0.000;HaplotypeScore=1.5282;MLEAC=2;MLEAF=0.125;MQ=58.74;MQ0=0;MQRankSum=-0.560;QD=2.70;ReadPosRankSum=-1.797;SOR=0.935;VariantType=SNP  GT:AD:DP:GQ:PL  0/0:3,0:3:6:0,6,76  0/0:4,2:6:9:0,9,115 0/1:3,1:4:24:24,0,80    0/0:6,0:6:12:0,12,130   0/1:1,1:2:29:30,0,29    ./. 0/0:7,0:7:15:0,15,188   0/0:3,1:4:6:0,6,74  ./. 0/0:5,0:5:12:0,12,142

HC calls on samples 11-20:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  585 590 622 625 628 640 655 668 687 693

20  106089  .   CA  C   47.95   .   AC=5;AF=0.250;AN=20;BaseQRankSum=0.925;DP=36;FS=1.850;InbreedingCoeff=0.0646;MLEAC=5;MLEAF=0.250;MQ=59.48;MQ0=0;MQRankSum=0.175;QD=3.00;ReadPosRankSum=-1.725;SOR=0.387 GT:AD:GQ:PL 0/0:2,0:6:0,6,49    0/0:2,0:6:0,6,49    0/0:3,0:12:0,12,130 0/0:5,0:15:0,15,122 0/0:2,0:6:0,6,46    0/1:2,1:14:14,0,39  0/1:2,1:15:15,0,38  0/0:4,0:12:0,12,93  0/1:3,1:12:12,0,46  1/1:0,3:9:67,9,0

UG GGA recalls on samples 1-20:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  535 545 546 550 554 564 567 574 575 578 585 590 622 625 628 640 655 668 687 693
20  106089  .   CA  AA,C    110.56  .   AC=0,8;AF=0.00,0.222;AN=36;DP=81;FS=0.000;InbreedingCoeff=0.5076;MLEAC=0,6;MLEAF=0.00,0.167;MQ=58.56;MQ0=0;QD=3.45;SOR=0.859;VariantType=MULTIALLELIC_MIXED GT:AD:DP:GQ:PL:SB   0/0:0,0,0:3:0:0,0,0,6,6,52:0,0,0,0  0/2:0,0,1:6:0:5,5,5,0,0,109:0,0,1,0 0/2:0,0,1:4:0:12,12,12,0,0,47:0,0,1,0   0/0:0,0,0:6:0:0,0,0,17,17,123:0,0,0,0   0/0:0,0,0:2:0:0,0,0,3,3,10:0,0,0,0  ./. 0/0:0,0,0:7:0:0,0,0,9,9,60:0,0,0,0  0/2:0,0,1:4:0:12,12,12,0,0,61:0,0,0,1   ./. 0/0:0,0,1:5:0:0,0,0,4,4,30:0,0,0,1  0/0:0,0,0:3:0:0,0,0,6,6,49:0,0,0,0  0/0:0,0,0:3:0:0,0,0,9,9,76:0,0,0,0  0/0:0,0,1:4:0:0,0,0,1,1,22:0,0,1,0  0/0:0,0,0:7:0:0,0,0,18,18,149:0,0,0,0   0/0:0,0,0:4:0:0,0,0,11,11,76:0,0,0,0    0/2:0,0,1:5:0:9,9,9,0,0,65:0,0,0,1  0/2:0,0,1:4:0:12,12,12,0,0,60:0,0,0,1   0/0:0,0,0:5:0:0,0,0,15,15,116:0,0,0,0   0/2:0,0,1:6:0:12,12,12,0,0,47:0,0,0,1   2/2:0,0,3:3:9:67,67,67,9,9,0:0,0,3,0

This thread is related to the following threads on GGA:

http://gatkforums.broadinstitute.org/discussion/5249/overcalling-deletion-in-unifiedgenotyper-genotype-given-alleles-mode
http://gatkforums.broadinstitute.org/discussion/5018/ug-call-combined-snp-indel-sites-in-gga-mode
http://gatkforums.broadinstitute.org/discussion/4936/not-all-sites-emitted-with-genotype-given-alleles
http://gatkforums.broadinstitute.org/discussion/4024/genotype-and-validate-or-haplotype-caller-gga-what-am-i-doing-wrong

P.S. I might gate crash your Cambridge party this week despite not being invited :D The course was already fully booked, when you announced it. I don't have a time machine!


Created 2015-03-17 10:47:42 | Updated | Tags: unifiedgenotyper non-human
Comments (2)

Hi,

I've effectively got a sample of 200 individuals, that share one father but each have a different mother. I'm wondering how to best change the default parameters of Unified Genotyper to call variants without bias caused by assumptions of the algorithm.

Cheers,

Blue


Created 2015-03-12 16:15:59 | Updated | Tags: unifiedgenotyper snp error indel
Comments (3)

Hello,

when I run the UnifiedGenotyper to call INDELs, I get the following error (detailed command line output below) HAPLOTYPE_MAX_LENGTH must be > 0 but got 0

When I call only SNPs instead, it doe not occur. I have searched to find an answer to why this is happening, but cannot figure out the reason. Could this be a bug? I get the error no matter if I run the Realigner before or not.

Did you observe this problem before?

Command line output: java -Xmx10g -jar ~/work/tools/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /media/rebecca/UUI/Work/BputSem/BputSem_gapfilled.final.fa -I realigned_A.bam -gt_mode DISCOVERY -glm INDEL -stand_call_conf 30 -stand_emit_conf 10 -o rawINDELS_q30_A.vcf -ploidy 10

INFO 15:50:47,987 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:50:47,990 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 INFO 15:50:47,990 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 15:50:47,990 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 15:50:47,996 HelpFormatter - Program Args: -T UnifiedGenotyper -R /media/rebecca/UUI/Work/BputSem/BputSem_gapfilled.final.fa -I realigned_A.bam -gt_mode DISCOVERY -glm INDEL -stand_call_conf 30 -stand_emit_conf 10 -o rawINDELS_q30_A.vcf -ploidy 10 INFO 15:50:48,002 HelpFormatter - Executing as rebecca@rebecca-ThinkPad-T440s on Linux 3.13.0-44-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_65-b32. INFO 15:50:48,002 HelpFormatter - Date/Time: 2015/03/12 15:50:47 INFO 15:50:48,002 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:50:48,002 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:50:48,132 GenomeAnalysisEngine - Strictness is SILENT INFO 15:50:48,402 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 15:50:48,409 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 15:50:48,447 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04 INFO 15:50:48,594 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 15:50:48,622 GenomeAnalysisEngine - Done preparing for traversal INFO 15:50:48,622 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 15:50:48,622 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 15:50:48,623 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 15:50:48,658 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values. INFO 15:50:48,658 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values. INFO 15:51:06,585 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: HAPLOTYPE_MAX_LENGTH must be > 0 but got 0 at org.broadinstitute.gatk.utils.pairhmm.PairHMM.initialize(PairHMM.java:97) at org.broadinstitute.gatk.utils.pairhmm.N2MemoryPairHMM.initialize(N2MemoryPairHMM.java:60) at org.broadinstitute.gatk.utils.pairhmm.LoglessPairHMM.initialize(LoglessPairHMM.java:66) at org.broadinstitute.gatk.utils.pairhmm.PairHMM.computeLikelihoods(PairHMM.java:194) at org.broadinstitute.gatk.tools.walkers.indels.PairHMMIndelErrorModel.computeGeneralReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:461) at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyIndelGenotypeLikelihoods.add(GeneralPloidyIndelGenotypeLikelihoods.java:201) at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyIndelGenotypeLikelihoods.add(GeneralPloidyIndelGenotypeLikelihoods.java:124) at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyGenotypeLikelihoodsCalculationModel.getLikelihoods(GeneralPloidyGenotypeLikelihoodsCalculationModel.java:270) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateLikelihoods(UnifiedGenotypingEngine.java:317) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotypingEngine.java:201) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:379) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:151) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: HAPLOTYPE_MAX_LENGTH must be > 0 but got 0
ERROR ------------------------------------------------------------------------------------------

Created 2015-03-04 17:41:22 | Updated | Tags: unifiedgenotyper commandlinegatk
Comments (2)

I'm encountering a problem similar to what I experienced with a mismatch between reference and cosmic files. Specifically, I created .bam files using human_g1k_v3.fasta, which reference coordinates as 1...22,X,Y etc. When I try to run UnifiedGenotyper with this reference file, the .bam file I created, and the dbsnp_137.b37.vcf, I get an error on account of the fact that snp coordinates are listed as chr1...chr22,chrX,chrY etc. Other than writing a script to remove all occurrences of "chr" is there another way to get around this problem, i.e. a dbsnp reference file that has the desired coordinates without the "chr"?

I resolved the problem with reference/cosmic by finding a cosmic file with consistent notation, but can't find a similar fix for this one. I'd appreciate any suggestions.


Created 2015-03-03 22:52:42 | Updated 2015-03-03 22:54:44 | Tags: unifiedgenotyper
Comments (1)

Hello GATK Team,

I've just run the following using GATK v3.1.

gatk -R ref.fa -T UnifiedGenotyper -I Q11_Final.sorted.rg.bam -I Q11D1.sorted.bam -I Q11D4.sorted.bam -I Q11D2.sorted.bam -I Q11D5.sorted.bam -o PX.raw.vcf -nt 5 -glm SNP -ploidy 2

I then trimmed SNPs within 10bp of indels (called using the same but -glm indel).

The samples therein represent a mother (Sample Q11) and her 4 haploid sons.

The problem I'm getting is that calls that should be heterozygotic in the mother are not being called as such (see attached image; .bam file of the mother is int he bottom and the resulting VCF file is loaded in the top panel. The green cmd screen shows the VCF file line for the site in question where the 14th element of the indicates the queen's results (1/1:68,86:154:99:1837,120,0)

I'm not sure what's going on here and was wondering if you could help out?


Created 2015-02-24 19:52:54 | Updated | Tags: unifiedgenotyper snp indels
Comments (1)

I am working with plants, and so cannot follow most of your human/cancer specific workflows etc. Also my question is probably not GATK specifc, but I would be very grateful for any kind of answer. I have used BWA to align a set of reads and the contigs assembled from these reads (de novo). I have used samtools mpileup and GATK unifiedGenotyper on the aligned reads. This results for both in a list of ONLY SNPs, the same ones (aside of sensitivity). I have also used the samtools steps on the aligned contigs, resulting in a list of ONLY indels. I can clearly see both SNPs and indels in IGV, both looking at the reads or the contigs. generally I see that reads are being used for variant calling, so is it "wrong" to use contigs? and why do I not get ANY indels in either tool's output when using reads? any ideas?


Created 2015-02-19 22:47:05 | Updated | Tags: unifiedgenotyper ploidy vcf
Comments (1)

Hi, I'm running on the Unified Genotyper (Version=3.3-0-g37228af) on a pooled sample. The ploidy is set to 32. I'm trying to get allele frequency information. I'm trying to filter sites based on depth of coverage but the DP flag is not consistent within a loci

groupIV 756 . C T 141.24 . AC=11;AF=0.344;AN=32;BaseQRankSum=-2.927;DP=37;Dels=0.65;FS=0.000;HaplotypeScore=166.3715;MLEAC=11;MLEAF=0.344;MQ=86.28;MQ0=0;MQRankSum=-0.696;QD=3.82;ReadPosRankSum=-2.927;SOR=1.022 GT:AD:DP:GQ:PL 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1/1/1/1/1/1/1/1/1/1/1:8,5:13:0:155,39,25,17,12,9,6,4,2,1,1,0,0,0,0,1,2,2,4,5,7,9,11,14,17,21,25,31,38,47,60,2147483647,2147483647

The DP at the start says there is a depth of 37, but at the end it says its 13. Judging from neighbouring sites, the actual depth seems to be 37, but I'm confused why it seems to only be using 13 of them. At other sites the scores match.


Created 2015-02-10 05:49:09 | Updated | Tags: unifiedgenotyper vcf
Comments (5)

Hi, How do I know based on the REF and ALT column of a VCF file the actual position where an indel event happened? I usually see an event that occur in the 2nd base. For example, REF ALT GGCGTGGCGT G,GCGCGTGGCGT --deletion; insertion of "C" ATTT A,ATT --deletions

But I saw a VCF format documentation(http://samtools.github.io/hts-specs/VCFv4.2.pdf) allowing a different case: GTC G,GTCT --deletion; insertion of "T" at the end

How is UnifiedGenotyper formatting the indels? Does it differ from the one used in the 1000Genome Project? Thank you!

Roven


Created 2015-02-04 04:17:31 | Updated 2015-02-04 04:18:06 | Tags: unifiedgenotyper vcf
Comments (5)

Hi

I'm wondering why some positions in VCF overlap. Can GATK skip emitting positions that are already part of an indel/concatenated ref? We need to use EMIT_ALL_SITES but it's quite confusing if a position is represented multiple times especially if indel is present. I understand that there is always a duplicated position preceding an indel, but the positions after the indel should ideally be not overlapping the neighboring positions. Please see some examples below:

chr02 28792823 . C . 34.23 . AN=2;DP=12;MQ=57.31;MQ0=0 GT:DP 0/0:12 chr02 28792823 . CAA . 10000037.27 . AN=2;DP=12;MQ=57.31;MQ0=0 GT:AD:DP 0/0:12:12 chr02 28792824 . A . . . DP=12;MQ=57.31;MQ0=0 GT ./. chr02 28792825 . A . 31.22 . AN=2;DP=12;MQ=57.31;MQ0=0 GT:DP 0/0:7

chr02 314879 . T . 100.23 . AN=2;DP=27;MQ=55.82;MQ0=0 GT:DP 0/0:27 chr02 314879 . TAGAG T,TAG 647.19 . AC=1,1;AF=0.500,0.500;AN=2;DP=27;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=55.82;MQ0=0;QD=23.97;RPA=8,6,7;RU=AG;STR GT:AD:DP:GQ:PL 1/2:0,9,11:26:99:989,445,424,395,0,335 chr02 314880 . A . 43.23 . AN=2;DP=26;MQ=55.66;MQ0=0 GT:DP 0/0:5 chr02 314881 . G . 40.23 . AN=2;DP=25;MQ=55.47;MQ0=0 GT:DP 0/0:4 chr02 314882 . A . 73.23 . AN=2;DP=25;MQ=55.47;MQ0=0 GT:DP 0/0:16 chr02 314883 . G . 73.23 . AN=2;DP=25;MQ=55.47;MQ0=0 GT:DP 0/0:16

This next example even calls a SNP for position 10221400 even if deletion occurs in 10221399.

chr02 10221399 . C . 76.23 . AN=2;DP=17;MQ=58.74;MQ0=0 GT:DP 0/0:17 chr02 10221399 . CA CCTA,C 143.19 . AC=1,1;AF=0.500,0.500;AN=2;DP=17;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=58.74;MQ0=0;QD=8.42 GT:AD:DP:GQ:PL 1/2:0,8,6:17:99:527,131,112,185,0,143 chr02 10221400 . A C,T 287.29 . AC=1,1;AF=0.500,0.500;AN=2;DP=17;Dels=0.29;FS=0.000;HaplotypeScore=6.7569;MLEAC=1,1;MLEAF=0.500,0.500;MQ=58.74;MQ0=0;QD=16.90 GT:AD:DP:GQ:PL 1/2:0,4,8:12:88:407,294,285,112,0,88 chr02 10221401 . A C 163.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.854;DP=17;Dels=0.00;FS=1.913;HaplotypeScore=9.7562;MLEAC=1;MLEAF=0.500;MQ=58.74;MQ0=0;MQRankSum=1.558;QD=9.63;ReadPosRankSum=2.764 GT:AD:DP:GQ:PL 0/1:11,6:17:99:192,0,392 chr02 10221402 . A . 76.23 . AN=2;DP=17;MQ=58.74;MQ0=0 GT:DP 0/0:17

Thanks in advance for your help. rfuentes


Created 2015-02-02 21:19:12 | Updated | Tags: unifiedgenotyper snp-calling input-prior
Comments (2)

I'm a bit confused about the instructions for --input_prior in UnifiedGenotyper but they seem quite useful. Is there any chance you could clarify or simplify ?

https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_genotyper_UnifiedGenotyper.php#--input_prior --input_prior / -inputPrior Input prior for calls By default, the prior specified with the argument --heterozygosity/-hets is used for variant discovery at a particular locus, using an infinite sites model, see e.g. Waterson (1975) or Tajima (1996). This model asserts that the probability of having a population of k variant sites in N chromosomes is proportional to theta/k, for 1=1:N There are instances where using this prior might not be desireable, e.g. for population studies where prior might not be appropriate, as for example when the ancestral status of the reference allele is not known. By using this argument, user can manually specify priors to be used for calling as a vector for doubles, with the following restriciotns: a) User must specify 2N values, where N is the number of samples. b) Only diploid calls supported. c) Probability values are specified in double format, in linear space. d) No negative values allowed. e) Values will be added and Pr(AC=0) will be 1-sum, so that they sum up to one. f) If user-defined values add to more than one, an error will be produced. If user wants completely flat priors, then user should specify the same value (=1/(2N+1)) 2N times,e.g. -inputPrior 0.33 -inputPrior 0.33 for the single-sample diploid case.

List[Double]


Created 2015-01-13 01:21:15 | Updated | Tags: unifiedgenotyper indels snps genotype-given-alleles
Comments (1)

I have a small UG question to ask and a minor bug to report, but I will give the whole context. I have cohorts at low and high coverage. I will probably call my low coverage data with UG and my high coverage data with HC. I will probably run VQSR per cohort. I will then create a union set of variants and recall all samples with GENOTYPE_GIVEN_ALLELES. I will probably use UG and HC for the low and high coverage recalling, respectively. Prior to generating the union set I will need to merge the UG outputted SNPs and InDels at the same position. I can merge SNPs and InDels with bcftools:

bcftools norm -m +any

I remember from another thread that UG treats SNPs and InDels separately:

http://gatkforums.broadinstitute.org/discussion/3375/combine-snp-and-indel-vcf

But I just checked that UG can do SNP and InDels calls at the same position, when an alleles file with merged SNPs and InDels is provided, when I use GENOTYPE_GIVEN_ALLELES. I expected it to fail, but it seems to work fine. SNPs and InDels are emitted at the same site/record/line. For one record and 3 samples (HOMREF, HET, HOMALT) the genotypes are correct. Can I trust the PLs? Can I also trust QUAL and other calculated values? Or is UG not meant to be used for calling SNPs and InDels at the same site with GGA?

I only had a minor issue despite tabix indexing my bgzipped known alleles file:

##### ERROR MESSAGE: An index is required, but none found., for input source: alleles.vcf.gz

Feel free to delete the original post of this question.


Created 2015-01-12 06:03:20 | Updated | Tags: unifiedgenotyper haplotypecaller downsample
Comments (1)

Does UG and HC downsample before or after exclusion of marked duplicates? I guess code is king for the answer...


Created 2015-01-10 08:13:41 | Updated | Tags: unifiedgenotyper variantrecalibrator vqsr haplotypescore annotation
Comments (2)

The documentation on the HaplotypeScore annotation reads:

HaplotypeCaller does not output this annotation because it already evaluates haplotype segregation internally. This annotation is only informative (and available) for variants called by Unified Genotyper.

The annotation used to be part of the best practices:

http://gatkforums.broadinstitute.org/discussion/15/best-practice-variant-detection-with-the-gatk-v1-x-retired

I will include it in the VQSR model for UG calls from low coverage data. Is this an unwise decision? I guess this is for myself to evaluate. I thought I would ask, in case I have missed something obvious.


Created 2014-12-11 00:55:58 | Updated 2014-12-11 01:47:11 | Tags: unifiedgenotyper ug allsitespl
Comments (3)

Hi

I've been trying to use the --allSitePLs option with Unified Genotyper with GaTK v3.2-2-gec30cee.

My command is:

java -Xmx${MEM} -jar ${gatk_dir}/GenomeAnalysisTK.jar -R ${ref_dir}/${genome} -T UnifiedGenotyper \
    --min_base_quality_score 30 \
    -I ${in_dir}/"34E_"${REGION}"_ddkbt_RS74408_recalibrated_3.bam" \
    -I ${in_dir}/"34E_"${REGION}"_ddber_RS86405_recalibrated_3.bam" \
    --intervals $sites_dir/$file \
    -o ${out_dir}/"38F_"${REGION}"_UG_allPL.vcf" \
    --output_mode EMIT_ALL_SITES \
   --allSitePLs 

I ran the same command with and without the --allSitePLs option and compared, and the output seems strange.

Specifically, with --allSitePLs - the FILTER column: 5053700 sites = lowqual, 19303 = . and 5059568 had QUAL < 30. By contrast WITHOUT the --allSitePLs 5067411 = lowqual, 5592 = . and 11460 had QUAL < 30. I don't understand why does adding this option changes the QUAL so much when I'm running UG on the exact same data but just requesting that I get a PL for all sites.

Lastly, how is the specific ALT allele selected? Is it random - because they all seem equally unlikely in the example below, where only one allele was seen in both my samples.

## 2 lines from the output
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ddber_RS86405   ddkbt_RS74408
chr01   1753201 .   C   A   0   LowQual AC=0;AF=0.00;AN=4;DP=43;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0 GT:AD:APL:DP:GQ:PL   0/0:17,0:655,51,0,655,51,655,655,51,655,655:17:51:0,51,655  0/0:26,0:1010,78,0,1010,78,1010,1010,78,1010,1010:26:78:0,78,1010
chr01   1753202 .   T   A   0   LowQual AC=0;AF=0.00;AN=4;DP=43;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0 GT:AD:APL:DP:GQ:PL   0/0:17,0:630,630,630,630,630,630,48,48,48,0:17:48:0,48,630  0/0:26,0:1027,1027,1027,1027,1027,1027,78,78,78,0:26:78:0,78,1027

Created 2014-12-10 08:57:09 | Updated | Tags: unifiedgenotyper combinevariants haplotypecaller combinegvcfs
Comments (6)

Hi GATK team, i would like to seek opinion from your team to find the best workflow that best fit my data. Previously i've been exploring both variant calling algorithms UnifiedGenotyper and HaplotypeCaller, and i would love to go for UnifiedGenotyper considering of the sensitivity and the analysis runtime. Due to my experimental cohort samples grows from time to time, so i've opt for single sample calling follow by joint-analysis using combineVariants instead of doing multiple-samples variant calling. However by doing so, i've experience few drawbacks from it (this issue was discussed at few forums). For a particular SNP loci, we wouldn't know whether the "./." reported for some of the samples are due to no reads covering that particular loci, or it doesn't pass certain criteria during variant calling performed previously, or it is a homo-reference base (which i concern this most and can't cope to lost this information).

Then, i found this "gvcf", and it is potentially to solve my problem (Thanks GATK team for always understand our researcher's need)!! Again, i'm insist of opt for unifiedGenotyper instead of haplotypeCaller to generate the gvcf, and reading from the forum at https://www.broadinstitute.org/gatk/guide/tagged?tag=gvcf, i would assume that as VCFs produced by "UnifiedGenotyper with --output_mode EMIT_ALL_SITES" to be something alike with the gvcf file produced by HaplotyperCaller. However i couldn't joint them up using either "CombineVariants" or "CombineGVCFs", most probably i think "UnifiedGenotyper with --output_mode EMIT_ALL_SITES" doesn't generate gvcf format.

Can you please give me some advice to BEST fit my need and with minimum runtime (UnifiedGenotyper would be my best choice), is there any method to joint the ALL_SITES vcf file produced by UnifiedGenotyper which i might probably missed out from the GATK page?


Created 2014-11-25 16:32:43 | Updated | Tags: unifiedgenotyper error
Comments (3)

When I run this command:

java -jar /storage/app/GATK_3.3/GenomeAnalysisTK.jar -R /storage/data/gabriel/ucsc/hg19/hg19.fasta -T UnifiedGenotyper -I SRR493939.valid.bam -o teste2.vcf

This error happens:

WARN 04:34:22,257 RestStorageService - Error Response: PUT '/DAXKr9W5oe2LWMJhFdSveQd50kAK6tex.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 393, Content-MD5: 33uF1RoVpa9w9uTWFUxP0g==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: df7b85d51a15a5af70f6e4d6154c4fd2, Date: Tue, 25 Nov 2014 06:34:20 GMT, Authorization: AWS AKIAI22FBBJ37D5X62OQ:agZvhjUckV8Jnth+aubQX9KViUo=, User-Agent: JetS3t/0.8.1 (Linux/3.13.0-30-generic; amd64; en; JVM 1.7.0_65), Host: broad.gsa.gatk.run.reports.s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: B6DD058B64A98870, x-amz-id-2: ADKbyes3Qb8ovjofT4tKpP3gV3yPuc9v5C157qzTkfII2q3U2ZwEpUgzy0eIOz+J, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Tue, 25 Nov 2014 14:39:19 GMT, Connection: close, Server: AmazonS3] WARN 04:34:22,846 RestStorageService - Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately 29096 seconds. Retrying connection.

Can someone help me?


Created 2014-11-18 14:19:42 | Updated | Tags: unifiedgenotyper genotype-given-alleles
Comments (12)

I was running some samtools calls through UG with genotyping_mode GENOTYPE_GIVEN_ALLELES after reading this thread:

http://gatkforums.broadinstitute.org/discussion/1842/variant-annotator-not-annotating-mappingqualityranksumtest-and-readposranksumtest-for-indels

I got this error message:

##### ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Alignment 440969 | 36S26M1D6M1I2M1D1M3D27M1S

What did I do wrong? It works for most other fragments (chrom20:pos1-pos2). Is it possible to avoid this error message and just do ./. calls at coordinates not covered by reads? I'm not sure, why samtools called at a coordinate not covered by a read. I don't know at which coordinate this happened. I was at this coordinate, when the error happened:

INFO  10:21:06,036 ProgressMeter -       20:479014   2329564.0    55.0 m      23.6 m        0.6%     6.7 d       6.7 d 

I was using this command:

vcf=samtools.vcf.gz
$java -Djava.io.tmpdir=tmp -Xmx63900m \
 -jar $jar \
 --analysis_type UnifiedGenotyper \
 --reference_sequence $ref \
 --input_file bam.list \
 --dbsnp $dbsnp138 \
 --num_cpu_threads_per_data_thread 3 \
 --num_threads 4 \
 --out $out \
 --intervals $vcf \
 -stand_call_conf 10 \
 -stand_emit_conf 10 \
 --genotype_likelihoods_model BOTH \
 --output_mode EMIT_ALL_SITES \
 --annotation Coverage -A FisherStrand -A HaplotypeScore -A MappingQualityRankSumTest -A QualByDepth -A RMSMappingQuality -A ReadPosRankSumTest \
 --alleles $vcf \
 --genotyping_mode GENOTYPE_GIVEN_ALLELES \
 -L 20:$pos1-$pos2 \

Created 2014-10-20 03:14:27 | Updated | Tags: unifiedgenotyper
Comments (5)

We sequenced the exoms of about 50 genes. UnifiedGenotyper seems to call variants along the whole genome. How can I restrict the region of variant calling? Thanks


Created 2014-10-08 07:06:57 | Updated | Tags: unifiedgenotyper mappingqualityfilter
Comments (6)

Hi I have a vcf that was generated using unified genotyper using output-mode EMIT_ALL_SITES. Several positions in the vcf with ALT as "." have QUAL as "." which I understand as "Reference" with unknown Quality. Howrever, FILTER for these is set to PASS.I am wondering how this is possible? Does this mean that Unified Genotyper did not print a QUAL even though it was score high enough to get it to PASS?

I am pasting some parts of the vcf below. Any help is appreciated.

##fileformat=VCFv4.0
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=3,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Filtered Depth">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[x.bam] sample_metadata=[] read_buffer_size=n
ull phone_home=STANDARD read_filter=[] intervals=[x.bed] excludeIntervals=null reference_sequence=hg19.fasta rodBind=[dbsnp_132.hg19.vcf] rodToIntervalTrackName=null BTI_merge_rule=UNION nonDeterministicRandomSeed=false DBSNP=null downsampling_type=null downs
ample_to_fraction=null downsample_to_coverage=null baq=CALCULATE_AS_NECESSARY baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false defaultBaseQualities=-1 validation_strictness=SIL
ENT unsafe=null num_threads=1 interval_merging=ALL read_group_black_list=null processingTracker=null restartProcessingTracker=false processingTrackerStatusFile=null processingTrackerID=-1 allow_int
ervals_with_unindexed_bam=false disable_experimental_low_memory_sharding=false logging_level=INFO log_to_file=null help=false genotype_likelihoods_model=BOTH p_nonref_model=EXACT heterozygosity=0.0
010 pcr_error_rate=1.0E-4 genotyping_mode=DISCOVERY output_mode=EMIT_ALL_SITES standard_min_confidence_threshold_for_calling=50.0 standard_min_confidence_threshold_for_emitting=10.0 noSLOD=false as
sume_single_sample_reads=null abort_at_too_much_coverage=-1 min_base_quality_score=17 min_mapping_quality_score=20 max_deletion_fraction=0.05 min_indel_count_for_genotyping=5 indel_heterozygosity=1
.25E-4 indelGapContinuationPenalty=10.0 indelGapOpenPenalty=45.0 indelHaplotypeSize=80 doContextDependentGapPenalties=true getGapPenaltiesFromData=false indel_recal_file=indel.recal_data.csv indelD
ebug=false dovit=false GSA_PRODUCTION_ONLY=false exactCalculation=LINEAR_EXPERIMENTAL ignoreSNPAlleles=false output_all_callable_bases=false genotype=false out=org.broadinstitute.sting.gatk.io.stub
s.VCFWriterStub NO_HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub debug_file=null metrics_file=null annotation=[DepthOfC
overage, RMSMappingQuality]"
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  YH1
chr1    14468   .       G       .       .       PASS    DP=110;HaplotypeScore=0.0000;MQ=2.10;MQ0=104    GT      ./.
chr1    14469   .       C       .       .       PASS    DP=109;HaplotypeScore=0.0000;MQ=2.11;MQ0=103    GT      ./.
chr1    14470   .       G       .       .       PASS    DP=105;HaplotypeScore=0.0000;MQ=2.15;MQ0=99     GT      ./.
chr1    14471   .       C       .       .       PASS    DP=103;HaplotypeScore=0.0000;MQ=2.17;MQ0=97     GT      ./.
chr1    14472   .       A       .       .       PASS    DP=106;HaplotypeScore=0.0000;MQ=2.14;MQ0=100    GT      ./.
chr1    14473   .       G       .       .       PASS    DP=103;HaplotypeScore=0.0000;MQ=2.17;MQ0=97     GT      ./.
chr1    14474   .       G       .       .       PASS    DP=98;HaplotypeScore=0.0000;MQ=2.23;MQ0=92      GT      ./.
chr1    14553   .       C       .       .       PASS    DP=98;HaplotypeScore=0.0000;MQ=2.33;MQ0=94      GT      ./.
chr1    14554   .       G       .       .       PASS    DP=99;HaplotypeScore=0.0000;MQ=2.32;MQ0=95      GT      ./.
chr1    14555   .       C       .       .       PASS    DP=101;HaplotypeScore=0.0000;MQ=3.17;MQ0=96     GT      ./.
chr1    14556   .       T       .       32.99   LowQual AC=0;AF=0.00;AN=2;DP=101;MQ=3.17;MQ0=96 GT:DP:GQ:PL     0/0:101:3:0,3,27
chr1    14557   .       C       .       32.99   LowQual AC=0;AF=0.00;AN=2;DP=102;MQ=3.15;MQ0=97 GT:DP:GQ:PL     0/0:102:3:0,3,27
....
chr1    14587   .       T       .       35.99   LowQual AC=0;AF=0.00;AN=2;DP=100;MQ=4.41;MQ0=90 GT:DP:GQ:PL     0/0:100:6:0,6,51
chr1    14640   .       C       .       50.96   PASS    AC=0;AF=0.00;AN=2;DP=123;MQ=5.73;MQ0=107        GT:DP:GQ:PL     0/0:123:20.97:0,21,174
chr1    14641   .       A       .       50.96   PASS    AC=0;AF=0.00;AN=2;DP=123;MQ=5.84;MQ0=106        GT:DP:GQ:PL     0/0:123:20.97:0,21,174

Created 2014-10-07 08:30:05 | Updated | Tags: unifiedgenotyper
Comments (1)

Dear GATK team, I am encountering the following error when running UnifiedGenotyper on the LSF cluster. This error is not consistent and occurs in ~5% of the jobs. When I pick the jobs that failed and re-run them with the same command and input files, they successfully go through. I will greatly appreciate your help.

java -Xmx10g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R reference.fa -glm BOTH -o out.vcf -nct 2 -I input.bam

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.NullPointerException at java.lang.String.checkBounds(Unknown Source) at java.lang.String.(Unknown Source) at net.sf.samtools.util.StringUtil.bytesToString(StringUtil.java:301) at net.sf.samtools.BAMRecord.decodeReadName(BAMRecord.java:331) at net.sf.samtools.BAMRecord.getReadName(BAMRecord.java:220) at net.sf.samtools.BAMRecord.eagerDecode(BAMRecord.java:112) at net.sf.samtools.SAMRecord.hashCode(SAMRecord.java:1489) at org.broadinstitute.sting.utils.sam.GATKSAMRecord.hashCode(GATKSAMRecord.java:239) at java.util.HashMap.hash(Unknown Source) at java.util.HashMap.getEntry(Unknown Source) at java.util.HashMap.containsKey(Unknown Source) at org.broadinstitute.sting.utils.genotyper.PerReadAlleleLikelihoodMap.containsPileupElement(PerReadAlleleLikelihoodMap.java:158) at org.broadinstitute.sting.gatk.walkers.indels.PairHMMIndelErrorModel.computeGeneralReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:296) at org.broadinstitute.sting.gatk.walkers.indels.PairHMMIndelErrorModel.computeDiploidReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:251) at org.broadinstitute.sting.gatk.walkers.genotyper.IndelGenotypeLikelihoodsCalculationModel.getLikelihoods(IndelGenotypeLikelihoodsCalculationModel.java:149) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoods(UnifiedGenotyperEngine.java:331) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:232) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.8-1-g932cd3a):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)

Created 2014-10-06 19:20:53 | Updated | Tags: unifiedgenotyper gatk-runtime-error
Comments (6)

Command: /jre1.7.0/bin/java -Xmx80g -jar /next-gen_example_sim/bin/GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -out_mode EMIT_ALL_CONFIDENT_SITES \ -glm SNP \ -R $REF_PFX \ -nt 16 \ followed by ~186 input bam files.

Error:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: 96 at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.getCache(DiploidSNPGenotypeLikelihoods.java:298) at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.inCache(DiploidSNPGenotypeLikelihoods.java:262) at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.add(DiploidSNPGenotypeLikelihoods.java:229) at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.add(DiploidSNPGenotypeLikelihoods.java:190) at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.add(DiploidSNPGenotypeLikelihoods.java:176) at org.broadinstitute.sting.gatk.walkers.genotyper.SNPGenotypeLikelihoodsCalculationModel.getLikelihoods(SNPGenotypeLikelihoodsCalculationModel.java:114) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoods(UnifiedGenotyperEngine.java:331) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:232) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: 96

I found this documented somewhat for Haplotype caller, but the solution didn't look applicable here. These are pooled samples so we must use unified genotyper.


Created 2014-09-29 19:32:51 | Updated | Tags: unifiedgenotyper
Comments (2)

Hi Geraldine, I am confused when I read the paper and tutorial on GATK variant caller. First, the DePristo et al. 2011 paper says local re-aligner generates a list of candidate haplotypes using a) dbSNP info, b) presence of atleast one indel in a read and c) cluster of mismatches at a site. Then it finds the best alternative haplotype based on read haplotype likelihoods and uses log odds ratio to declare ref and alt haplotype pair. Second, the tutorial on GATK variant caller mentions Indel genotype likelihood calculation where UG estimates this via 1) first generate haplotypes from indels in the reads. 2) genotype likelihoods for all haplotype pairs 3) for each haplotype compute read haplotype likelihood using HMM

My question is do both the steps local realignment and UG do haplotype generation and then compute read haplotype likelihood. To me these two steps are redundant because if the best haplotype pairs are generated using local realignment then why is there a need to generate haplotypes for genotype likelihood estimation. I mean in theory, UG should just be able to call indels and assign them genotype by testing all the possible genotypes against the two haplotypes called by the local realigner. Please help me sort this out. I highly appreciate your concern.


Created 2014-09-18 20:32:00 | Updated | Tags: unifiedgenotyper haplotypecaller exome-sequencing
Comments (1)

Hello everyone, I have recently sequenced 15 sample exomes using the Ion proton system, which directly generates VCFs after sequencing runs. However, since I will be using GATK for downstream analysis, I have decided to recall the variants using UG or HC. I read a document stating that HC should not be used in the case of pooled samples. Can someone clarify what exactly this means? For Ion proton, three samples are barcoded and pooled together prior to sequencing. Also, is 15 samples enough to obtain good discovery power? I was thinking of using external, publicly available bam files of exome sequencing data if 15 samples is not sufficient for joint variant calling.

Thanks in advance, Ricky


Created 2014-09-15 21:28:59 | Updated | Tags: indelrealigner unifiedgenotyper indels gatk
Comments (3)

I used GATK 3.2-2 to do indel realignment and multisample calling with UnifiedGenotyper (I cannot use HaplotypeCaller because it is incompatible with the type of data I am analyzing).

Among the ~96 samples sequenced, I have one sample with two nearby variant calls, an indel and a missense, that we checked by Sanger. The missense is real but we found no trace of the indel in the Sanger traces. When I look at indel call in the multisample vcf, it has good allele depth and GQ, but suspiciously has the same AD as the missense call. Additionally, when I look at the bam in IGV, I see no evidence for the indel whatsoever and the variant is not called in any other samples in this project.

indel: chr13:101755523 CA>C GT:AD:DP:GQ:PL 0/1:55,56:111:99:1388,0,1538

missense: chr13:101755530 A>G GT:AD:DP:GQ:PL 0/1:55,56:111:99:2170,0,2125

I went back and recalled just this one sample (single sample calling)… which resulted in the correct variants, i.e. the indel was not called at all, but the SNP, which does validate, is called.

I understand that this is not an easy region to call because of the 7xA repeat, but it’s not obvious to me why this happens only in multisample mode and I'd like to have a better understanding of what is going on.


Created 2014-09-11 13:38:40 | Updated | Tags: unifiedgenotyper ploidy multiallelic pooled-calls
Comments (4)

Hi,

Is GATK Unified genotyper able to call multi-allelic positions in a single pooled sample? Case is a pool of 13 samples, we use UG with ploidy set to 26. If I understand the supplementaries of the original publications correct, UG will never be able to call three alleles at a single position. in single sample calling. Or does this not hold for high ploidy analysis?

If needed, we can call multiple pools together, but this becomes computationally intensive.

In summary, we would like to call a 14xG,6xA,6xT call for example.

Also, how does UG take noise into account when genotyping (sequencing errors), when for example 3% of reads is aberrant at a position, this could correspond to ~ 1/26.

Thanks for any guidelines,

geert


Created 2014-09-10 18:50:45 | Updated | Tags: unifiedgenotyper joint-calling unifiedgenotyper-joint-calling
Comments (1)

Can someone tell me what version of the UnifiedGenotyper introduced joint sample calling? I'm specifically interested in whether 1.6 supported joint calling.

Thanks!


Created 2014-08-12 03:54:48 | Updated | Tags: unifiedgenotyper
Comments (9)

Hello, Is there any possibility that I can separate the reads in AD (Allele Depths by sample) into 5' and 3' strands? e.g. the current expression of AD is REF:ALT1:ALT2 ...etc, Can I get the expression of AD like REF_5':REF_3':ALT1_5':ALT1_3':ALT2_5':ALT2_3'


Created 2014-08-11 09:50:56 | Updated | Tags: unifiedgenotyper vcf qual
Comments (1)

I used the EMIT_ALL_SITES option with Unified Genotyper. For polymorphic sites, the quality score (QUAL field) corresponds to the Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. But for monomorphic sites, (at this site, we have an homozygote for the reference allele), what is the definition of the quality score ? and how is it computed ? Many thanks for your explanation.


Created 2014-07-20 04:19:36 | Updated 2014-07-20 04:26:30 | Tags: unifiedgenotyper parallelism nct nt multiple-vcf
Comments (1)

Hello,

It seems that while non-parallel UnifiedGenotyper (SNP mode) always generates the same output vcf, parallel version (either -nt and/or -nct >1) of UnifiedGenotyper generates not exactly the same vcf files for each run. For example, below is the GenotypeConcordance output of two runs with same input file (chr 21) and same parameters (-nt4 -nct4):

Sample Eval_Genotype Comp_Genotype Count
ALL HET HET 46341 ALL HET HOM_REF 0 ALL HET HOM_VAR 8 ALL HET MIXED 0 ALL HET NO_CALL 0 ALL HOM_REF HET 300 ALL HOM_REF HOM_REF 34306900 ALL HOM_REF HOM_VAR 0 ALL HOM_REF MIXED 0 ALL HOM_REF NO_CALL 18 ALL HOM_VAR HET 12 ALL HOM_VAR HOM_REF 0 ALL HOM_VAR HOM_VAR 23490 ALL HOM_VAR MIXED 0 ALL HOM_VAR NO_CALL 18 ALL Mismatching_Alleles Mismatching_Alleles 303 ALL NO_CALL HET 0 ALL NO_CALL HOM_REF 11 ALL NO_CALL HOM_VAR 24 ALL NO_CALL MIXED 0 ALL NO_CALL NO_CALL 729217

So I am wondering if there is a way to get the same output vcf. Or do I miss something here? Thanks.


Created 2014-07-16 09:46:17 | Updated | Tags: unifiedgenotyper snp-calling
Comments (2)

hi,

is there a way to output all possible variants (the alternate allele)? i have total reads of 2100 whereby 2045 reads are match reference and 55 reads are alternate allele. My case is very specific, i would like GATK UnifiedGenotyper to call out my variants although the proportion of alternate read is relative low (1-5%) . And regardless of the quality score of alternate reads. Is there a way to specify this settings during the SNP calling in GATK UnifiedGenotyper?

thank you very much Best J


Created 2014-07-10 20:18:01 | Updated | Tags: unifiedgenotyper indels
Comments (2)

I performed Variant Calling using UnifiedGenotyper and HaplotypeCaller for Whole Exome Sequenced samples.

GATK version used GATK 3.0

I had only 4 African samples, since did not have any controls I took 70 African BAMs from 1000Genomes. Used Agilent Sure Select WES Version 5 target bed file as interval file to limit the calls to exomes.

I couldot find a single INDEL in UnifiedGenotyper called data and hence couldnot run VQSR for indels on it. But HaplotypeCaller called 20,456 INDELs.

UnifiedGenotyper Called Data

Total Variants-278319

SNPs-278319

INDELs-0

HaplotypeCaller Called Data

Total Variants-291269

SNPs-270813

INDELs-20456

Is it possible that for variants called from 74 BAMs, UnifiedGenotyper could not find a single INDEL ??

Thanks, Tinu


Created 2014-07-10 10:59:18 | Updated | Tags: unifiedgenotyper multi-sample aneuploid
Comments (3)

Hi all,

I am working with a set of samples at the moment which have the following problems:

1) The organism is aneuploid 2) Ploidy for individual chromosomes varies between samples (we have a script that estimates this from the coverage so we can use this information to improve variant calling) 3) Coverage is thin (sequencing was done on a student's budget)

I am using the UnifiedGenotyper for variant discovery as it can account for aneuploidy.

I initially tried calling variants for each sample, grouping chromosomes by ploidy (i.e, for sample 1 I call all the diploid chromosomes, then all the triploid chromosomes etc). I also tried doing multi-sample variant calling across all the samples, setting the ploidy to 2 (The modal chromosome number is 2). Comparison of these two analyses shows that each is missing good SNPs that are called by the other. Multi-sample analyses is missing or mis-calling SNPs on aneuploid chromosomes, and individual analysis is missing SNPs in individual samples due to thin coverage (or more accurately, they are being called, but failing QD or quality filters due to thin coverage - these SNPs pass these filters in the multi-sample analysis).

I am thinking of trying to combine these approaches. The idea is to analyse each chromosome separately. For each chromosome, I would do a multi-sample call on all the samples which are diploid and a multi-sample call on all the samples which are triploid etc etc. I am hoping that this will give me the best of the two approaches above.

So my questions are:

1) Does this strategy makes sense? Is there any reason not to do it this way?

2) How should I proceed downstream? I know I will have to hard-filter regardless. Can I merge all my vcf files before filtering and annotation, or is there a reason to keep them separate?

Any input from the GATK team or the wider community is appreciated!

Thanks

Kathryn


Created 2014-07-06 16:26:42 | Updated | Tags: unifiedgenotyper snp-calling
Comments (5)

I'm seeing Mendelian errors in small but significant proportion of my unified genotyper calls. For many instances the SNP is called incorrectly as a homozygote even though there are some reads of the second allele. How can I set unified genotyper to be more sensitive ?


Created 2014-07-02 20:02:53 | Updated | Tags: unifiedgenotyper fastareference heapsize callvariants
Comments (11)

Hi all,

Do you have a recommendation to estimate how much heap memory (-Xmx) is necessary to cal variants using the Unified Genotyper. I think that with my project I might be facing a situation where I will run out of memory until there is not more left to increase. To give you an idea, I have 185 samples (that together are 8Gb) and the fasta reference that I am using has too many scaffolds (3 Million). I don't have the opportunity to improve the reference I have at the moment. I have been using -Xmx52G and -nt 10 (in GATK 3.1) but it gives an error at the same point.

INFO 14:45:41,790 HelpFormatter - -------------------------------------------------------------------------------- INFO 14:45:42,773 GenomeAnalysisEngine - Strictness is SILENT INFO 14:59:01,106 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 14:59:01,171 SAMDataSource$SAMReaders - Initializing SAMRecords in serial

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java
ERROR ------------------------------------------------------------------------------------------

If you have a suggestion/advice of how to make the analysis work it would be very much appreciated. I know that increasing scaffolds length (reducing number of scaffolds) can improve the analysis so I am wondering if I am facing a situation where I can't do any analysis until the fasta reference is improved.

Many thanks,

Ximena


Created 2014-06-16 15:43:27 | Updated | Tags: unifiedgenotyper
Comments (6)

Dear All,

I am Calling SNPs using Unified Genotyper. I am using the Version GATK 2.5.2. Command Used: java -Djava.io.tmpdir=TMP -jar /GenomeAnalysisTK/2.5.2/GenomeAnalysisTK.jar -R Ref.fa -T UnifiedGenotyper -I input1.bam -I input2.bam -I input3.bam -o output.raw.vcf -stand_call_conf 50 -stand_emit_conf 20 -out_mode EMIT_ALL_CONFIDENT_SITES.

Usually I get A summary of callable base counts: Like INFO 00:23:29,795 UnifiedGenotyper - Visited bases 247249719 INFO 00:23:29,796 UnifiedGenotyper - Callable bases 219998386 INFO 00:23:29,796 UnifiedGenotyper - Confidently called bases 219936125 INFO 00:23:29,796 UnifiedGenotyper - % callable bases of all loci 88.978 INFO 00:23:29,797 UnifiedGenotyper - % confidently called bases of all loci 88.953 INFO 00:23:29,797 UnifiedGenotyper - % confidently called bases of callable loci 88.953 INFO 00:23:29,797 UnifiedGenotyper - Actual calls made 303126

But I did not get it. Can you tell me how can I calculate this from my output VCF file now?


Created 2014-06-16 07:02:28 | Updated | Tags: unifiedgenotyper exome target
Comments (4)

I am running some exome samples with UG. I tried this in 2 ways: 1. Run UG and then apply exome region filter. 2. Run UG with exome regions.

Is there any difference in these approaches if I am only concerned about the variants in exome region? I have ran in both ways and on comparison I see that some values are different for the same locus. For eg. PL, MQRankSum, BaseQRankSum. However, the different is not huge but does UG perform slightly differently for making calls with target regions vs without?

PS: I have ran this for UG with multiple samples


Created 2014-05-28 22:56:35 | Updated | Tags: unifiedgenotyper heterozygosity priors
Comments (3)

If I specify both -inputPrior and -hets, which one will UnifiedGenotyper use to determine the prior? Is --heterozygosity used for anything other than prior specification?


Created 2014-05-28 11:46:40 | Updated | Tags: unifiedgenotyper heterozygous snp-calling
Comments (3)

All of my whole-genome sequenced individuals are a cross of different outbred lines with one reference line - effectively identical to the reference genome used for read mapping and variant calling. Thus, effectively all SNPs are heterozygous relative to the reference genome.

Should I alter the settings on the Unified Genotyper to maximise detection of these kind of variants ?


Created 2014-05-23 15:26:18 | Updated 2014-05-23 15:32:12 | Tags: unifiedgenotyper haplotypecaller indels
Comments (13)

Hello,

I used UnifiedGenotyper and HaplotypeCaller on my sample to get out INDELS and SNPS. There is a 147bp deletion in one of the genes(working with yeast genome)(see attached file). But this deletion is not reported by UnifiedGenotyper and HaplotypeCaller. Since I am working with yeast genome(which is comparatively very small compared to human genome), I am only interested in INDELS ranging from 150bp-200bp maximum. Is there a way by changing parameters that this can happen. I have read some of the previous questions about how haplotypecaller is unable to call large INDELS(in kb's), but I guess in my case INDELS max I am looking for is 200bp. Can this be achieved??

Also UnifiedGenotyper has an option to provide ploidy level. I could not locate that option in HaplotypeCaller. Any way to have that on HaplotypeCaller ? I ask you this because the yeast genome samples I work with are eseentially haploid.

NOTE: My sample is paired end data with 100bp reads

Hope to hear from you soon

Regards Varun


Created 2014-05-21 07:42:51 | Updated | Tags: unifiedgenotyper problem-with-the-file-locking-support
Comments (3)

We've used GATK(v2.1-8) in a SGE environment for more than 1 years now. Without any changing in environment, we have had a few cases of UnifiedGenotyper getting hung up while trying to read our previously created .fai and .dict reference files. FSLockWithShared - WARNING: Unable to lock file xxx/ucsc.hg19.dict; ReferenceDataSource - Unable to create a lock on dictionary file. Then GATK was updated to 2.7-4 and 3.1-1. There was still an error. "ERROR MESSAGE: Timeout of 30000 milliseconds was reached while trying to acquire a lock on file xxx/dbsnp_135.hg19.vcf.idx. Since the GATK uses non-blocking lock acquisition calls that are not supposed to wait, this implies a problem with the file locking support in your operating system." Besides, there is no problem reported when we run the UnifiedGenotyper step on data store node. This problem happens if we run the UnifiedGenotyper step on compute nodes. We used NFS for shared data between nodes. Is it the way, one is access the data directly and the other one is by NFS shared, leading this problem? If so, how can I fix this problem? Hoping someone can answer this problem as soon as possible. Thank you!


Created 2014-05-07 08:07:34 | Updated | Tags: unifiedgenotyper trio-data
Comments (3)

Hi all,

I was wondering if there is a recommended way how to call variants in trio data (affected children, healthy parents) using the Unified Genotyper. As I see it, there are three possible ways:

  1. Call each sample individually
  2. Call trios together
  3. Call all samples together

But which one would you recommend?


Created 2014-04-29 12:21:43 | Updated | Tags: unifiedgenotyper bam
Comments (5)

Hi GATK team, I am using GATK to call random mutations caused by mutagen. I have more than 5 mutant lines sequenced and I want to call SNP/INDELs which are unique for each samples (since they are randomly introduced they must be unique for each samples). Majoirity of SNP will be shared between all mutant samples as this are the natural variations between line used for mutagenesis and the Reference sequence.

I am using UnifiedGenotyper (because I have sequenced pooled DNA for each mutant family) by giving all the 5 bam files from samples. I wonder, by calling SNP from all 5 mutant samples together may negatively influence the unique SNPs? Does GATK consider all the 5 samples as a population and give preference (or high score to) locus which are shared between majority of samples? In my case, would it be better to call SNP/INDEL separately for all 5 samples?

The command I am using is java -jar GenomeAnalysisTKLite-2.3-9-gdcdccbb/GenomeAnalysisTKLite.jar -T UnifiedGenotyper -glm BOTH -R ref.fa -nt 10 -I results/1_sorted.bam_fixed_rmdup_realign.bam -I results/2_sorted.bam_fixed_rmdup_realign.bam -I results/3_sorted.bam_fixed_rmdup_realign.bam -I results/4_sorted.bam_fixed_rmdup_realign.bam -I results/5_sorted.bam_fixed_rmdup_realign.bam -o results/out.vcf

I hope I am clear, and looking forward to learn more,

Thank you in advance


Created 2014-04-28 10:34:08 | Updated 2014-04-28 10:36:08 | Tags: unifiedgenotyper allele-counts
Comments (2)

I found that the allele counts of a position by GATK is different from those by IGV acculumation. For example, IGV gives (A:0, C:13, G:1, T:7) counts. But, the vcf file generated by GATK UnfiedGenotyper contains the following: chr1 10180 . T C 107.93 PASS AC=1;AF=0.50;AN=2;BaseQRankSum=-0.278;DP=17;Dels=0.00;FS=3.522;HRun=2;Haplot ypeScore=7.9790;MQ=25.98;MQ0=4;MQRankSum=-0.597;QD=6.35;ReadPosRankSum=1.663;SB=-67.79 GT:AD:DP:GQ:PL 0/1:7,9:17:98.17:138,0,98

I thought that there might be some filtered out bases. Did I guess right? Then, could you tell me What are the default filtering options in UnifiedGenotyper?


Created 2014-04-16 14:11:37 | Updated | Tags: unifiedgenotyper haplotypecaller bug gatk
Comments (12)

HI, I'd like to report a weird result from HaplotypeCaller.

We have a patient sequenced by targeted sequencing on HNF1B. This patient has been confirmed to have a whole gene deletion of HNF1B so we used this patient as a positive control. We expected to see no heterozygous variants called in HNF1B.

However, HaplotypeCaller called two heterozygous variants: one deletion (it didn't pass the FS strand bias filter and the ReadPosRankSumTest filter) and one substitution (this one passed all the quality filters). Both these two variants were not called by UnifiedGenotyper (and the variants called by UnifiedGenotyper in HNF1B region were all homozygous as what we expected)

Please see the VCF table:

There are three things I want to highlight: 1, The deletion is only 10 bases upstream of the substitution, but the FS score is enormous for the deletion whereas very low for the substitution. If there's a strand bias, it must affect both variants if they are so close to each other. 2, The score of ReadPosRankSumTest of the deletion didn't pass the filter because it's always called near the end of the reads. The ReadPosRankSumTest score for the substitution is missing. 3, The genotype was called 0/1 for the substitution, but if we look at the AD, there are 206 reads supporting the ref allele and 0 read supporting the alt allele. Going by the AD, it's clearly a homozygous ref/ref genotype.

Then I looked into the bam files. It turns out the all the alternative alleles of the substitution were from the ends of bad reads, and there are not too many of them after all. So the reads in the bam file also support a homozygous ref/ref genotype.

Therefore I'm really confused why the substitution has 0/1 genotype called by the HaplotypeCaller and why it passed the filter.

Many Thanks

Betty


Created 2014-04-14 17:14:33 | Updated | Tags: unifiedgenotyper
Comments (4)

Hello,

I am running unified genotyper and for some reason in the log file it shows that all reads are being filtered out and no variants have been called in the vcf file. I get an empty vcf file with just the header. I could use some help regarding this.


Created 2014-04-02 09:09:08 | Updated | Tags: unifiedgenotyper
Comments (1)

Dear,

I want to output the number of all bases (ACGT) for each position using UnifiedGenotyper, in GATK2 we used the option -A BaseCounts. Now I am wondering what option I should use in GATK3,

Thanks, Bram


Created 2014-03-31 10:41:27 | Updated | Tags: unifiedgenotyper gatk
Comments (3)

Hi

I am hitting with the following error when I try to use UnifiedGenotyper. Wondering if its a bug or something wrong with my bam file? Thanks for your help in advance.

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: Illegal base [K] seen in the allele at org.broadinstitute.variant.variantcontext.Allele.create(Allele.java:205) at org.broadinstitute.variant.variantcontext.Allele.create(Allele.java:213) at org.broadinstitute.sting.gatk.walkers.annotator.RankSumTest.fillQualsFromPileup(RankSumTest.java:159) at org.broadinstitute.sting.gatk.walkers.annotator.RankSumTest.annotate(RankSumTest.java:113) at org.broadinstitute.sting.gatk.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:192) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateGenotypes(UnifiedGenotyperEngine.java:560) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:234) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.7-4-g6f46d11):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Illegal base [K] seen in the allele
ERROR ------------------------------------------------------------------------------------------

Cheers Scott..


Created 2014-03-28 13:58:31 | Updated | Tags: unifiedgenotyper multithreading multi-sample gatk unified-genotyper capacity
Comments (1)

Hi,

How many samples and of what file/genome size can the Unified Genotyper process simultaneously ? I will have about 250 samples, genome size 180Mb, bam sizes 3-5GB, reduced bam sizes ~1GB.

Also, how long would this take can it be multi-threaded ? Cheers,

Blue


Created 2014-03-19 05:40:04 | Updated | Tags: unifiedgenotyper input-output-error
Comments (3)

Dear GATK team

I am trying to produce a mutli-sample VCF file, this is the command line I used:

java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R /home/mohameam/Pchabaudi/ref.fasta -I S1.bam -I S2.bam -I S3.bam -ploidy 1 -o raw.S1.S2.S3.vcf

the program runs smoothly and the output file seems ok, however I get the following error msg in the end:

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.0-0-g6bad1c6):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Input/output error
ERROR ------------------------------------------------------------------------------------------

please advise ? thank you Alyaa


Created 2014-03-12 08:28:47 | Updated | Tags: unifiedgenotyper gatk
Comments (7)

Hi, we observe an unexpected deletion call (GATK UnifiedGenotyper 3.0 and 2.5) in a setup with three samples called together. In the file call.txt you find the call. From the pileup (pileup.txt, position 19:57325555) we would expect, that the indel would only be called for the first sample. Is there another source of information, that makes GATK believe, that the deletion also occurs in the second and third sample?

Thanks in advance, Johannes


Created 2014-03-10 22:03:01 | Updated | Tags: unifiedgenotyper variantrecalibrator applyrecalibration snp indels
Comments (1)

Hi,

I am running Variant Recalibration on Indels,prior to this I completed Variant Recalibration and ApplyRecalibration on SNPs. So,the input file is the recalibrated VCF file from Apply Recalibration step of SNP's.

Below is the error I am getting.

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.7-4-g6f46d11):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Bad input: Values for DP annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://gatkforums.broadinsti

tute.org/discussion/49/using-variant-annotator

The command I am using for this is :

jre1.7.0_40/bin/java -Djava.io.tmpdir=./rb_2905_VCF/tmp -Xmx2g -jar GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T VariantRecalibrator -R dbdata/human_g1k_v37.fasta -input ${input_file} --maxGaussians 4 -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.vcf - resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_137.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.indels.b37.vcf -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum -mode INDEL -recalFile $destdir/${input_file%recal.snps.vcf}recal.indel.recal -tranchesFile $destdir/${input_file%recal.snps.vcf}recal.indel.tranches -rscriptFile $destdir/${input_file%recal.snps.vcf}recal.indel.plots.R

If I remove the options -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum,then I get this error:

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.7-4-g6f46d11):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Argument with name '--use_annotation' (-an) is missing.
ERROR ------------------------------------------------------------------------------------------

Created 2014-02-27 15:25:19 | Updated 2014-02-27 22:40:50 | Tags: unifiedgenotyper genotype vcf-format gt
Comments (10)

generated with gatk 2.8-1-g932cd3a

Although it is rare I see Genotype Fields that are inconsistent with the AD values (Read as table):

CHROM   POS ID  REF ALT FILTER  QUAL    ABHet   ABHom   AC  AF  AN  BaseCounts  BaseQRankSum    DP  Dels    FS  GC  HRun    HaplotypeScore  LowMQ   MLEAC   MLEAF   MQ  MQ0 MQRankSum   MeanDP  MinDP   OND PercentNBaseSolid   QD  ReadPosRankSum  Samples Somatic VariantType cosmic.ID   1.AB    1.AD    1.DP    1.F 1.GQ    1.GT    1.MQ0   1.PL    1.Z 2.AB    2.AD    2.DP    2.F 2.GQ    2.GT    2.MQ0   2.PL    2.Z 3.AB    3.AD    3.DP    3.F 3.GQ    3.GT    3.MQ0   3.PL    3.Z 4.AB    4.AD    4.DP    4.F 4.GQ    4.GT    4.MQ0   4.PL    4.Z 5.AB    5.AD    5.DP    5.F 5.GQ    5.GT    5.MQ0   5.PL    5.Z
11  92616485    0   A   C   PASS    63.71   0.333   0.698   1   0.1 10  89,54,0,0   -5.631  143 0   49.552  71.29   2   4.4154  0.0000,0.0000,143   1   0.1 50.27   0   -1.645  28.6    16  0.242   0   2.36    2.125   R5_A3_1 NA  SNP COSM467570  NA  24,9    33  0.2727272727    54  A/A 0   0,54,537    -1.3055824197   0.33    9,18    27  0.6666666667    96  A/C 0   96,0,178    0.8660254038    NA  21,11   32  0.34375 21  A/A 0   0,21,466    -0.8838834765   NA  12,4    16  0.25    27  A/A 0   0,27,272    -1  NA  23,12   35  0.3428571429    42  A/A 0   0,42,537    -0.9296696802

This shows that for example sample 5 has a AD value of '23,12' and a GT of 'A/A' aka homyzougous reference allele. I've included a screenshot wich shows low base quality and complete strand bias (Which I suspect to mis variants). So whats the prob? and how can i recalculate the GT's based on AD? because i cannot filter based on genotypes when they are buggy....


Created 2014-02-26 13:11:55 | Updated | Tags: unifiedgenotyper vcf genotyping
Comments (4)

Dear GATK team,

Could you please help me how to explain genotyping in my vcf file. I have Illumina data and vcf caller was GATK. My variant frequency (Alt variant freq) is 99.7%. DP = 4622 (AD = 16, 4606) - so I would expect that this sample is alternate homozygous. But when I check PL value, which is - PL = 1655,0,323 - after calculating my likelihood -

REF= G ALT= A

P(D|GG) = 10 ^ -165.5 = small
P(D|AG) = 10 ^ 0  = 1
P(D|AA) = 10 ^ 32.3 = small

we can see it is heterozygous. Can anybody help me how to interpret my result? How it is possible that likelihoods show me heterozygous and coverage and VF show me homozygous?

Here is part of my vcf file:

chr13 32899193 . G A 1625.01 PASS AC=1;AF=0.5;AN=2;DP=4622;QD=0.35;TI=NM_000059;GI=BRCA2;FC=Silent GT:AD:DP:GQ:PL:VF:GQX 0/1:16,4606:5000:99:1655,0,323:0.997:99

Thank you for any explanation.

Paul.


Created 2014-02-18 19:05:14 | Updated 2014-02-18 19:06:52 | Tags: unifiedgenotyper insertsizedistribution indels
Comments (2)

Hi, I have run gatk UnifiedGenotyper on ~45 human whole exomes. I have noticed that the maximum size for called indels in these exomes is 60bp. My question is, is 60 a default for the maximum indel size? I can't seem to find information regarding the threshold for indel size called by UnifiedGenotyper. Is the maximum indel size a parameter that can be adjusted?

In case this is of any use; This is the command I am using:

java -jar /usr/local/gatk/2.6-4/GenomeAnalysisTK.jar -R Homo_sapiens_assembly19.fasta -T UnifiedGenotyper -glm INDEL -I bams.list -L Targets.bed -stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 1000

Thank you very much for your support! Dalia


Created 2014-02-14 08:56:11 | Updated 2014-02-14 09:06:51 | Tags: unifiedgenotyper indels
Comments (1)

Hi there,

I'm systematically comparing different read mappers right now. That is, I align reads and run MarkDup/IndelRealignment/Recalibration followed by the UnifiedGenotyper.

When using stampy, I only get SNP calls but no indels. I've extracted one example where the alignments clearly show a homozygous 14bp deletion that does not show up in the VCF. For comparison purposes, I've looked at GSNAP alignments at the same locus. There the deletion looks a bit less clear in the IGV but is nonetheless present in the VCF produced by UG (screenshot attached).

My call to UG (2.8-1-g932cd3a) is the following:

gatk -T UnifiedGenotyper -glm BOTH -R hg18/hg18.fasta -rf BadCigar -L chr1:4742500-4744500 -I example.stampy.bam -o example.stampy.vcf

gatk -T UnifiedGenotyper -glm BOTH -R hg18/hg18.fasta -rf BadCigar -L chr1:4742500-4744500 -I example.gsnap.bam -o example.gsnap.vcf

So my question boils down to: Why is the deletion called for GSNAP but not for STAMPY, although it looks much cleaner for STAMPY. I'm attaching example BAM files and UG output.

Your help is very much appreciated. Please let me know if there is any additional information I can provide.

Cheers, Tobi


Created 2014-02-13 17:30:45 | Updated | Tags: unifiedgenotyper haplotypecaller multi-sample pass
Comments (14)

Hi, I have run GATK Unified genotyper with multisample and I am getting unexpected output. (Hard filters used)

Here are examples of the output format I get in the vcf for a sample: 0/1:43,4:PASS:24:24,0,4368, ./.:.:PASS or 0/0:2,0:2:DPFilter:6:0,6,47 as well as 0/0:12,0:37:0,37,830. The latter is the format I would expect. What does this mean? Also I get the filter PASS even when all but two samples have ./.:.:PASS and the other two have 1/1:0,1:1:DPFilter;GQFilter:3:38,3,0 and 1/1:0,1:1:DPFilter;GQFilter:3:38,3,0.

Also I ran HalpotypeCaller multisample and get similar output all though more of the 0/0:12,0:37:0,37,830 format....


Created 2014-02-11 02:58:14 | Updated | Tags: unifiedgenotyper ploidy vcf
Comments (3)

I using the UnifiedGenotyper to call SNPs in a pooled sample of 30 diploid individuals (i.e., I am setting the ploidy to 60). Does this mean that if the coverage is < 60 at a given variant site, the vcf file will read "./." for all alleles at that site? In other words, does it require the coverage to be >= the ploidy or it won't produce a called variant at that site? I'm just trying to make sure I am interpreting the vcf file correctly, in that: if there is a called genotype at a given variant site that I can interpret that as the estimated allele frequency in that pool.

Thanks in advance for any advice.

Best, Jon


Created 2014-02-08 00:18:03 | Updated | Tags: unifiedgenotyper
Comments (1)

Dear all in GATK forums,

I am dealing with a slightly annoying feature of GATK, and I am wondering if there is something I can do to prevent it. I am using multisample calling for a large cohort and UG version 2.8.1 and I see in my VCF situations like this one: 5 96443177 . GC G 862.42 . 5 96443178 . CC C 859.42 .

The same individual is a carrier for both, and quite clearly this is the same variant (a deletion of a C) called in two different places. It is not a disaster but it messes up some downstream analyses (case control...). Would you have a suggestion to avoid that problem?

Thank you in advance.


Created 2014-02-06 08:15:40 | Updated | Tags: unifiedgenotyper vcf
Comments (2)

Hi, I'm trying to understand how do you usually calculate the GQ of a SNP. I understood the model to calculate the Likelihood of all the genotypes (AA,AC, AG,AT,CC,CG,CT,GG,GT,TT). Once all the likelihood have been calculated, correct me if I'm wrong, you normalize the likelihood of the best genotype to 1 and all the other likelihoods according to that scale. So the PL field in the VCF should be the Phred-scaled values of the normalized values. But is not clear to me how do you finally calculate the GQ value. What values do you use to calculate that quality (normalized or phred scaled)? And what's the right formula? I've tried to debug the code but it ends to be really tricky. I really hope that you would help me. I would be really thankfull for that.


Created 2014-02-05 11:33:30 | Updated | Tags: indelrealigner unifiedgenotyper java error
Comments (1)

Hi All We are running into some random weirdness when running jobs using SGE, GATK version 2.7-2-g6bda569, pretty much all GATK tools - but mostly IndelRealigner abd UnifiedGenotyper, we often get the following error:-

ERROR MESSAGE: Couldn't read file /scratch/project/pipelines/novorecal.bam because java.io.FileNotFoundException: /scratch/project/pipelines/novorecal.bam (No such file or directory)

This also happens for supplied reference genomes and vcf files. The GATK tool cant find them.

These "missing" files do exist, and have often even been created by the previous tool/step in the pipeline.

When we re-run the pipeline on a failed sample, it works. So we end up having to re-run our pipeline on the same set of samples multiple times and are beginning to find this very frustrating. These errors seem to be random, I cant find any pattern, and as I mentioned, when we re-run the pipeline on a failed run, it work without a hitch.

Has anyone experienced this? And if so, any recommendations?

Please help

Steve


Created 2014-02-04 21:03:21 | Updated | Tags: unifiedgenotyper indels
Comments (3)

Hi all,

I am running UnifiedGenotyper GLM=Indel Ploidy=1, which produces my raw indels, with no apparent problems. I then run BaseRecalibrator and PrintReads to generate a new BAM, which i am then recalling the indels. This time, I am getting no indels in the VCF (only headers). Curiously, this problem is only present in all 13/52 isolates that are most distant from the reference (expecting ~50K indels), while the rest have all worked (~6K indels). Any clues what might be causing this problem? Thanks, Rhys


Created 2014-01-31 13:44:10 | Updated | Tags: unifiedgenotyper
Comments (2)

Hi, I encounter a strange error when using UnifiedGenotyper, The command that I use is: /usr/bin/java -Xmx4g -jar /home/shengyu_ni/bin/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /mnt/genotyping/sendru/human_g1k_v37.fasta -I bqsr.bam -o haplogroup.vcf -out_mode EMIT_ALL_SITES -gt_mode GENOTYPE_GIVEN_ALLELES -alleles /mnt/genotyping/sendru/isogg.vcf -nt 4 -ploidy 1 -L /mnt/genotyping/sendru/comprey.interval_list

and the error mesage is:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: -1 at org.broadinstitute.sting.gatk.walkers.genotyper.GeneralPloidyGenotypeLikelihoods.subsetToAlleles(GeneralPloidyGenotypeLikelihoods.java:380) at org.broadinstitute.sting.gatk.walkers.genotyper.GeneralPloidyGenotypeLikelihoodsCalculationModel.getLikelihoods(GeneralPloidyGenotypeLikelihoodsCalculationModel.java:294) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoods(UnifiedGenotyperEngine.java:331) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:232) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.8-1-g932cd3a):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: -1
ERROR ------------------------------------------------------------------------------------------

I already notice that it can be some compatible problem with my file: isogg.vcf, because when I replace that file with another vcf file, it runs successfully, but I can not spot what makes the difference. so I list the header of the vcf file as the following,

additionally, this vcf file works fine with realignment

fileformat=VCFv4.1

fileDate=20140131

source=isogg

isogg_BUILD_ID=2014

reference=GRCh37.p10

phasing=partial

INFO=<ID=ISOGG,Number=0,Type=Flag,Description="Used as y chromosome haplogroup in ISOGG. ">

FILTER=<ID=PASS,Description="Only pass snp is kept, the other possibles are private and investigation. ">

contig=<ID=1,assembly=b37,length=249250621>

contig=<ID=10,assembly=b37,length=135534747>

contig=<ID=11,assembly=b37,length=135006516>

contig=<ID=12,assembly=b37,length=133851895>

contig=<ID=13,assembly=b37,length=115169878>

contig=<ID=14,assembly=b37,length=107349540>

contig=<ID=15,assembly=b37,length=102531392>

contig=<ID=16,assembly=b37,length=90354753>

contig=<ID=17,assembly=b37,length=81195210>

contig=<ID=18,assembly=b37,length=78077248>

contig=<ID=19,assembly=b37,length=59128983>

contig=<ID=2,assembly=b37,length=243199373>

contig=<ID=20,assembly=b37,length=63025520>

contig=<ID=21,assembly=b37,length=48129895>

contig=<ID=22,assembly=b37,length=51304566>

contig=<ID=3,assembly=b37,length=198022430>

contig=<ID=4,assembly=b37,length=191154276>

contig=<ID=5,assembly=b37,length=180915260>

contig=<ID=6,assembly=b37,length=171115067>

contig=<ID=7,assembly=b37,length=159138663>

contig=<ID=8,assembly=b37,length=146364022>

contig=<ID=9,assembly=b37,length=141213431>

contig=<ID=GL000191.1,assembly=b37,length=106433>

contig=<ID=GL000192.1,assembly=b37,length=547496>

contig=<ID=GL000193.1,assembly=b37,length=189789>

contig=<ID=GL000194.1,assembly=b37,length=191469>

contig=<ID=GL000195.1,assembly=b37,length=182896>

contig=<ID=GL000196.1,assembly=b37,length=38914>

contig=<ID=GL000197.1,assembly=b37,length=37175>

contig=<ID=GL000198.1,assembly=b37,length=90085>

contig=<ID=GL000199.1,assembly=b37,length=169874>

contig=<ID=GL000200.1,assembly=b37,length=187035>

contig=<ID=GL000201.1,assembly=b37,length=36148>

contig=<ID=GL000202.1,assembly=b37,length=40103>

contig=<ID=GL000203.1,assembly=b37,length=37498>

contig=<ID=GL000204.1,assembly=b37,length=81310>

contig=<ID=GL000205.1,assembly=b37,length=174588>

contig=<ID=GL000206.1,assembly=b37,length=41001>

contig=<ID=GL000207.1,assembly=b37,length=4262>

contig=<ID=GL000208.1,assembly=b37,length=92689>

contig=<ID=GL000209.1,assembly=b37,length=159169>

contig=<ID=GL000210.1,assembly=b37,length=27682>

contig=<ID=GL000211.1,assembly=b37,length=166566>

contig=<ID=GL000212.1,assembly=b37,length=186858>

contig=<ID=GL000213.1,assembly=b37,length=164239>

contig=<ID=GL000214.1,assembly=b37,length=137718>

contig=<ID=GL000215.1,assembly=b37,length=172545>

contig=<ID=GL000216.1,assembly=b37,length=172294>

contig=<ID=GL000217.1,assembly=b37,length=172149>

contig=<ID=GL000218.1,assembly=b37,length=161147>

contig=<ID=GL000219.1,assembly=b37,length=179198>

contig=<ID=GL000220.1,assembly=b37,length=161802>

contig=<ID=GL000221.1,assembly=b37,length=155397>

contig=<ID=GL000222.1,assembly=b37,length=186861>

contig=<ID=GL000223.1,assembly=b37,length=180455>

contig=<ID=GL000224.1,assembly=b37,length=179693>

contig=<ID=GL000225.1,assembly=b37,length=211173>

contig=<ID=GL000226.1,assembly=b37,length=15008>

contig=<ID=GL000227.1,assembly=b37,length=128374>

contig=<ID=GL000228.1,assembly=b37,length=129120>

contig=<ID=GL000229.1,assembly=b37,length=19913>

contig=<ID=GL000230.1,assembly=b37,length=43691>

contig=<ID=GL000231.1,assembly=b37,length=27386>

contig=<ID=GL000232.1,assembly=b37,length=40652>

contig=<ID=GL000233.1,assembly=b37,length=45941>

contig=<ID=GL000234.1,assembly=b37,length=40531>

contig=<ID=GL000235.1,assembly=b37,length=34474>

contig=<ID=GL000236.1,assembly=b37,length=41934>

contig=<ID=GL000237.1,assembly=b37,length=45867>

contig=<ID=GL000238.1,assembly=b37,length=39939>

contig=<ID=GL000239.1,assembly=b37,length=33824>

contig=<ID=GL000240.1,assembly=b37,length=41933>

contig=<ID=GL000241.1,assembly=b37,length=42152>

contig=<ID=GL000242.1,assembly=b37,length=43523>

contig=<ID=GL000243.1,assembly=b37,length=43341>

contig=<ID=GL000244.1,assembly=b37,length=39929>

contig=<ID=GL000245.1,assembly=b37,length=36651>

contig=<ID=GL000246.1,assembly=b37,length=38154>

contig=<ID=GL000247.1,assembly=b37,length=36422>

contig=<ID=GL000248.1,assembly=b37,length=39786>

contig=<ID=GL000249.1,assembly=b37,length=38502>

contig=<ID=MT,assembly=b37,length=16569>

contig=<ID=X,assembly=b37,length=155270560>

contig=<ID=Y,assembly=b37,length=59373566>

CHROM POS ID REF ALT QUAL FILTER INFO

Y 2649696 M236 G T . PASS ISOGG Y 2655180 M176 C G . PASS ISOGG Y 2656127 Z12426 G A . PASS ISOGG Y 2656959 L1233 G C . PASS ISOGG

Thank you.


Created 2014-01-29 14:38:17 | Updated 2014-01-29 15:22:02 | Tags: unifiedgenotyper qc
Comments (1)

I run UnifiedGenotyper for 84 samples using RNAseq data (Bam-files). I got the following information in the Log file. As it can bee seen, about 27% of my reads failed for many reasons. How can I decrease the percentage of reads which filtered? In other words how I can improve my QC of data before UnifiedGenotyper. Thanks!

INFO  14:15:23,762 MicroScheduler - 970952111 reads were filtered out during the traversal out of approximately 3587224222 total reads (27.07%) 
INFO  14:15:23,762 MicroScheduler -   -> 147143075 reads (4.10% of total) failing BadMateFilter 
INFO  14:15:23,762 MicroScheduler -   -> 0 reads (0.00% of total) failing DuplicateReadFilter 
INFO  14:15:23,762 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter 
INFO  14:15:23,763 MicroScheduler -   -> 566795160 reads (15.80% of total) failing MalformedReadFilter 
INFO  14:15:23,763 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter 
INFO  14:15:23,763 MicroScheduler -   -> 257013876 reads (7.16% of total) failing NotPrimaryAlignmentFilter 
INFO  14:15:23,763 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter

Created 2014-01-28 18:29:03 | Updated | Tags: unifiedgenotyper ploidy scala
Comments (3)

Hi all, I am running a scala script, and would like to the include "-ploidy 1". Any advice on how can I do this?

Some information that may or may not be relevant (idk!):

I am attempting this using UnifiedGenotyper (v2.7-4). At the top of the script I have: package org.broadinstitute.sting.queue.qscripts.examples import org.broadinstitute.sting.queue.QScript import org.broadinstitute.sting.queue.extensions.gatk._

I got the glm both to work using: genotyper.glm = org.broadinstitute.sting.gatk.walkers.genotyper.GenotypeLikelihoodsCalculationModel.Model.BOTH

I was hoping for something similar for ploidy.

Thanks, Rhys


Created 2014-01-23 08:01:04 | Updated 2014-01-23 08:03:11 | Tags: unifiedgenotyper
Comments (3)

Hi, everyone.

from.. GATK Document

-out_mode,--output_mode specifies which sites to emit; possible values are EMIT_VARIANTS_ONLY (the default), EMIT_ALL_CONFIDENT_SITES (include confident reference sites), or EMIT_ALL_SITES (any callable site regardless of confidence).

I really want to know the meaning of confident reference site.

When I calling with the GATK UnifiedGenotyper EMIT_ALL_CONFIDENT_SITES option in each sample BAM file, Can I distinguish the genotype in each sample? (No call, Ref homo, Alt homo, Hetero)

In other words, I know the some site is no call or ref homo for this purpose.


Created 2014-01-17 10:16:07 | Updated | Tags: unifiedgenotyper haplotypecaller
Comments (6)

Hi, I'm updating my pipeline for exome sequencing analysis, so I'm experiencing the HaplotypeCaller capabilities! I have analyzed the same sample with the UnifiedGenotyper walker and the HC one and I have examined the differences between the two output vcf files and I had a very bad finding... HC failed to find a true novel variant!! I know that this is a true variants because I validated that with Sanger sequencing after the first calling with UG.

I have run UG using GATK version 1.6-11-g3b2fab9. This is the VCF line of the variant:

chr7 45123943 . A T 3436.17 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=2.043;DP=114;Dels=0.00;FS=2.678;HRun=1;HaplotypeScore=0.0000;MQ=42.18;MQ0=1;MQRankSum=2.152;QD=30.14;ReadPosRankSum=-0.781;SB=-1010.47 GT:AD:DP:GQ:PL 1/1:8,105:114:99:3436,253,0

I have run HC using GATK version 2.7-4-g6f46d11 both in a single- and in a multi-sample manner but not the shadow of this variant in the VCF output.. I also noticed that together with this novel variant, HC lost other two variants upstream the first; these are the VCF lines:

chr7 45123881 rs61740891 C T 654.25 PASS AC=1;AF=0.50;AN=2;BaseQRankSum=0.205;DB;DP=43;DS;Dels=0.00;FS=65.862;HRun=0;HaplotypeScore=2.2312;MQ=30.27;MQ0=1;MQRankSum=3.254;QD=15.22;ReadPosRankSum=-3.921;SB=-3.02 GT:AD:DP:GQ:PL 0/1:18,25:43:99:684,0,176

chr7 45123888 . C T 161.90 PASS AC=1;AF=0.50;AN=2;BaseQRankSum=-2.293;DP=26;DS;Dels=0.00;FS=49.656;HRun=2;HaplotypeScore=0.0000;MQ=23.81;MQ0=1;MQRankSum=0.425;QD=6.23;ReadPosRankSum=-3.821;SB=-3.00 GT:AD:DP:GQ:PL 0/1:17,9:26:99:192,0,249

How is it possible?

Many thanks in advance

Best, Flavia


Created 2013-12-17 17:07:40 | Updated 2013-12-17 17:08:47 | Tags: unifiedgenotyper vcf empty
Comments (8)

Hey there,

I was trying to build an analysis pipeline for paired reads with BWA, Duplicate Removal Local Realignment and Base Quality Score Recalibration to finally use GATK's UnifiedGenotyper for SNP and Indel calling. However, for both SNPs and Indels, I receive no called variants no matter how low my used thresholds are. Quality values of the reads look ok, leaving out dbSNP does not change results. I have used the same reference throughout the whole pipeline. I use GATK 2.7, nevertheless a switch to GATK 1.6 did not change anything.

This is my shell command for SNP calling on chromosome X (GATK delivers no results for all chromosomes): java -Xmx4g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R Homo_sapiens_assembly19.fasta -stand_call_conf 30.0 -stand_emit_conf 30.0 -glm SNP -mbq 17 -I test.bam -L X -o test.snps.vcf -D dbsnp_135.hg19.excluding_sites_after_129.vcf

Entries in my BAM file look like this: SRR389458.1885965 113 X 10092397 37 76M = 10092397 1 CCTGTTTCCCCTGGGGCTGGGCTNGANACTGGGCCCAACCNGTGGCTCCCACCTGCACACACAGGGCTGGAGGGAC 98998999989:99:9:999888#88#79999:;:89998#99:;:88:989:;:91889888:;:9;:::::999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:23G2G13C35 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:3 SM:i:37 XM:i:3 XO:i:0 MQ:i:37 XT:A:U SRR389458.1885965 177 X 10092397 37 76M = 10092397 -1 CCTGTTTCCCCTGGGGCTGGGCTNGANACTGGGCCCAACCNGTGGCTCCCACCTGCACACACAGGGCTGGAGGGAC 98998999989:99:9:999888#88#79999:;:89998#99:;:88:989:;:91889888:;:9;:::::999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:23G2G13C35 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:3 SM:i:37 XM:i:3 XO:i:0 MQ:i:37 XT:A:U SRR389458.1888837 113 X 14748343 37 76M = 14748343 1 TCGTGAAAGTCGTTTTAATTTTAGCGGTTATGGGATGGGTCACTGCCTCCAAGTGAAAGATGGAACAGCTGTCAAG 889999:9988;98:9::9;9986::::99:8:::::999988989:8;;9::989:999:9;9:;:99:98:999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:76 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 MQ:i:37 XT:A:U

And this is the output of the UnifiedGenotyper: INFO 17:57:00,575 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:57:00,578 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51 INFO 17:57:00,578 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:57:00,578 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:57:00,582 HelpFormatter - Program Args: -T UnifiedGenotyper -R /hana/exchange/reference_genomes/hg19/Homo_sapiens_assembly19.fasta -stand_call_conf 30.0 -stand_emit_conf 30.0 -glm SNP -mbq 17 -I test.bam -L X -o testX.snps.vcf -D dbsnp_135.hg19.excluding_sites_after_129.vcf INFO 17:57:00,583 HelpFormatter - Date/Time: 2013/12/17 17:57:00 INFO 17:57:00,583 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:57:00,583 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:57:00,943 ArgumentTypeDescriptor - Dynamically determined type of /hana/exchange/reference_genomes/hg19/dbsnp_135.hg19.excluding_sites_after_129.vcf to be VCF INFO 17:57:01,579 GenomeAnalysisEngine - Strictness is SILENT INFO 17:57:02,228 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 17:57:02,237 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 17:57:02,364 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.13 INFO 17:57:02,594 RMDTrackBuilder - Loading Tribble index from disk for file /hana/exchange/reference_genomes/hg19/dbsnp_135.hg19.excluding_sites_after_129.vcf INFO 17:57:02,867 IntervalUtils - Processing 155270560 bp from intervals INFO 17:57:02,943 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 17:57:03,166 GenomeAnalysisEngine - Done preparing for traversal INFO 17:57:03,167 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:57:03,167 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 17:57:33,171 ProgressMeter - X:11779845 1.01e+07 30.0 s 2.0 s 7.6% 6.6 m 6.1 m INFO 17:58:03,173 ProgressMeter - X:24739805 1.89e+07 60.0 s 3.0 s 15.9% 6.3 m 5.3 m INFO 17:58:33,175 ProgressMeter - X:37330641 3.25e+07 90.0 s 2.0 s 24.0% 6.2 m 4.7 m INFO 17:59:03,177 ProgressMeter - X:49404077 4.94e+07 120.0 s 2.0 s 31.8% 6.3 m 4.3 m INFO 17:59:33,178 ProgressMeter - X:64377965 5.36e+07 2.5 m 2.0 s 41.5% 6.0 m 3.5 m INFO 18:00:03,180 ProgressMeter - X:75606869 7.54e+07 3.0 m 2.0 s 48.7% 6.2 m 3.2 m INFO 18:00:33,189 ProgressMeter - X:88250233 7.74e+07 3.5 m 2.0 s 56.8% 6.2 m 2.7 m INFO 18:01:03,190 ProgressMeter - X:100393213 9.94e+07 4.0 m 2.0 s 64.7% 6.2 m 2.2 m INFO 18:01:33,192 ProgressMeter - X:110535705 1.09e+08 4.5 m 2.0 s 71.2% 6.3 m 109.0 s INFO 18:02:03,193 ProgressMeter - X:121257489 1.20e+08 5.0 m 2.0 s 78.1% 6.4 m 84.0 s INFO 18:02:33,195 ProgressMeter - X:132533757 1.32e+08 5.5 m 2.0 s 85.4% 6.4 m 56.0 s INFO 18:03:03,197 ProgressMeter - X:144498909 1.41e+08 6.0 m 2.0 s 93.1% 6.4 m 26.0 s INFO 18:03:30,079 ProgressMeter - done 1.55e+08 6.4 m 2.0 s 100.0% 6.4 m 0.0 s INFO 18:03:30,079 ProgressMeter - Total runtime 386.91 secs, 6.45 min, 0.11 hours INFO 18:03:30,080 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 150 total reads (0.00%) INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing BadMateFilter INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter INFO 18:03:32,167 GATKRunReport - Uploaded run statistics report to AWS S3

Do I miss anything here?

Best,

Cindy


Created 2013-12-11 09:59:05 | Updated | Tags: unifiedgenotyper
Comments (16)

Dear all,

all my GATK 2.8.1 UG jobs crash with the "Null alleles are not supported" error message. It works fine with 2.7.4 (more precisely the nightly release version as of November 28) but this error message kills all my jobs right now with the just released 2.8.1 version. Based on the traces, it looks like an indel calling issue.

It's multisample calling, with reduced reads format. The issue may come from the fact that a handful of BAM files have been reduced with older versions of GATK (something I am currently fixing, but it takes time). The fact that the issue is brand new to 2.8.1 suggests something different though but I can't be sure. Full error message below.

Any idea?

V

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: Null alleles are not supported at org.broadinstitute.variant.variantcontext.Allele.(Allele.java:117) at org.broadinstitute.sting.utils.haplotype.Haplotype.(Haplotype.java:61) at org.broadinstitute.sting.gatk.walkers.indels.PairHMMIndelErrorModel.trimHaplotypes(PairHMMIndelErrorModel.java:235) at org.broadinstitute.sting.gatk.walkers.indels.PairHMMIndelErrorModel.computeGeneralReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:438) at org.broadinstitute.sting.gatk.walkers.indels.PairHMMIndelErrorModel.computeDiploidReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:251) at org.broadinstitute.sting.gatk.walkers.genotyper.IndelGenotypeLikelihoodsCalculationModel.getLikelihoods(IndelGenotypeLikelihoodsCalculationModel.java:149) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoods(UnifiedGenotyperEngine.java:331) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:232) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.8-1-g932cd3a):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Null alleles are not supported
ERROR ------------------------------------------------------------------------------------------

Created 2013-11-30 15:32:34 | Updated | Tags: unifiedgenotyper haplotypecaller sensitivity
Comments (19)

Hi there,

I compared HaplotypeCaller and UnifiedGenotype of the latest version of GATK (2.7-4) on the same NA12878 trio. It seems HC missed some true variants (specifically, 7 out of the 49 experimentally validated de novo SNVs are missed). I guess this might be partly due to HCMappingQualityFilter, because the log says ~9% of the total reads have been filtered by this filter, and it looks there is not such a filter in UnifiedGenotyper. I wonder is there a safe way to relax HCMappingQualityFilter (default 20) to achieve a better sensitivity for HC?

Thanks in advance!

Aaron


Created 2013-11-07 20:10:30 | Updated | Tags: unifiedgenotyper
Comments (2)

Hi, I'm using unifiedgenotyper for the SNP calling in one lane illumina RNA-seq data, the bam file is ~15gb. The command I used is: java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R assembly.fa -I seq.bam --out result.vcf -ploidy 48 -stand_call_conf 20 -stand_emit_conf 20.0 I run it with 2 cpus and 72gb memory. It has been run for 3 days (I also used -nt 12 to run it with 12 cpus at the same time but the speed didn't improve significantly) and haven't finished. In addition, there are only 6000 loci were write into the vcf file by now, but I analysed this data using other pipelines at the same time, all of them have been finished and reported ~1 million loci. So can anybody tell me usually how long it will take for unifiedgenotyper to analysis one lane illumina RNA-seq data? Thanks.


Created 2013-10-16 12:50:25 | Updated | Tags: unifiedgenotyper haplotypecaller dp
Comments (27)

Dear GATK Team,

I've recently been exploring HaplotypeCaller and noticed that, for my data, it is reporting ~10x lower DP and AD values in comparison to reads visible in the igv browser and reported by the UnifiedGenotyper.

I'm analyzing a human gene panel of amplicon data produced on a MiSeq, 150bp paired end. The coverage is ~5,000x.

My pipeline is:

Novoalign -> GATK (recalibrate quality) -> GATK (re-align) -> HaplotypeCaller/UnifiedGenotyper.

Here are the minimum commands that reproduce the discrepancy:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.HC.vcf \
-L ROI.bed \
-dt NONE \
-nct 8

Example variant from sample1.HC.vcf:

chr17 41245466 . G A 18004.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.411;ClippingRankSum=-1.211;DP=462;FS=2.564;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.250;QD=31.14;ReadPosRankSum=1.159 GT:AD:DP:GQ:PL 1/1:3,458:461:99:18033,1286,0

... In comparison to using UnifiedGenotyper with exactly the same alignment file:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.UG.vcf \
-L ROI.bed \
-nct 4 \
-dt NONE \
-glm BOTH

Example variant from sample1.UG.vcf:

chr17 41245466 . G A 140732.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=5.488;DP=6382;Dels=0.00;FS=0.000;HaplotypeScore=568.8569;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.096;QD=22.05;ReadPosRankSum=0.104 GT:AD:DP:GQ:PL 1/1:56,6300:6378:99:140761,8716,0

I looked at the mapping quality and number of the alignments at the example region (200nt window) listed above and they look good:

awk '{if ($3=="chr17" && $4 > (41245466-100) && $4 < (41245466+100))  print}' sample1.rg.sam | awk '{count[$5]++} END {for(i in count) print count[i], i}' | sort -nr
8764 70
77 0

With other data generated in our lab, that has ~200x coverage and the same assay principle [just more amplicons], the DP reported by HaplotypeCaller corresponds perfectly to UnifiedGenotyper and igv.

Is there an explanation as to why I should see a difference between HaplotypeCaller and UnifiedGenotyper, using these kinds of data?

Many thanks in advance,

Sam


Created 2013-08-22 18:55:31 | Updated | Tags: unifiedgenotyper gatk
Comments (57)

Dear GATK Team,

I am receiving the following error while running GATK 1.6. Unfortunately, for project consistency I cannot update to a more recent version of GATK and would at least wish to understand the source of the error. I ran ValidateSamFile on the input bam files and they appear to be OK.

Any insight or advice would be greatly appreciated:

`##### ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Somehow the requested coordinate is not covered by the read. Too many deletions? at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:425) at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:374) at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:370) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinates(ReadClipper.java:445) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinatesRightTail(ReadClipper.java:176) at org.broadinstitute.sting.gatk.walkers.indels.PairHMMIndelErrorModel.computeReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:196) at org.broadinstitute.sting.gatk.walkers.genotyper.IndelGenotypeLikelihoodsCalculationModel.getLikelihoods(IndelGenotypeLikelihoodsCalculationModel.java:212) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoods(UnifiedGenotyperEngine.java:235) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:164) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:302) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:115) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:78) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:18) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:63) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:248) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 1.6-22-g3ec78bd):
ERROR
ERROR Please visit the wiki to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
ERROR
ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Too many deletions?
ERROR ------------------------------------------------------------------------------------------`

Abbreviated commandline used:

GenomeAnalysisTK.jar -T UnifiedGenotyper -glm BOTH -et NO_ET \ -R "Saccharomyces_cerevisiae/UCSC/sacCer2/Sequence/WholeGenomeFasta/genome.fa" \ -dcov 5000 -I "someFile.bam" --output_mode EMIT_ALL_SITES -gvcf -l OFF \ -stand_call_conf 1 -L chrIV:1-1531919


Created 2013-06-12 09:22:28 | Updated | Tags: unifiedgenotyper
Comments (31)

Dear GATK team, I have used the UG following the best practice GATK workflow to call snps and Indels from exomeseq data of 24 human samples. First I called snps and Indels separately for each bam file, and I obtained separate vcfs. Then I decided to try to call the snps and Indels all in one go. I noticed that the output was quite different and the number of inser/delitions was higher when I called variants in contemporary (starting from separate bam files: -I sample1.bam -I sample2.bam...ETC). I also noticed that the called indels mostly were adjacent to tricky areas such as repetitive elements (ATATATATATATAT) or next to polyAAAAA. These snps and Indels weren't called by the UG when I called the variants separately. Is it more error prone to call variants in contemporary?


Created 2013-04-11 21:30:55 | Updated 2013-04-11 21:32:22 | Tags: unifiedgenotyper haplotypecaller miseq
Comments (9)

Hey there,

I'm currently working on a project that uses the MiSeq system to perform very deep sequencing (5000x coverage) on a small set of genes related to a particular cancer type. I have had some odd calls with HaplotypeCaller and UnifiedGenotyper, especially near the edges of the sequenced regions. It seems that often Haplotype calls very large deletions and Unified calls SNPs, e.g. (from the VCFs):

Haplotype Caller:

chr2 212576988 . TATTTTTAATTGTA T 13.95 LowQual AC=1;AF=0.500;AN=2;BaseQRankSum=1.805;ClippingRankSum=-0.916;DP=1000;FS=8.285;MLEAC=1;MLEAF=0.500;MQ=59.98;MQ0=0;MQRankSum=0.836;QD=0.00;ReadPosRankSum=0.528 GT:AD:GQ:PL 0/1:308,53:42:42,0,8418

Unified Genotyper:

chr2 212576990 . T C 555.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.991;DP=1000;Dels=0.00;FS=0.000;HaplotypeScore=22.6027;MLEAC=1;MLEAF=0.500;MQ=59.98;MQ0=0;MQRankSum=0.073;QD=0.56;ReadPosRankSum=0.164 GT:AD:DP:GQ:PL 0/1:860,130:3743:99:584,0,32767

A screenshot from IGV showing some of the (few) BAM records in the area of these calls is attached.

Many of these false variant calls can be removed through hard filtering, but I was a little surprised to see them called in the first place, because our raw results had been quite good with lower-coverage data. Is this to be expected with high coverage data?

I am also wondering if there is a hard cap after which downsampling occurs. I set -dcov 10000 for all our samples, but the depth reported in the DP tag seemed max out at 1000. Is there another parameter I need to change? In some areas, due to the very small regions being sequenced, I have coverage in excess of 15,000 reads, so downsampling to 1000 might skew the results.

Any guidance is appreciated! Cheers, Natascha


Created 2013-03-26 11:34:41 | Updated | Tags: unifiedgenotyper snp snps
Comments (3)

I was wondering, in the vcf output from UnifiedGenotyper what metrics go into the variant quality score, QUAL ?

I'm assuming that depth, DP is one of them but I can't find further information. Apologies if I've missed it.

Attached are some plots of the depth and quality distribution for variant calls, and also of the relationship between depth and quality. I'm slightly worried that the bimodal quality distribution indicates an error, otherwise I've just attached the graphs for general interest.


Created 2013-02-22 16:49:52 | Updated 2013-02-22 16:51:24 | Tags: unifiedgenotyper argument
Comments (5)

Previously I have been running a command like this:

  java -jar /path/GenomeAnalysisTK.jar \ 

  -T UnifiedGenotyper \

  -R /path/human_g1k_v37.fasta \

  -et NO_ET \

  -K /path/key \

  -out_mode EMIT_ALL_SITES \

  --input_file /path/bam \

  -L /path/intervals \

  -gt_mode GENOTYPE_GIVEN_ALLELES \

  --alleles /path/vcf \

  --dbsnp /path/dbsnp_135.b37.vcf \

  -o /path/my.vcf

But I was reading the documentation again and I read this statement: GENOTYPE_GIVEN_ALLELES only the alleles passed in from a VCF rod bound to the -alleles argument will be used for genotyping

Which lead me to believe that there wasn't a need to include the lines: --input_file /path/bam \ -L /path/intervals \

because it would be redundant. But when I try to run without those line I get back an error message: Walker requires reads but none were provided.

Can you give an explaination as to why both of those lines AND GENOTYPE_GIVEN_ALLELES would be needed?


Created 2012-11-29 17:51:33 | Updated | Tags: unifiedgenotyper depthperallelebysample
Comments (12)

Hello,

I found some strange entries for indels in my VCF file created by the Unified Genotyper. For example:

4 184513470 . TC T 4009 PASS AC=4;AF=0.250;AN=16;BaseQRankSum=1.972;DP=1315;DS;FS=3.466;HaplotypeScore=537.6937;MLEAC=4;MLEAF=0.250;MQ=52.55;MQ0=0;MQRankSum=-10.581;QD=4.55;ReadPosRankSum=-10.128;SB=-3.500e+01;set=variant2 GT:AD:DP:GQ:PL 0/1:230,0:239:99:282,0,5011 0/0:92,0:95:99:0,133,2435

The first sample has genotype 0/1 with a good GQ value. However, according the allele depth field, there is no read supporting the deletion. When I look at the reads using the IGV, I find some reads supporting the deletion for the first sample (and even some for the second one).

Moreover, when I looked at the AD values for SNPs, I noticed the the sum of all AD values is much less than the coverage shown in the IGV. I filtered duplicated reads in the IGV.

Can someone please give an explanation? This link http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_DepthPerAlleleBySample.html explains the difference between AD and DP, but does not help in my case.

Best greetings, Hans-Ulrich


Created 2012-10-11 21:54:39 | Updated 2012-10-18 01:33:56 | Tags: unifiedgenotyper badmatefilter mappingqualityfilter
Comments (13)

It was a bit unclear what the BadMateFilter is doing in the documentation. Any information would be appreciated!


Created 2012-09-10 10:50:14 | Updated 2012-09-10 15:30:14 | Tags: unifiedgenotyper haplotypecaller
Comments (44)

Hi,

how does the speed of the haplotypeCaller usually compare to that of the UnifiedGenotyper?

When I try to use it, it seems to be about 90x slower.

these are the command lines I use:

java -jar -Xmx32g GenomeAnalysisTK.jar -R hg19.fasta -T UnifiedGenotyper -dcov 1000 -I file.bam -o file.vcf -L myTarget

java -jar -Xmx32g GenomeAnalysisTK.jar -R hg19.fasta -T HaplotypeCaller -dcov 1000 -I file.bam -o file.vcf -L myTarget

thanks!


Created 2012-08-10 18:19:46 | Updated 2013-01-07 19:45:12 | Tags: unifiedgenotyper vcf indels
Comments (14)

Hello, I'm maybe missing something obvious but it seems a GATK vcf file does not tell a given variant is SNP, Insertion or deletion. Did I missed some command when I called variations? I can easily classify variations by eye or a script from a given vcf entry but the entry does not explicitly say variation type.

Here are deletions:

d02 264482 . TT T 389.93 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.224;DP=25;FS=0.000;HaplotypeScore=58.0966;MLEAC=2;MLEAF=1.00;MQ=47.51;MQ0=0;MQRankSum=1.714;QD=15.60;ReadPosRankSum=0.075;SB=-1.112e+02 GT:AD:DP:GQ:PL 1/1:15,9:20:9:431,9,0

Ld04 26597 . CCC C 2923.96 . AC=2;AF=1.00;AN=2;BaseQRankSum=-2.217;DP=98;FS=2.954;HaplotypeScore=145.2957;MLEAC=2;MLEAF=1.00;MQ=59.82;MQ0=0;MQRankSum=0.283;QD=29.84;ReadPosRankSum=1.150;SB=-1.306e+03 GT:AD:DP:GQ:PL 1/1:58,31:98:99:2966,195,0

Here are insertions:

Ld04 23671 . G GAAAT 6952 . AC=2;AF=1.00;AN=2;DP=100;FS=0.000;HaplotypeScore=24.8695;MLEAC=2;MLEAF=1.00;MQ=59.54;MQ0=0;QD=69.52;SB=-3.355e+03 GT:AD:DP:GQ:PL 1/1:65,34:100:99:6952,301,0

Ld04 26880 . T TTATT 4973 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.552;DP=99;FS=2.935;HaplotypeScore=116.0082;MLEAC=2;MLEAF=1.00;MQ=59.46;MQ0=0;MQRankSum=0.192;QD=50.23;ReadPosRankSum=1.834;SB=-1.950e+03 GT:AD:DP:GQ:PL 1/1:49,38:98:99:4973,247,0

Here are SNPs

Ld04 19890 . T C 3276.01 . AC=2;AF=1.00;AN=2;DP=85;Dels=0.00;FS=0.000;HaplotypeScore=0.7887;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=38.54;SB=-1.614e+03 GT:AD:DP:GQ:PL 1/1:0,85:85:99:3276,253,0

Ld04 19963 . A G 3571.01 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.895;DP=96;Dels=0.00;FS=0.000;HaplotypeScore=2.4678;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=1.769;QD=37.20;ReadPosRankSum=-0.316;SB=-1.523e+03 GT:AD:DP:GQ:PL 1/1:3,93:96:99:3571,205,0

I do not see any markers that tell me variation type. Nor the length of variations. Such information is given in a vcf of Samtools.

PS I used this command line of V 2.0-38:

java -jar ~/GenomeAnalysisTK-2.0-38/GenomeAnalysisTK.jar -T UnifiedGenotyper --genotype_likelihoods_model BOTH -R ref1 -I IN.bam -o OUT.GK2.vcf

Thank you.

Hideo


Created 2012-08-01 16:50:30 | Updated 2012-08-01 16:55:22 | Tags: unifiedgenotyper fslockwithshared
Comments (22)

We have a SGE environment where we run our GATK jobs. Lately we have had a few cases of UnifiedGenotyper getting hung up while trying to read our previously created .fai and .dict reference files. On one occasion we had 6 jobs on different nodes all hung for about 5 days before finally giving the FSLockWithShared - WARNING messages and then continuing on just fine. We haven't been able to find out why this is happening. Do you know of anything that would cause UnifiedGenotyper to try and read those files before ultimately throwing the warning messages 5 days later?

We looked into the possibility of other users/jobs having an exclusive lock on those reference files, but we had other UnifiedGenotyper jobs run fine during that same time period without these problems.

We also haven't been able to find any hardware problems yet. The jobs with this problem were on multiple nodes, so I thought I would see if you have any ideas.

Thanks for your help!

Using strace we were able to find that all 6 of the UnifiedGenotyper processes were hung in the same spot:

$ strace -p 22924
Process 22924 attached - interrupt to quit
futex(0x41c319d0, FUTEX_WAIT, 22925, NULL <unfinished ...>

$ strace -p 22925
Process 22925 attached - interrupt to quit
write(1, "WARN  12:39:09,312 FSLockWithSha"..., 158) = 158
write(1, "INFO  12:39:09,364 ReferenceData"..., 116) = 116
write(1, "INFO  12:39:09,364 ReferenceData"..., 89) = 89
open("/reference/sequence/human/ncbi/37.1/allchr.fa.fai", O_RDONLY) = 16
fstat(16, {st_mode=S_IFREG|0770, st_size=2997, ...}) = 0
fstat(16, {st_mode=S_IFREG|0770, st_size=2997, ...}) = 0
fcntl(16, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=0, len=9223372036854775807} <unfinished ...>

Here is the log file for one of the jobs:

Thu Jul 19 09:42:53 CDT 2012
INFO  09:42:58,047 HelpFormatter - --------------------------------------------------------------------------------
INFO  09:42:58,048 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.6-7-g2be5704, Compiled 2012/05/25 16:27:30
INFO  09:42:58,048 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO  09:42:58,048 HelpFormatter - Please view our documentation at http://www.broadinstitute.org/gsa/wiki
INFO  09:42:58,048 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa
INFO  09:42:58,049 HelpFormatter - Program Args: -R /reference/sequence/human/ncbi/37.1/allchr.fa -et NO_ET -K /p
rojects/bsi/bictools/apps/alignment/GenomeAnalysisTK/1.6-7-g2be5704//key -T UnifiedGenotyper --output
_mode EMIT_ALL_SITES --min_base_quality_score 20 -nt 4 --max_alternate_alleles 5 -glm BOTH -L /data2/bsi/target.bed -I /data2/bsi/H-001.chr22-sorted.bam --out /data2/bsi/variants.chr22.raw.all.vcf
INFO  09:42:58,050 HelpFormatter - Date/Time: 2012/07/19 09:42:58
INFO  09:42:58,050 HelpFormatter - --------------------------------------------------------------------------------
INFO  09:42:58,050 HelpFormatter - --------------------------------------------------------------------------------
INFO  09:42:58,166 GenomeAnalysisEngine - Strictness is SILENT
WARN  12:39:09,312 FSLockWithShared - WARNING: Unable to lock file /reference/sequence/human/ncbi/37.1/allchr.dict: Protocol family not supported.
INFO  12:39:09,364 ReferenceDataSource - Unable to create a lock on dictionary file: Protocol family not supported
INFO  12:39:09,364 ReferenceDataSource - Treating existing dictionary file as complete.
WARN  12:39:12,527 FSLockWithShared - WARNING: Unable to lock file /reference/sequence/human/ncbi/37.1/allchr.fa.fai: Protocol family not supported.
INFO  12:39:12,528 ReferenceDataSource - Unable to create a lock on index file: Protocol family not supported
INFO  12:39:12,528 ReferenceDataSource - Treating existing index file as complete.
INFO  12:39:12,663 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:39:12,954 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.29
INFO  12:39:13,108 MicroScheduler - Running the GATK in parallel mode with 4 concurrent threads
WARN  12:39:13,864 UnifiedGenotyper - WARNING: note that the EMIT_ALL_SITES option is intended only for point mutations (SNPs) in DISCOVERY mode or generally when running in GENOTYPE_GIVEN_ALLELES mode; it will by no means produce a comprehensive set of indels in DISCOVERY mode
INFO  12:39:14,361 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:39:14,568 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.21
INFO  12:39:14,569 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:39:14,627 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06
INFO  12:39:14,628 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:39:14,686 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06
INFO  12:39:15,123 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
...
Tue Jul 24 12:47:32 CDT 2012