# Tagged with #UnifiedGenotyper 4 documentation articles | 1 announcement | 94 forum discussions

Created 2013-08-23 21:34:04 | Updated 2014-10-24 16:21:11 | Tags: unifiedgenotyper official haplotypecaller ploidy

Use HaplotypeCaller!

The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper. Its ability to call SNPs is equivalent to that of the UnifiedGenotyper, its ability to call indels is far superior, and it is now capable of calling non-diploid samples. It also comprises several unique functionalities such as the reference confidence model (which enables efficient and incremental variant discovery on ridiculously large cohorts) and special settings for RNAseq data.

As of GATK version 3.3, we recommend using HaplotypeCaller in all cases, with no exceptions.

Caveats for older versions

If you are limited to older versions for project continuity, you may opt to use UnifiedGenotyper in the following cases:

• If you are working with non-diploid organisms (UG can handle different levels of ploidy while older versions of HC cannot)
• If you are working with pooled samples (also due to the HC’s limitation regarding ploidy)
• If you want to analyze more than 100 samples at a time (for performance reasons) (versions 2.x)

Created 2013-06-17 21:39:48 | Updated 2015-05-08 22:01:39 | Tags: unifiedgenotyper variant-discovery locus-based

### Note: the UnifiedGenotyper has been replaced by HaplotypeCaller, which is a much better tool. UG is still available but you should really consider using HC instead.

#### Objective

Call variants on a haploid genome with the UnifiedGenotyper, producing a raw (unfiltered) VCF.

• TBD

#### Steps

1. Determine the basic parameters of the analysis
2. Call variants in your sequence data

### 1. Determine the basic parameters of the analysis

If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.

• Ploidy (-ploidy)

In its basic use, this is the ploidy (number of chromosomes) per sample. By default it is set to 2, to process diploid organisms' genomes, but it can be set to any other desired value, starting at 1 for haploid organisms, and up for polyploids. This argument can also be used to handle pooled data. For that purpose, you'll need to set -ploidy to Number of samples in each pool * Sample Ploidy. There is no fixed upper limit, but keep in mind that high-level ploidy will increase processing times since the calculations involved are more complex. For full details on how to process pooled data, see Eran et al. (in preparation).

• Genotype likelihood model (-glm)

This is the model that the program will use to calculate the genotype likelihoods. By default, it is set to SNP, but it can also be set to INDEL or BOTH. If set to BOTH, both SNPs and Indels will be called in the same run and be output to the same variants file.

• Emission confidence threshold (–stand_emit_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.

• Calling confidence threshold (–stand_call_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.

The terms called and filtered are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.

### 2. Call variants in your sequence data

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
-R haploid_reference.fa \
-L 20 \
--glm BOTH \
--stand_call_conf 30 \
--stand_emit_conf 10 \
-o raw_ug_variants.vcf


This creates a VCF file called raw_ug_variants.vcf, containing all the sites that the UnifiedGenotyper evaluated to be potentially variant.

Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.

Created 2012-07-30 17:37:12 | Updated 2015-04-26 02:30:27 | Tags: unifiedgenotyper haplotypecaller snp bamout debugging

This can happen when you expect a call to be made based on the output of other variant calling tools, or based on examination of the data in a genome browser like IGV.

There are several possibilities, and among them, it is possible that GATK may be missing a real variant. But we are generally very confident in the calculations made by our tools, and in our experience, most of the time, the problem lies elsewhere. So, before you post this issue in our support forum, please follow these troubleshooting guidelines, which hopefully will help you figure out what's going on.

In all cases, to diagnose what is happening, you will need to look directly at the sequencing data at the position in question.

### 1. Generate the bamout and compare it to the input bam

If you are using HaplotypeCaller to call your variants (as you nearly always should) you'll need to run an extra step first to produce a file called the "bamout file". See this tutorial for step-by-step instructions on how to do this.

What often happens is that when you look at the reads in the original bam file, it looks like a variant should be called. However, once HaplotypeCaller has performed the realignment, the reads may no longer support the expected variant. Generating the bamout file and comparing it to the original bam will allow you to elucidate such cases.

In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!

### 2. Check the base qualities of the non-reference bases

The variant callers apply a minimum base quality threshold, under which bases will not be counted as supporting evidence for a variant. This is because low base qualities mean that the sequencing machine was not confident that it called the right bases. If your expected variant is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the DP field of the VCF is the unfiltered depth. You may believe you have good coverage at your site of interest, but since the variant callers ignore bases that fail the quality filters, the actual coverage seen by the variant callers may be lower than you think.

### 3. Check the mapping qualities of the reads that support the non-reference allele(s)

The quality of a base is capped by the mapping quality of the read that it is on. This is because low mapping qualities mean that the aligner had little confidence that the read was mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- in fact, you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

### 4. Check how many alternate alleles are present

By default the variant callers will only consider a certain number of alternate alleles. This parameter can be relaxed using the --max_alternate_alleles argument (see the HaplotypeCaller documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles increases the computational cost of the processing, scaling exponentially with the number of alternate alleles, which means it will use more resources and take longer. Unless you have a really good reason to change the default value, we highly recommend that you not modify this parameter.

### 5. When using UnifiedGenotyper, check for overlapping deletions

The UnifiedGenotyper ignores sites if there are too many overlapping deletions. This parameter can be relaxed using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument) but be aware that increasing its value could adversely affect the reliability of your results.

### 6. Check for systematic biases introduced by your sequencing technology

Some sequencing technologies introduce particular sources of bias. For example, in data produced by the SOLiD platform, alignments tend to have reference bias and it can be severe in some cases. If the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site, you are probably seeing false positives.

Created 2012-07-26 14:50:55 | Updated 2014-10-24 00:55:34 | Tags: unifiedgenotyper official ploidy haploid analyst intermediate

In general most GATK tools don't care about ploidy. The major exception is, of course, at the variant calling step: the variant callers need to know what ploidy is assumed for a given sample in order to perform the appropriate calculations.

### Ploidy-related functionalities

As of version 3.3, the HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the -ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the -ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the -ploidy argument.

For earlier versions (all the way to 2.0) the fallback option is UnifiedGenotyper, which also accepts the -ploidy argument.

### Cases where ploidy needs to be specified

1. Native variant calling in haploid or polyploid organisms.
2. Pooled calling where many pooled organisms share a single barcode and hence are treated as a single "sample".
3. Pooled validation/genotyping at known sites.

For normal organism ploidy, you just set the -ploidy argument to the desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e. (Ploidy per individual) * (Individuals in pool).

## Important limitations

Several variant annotations are not appropriate for use with non-diploid cases. In particular, InbreedingCoeff will not be annotated on non-diploid calls. Annotations that do work and are supported in non-diploid use cases are the following: QUAL, QD, SB, FS, AC, AF, and Genotype annotations such as PL, AD, GT, etc.

You should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.

Created 2013-03-14 12:29:25 | Updated 2014-12-01 17:34:07 | Tags: unifiedgenotyper official haplotypecaller printreads bug

As reported here:

If you encounter this bug too, please don't post a new question about it. Feel free to comment in this thread to let us know you have also had the same problem. Tell us what version of the GATK you were using and post your command line.

Thank you for your patience while we work to fix this issue.

### Latest update: we found that the three tools (PrintRead, HaplotypeCaller and UnifiedGenotyper) had different issues that only produced the same symptom (the ArrayIndexOutOfBoundsException error). The underlying issues have all been fixed in version 2.5. If you encounter an ArrayIndexOutOfBoundsException in later versions, it is certainly caused by a different problem, so please open a new thread to report the issue.

Created 2015-05-11 13:57:37 | Updated | Tags: unifiedgenotyper depthofcoverage haplotypecaller mendelianviolations

Hi,

Two questions, which relate to Unified Genotyper or Haplocaller:

1. Detecting and determining the exact genotype of an individual seems to be inaccurate at read depths less than 10X. This is because there aren't enough reads to accurately say whether it could be heterozygous. Is there an option in either Unified Genotyper or Haplocaller, that excludes sites in individuals that have below 10X? Alternatively how can these sites be removed from the vcf - or set to missing ?

2. I'm sure I've read somewhere that pedigree information can be supplied to a variant caller (such as Unified Genotyper or Haplocaller), and used to improve the calling accuracy/speed/efficiency. I am calling variants on one parent and multiple offspring.

Apologies if these questions are answered in the user manual. I regularly look it but have not found much to answer my questions.

Cheers,

Blue

Created 2015-05-10 19:31:44 | Updated | Tags: unifiedgenotyper duplicatereadfilter

Hi All,

My understanding of DuplicateReadFilter is that is removes PCR duplicates (ie: identical reads that map to the same location in the genome). Is there a way to prevent UnifiedGenotyper from using this filter?

Many of my 'duplicate' reads are not really duplicates. I have 50bp single ended yeast RNA-seq data, and 10 samples/lane results in highly over-sampled data. Removing duplicates results in an undercounting of reads at the most highly expressed genes, and therefore an increased number of sub-clonal variants at these genes (because the reads from the major clonal population are discarded, but reads of sub-clonal populations are kept because they have a different sequence). At least I think that is what is happening.

-Lucas

Created 2015-05-06 14:48:11 | Updated | Tags: unifiedgenotyper strand-bias

Hi there, I have been struggling with the interpretation of the SOR annotation. I do get that a higher value is the sign of a strand bias and that this value is never negative.

I did read the SOR annotation documention but still cannot figure out how you calculate the SOR.

Here is one of my example where a SB is present for sure REF + strand = 2185 reads REF - strand = 5 reads ALT + strand = 7 reads ALT - strand = 2370 reads.

When I calculate "R" as indicated in the documentation, I obtain a very high value of 147 955. But for this variant in the VCF file, SOR = 11.382

Thanks Manon

Created 2015-05-06 09:14:36 | Updated | Tags: unifiedgenotyper low-coverage

Is it possible to call variants using unifiedgenotyper from data with less than 1 X coverage per sample? We have a total of 25 samples and each has a coverage of < 1 X. does changing dcov parameter to 50 or lower work?

Created 2015-04-30 12:15:29 | Updated | Tags: unifiedgenotyper multi-sample

Dear GATK,

I've got an issue using the UnifiedGenotyper, it does not output all possible variants sharing the same start coordinate.

My command line: GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ReferenceGenome.fasta -I list_of_Samples.list -ploidy 1 -glm BOTH --heterozygosity 0.001 --indel_heterozygosity 0.005 -gt_mode DISCOVERY -dcov 1000 -o output_raw_variants.vcf

Using "Batch 1" , about 50 Plasmodium falciparum samples, I identified the following insertion in Sample1, which I know is real :

chrom14 2261798 . T TTA 1576.91 1:2,42:59:99:1:1.00:1609,0

However when I run the same command line with Batch1 + Batch2, total of 100 samples, I get the following result for Sample1:

chrom14 2261798 . T TTATATA 23812.75 0:39,5:59:99:0:0.00:0,775

Some samples from Batch2 have a longer insertion starting at the same coordinate, and now the original insertion in Sample1 does not appear in the VCF anymore... Why UnifiedGenotyper did not output multiple lines in this example? (for some other positions it did output multiple lines sharing the same coordinate) I did not change the parameter --max_alternate_alleles 6.

Antoine

Created 2015-04-30 06:34:42 | Updated 2015-05-01 15:02:14 | Tags: unifiedgenotyper ploidy queue

Hi, I have several questions about the function "UnifiedGenotyper" and its corresponding Qscript.

1. I first test this function on a single chromosome of my chicken sample (bam file is about 800M) to call SNPs, and set the ploidy value to 30. It took me half a day to finish. Is it normal that it took so long for a single chromosome?

As a result of this, I tried to use the scatter-gather to parallel my job in order to get the full use of my multi-node server. I built my own script based on modifying the ExampleUnifiedGenotyper.scala. I defined "ploidy" as an argument in my Qscript, then I assign a value of 30 to ploidy in my bash command line.

java -Djava.io.tmpdir=$tmp_dir \ -jar /sw/apps/bioinfo/GATK-Queue/3.2.2/Queue.jar \ -S /path/to/ug.scala \ -R$ref_file \
-I /path/to/bamfile.bam \
-o /path/to/my.vcf  \
-ploidy 30 \
-run


2.The script ran successfully, but the output VCF file was based sample_ploidy=2. It seems that it does not respond to the argument "ploidy". Then if I add one command line in the def script() part, which is" genotyper.sample_ploidy = qscript.sample_ploidy ". GATK will give warning of errors "sample_ploidy is not a member of TestUnifiedGenotyper.this.UnifiedGenotyperArguments". Could you please help point out where I did wrong?

Also, I have some questions about how to assign values to argument such as " this.memoryLimit" and "genotyper.scatterCount".

1. Based on the explanation given in the example scala file, argument "this.memoryLimit" sets the memory limit for each command. So in my case, I have only one command in my script, so that I can assign as much memory as I have to it? I am running on a server that has multiple-node, each node with 16 cores, and each core has 8G RAM. So in my understanding, if I run this job on one node with 16 cores, can I assign "this.memoryLimit=128" (16*8)? Do I understand correctly?

2. about the scatterCount, can I assign it with the number of cores I have in the server? That is, based on the example I listed above, 16.

Thank you very much for your help and time! Attached is my Qscript in relating with my question.

My Qscript is as follows:

package org.broadinstitute.gatk.queue.qscripts.examples

class TestUnifiedGenotyper extends QScript {
qscript =>
@Input(doc="The reference file for the bam files.", shortName="R")
var referenceFile: File = _

@Input(doc="Bam file to genotype.", shortName="o")
var outputFile: File = _

@Input(doc="Bam file to genotype.", shortName="I")
var bamFile: File = _

@Input(doc="An optional file with dbSNP information.", shortName="D", required=false)
var dbsnp: File = _

@Argument(doc="Ploidy (number of chromosomes) per sample.", shortName="ploidy")
var sample_ploidy: List[String] = Nil

trait UnifiedGenotyperArguments extends CommandLineGATK {
this.reference_sequence = qscript.referenceFile
this.memoryLimit = 2
}

def script() {
val genotyper = new UnifiedGenotyper with UnifiedGenotyperArguments

genotyper.scatterCount = 3
genotyper.input_file :+= qscript.bamFile
// genotyper.sample_ploidy = qscript.sample_ploidy
genotyper.out = swapExt(outputFile, qscript.bamFile, "bam", "unfiltered.vcf")

}
}


Created 2015-04-28 08:17:21 | Updated | Tags: unifiedgenotyper haplotypecaller coverage bias

Hi,

I am doing joint variant calling for Illumina paired end data of 150 monkeys. Coverage varies from 3-30 X with most individuals having around 4X coverage.

I was doing all the variant detection and hard-filtering (GATK Best Practices) process with both UnifiedGenotyper and Haplotype caller.

My problem is that HaplotypeCaller shows a much stronger bias for calling the reference allele in low coverage individuals than UnifiedGenotyper does. Is this a known issue?

In particular, consider pairwise differences across individuals: The absolute values are lower for low coverage individuals than for high coverage, for both methods, since it is more difficult to make calls for them. However, for UnifiedGenotyper, I can correct for this by calculating the "accessible genome size" for each pair of individuals by substracting from the total reference length all the filtered sites and sites where one of the two individuals has no genotype call (./.). If I do this, there is no bias in pairwise differences for UnifiedGenotyper. Values are comparable for low and high coverage individuals (If both pairs consist of members of similar populations).

However, for HaplotypeCaller, this correction does not remove bias due to coverage. Hence, it seems that for UnifiedGenotyper low coverage individuals are more likely to have no call (./.) but if there is a call it is not biased towards reference or alternative allele (at least compared to high coverage individuals). For HaplotypeCaller, on the other hand, it seems that in cases of doubt the genotype is more likely to be set to reference. I can imagine that this is an effect of looking for similar haplotypes in the population.

Can you confirm this behaviour? For population genetic analysis this effect is highly problematic. I would trade in more false positive if this removed the bias. Note that when running HaplotypeCaller, I used a value of 3*10^(-3) for the expected heterozygosity (--heterozygosity) which is the average cross individuals diversity and thus already at the higher-end for within individual heterozygosity. I would expect the problem to be even worse if I chose lower values.

Can you give me any recommendation, should I go back using UnifiedGenotyper or is there any way to solve this problem?

Created 2015-04-20 10:57:05 | Updated 2015-04-20 11:03:28 | Tags: unifiedgenotyper combinevariants haplotypecaller genotype-given-alleles

Question: Is it possible to have CV merge like bcftools does it?

I get this warning, when running UG in GGA mode using an -alleles vcf generated with CV:

WARN  10:17:21,394 GenotypingGivenAllelesUtils - Multiple valid VCF records detected in the alleles input file at site 20:106089, only considering the first record


I made this call with HC from 10 samples:

20  106089  .   CA  C


And this call with UG from 10 other samples:

20  106089  .   C   A


CV merges like this:

20  106089  .   C   A
20  106089  .   CA  C


bcftools merges like this:

20  106089  .   CA  AA,C


The UG recall from the CV generated -alleles vcf is incomplete:

20  106089  .   C   A


The UG recall from the bcftools generated -alleles vcf is complete:

20  106089  .   CA  AA,C


Is it possible to have CV merge like bcftools does it?

I'm really not sure. It's not a use case that UG was designed for (with UG we kept SNPs and indels separate until post-analysis), so I would recommend being cautious with it.

I checked the genotypes and UG seems to handle merged MNPs and indels just fine; see below. But I will do some additional testing. Or I might just take the safe path and do the recalling separately for SNPs and indels as suggested. The reason I have UG and HC calls in the first place is because I have low and high coverage data for different cohorts. I want to create a merged dataset.

Despite --interval_padding 100 helping to recall more sites with HC in GGA mode as per previous recommendation, some sites still fail to be called with HC in GGA mode. Hence I opted for UG.

UG calls on samples 1-10:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  535 545 546 550 554 564 567 574 575 578
20  106089  .   C   A   16.19   .   AC=2;AF=0.125;AN=16;BaseQRankSum=-0.854;DP=37;Dels=0.00;FS=0.000;HaplotypeScore=1.5282;MLEAC=2;MLEAF=0.125;MQ=58.74;MQ0=0;MQRankSum=-0.560;QD=2.70;ReadPosRankSum=-1.797;SOR=0.935;VariantType=SNP  GT:AD:DP:GQ:PL  0/0:3,0:3:6:0,6,76  0/0:4,2:6:9:0,9,115 0/1:3,1:4:24:24,0,80    0/0:6,0:6:12:0,12,130   0/1:1,1:2:29:30,0,29    ./. 0/0:7,0:7:15:0,15,188   0/0:3,1:4:6:0,6,74  ./. 0/0:5,0:5:12:0,12,142


HC calls on samples 11-20:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  585 590 622 625 628 640 655 668 687 693

20  106089  .   CA  C   47.95   .   AC=5;AF=0.250;AN=20;BaseQRankSum=0.925;DP=36;FS=1.850;InbreedingCoeff=0.0646;MLEAC=5;MLEAF=0.250;MQ=59.48;MQ0=0;MQRankSum=0.175;QD=3.00;ReadPosRankSum=-1.725;SOR=0.387 GT:AD:GQ:PL 0/0:2,0:6:0,6,49    0/0:2,0:6:0,6,49    0/0:3,0:12:0,12,130 0/0:5,0:15:0,15,122 0/0:2,0:6:0,6,46    0/1:2,1:14:14,0,39  0/1:2,1:15:15,0,38  0/0:4,0:12:0,12,93  0/1:3,1:12:12,0,46  1/1:0,3:9:67,9,0


UG GGA recalls on samples 1-20:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  535 545 546 550 554 564 567 574 575 578 585 590 622 625 628 640 655 668 687 693
20  106089  .   CA  AA,C    110.56  .   AC=0,8;AF=0.00,0.222;AN=36;DP=81;FS=0.000;InbreedingCoeff=0.5076;MLEAC=0,6;MLEAF=0.00,0.167;MQ=58.56;MQ0=0;QD=3.45;SOR=0.859;VariantType=MULTIALLELIC_MIXED GT:AD:DP:GQ:PL:SB   0/0:0,0,0:3:0:0,0,0,6,6,52:0,0,0,0  0/2:0,0,1:6:0:5,5,5,0,0,109:0,0,1,0 0/2:0,0,1:4:0:12,12,12,0,0,47:0,0,1,0   0/0:0,0,0:6:0:0,0,0,17,17,123:0,0,0,0   0/0:0,0,0:2:0:0,0,0,3,3,10:0,0,0,0  ./. 0/0:0,0,0:7:0:0,0,0,9,9,60:0,0,0,0  0/2:0,0,1:4:0:12,12,12,0,0,61:0,0,0,1   ./. 0/0:0,0,1:5:0:0,0,0,4,4,30:0,0,0,1  0/0:0,0,0:3:0:0,0,0,6,6,49:0,0,0,0  0/0:0,0,0:3:0:0,0,0,9,9,76:0,0,0,0  0/0:0,0,1:4:0:0,0,0,1,1,22:0,0,1,0  0/0:0,0,0:7:0:0,0,0,18,18,149:0,0,0,0   0/0:0,0,0:4:0:0,0,0,11,11,76:0,0,0,0    0/2:0,0,1:5:0:9,9,9,0,0,65:0,0,0,1  0/2:0,0,1:4:0:12,12,12,0,0,60:0,0,0,1   0/0:0,0,0:5:0:0,0,0,15,15,116:0,0,0,0   0/2:0,0,1:6:0:12,12,12,0,0,47:0,0,0,1   2/2:0,0,3:3:9:67,67,67,9,9,0:0,0,3,0


http://gatkforums.broadinstitute.org/discussion/5249/overcalling-deletion-in-unifiedgenotyper-genotype-given-alleles-mode


P.S. I might gate crash your Cambridge party this week despite not being invited :D The course was already fully booked, when you announced it. I don't have a time machine!

Created 2015-03-17 10:47:42 | Updated | Tags: unifiedgenotyper non-human

Hi,

I've effectively got a sample of 200 individuals, that share one father but each have a different mother. I'm wondering how to best change the default parameters of Unified Genotyper to call variants without bias caused by assumptions of the algorithm.

Cheers,

Blue

Created 2015-03-12 16:15:59 | Updated | Tags: unifiedgenotyper snp error indel

Hello,

when I run the UnifiedGenotyper to call INDELs, I get the following error (detailed command line output below) HAPLOTYPE_MAX_LENGTH must be > 0 but got 0

When I call only SNPs instead, it doe not occur. I have searched to find an answer to why this is happening, but cannot figure out the reason. Could this be a bug? I get the error no matter if I run the Realigner before or not.

Did you observe this problem before?

Command line output: java -Xmx10g -jar ~/work/tools/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /media/rebecca/UUI/Work/BputSem/BputSem_gapfilled.final.fa -I realigned_A.bam -gt_mode DISCOVERY -glm INDEL -stand_call_conf 30 -stand_emit_conf 10 -o rawINDELS_q30_A.vcf -ploidy 10

INFO 15:50:47,987 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:50:47,990 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 INFO 15:50:47,990 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 15:50:47,990 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 15:50:47,996 HelpFormatter - Program Args: -T UnifiedGenotyper -R /media/rebecca/UUI/Work/BputSem/BputSem_gapfilled.final.fa -I realigned_A.bam -gt_mode DISCOVERY -glm INDEL -stand_call_conf 30 -stand_emit_conf 10 -o rawINDELS_q30_A.vcf -ploidy 10 INFO 15:50:48,002 HelpFormatter - Executing as rebecca@rebecca-ThinkPad-T440s on Linux 3.13.0-44-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_65-b32. INFO 15:50:48,002 HelpFormatter - Date/Time: 2015/03/12 15:50:47 INFO 15:50:48,002 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:50:48,002 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:50:48,132 GenomeAnalysisEngine - Strictness is SILENT INFO 15:50:48,402 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 15:50:48,409 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 15:50:48,447 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04 INFO 15:50:48,594 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 15:50:48,622 GenomeAnalysisEngine - Done preparing for traversal INFO 15:50:48,622 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 15:50:48,622 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 15:50:48,623 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 15:50:48,658 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values. INFO 15:50:48,658 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values. INFO 15:51:06,585 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR stack trace

java.lang.IllegalArgumentException: HAPLOTYPE_MAX_LENGTH must be > 0 but got 0 at org.broadinstitute.gatk.utils.pairhmm.PairHMM.initialize(PairHMM.java:97) at org.broadinstitute.gatk.utils.pairhmm.N2MemoryPairHMM.initialize(N2MemoryPairHMM.java:60) at org.broadinstitute.gatk.utils.pairhmm.LoglessPairHMM.initialize(LoglessPairHMM.java:66) at org.broadinstitute.gatk.utils.pairhmm.PairHMM.computeLikelihoods(PairHMM.java:194) at org.broadinstitute.gatk.tools.walkers.indels.PairHMMIndelErrorModel.computeGeneralReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:461) at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyIndelGenotypeLikelihoods.add(GeneralPloidyIndelGenotypeLikelihoods.java:201) at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyIndelGenotypeLikelihoods.add(GeneralPloidyIndelGenotypeLikelihoods.java:124) at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyGenotypeLikelihoodsCalculationModel.getLikelihoods(GeneralPloidyGenotypeLikelihoodsCalculationModel.java:270) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateLikelihoods(UnifiedGenotypingEngine.java:317) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotypingEngine.java:201) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:379) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:151) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

##### ERROR ------------------------------------------------------------------------------------------

Created 2015-03-04 17:41:22 | Updated | Tags: unifiedgenotyper commandlinegatk

I'm encountering a problem similar to what I experienced with a mismatch between reference and cosmic files. Specifically, I created .bam files using human_g1k_v3.fasta, which reference coordinates as 1...22,X,Y etc. When I try to run UnifiedGenotyper with this reference file, the .bam file I created, and the dbsnp_137.b37.vcf, I get an error on account of the fact that snp coordinates are listed as chr1...chr22,chrX,chrY etc. Other than writing a script to remove all occurrences of "chr" is there another way to get around this problem, i.e. a dbsnp reference file that has the desired coordinates without the "chr"?

I resolved the problem with reference/cosmic by finding a cosmic file with consistent notation, but can't find a similar fix for this one. I'd appreciate any suggestions.

Created 2015-03-03 22:52:42 | Updated 2015-03-03 22:54:44 | Tags: unifiedgenotyper

Hello GATK Team,

I've just run the following using GATK v3.1.

gatk -R ref.fa -T UnifiedGenotyper -I Q11_Final.sorted.rg.bam -I Q11D1.sorted.bam -I Q11D4.sorted.bam -I Q11D2.sorted.bam -I Q11D5.sorted.bam -o PX.raw.vcf -nt 5 -glm SNP -ploidy 2

I then trimmed SNPs within 10bp of indels (called using the same but -glm indel).

The samples therein represent a mother (Sample Q11) and her 4 haploid sons.

The problem I'm getting is that calls that should be heterozygotic in the mother are not being called as such (see attached image; .bam file of the mother is int he bottom and the resulting VCF file is loaded in the top panel. The green cmd screen shows the VCF file line for the site in question where the 14th element of the indicates the queen's results (1/1:68,86:154:99:1837,120,0)

I'm not sure what's going on here and was wondering if you could help out?

Created 2015-02-24 19:52:54 | Updated | Tags: unifiedgenotyper snp indels

I am working with plants, and so cannot follow most of your human/cancer specific workflows etc. Also my question is probably not GATK specifc, but I would be very grateful for any kind of answer. I have used BWA to align a set of reads and the contigs assembled from these reads (de novo). I have used samtools mpileup and GATK unifiedGenotyper on the aligned reads. This results for both in a list of ONLY SNPs, the same ones (aside of sensitivity). I have also used the samtools steps on the aligned contigs, resulting in a list of ONLY indels. I can clearly see both SNPs and indels in IGV, both looking at the reads or the contigs. generally I see that reads are being used for variant calling, so is it "wrong" to use contigs? and why do I not get ANY indels in either tool's output when using reads? any ideas?

Created 2015-02-19 22:47:05 | Updated | Tags: unifiedgenotyper ploidy vcf

Hi, I'm running on the Unified Genotyper (Version=3.3-0-g37228af) on a pooled sample. The ploidy is set to 32. I'm trying to get allele frequency information. I'm trying to filter sites based on depth of coverage but the DP flag is not consistent within a loci

The DP at the start says there is a depth of 37, but at the end it says its 13. Judging from neighbouring sites, the actual depth seems to be 37, but I'm confused why it seems to only be using 13 of them. At other sites the scores match.

Created 2015-02-10 05:49:09 | Updated | Tags: unifiedgenotyper vcf

Hi, How do I know based on the REF and ALT column of a VCF file the actual position where an indel event happened? I usually see an event that occur in the 2nd base. For example, REF ALT GGCGTGGCGT G,GCGCGTGGCGT --deletion; insertion of "C" ATTT A,ATT --deletions

But I saw a VCF format documentation(http://samtools.github.io/hts-specs/VCFv4.2.pdf) allowing a different case: GTC G,GTCT --deletion; insertion of "T" at the end

How is UnifiedGenotyper formatting the indels? Does it differ from the one used in the 1000Genome Project? Thank you!

Roven

Created 2015-02-04 04:17:31 | Updated 2015-02-04 04:18:06 | Tags: unifiedgenotyper vcf

Hi

I'm wondering why some positions in VCF overlap. Can GATK skip emitting positions that are already part of an indel/concatenated ref? We need to use EMIT_ALL_SITES but it's quite confusing if a position is represented multiple times especially if indel is present. I understand that there is always a duplicated position preceding an indel, but the positions after the indel should ideally be not overlapping the neighboring positions. Please see some examples below:

chr02 28792823 . C . 34.23 . AN=2;DP=12;MQ=57.31;MQ0=0 GT:DP 0/0:12 chr02 28792823 . CAA . 10000037.27 . AN=2;DP=12;MQ=57.31;MQ0=0 GT:AD:DP 0/0:12:12 chr02 28792824 . A . . . DP=12;MQ=57.31;MQ0=0 GT ./. chr02 28792825 . A . 31.22 . AN=2;DP=12;MQ=57.31;MQ0=0 GT:DP 0/0:7

chr02 314879 . T . 100.23 . AN=2;DP=27;MQ=55.82;MQ0=0 GT:DP 0/0:27 chr02 314879 . TAGAG T,TAG 647.19 . AC=1,1;AF=0.500,0.500;AN=2;DP=27;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=55.82;MQ0=0;QD=23.97;RPA=8,6,7;RU=AG;STR GT:AD:DP:GQ:PL 1/2:0,9,11:26:99:989,445,424,395,0,335 chr02 314880 . A . 43.23 . AN=2;DP=26;MQ=55.66;MQ0=0 GT:DP 0/0:5 chr02 314881 . G . 40.23 . AN=2;DP=25;MQ=55.47;MQ0=0 GT:DP 0/0:4 chr02 314882 . A . 73.23 . AN=2;DP=25;MQ=55.47;MQ0=0 GT:DP 0/0:16 chr02 314883 . G . 73.23 . AN=2;DP=25;MQ=55.47;MQ0=0 GT:DP 0/0:16

This next example even calls a SNP for position 10221400 even if deletion occurs in 10221399.

chr02 10221399 . C . 76.23 . AN=2;DP=17;MQ=58.74;MQ0=0 GT:DP 0/0:17 chr02 10221399 . CA CCTA,C 143.19 . AC=1,1;AF=0.500,0.500;AN=2;DP=17;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=58.74;MQ0=0;QD=8.42 GT:AD:DP:GQ:PL 1/2:0,8,6:17:99:527,131,112,185,0,143 chr02 10221400 . A C,T 287.29 . AC=1,1;AF=0.500,0.500;AN=2;DP=17;Dels=0.29;FS=0.000;HaplotypeScore=6.7569;MLEAC=1,1;MLEAF=0.500,0.500;MQ=58.74;MQ0=0;QD=16.90 GT:AD:DP:GQ:PL 1/2:0,4,8:12:88:407,294,285,112,0,88 chr02 10221401 . A C 163.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.854;DP=17;Dels=0.00;FS=1.913;HaplotypeScore=9.7562;MLEAC=1;MLEAF=0.500;MQ=58.74;MQ0=0;MQRankSum=1.558;QD=9.63;ReadPosRankSum=2.764 GT:AD:DP:GQ:PL 0/1:11,6:17:99:192,0,392 chr02 10221402 . A . 76.23 . AN=2;DP=17;MQ=58.74;MQ0=0 GT:DP 0/0:17

Created 2015-02-02 21:19:12 | Updated | Tags: unifiedgenotyper snp-calling input-prior

I'm a bit confused about the instructions for --input_prior in UnifiedGenotyper but they seem quite useful. Is there any chance you could clarify or simplify ?

https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_genotyper_UnifiedGenotyper.php#--input_prior --input_prior / -inputPrior Input prior for calls By default, the prior specified with the argument --heterozygosity/-hets is used for variant discovery at a particular locus, using an infinite sites model, see e.g. Waterson (1975) or Tajima (1996). This model asserts that the probability of having a population of k variant sites in N chromosomes is proportional to theta/k, for 1=1:N There are instances where using this prior might not be desireable, e.g. for population studies where prior might not be appropriate, as for example when the ancestral status of the reference allele is not known. By using this argument, user can manually specify priors to be used for calling as a vector for doubles, with the following restriciotns: a) User must specify 2N values, where N is the number of samples. b) Only diploid calls supported. c) Probability values are specified in double format, in linear space. d) No negative values allowed. e) Values will be added and Pr(AC=0) will be 1-sum, so that they sum up to one. f) If user-defined values add to more than one, an error will be produced. If user wants completely flat priors, then user should specify the same value (=1/(2N+1)) 2N times,e.g. -inputPrior 0.33 -inputPrior 0.33 for the single-sample diploid case.

List[Double]

Created 2015-01-13 01:21:15 | Updated | Tags: unifiedgenotyper indels snps genotype-given-alleles

I have a small UG question to ask and a minor bug to report, but I will give the whole context. I have cohorts at low and high coverage. I will probably call my low coverage data with UG and my high coverage data with HC. I will probably run VQSR per cohort. I will then create a union set of variants and recall all samples with GENOTYPE_GIVEN_ALLELES. I will probably use UG and HC for the low and high coverage recalling, respectively. Prior to generating the union set I will need to merge the UG outputted SNPs and InDels at the same position. I can merge SNPs and InDels with bcftools:

bcftools norm -m +any


I remember from another thread that UG treats SNPs and InDels separately:

http://gatkforums.broadinstitute.org/discussion/3375/combine-snp-and-indel-vcf


But I just checked that UG can do SNP and InDels calls at the same position, when an alleles file with merged SNPs and InDels is provided, when I use GENOTYPE_GIVEN_ALLELES. I expected it to fail, but it seems to work fine. SNPs and InDels are emitted at the same site/record/line. For one record and 3 samples (HOMREF, HET, HOMALT) the genotypes are correct. Can I trust the PLs? Can I also trust QUAL and other calculated values? Or is UG not meant to be used for calling SNPs and InDels at the same site with GGA?

I only had a minor issue despite tabix indexing my bgzipped known alleles file:

##### ERROR MESSAGE: An index is required, but none found., for input source: alleles.vcf.gz


Feel free to delete the original post of this question.

Created 2015-01-12 06:03:20 | Updated | Tags: unifiedgenotyper haplotypecaller downsample

Does UG and HC downsample before or after exclusion of marked duplicates? I guess code is king for the answer...

Created 2015-01-10 08:13:41 | Updated | Tags: unifiedgenotyper variantrecalibrator vqsr haplotypescore annotation

The documentation on the HaplotypeScore annotation reads:

HaplotypeCaller does not output this annotation because it already evaluates haplotype segregation internally. This annotation is only informative (and available) for variants called by Unified Genotyper.

The annotation used to be part of the best practices:

http://gatkforums.broadinstitute.org/discussion/15/best-practice-variant-detection-with-the-gatk-v1-x-retired


I will include it in the VQSR model for UG calls from low coverage data. Is this an unwise decision? I guess this is for myself to evaluate. I thought I would ask, in case I have missed something obvious.

Created 2014-12-11 00:55:58 | Updated 2014-12-11 01:47:11 | Tags: unifiedgenotyper ug allsitespl

Hi

I've been trying to use the --allSitePLs option with Unified Genotyper with GaTK v3.2-2-gec30cee.

My command is:

java -Xmx${MEM} -jar${gatk_dir}/GenomeAnalysisTK.jar -R ${ref_dir}/${genome} -T UnifiedGenotyper \
--min_base_quality_score 30 \
-I ${in_dir}/"34E_"${REGION}"_ddkbt_RS74408_recalibrated_3.bam" \
-I ${in_dir}/"34E_"${REGION}"_ddber_RS86405_recalibrated_3.bam" \
--intervals $sites_dir/$file \
-o ${out_dir}/"38F_"${REGION}"_UG_allPL.vcf" \
--output_mode EMIT_ALL_SITES \
--allSitePLs


I ran the same command with and without the --allSitePLs option and compared, and the output seems strange.

Specifically, with --allSitePLs - the FILTER column: 5053700 sites = lowqual, 19303 = . and 5059568 had QUAL < 30. By contrast WITHOUT the --allSitePLs 5067411 = lowqual, 5592 = . and 11460 had QUAL < 30. I don't understand why does adding this option changes the QUAL so much when I'm running UG on the exact same data but just requesting that I get a PL for all sites.

Lastly, how is the specific ALT allele selected? Is it random - because they all seem equally unlikely in the example below, where only one allele was seen in both my samples.

## 2 lines from the output
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ddber_RS86405   ddkbt_RS74408
chr01   1753201 .   C   A   0   LowQual AC=0;AF=0.00;AN=4;DP=43;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0 GT:AD:APL:DP:GQ:PL   0/0:17,0:655,51,0,655,51,655,655,51,655,655:17:51:0,51,655  0/0:26,0:1010,78,0,1010,78,1010,1010,78,1010,1010:26:78:0,78,1010
chr01   1753202 .   T   A   0   LowQual AC=0;AF=0.00;AN=4;DP=43;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0 GT:AD:APL:DP:GQ:PL   0/0:17,0:630,630,630,630,630,630,48,48,48,0:17:48:0,48,630  0/0:26,0:1027,1027,1027,1027,1027,1027,78,78,78,0:26:78:0,78,1027


Created 2014-12-10 08:57:09 | Updated | Tags: unifiedgenotyper combinevariants haplotypecaller combinegvcfs

Hi GATK team, i would like to seek opinion from your team to find the best workflow that best fit my data. Previously i've been exploring both variant calling algorithms UnifiedGenotyper and HaplotypeCaller, and i would love to go for UnifiedGenotyper considering of the sensitivity and the analysis runtime. Due to my experimental cohort samples grows from time to time, so i've opt for single sample calling follow by joint-analysis using combineVariants instead of doing multiple-samples variant calling. However by doing so, i've experience few drawbacks from it (this issue was discussed at few forums). For a particular SNP loci, we wouldn't know whether the "./." reported for some of the samples are due to no reads covering that particular loci, or it doesn't pass certain criteria during variant calling performed previously, or it is a homo-reference base (which i concern this most and can't cope to lost this information).

Then, i found this "gvcf", and it is potentially to solve my problem (Thanks GATK team for always understand our researcher's need)!! Again, i'm insist of opt for unifiedGenotyper instead of haplotypeCaller to generate the gvcf, and reading from the forum at https://www.broadinstitute.org/gatk/guide/tagged?tag=gvcf, i would assume that as VCFs produced by "UnifiedGenotyper with --output_mode EMIT_ALL_SITES" to be something alike with the gvcf file produced by HaplotyperCaller. However i couldn't joint them up using either "CombineVariants" or "CombineGVCFs", most probably i think "UnifiedGenotyper with --output_mode EMIT_ALL_SITES" doesn't generate gvcf format.

Can you please give me some advice to BEST fit my need and with minimum runtime (UnifiedGenotyper would be my best choice), is there any method to joint the ALL_SITES vcf file produced by UnifiedGenotyper which i might probably missed out from the GATK page?

Created 2014-11-25 16:32:43 | Updated | Tags: unifiedgenotyper error

When I run this command:

java -jar /storage/app/GATK_3.3/GenomeAnalysisTK.jar -R /storage/data/gabriel/ucsc/hg19/hg19.fasta -T UnifiedGenotyper -I SRR493939.valid.bam -o teste2.vcf

This error happens:

WARN 04:34:22,257 RestStorageService - Error Response: PUT '/DAXKr9W5oe2LWMJhFdSveQd50kAK6tex.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 393, Content-MD5: 33uF1RoVpa9w9uTWFUxP0g==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: df7b85d51a15a5af70f6e4d6154c4fd2, Date: Tue, 25 Nov 2014 06:34:20 GMT, Authorization: AWS AKIAI22FBBJ37D5X62OQ:agZvhjUckV8Jnth+aubQX9KViUo=, User-Agent: JetS3t/0.8.1 (Linux/3.13.0-30-generic; amd64; en; JVM 1.7.0_65), Host: broad.gsa.gatk.run.reports.s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: B6DD058B64A98870, x-amz-id-2: ADKbyes3Qb8ovjofT4tKpP3gV3yPuc9v5C157qzTkfII2q3U2ZwEpUgzy0eIOz+J, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Tue, 25 Nov 2014 14:39:19 GMT, Connection: close, Server: AmazonS3] WARN 04:34:22,846 RestStorageService - Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately 29096 seconds. Retrying connection.

Can someone help me?

Created 2014-11-18 14:19:42 | Updated | Tags: unifiedgenotyper genotype-given-alleles

I was running some samtools calls through UG with genotyping_mode GENOTYPE_GIVEN_ALLELES after reading this thread:

http://gatkforums.broadinstitute.org/discussion/1842/variant-annotator-not-annotating-mappingqualityranksumtest-and-readposranksumtest-for-indels


I got this error message:

##### ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Alignment 440969 | 36S26M1D6M1I2M1D1M3D27M1S


What did I do wrong? It works for most other fragments (chrom20:pos1-pos2). Is it possible to avoid this error message and just do ./. calls at coordinates not covered by reads? I'm not sure, why samtools called at a coordinate not covered by a read. I don't know at which coordinate this happened. I was at this coordinate, when the error happened:

INFO  10:21:06,036 ProgressMeter -       20:479014   2329564.0    55.0 m      23.6 m        0.6%     6.7 d       6.7 d


I was using this command:

vcf=samtools.vcf.gz
$java -Djava.io.tmpdir=tmp -Xmx63900m \ -jar$jar \
--analysis_type UnifiedGenotyper \
--reference_sequence $ref \ --input_file bam.list \ --dbsnp$dbsnp138 \
--out $out \ --intervals$vcf \
-stand_call_conf 10 \
-stand_emit_conf 10 \
--genotype_likelihoods_model BOTH \
--output_mode EMIT_ALL_SITES \
--annotation Coverage -A FisherStrand -A HaplotypeScore -A MappingQualityRankSumTest -A QualByDepth -A RMSMappingQuality -A ReadPosRankSumTest \

Error:

##### ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: 96 at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.getCache(DiploidSNPGenotypeLikelihoods.java:298) at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.inCache(DiploidSNPGenotypeLikelihoods.java:262) at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.add(DiploidSNPGenotypeLikelihoods.java:229) at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.add(DiploidSNPGenotypeLikelihoods.java:190) at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.add(DiploidSNPGenotypeLikelihoods.java:176) at org.broadinstitute.sting.gatk.walkers.genotyper.SNPGenotypeLikelihoodsCalculationModel.getLikelihoods(SNPGenotypeLikelihoodsCalculationModel.java:114) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoods(UnifiedGenotyperEngine.java:331) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:232) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

##### ERROR MESSAGE: 96

I found this documented somewhat for Haplotype caller, but the solution didn't look applicable here. These are pooled samples so we must use unified genotyper.

Created 2014-09-29 19:32:51 | Updated | Tags: unifiedgenotyper

Hi Geraldine, I am confused when I read the paper and tutorial on GATK variant caller. First, the DePristo et al. 2011 paper says local re-aligner generates a list of candidate haplotypes using a) dbSNP info, b) presence of atleast one indel in a read and c) cluster of mismatches at a site. Then it finds the best alternative haplotype based on read haplotype likelihoods and uses log odds ratio to declare ref and alt haplotype pair. Second, the tutorial on GATK variant caller mentions Indel genotype likelihood calculation where UG estimates this via 1) first generate haplotypes from indels in the reads. 2) genotype likelihoods for all haplotype pairs 3) for each haplotype compute read haplotype likelihood using HMM

My question is do both the steps local realignment and UG do haplotype generation and then compute read haplotype likelihood. To me these two steps are redundant because if the best haplotype pairs are generated using local realignment then why is there a need to generate haplotypes for genotype likelihood estimation. I mean in theory, UG should just be able to call indels and assign them genotype by testing all the possible genotypes against the two haplotypes called by the local realigner. Please help me sort this out. I highly appreciate your concern.

Created 2014-09-18 20:32:00 | Updated | Tags: unifiedgenotyper haplotypecaller exome-sequencing

Hello everyone, I have recently sequenced 15 sample exomes using the Ion proton system, which directly generates VCFs after sequencing runs. However, since I will be using GATK for downstream analysis, I have decided to recall the variants using UG or HC. I read a document stating that HC should not be used in the case of pooled samples. Can someone clarify what exactly this means? For Ion proton, three samples are barcoded and pooled together prior to sequencing. Also, is 15 samples enough to obtain good discovery power? I was thinking of using external, publicly available bam files of exome sequencing data if 15 samples is not sufficient for joint variant calling.

Created 2014-09-15 21:28:59 | Updated | Tags: indelrealigner unifiedgenotyper indels gatk

I used GATK 3.2-2 to do indel realignment and multisample calling with UnifiedGenotyper (I cannot use HaplotypeCaller because it is incompatible with the type of data I am analyzing).

Among the ~96 samples sequenced, I have one sample with two nearby variant calls, an indel and a missense, that we checked by Sanger. The missense is real but we found no trace of the indel in the Sanger traces. When I look at indel call in the multisample vcf, it has good allele depth and GQ, but suspiciously has the same AD as the missense call. Additionally, when I look at the bam in IGV, I see no evidence for the indel whatsoever and the variant is not called in any other samples in this project.

indel: chr13:101755523 CA>C GT:AD:DP:GQ:PL 0/1:55,56:111:99:1388,0,1538

missense: chr13:101755530 A>G GT:AD:DP:GQ:PL 0/1:55,56:111:99:2170,0,2125

I went back and recalled just this one sample (single sample calling)… which resulted in the correct variants, i.e. the indel was not called at all, but the SNP, which does validate, is called.

I understand that this is not an easy region to call because of the 7xA repeat, but it’s not obvious to me why this happens only in multisample mode and I'd like to have a better understanding of what is going on.

Created 2014-09-11 13:38:40 | Updated | Tags: unifiedgenotyper ploidy multiallelic pooled-calls

Hi,

Is GATK Unified genotyper able to call multi-allelic positions in a single pooled sample? Case is a pool of 13 samples, we use UG with ploidy set to 26. If I understand the supplementaries of the original publications correct, UG will never be able to call three alleles at a single position. in single sample calling. Or does this not hold for high ploidy analysis?

If needed, we can call multiple pools together, but this becomes computationally intensive.

In summary, we would like to call a 14xG,6xA,6xT call for example.

Also, how does UG take noise into account when genotyping (sequencing errors), when for example 3% of reads is aberrant at a position, this could correspond to ~ 1/26.

Thanks for any guidelines,

geert

Created 2014-09-10 18:50:45 | Updated | Tags: unifiedgenotyper joint-calling unifiedgenotyper-joint-calling

Can someone tell me what version of the UnifiedGenotyper introduced joint sample calling? I'm specifically interested in whether 1.6 supported joint calling.

Thanks!

Created 2014-08-12 03:54:48 | Updated | Tags: unifiedgenotyper

Hello, Is there any possibility that I can separate the reads in AD (Allele Depths by sample) into 5' and 3' strands? e.g. the current expression of AD is REF:ALT1:ALT2 ...etc, Can I get the expression of AD like REF_5':REF_3':ALT1_5':ALT1_3':ALT2_5':ALT2_3'

Created 2014-08-11 09:50:56 | Updated | Tags: unifiedgenotyper vcf qual

I used the EMIT_ALL_SITES option with Unified Genotyper. For polymorphic sites, the quality score (QUAL field) corresponds to the Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. But for monomorphic sites, (at this site, we have an homozygote for the reference allele), what is the definition of the quality score ? and how is it computed ? Many thanks for your explanation.

Created 2014-07-20 04:19:36 | Updated 2014-07-20 04:26:30 | Tags: unifiedgenotyper parallelism nct nt multiple-vcf

Hello,

It seems that while non-parallel UnifiedGenotyper (SNP mode) always generates the same output vcf, parallel version (either -nt and/or -nct >1) of UnifiedGenotyper generates not exactly the same vcf files for each run. For example, below is the GenotypeConcordance output of two runs with same input file (chr 21) and same parameters (-nt4 -nct4):

Sample Eval_Genotype Comp_Genotype Count
ALL HET HET 46341 ALL HET HOM_REF 0 ALL HET HOM_VAR 8 ALL HET MIXED 0 ALL HET NO_CALL 0 ALL HOM_REF HET 300 ALL HOM_REF HOM_REF 34306900 ALL HOM_REF HOM_VAR 0 ALL HOM_REF MIXED 0 ALL HOM_REF NO_CALL 18 ALL HOM_VAR HET 12 ALL HOM_VAR HOM_REF 0 ALL HOM_VAR HOM_VAR 23490 ALL HOM_VAR MIXED 0 ALL HOM_VAR NO_CALL 18 ALL Mismatching_Alleles Mismatching_Alleles 303 ALL NO_CALL HET 0 ALL NO_CALL HOM_REF 11 ALL NO_CALL HOM_VAR 24 ALL NO_CALL MIXED 0 ALL NO_CALL NO_CALL 729217

So I am wondering if there is a way to get the same output vcf. Or do I miss something here? Thanks.

Created 2014-07-16 09:46:17 | Updated | Tags: unifiedgenotyper snp-calling

hi,

is there a way to output all possible variants (the alternate allele)? i have total reads of 2100 whereby 2045 reads are match reference and 55 reads are alternate allele. My case is very specific, i would like GATK UnifiedGenotyper to call out my variants although the proportion of alternate read is relative low (1-5%) . And regardless of the quality score of alternate reads. Is there a way to specify this settings during the SNP calling in GATK UnifiedGenotyper?

thank you very much Best J

Created 2014-07-10 20:18:01 | Updated | Tags: unifiedgenotyper indels

I performed Variant Calling using UnifiedGenotyper and HaplotypeCaller for Whole Exome Sequenced samples.

GATK version used GATK 3.0

I had only 4 African samples, since did not have any controls I took 70 African BAMs from 1000Genomes. Used Agilent Sure Select WES Version 5 target bed file as interval file to limit the calls to exomes.

I couldot find a single INDEL in UnifiedGenotyper called data and hence couldnot run VQSR for indels on it. But HaplotypeCaller called 20,456 INDELs.

UnifiedGenotyper Called Data

# INDELs-0

HaplotypeCaller Called Data

# INDELs-20456

Is it possible that for variants called from 74 BAMs, UnifiedGenotyper could not find a single INDEL ??

Thanks, Tinu

Created 2014-07-10 10:59:18 | Updated | Tags: unifiedgenotyper multi-sample aneuploid

Hi all,

I am working with a set of samples at the moment which have the following problems:

1) The organism is aneuploid 2) Ploidy for individual chromosomes varies between samples (we have a script that estimates this from the coverage so we can use this information to improve variant calling) 3) Coverage is thin (sequencing was done on a student's budget)

I am using the UnifiedGenotyper for variant discovery as it can account for aneuploidy.

I initially tried calling variants for each sample, grouping chromosomes by ploidy (i.e, for sample 1 I call all the diploid chromosomes, then all the triploid chromosomes etc). I also tried doing multi-sample variant calling across all the samples, setting the ploidy to 2 (The modal chromosome number is 2). Comparison of these two analyses shows that each is missing good SNPs that are called by the other. Multi-sample analyses is missing or mis-calling SNPs on aneuploid chromosomes, and individual analysis is missing SNPs in individual samples due to thin coverage (or more accurately, they are being called, but failing QD or quality filters due to thin coverage - these SNPs pass these filters in the multi-sample analysis).

I am thinking of trying to combine these approaches. The idea is to analyse each chromosome separately. For each chromosome, I would do a multi-sample call on all the samples which are diploid and a multi-sample call on all the samples which are triploid etc etc. I am hoping that this will give me the best of the two approaches above.

So my questions are:

1) Does this strategy makes sense? Is there any reason not to do it this way?

2) How should I proceed downstream? I know I will have to hard-filter regardless. Can I merge all my vcf files before filtering and annotation, or is there a reason to keep them separate?

Any input from the GATK team or the wider community is appreciated!

Thanks

Kathryn

Created 2014-07-06 16:26:42 | Updated | Tags: unifiedgenotyper snp-calling

I'm seeing Mendelian errors in small but significant proportion of my unified genotyper calls. For many instances the SNP is called incorrectly as a homozygote even though there are some reads of the second allele. How can I set unified genotyper to be more sensitive ?

Created 2014-07-02 20:02:53 | Updated | Tags: unifiedgenotyper fastareference heapsize callvariants

Hi all,

Do you have a recommendation to estimate how much heap memory (-Xmx) is necessary to cal variants using the Unified Genotyper. I think that with my project I might be facing a situation where I will run out of memory until there is not more left to increase. To give you an idea, I have 185 samples (that together are 8Gb) and the fasta reference that I am using has too many scaffolds (3 Million). I don't have the opportunity to improve the reference I have at the moment. I have been using -Xmx52G and -nt 10 (in GATK 3.1) but it gives an error at the same point.

INFO 14:45:41,790 HelpFormatter - -------------------------------------------------------------------------------- INFO 14:45:42,773 GenomeAnalysisEngine - Strictness is SILENT INFO 14:59:01,106 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 14:59:01,171 SAMDataSourceSAMReaders - Initializing SAMRecords in serial ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java ##### ERROR ------------------------------------------------------------------------------------------ If you have a suggestion/advice of how to make the analysis work it would be very much appreciated. I know that increasing scaffolds length (reducing number of scaffolds) can improve the analysis so I am wondering if I am facing a situation where I can't do any analysis until the fasta reference is improved. Many thanks, Ximena Created 2014-06-16 15:43:27 | Updated | Tags: unifiedgenotyper Dear All, I am Calling SNPs using Unified Genotyper. I am using the Version GATK 2.5.2. Command Used: java -Djava.io.tmpdir=TMP -jar /GenomeAnalysisTK/2.5.2/GenomeAnalysisTK.jar -R Ref.fa -T UnifiedGenotyper -I input1.bam -I input2.bam -I input3.bam -o output.raw.vcf -stand_call_conf 50 -stand_emit_conf 20 -out_mode EMIT_ALL_CONFIDENT_SITES. Usually I get A summary of callable base counts: Like INFO 00:23:29,795 UnifiedGenotyper - Visited bases 247249719 INFO 00:23:29,796 UnifiedGenotyper - Callable bases 219998386 INFO 00:23:29,796 UnifiedGenotyper - Confidently called bases 219936125 INFO 00:23:29,796 UnifiedGenotyper - % callable bases of all loci 88.978 INFO 00:23:29,797 UnifiedGenotyper - % confidently called bases of all loci 88.953 INFO 00:23:29,797 UnifiedGenotyper - % confidently called bases of callable loci 88.953 INFO 00:23:29,797 UnifiedGenotyper - Actual calls made 303126 But I did not get it. Can you tell me how can I calculate this from my output VCF file now? Created 2014-06-16 07:02:28 | Updated | Tags: unifiedgenotyper exome target I am running some exome samples with UG. I tried this in 2 ways: 1. Run UG and then apply exome region filter. 2. Run UG with exome regions. Is there any difference in these approaches if I am only concerned about the variants in exome region? I have ran in both ways and on comparison I see that some values are different for the same locus. For eg. PL, MQRankSum, BaseQRankSum. However, the different is not huge but does UG perform slightly differently for making calls with target regions vs without? PS: I have ran this for UG with multiple samples Created 2014-05-28 22:56:35 | Updated | Tags: unifiedgenotyper heterozygosity priors If I specify both -inputPrior and -hets, which one will UnifiedGenotyper use to determine the prior? Is --heterozygosity used for anything other than prior specification? Created 2014-05-28 11:46:40 | Updated | Tags: unifiedgenotyper heterozygous snp-calling All of my whole-genome sequenced individuals are a cross of different outbred lines with one reference line - effectively identical to the reference genome used for read mapping and variant calling. Thus, effectively all SNPs are heterozygous relative to the reference genome. Should I alter the settings on the Unified Genotyper to maximise detection of these kind of variants ? Created 2014-05-23 15:26:18 | Updated 2014-05-23 15:32:12 | Tags: unifiedgenotyper haplotypecaller indels Hello, I used UnifiedGenotyper and HaplotypeCaller on my sample to get out INDELS and SNPS. There is a 147bp deletion in one of the genes(working with yeast genome)(see attached file). But this deletion is not reported by UnifiedGenotyper and HaplotypeCaller. Since I am working with yeast genome(which is comparatively very small compared to human genome), I am only interested in INDELS ranging from 150bp-200bp maximum. Is there a way by changing parameters that this can happen. I have read some of the previous questions about how haplotypecaller is unable to call large INDELS(in kb's), but I guess in my case INDELS max I am looking for is 200bp. Can this be achieved?? Also UnifiedGenotyper has an option to provide ploidy level. I could not locate that option in HaplotypeCaller. Any way to have that on HaplotypeCaller ? I ask you this because the yeast genome samples I work with are eseentially haploid. NOTE: My sample is paired end data with 100bp reads Hope to hear from you soon Regards Varun Created 2014-05-21 07:42:51 | Updated | Tags: unifiedgenotyper problem-with-the-file-locking-support We've used GATK(v2.1-8) in a SGE environment for more than 1 years now. Without any changing in environment, we have had a few cases of UnifiedGenotyper getting hung up while trying to read our previously created .fai and .dict reference files. FSLockWithShared - WARNING: Unable to lock file xxx/ucsc.hg19.dict; ReferenceDataSource - Unable to create a lock on dictionary file. Then GATK was updated to 2.7-4 and 3.1-1. There was still an error. "ERROR MESSAGE: Timeout of 30000 milliseconds was reached while trying to acquire a lock on file xxx/dbsnp_135.hg19.vcf.idx. Since the GATK uses non-blocking lock acquisition calls that are not supposed to wait, this implies a problem with the file locking support in your operating system." Besides, there is no problem reported when we run the UnifiedGenotyper step on data store node. This problem happens if we run the UnifiedGenotyper step on compute nodes. We used NFS for shared data between nodes. Is it the way, one is access the data directly and the other one is by NFS shared, leading this problem? If so, how can I fix this problem? Hoping someone can answer this problem as soon as possible. Thank you! Created 2014-05-07 08:07:34 | Updated | Tags: unifiedgenotyper trio-data Hi all, I was wondering if there is a recommended way how to call variants in trio data (affected children, healthy parents) using the Unified Genotyper. As I see it, there are three possible ways: 1. Call each sample individually 2. Call trios together 3. Call all samples together But which one would you recommend? Created 2014-04-29 12:21:43 | Updated | Tags: unifiedgenotyper bam Hi GATK team, I am using GATK to call random mutations caused by mutagen. I have more than 5 mutant lines sequenced and I want to call SNP/INDELs which are unique for each samples (since they are randomly introduced they must be unique for each samples). Majoirity of SNP will be shared between all mutant samples as this are the natural variations between line used for mutagenesis and the Reference sequence. I am using UnifiedGenotyper (because I have sequenced pooled DNA for each mutant family) by giving all the 5 bam files from samples. I wonder, by calling SNP from all 5 mutant samples together may negatively influence the unique SNPs? Does GATK consider all the 5 samples as a population and give preference (or high score to) locus which are shared between majority of samples? In my case, would it be better to call SNP/INDEL separately for all 5 samples? The command I am using is java -jar GenomeAnalysisTKLite-2.3-9-gdcdccbb/GenomeAnalysisTKLite.jar -T UnifiedGenotyper -glm BOTH -R ref.fa -nt 10 -I results/1_sorted.bam_fixed_rmdup_realign.bam -I results/2_sorted.bam_fixed_rmdup_realign.bam -I results/3_sorted.bam_fixed_rmdup_realign.bam -I results/4_sorted.bam_fixed_rmdup_realign.bam -I results/5_sorted.bam_fixed_rmdup_realign.bam -o results/out.vcf I hope I am clear, and looking forward to learn more, Thank you in advance Created 2014-04-28 10:34:08 | Updated 2014-04-28 10:36:08 | Tags: unifiedgenotyper allele-counts I found that the allele counts of a position by GATK is different from those by IGV acculumation. For example, IGV gives (A:0, C:13, G:1, T:7) counts. But, the vcf file generated by GATK UnfiedGenotyper contains the following: chr1 10180 . T C 107.93 PASS AC=1;AF=0.50;AN=2;BaseQRankSum=-0.278;DP=17;Dels=0.00;FS=3.522;HRun=2;Haplot ypeScore=7.9790;MQ=25.98;MQ0=4;MQRankSum=-0.597;QD=6.35;ReadPosRankSum=1.663;SB=-67.79 GT:AD:DP:GQ:PL 0/1:7,9:17:98.17:138,0,98 I thought that there might be some filtered out bases. Did I guess right? Then, could you tell me What are the default filtering options in UnifiedGenotyper? Created 2014-04-16 14:11:37 | Updated | Tags: unifiedgenotyper haplotypecaller bug gatk HI, I'd like to report a weird result from HaplotypeCaller. We have a patient sequenced by targeted sequencing on HNF1B. This patient has been confirmed to have a whole gene deletion of HNF1B so we used this patient as a positive control. We expected to see no heterozygous variants called in HNF1B. However, HaplotypeCaller called two heterozygous variants: one deletion (it didn't pass the FS strand bias filter and the ReadPosRankSumTest filter) and one substitution (this one passed all the quality filters). Both these two variants were not called by UnifiedGenotyper (and the variants called by UnifiedGenotyper in HNF1B region were all homozygous as what we expected) Please see the VCF table: There are three things I want to highlight: 1, The deletion is only 10 bases upstream of the substitution, but the FS score is enormous for the deletion whereas very low for the substitution. If there's a strand bias, it must affect both variants if they are so close to each other. 2, The score of ReadPosRankSumTest of the deletion didn't pass the filter because it's always called near the end of the reads. The ReadPosRankSumTest score for the substitution is missing. 3, The genotype was called 0/1 for the substitution, but if we look at the AD, there are 206 reads supporting the ref allele and 0 read supporting the alt allele. Going by the AD, it's clearly a homozygous ref/ref genotype. Then I looked into the bam files. It turns out the all the alternative alleles of the substitution were from the ends of bad reads, and there are not too many of them after all. So the reads in the bam file also support a homozygous ref/ref genotype. Therefore I'm really confused why the substitution has 0/1 genotype called by the HaplotypeCaller and why it passed the filter. Many Thanks Betty Created 2014-04-14 17:14:33 | Updated | Tags: unifiedgenotyper Hello, I am running unified genotyper and for some reason in the log file it shows that all reads are being filtered out and no variants have been called in the vcf file. I get an empty vcf file with just the header. I could use some help regarding this. Created 2014-04-02 09:09:08 | Updated | Tags: unifiedgenotyper Dear, I want to output the number of all bases (ACGT) for each position using UnifiedGenotyper, in GATK2 we used the option -A BaseCounts. Now I am wondering what option I should use in GATK3, Thanks, Bram Created 2014-03-31 10:41:27 | Updated | Tags: unifiedgenotyper gatk Hi I am hitting with the following error when I try to use UnifiedGenotyper. Wondering if its a bug or something wrong with my bam file? Thanks for your help in advance. ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.IllegalArgumentException: Illegal base [K] seen in the allele at org.broadinstitute.variant.variantcontext.Allele.create(Allele.java:205) at org.broadinstitute.variant.variantcontext.Allele.create(Allele.java:213) at org.broadinstitute.sting.gatk.walkers.annotator.RankSumTest.fillQualsFromPileup(RankSumTest.java:159) at org.broadinstitute.sting.gatk.walkers.annotator.RankSumTest.annotate(RankSumTest.java:113) at org.broadinstitute.sting.gatk.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:192) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateGenotypes(UnifiedGenotyperEngine.java:560) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:234) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNanoTraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.7-4-g6f46d11): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Illegal base [K] seen in the allele ##### ERROR ------------------------------------------------------------------------------------------ Cheers Scott.. Created 2014-03-28 13:58:31 | Updated | Tags: unifiedgenotyper multithreading multi-sample gatk unified-genotyper capacity Hi, How many samples and of what file/genome size can the Unified Genotyper process simultaneously ? I will have about 250 samples, genome size 180Mb, bam sizes 3-5GB, reduced bam sizes ~1GB. Also, how long would this take can it be multi-threaded ? Cheers, Blue Created 2014-03-19 05:40:04 | Updated | Tags: unifiedgenotyper input-output-error Dear GATK team I am trying to produce a mutli-sample VCF file, this is the command line I used: java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R /home/mohameam/Pchabaudi/ref.fasta -I S1.bam -I S2.bam -I S3.bam -ploidy 1 -o raw.S1.S2.S3.vcf the program runs smoothly and the output file seems ok, however I get the following error msg in the end: ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.0-0-g6bad1c6): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: Input/output error ##### ERROR ------------------------------------------------------------------------------------------ please advise ? thank you Alyaa Created 2014-03-12 08:28:47 | Updated | Tags: unifiedgenotyper gatk Hi, we observe an unexpected deletion call (GATK UnifiedGenotyper 3.0 and 2.5) in a setup with three samples called together. In the file call.txt you find the call. From the pileup (pileup.txt, position 19:57325555) we would expect, that the indel would only be called for the first sample. Is there another source of information, that makes GATK believe, that the deletion also occurs in the second and third sample? Thanks in advance, Johannes Created 2014-03-10 22:03:01 | Updated | Tags: unifiedgenotyper variantrecalibrator applyrecalibration snp indels Hi, I am running Variant Recalibration on Indels,prior to this I completed Variant Recalibration and ApplyRecalibration on SNPs. So,the input file is the recalibrated VCF file from Apply Recalibration step of SNP's. Below is the error I am getting. ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 2.7-4-g6f46d11): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: Bad input: Values for DP annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://gatkforums.broadinsti tute.org/discussion/49/using-variant-annotator The command I am using for this is : jre1.7.0_40/bin/java -Djava.io.tmpdir=./rb_2905_VCF/tmp -Xmx2g -jar GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T VariantRecalibrator -R dbdata/human_g1k_v37.fasta -input${input_file} --maxGaussians 4 -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.vcf - resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_137.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.indels.b37.vcf -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum -mode INDEL -recalFile $destdir/${input_file%recal.snps.vcf}recal.indel.recal -tranchesFile $destdir/${input_file%recal.snps.vcf}recal.indel.tranches -rscriptFile $destdir/${input_file%recal.snps.vcf}recal.indel.plots.R

If I remove the options -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum,then I get this error:

##### ERROR ------------------------------------------------------------------------------------------

Created 2014-02-27 15:25:19 | Updated 2014-02-27 22:40:50 | Tags: unifiedgenotyper genotype vcf-format gt

generated with gatk 2.8-1-g932cd3a

Although it is rare I see Genotype Fields that are inconsistent with the AD values (Read as table):

CHROM   POS ID  REF ALT FILTER  QUAL    ABHet   ABHom   AC  AF  AN  BaseCounts  BaseQRankSum    DP  Dels    FS  GC  HRun    HaplotypeScore  LowMQ   MLEAC   MLEAF   MQ  MQ0 MQRankSum   MeanDP  MinDP   OND PercentNBaseSolid   QD  ReadPosRankSum  Samples Somatic VariantType cosmic.ID   1.AB    1.AD    1.DP    1.F 1.GQ    1.GT    1.MQ0   1.PL    1.Z 2.AB    2.AD    2.DP    2.F 2.GQ    2.GT    2.MQ0   2.PL    2.Z 3.AB    3.AD    3.DP    3.F 3.GQ    3.GT    3.MQ0   3.PL    3.Z 4.AB    4.AD    4.DP    4.F 4.GQ    4.GT    4.MQ0   4.PL    4.Z 5.AB    5.AD    5.DP    5.F 5.GQ    5.GT    5.MQ0   5.PL    5.Z
11  92616485    0   A   C   PASS    63.71   0.333   0.698   1   0.1 10  89,54,0,0   -5.631  143 0   49.552  71.29   2   4.4154  0.0000,0.0000,143   1   0.1 50.27   0   -1.645  28.6    16  0.242   0   2.36    2.125   R5_A3_1 NA  SNP COSM467570  NA  24,9    33  0.2727272727    54  A/A 0   0,54,537    -1.3055824197   0.33    9,18    27  0.6666666667    96  A/C 0   96,0,178    0.8660254038    NA  21,11   32  0.34375 21  A/A 0   0,21,466    -0.8838834765   NA  12,4    16  0.25    27  A/A 0   0,27,272    -1  NA  23,12   35  0.3428571429    42  A/A 0   0,42,537    -0.9296696802


This shows that for example sample 5 has a AD value of '23,12' and a GT of 'A/A' aka homyzougous reference allele. I've included a screenshot wich shows low base quality and complete strand bias (Which I suspect to mis variants). So whats the prob? and how can i recalculate the GT's based on AD? because i cannot filter based on genotypes when they are buggy....

Created 2014-02-26 13:11:55 | Updated | Tags: unifiedgenotyper vcf genotyping

Dear GATK team,

Could you please help me how to explain genotyping in my vcf file. I have Illumina data and vcf caller was GATK. My variant frequency (Alt variant freq) is 99.7%. DP = 4622 (AD = 16, 4606) - so I would expect that this sample is alternate homozygous. But when I check PL value, which is - PL = 1655,0,323 - after calculating my likelihood -

REF= G ALT= A

P(D|GG) = 10 ^ -165.5 = small
P(D|AG) = 10 ^ 0  = 1
P(D|AA) = 10 ^ 32.3 = small


we can see it is heterozygous. Can anybody help me how to interpret my result? How it is possible that likelihoods show me heterozygous and coverage and VF show me homozygous?

Here is part of my vcf file:

chr13 32899193 . G A 1625.01 PASS AC=1;AF=0.5;AN=2;DP=4622;QD=0.35;TI=NM_000059;GI=BRCA2;FC=Silent GT:AD:DP:GQ:PL:VF:GQX 0/1:16,4606:5000:99:1655,0,323:0.997:99

Thank you for any explanation.

Paul.

Created 2014-02-18 19:05:14 | Updated 2014-02-18 19:06:52 | Tags: unifiedgenotyper insertsizedistribution indels

Hi, I have run gatk UnifiedGenotyper on ~45 human whole exomes. I have noticed that the maximum size for called indels in these exomes is 60bp. My question is, is 60 a default for the maximum indel size? I can't seem to find information regarding the threshold for indel size called by UnifiedGenotyper. Is the maximum indel size a parameter that can be adjusted?

In case this is of any use; This is the command I am using:

java -jar /usr/local/gatk/2.6-4/GenomeAnalysisTK.jar -R Homo_sapiens_assembly19.fasta -T UnifiedGenotyper -glm INDEL -I bams.list -L Targets.bed -stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 1000

Thank you very much for your support! Dalia

Created 2014-02-14 08:56:11 | Updated 2014-02-14 09:06:51 | Tags: unifiedgenotyper indels

Hi there,

I'm systematically comparing different read mappers right now. That is, I align reads and run MarkDup/IndelRealignment/Recalibration followed by the UnifiedGenotyper.

When using stampy, I only get SNP calls but no indels. I've extracted one example where the alignments clearly show a homozygous 14bp deletion that does not show up in the VCF. For comparison purposes, I've looked at GSNAP alignments at the same locus. There the deletion looks a bit less clear in the IGV but is nonetheless present in the VCF produced by UG (screenshot attached).

My call to UG (2.8-1-g932cd3a) is the following:

gatk -T UnifiedGenotyper -glm BOTH -R hg18/hg18.fasta -rf BadCigar -L chr1:4742500-4744500 -I example.stampy.bam -o example.stampy.vcf

gatk -T UnifiedGenotyper -glm BOTH -R hg18/hg18.fasta -rf BadCigar -L chr1:4742500-4744500 -I example.gsnap.bam -o example.gsnap.vcf


So my question boils down to: Why is the deletion called for GSNAP but not for STAMPY, although it looks much cleaner for STAMPY. I'm attaching example BAM files and UG output.

Your help is very much appreciated. Please let me know if there is any additional information I can provide.

Cheers, Tobi

Created 2014-02-13 17:30:45 | Updated | Tags: unifiedgenotyper haplotypecaller multi-sample pass

Hi, I have run GATK Unified genotyper with multisample and I am getting unexpected output. (Hard filters used)

Here are examples of the output format I get in the vcf for a sample: 0/1:43,4:PASS:24:24,0,4368, ./.:.:PASS or 0/0:2,0:2:DPFilter:6:0,6,47 as well as 0/0:12,0:37:0,37,830. The latter is the format I would expect. What does this mean? Also I get the filter PASS even when all but two samples have ./.:.:PASS and the other two have 1/1:0,1:1:DPFilter;GQFilter:3:38,3,0 and 1/1:0,1:1:DPFilter;GQFilter:3:38,3,0.

Also I ran HalpotypeCaller multisample and get similar output all though more of the 0/0:12,0:37:0,37,830 format....

Created 2014-02-11 02:58:14 | Updated | Tags: unifiedgenotyper ploidy vcf

I using the UnifiedGenotyper to call SNPs in a pooled sample of 30 diploid individuals (i.e., I am setting the ploidy to 60). Does this mean that if the coverage is < 60 at a given variant site, the vcf file will read "./." for all alleles at that site? In other words, does it require the coverage to be >= the ploidy or it won't produce a called variant at that site? I'm just trying to make sure I am interpreting the vcf file correctly, in that: if there is a called genotype at a given variant site that I can interpret that as the estimated allele frequency in that pool.

Best, Jon

Created 2014-02-08 00:18:03 | Updated | Tags: unifiedgenotyper

Dear all in GATK forums,

I am dealing with a slightly annoying feature of GATK, and I am wondering if there is something I can do to prevent it. I am using multisample calling for a large cohort and UG version 2.8.1 and I see in my VCF situations like this one: 5 96443177 . GC G 862.42 . 5 96443178 . CC C 859.42 .

The same individual is a carrier for both, and quite clearly this is the same variant (a deletion of a C) called in two different places. It is not a disaster but it messes up some downstream analyses (case control...). Would you have a suggestion to avoid that problem?

Created 2014-02-06 08:15:40 | Updated | Tags: unifiedgenotyper vcf

Hi, I'm trying to understand how do you usually calculate the GQ of a SNP. I understood the model to calculate the Likelihood of all the genotypes (AA,AC, AG,AT,CC,CG,CT,GG,GT,TT). Once all the likelihood have been calculated, correct me if I'm wrong, you normalize the likelihood of the best genotype to 1 and all the other likelihoods according to that scale. So the PL field in the VCF should be the Phred-scaled values of the normalized values. But is not clear to me how do you finally calculate the GQ value. What values do you use to calculate that quality (normalized or phred scaled)? And what's the right formula? I've tried to debug the code but it ends to be really tricky. I really hope that you would help me. I would be really thankfull for that.

Created 2014-02-05 11:33:30 | Updated | Tags: indelrealigner unifiedgenotyper java error

Hi All We are running into some random weirdness when running jobs using SGE, GATK version 2.7-2-g6bda569, pretty much all GATK tools - but mostly IndelRealigner abd UnifiedGenotyper, we often get the following error:-

##### ERROR MESSAGE: Couldn't read file /scratch/project/pipelines/novorecal.bam because java.io.FileNotFoundException: /scratch/project/pipelines/novorecal.bam (No such file or directory)

This also happens for supplied reference genomes and vcf files. The GATK tool cant find them.

These "missing" files do exist, and have often even been created by the previous tool/step in the pipeline.

When we re-run the pipeline on a failed sample, it works. So we end up having to re-run our pipeline on the same set of samples multiple times and are beginning to find this very frustrating. These errors seem to be random, I cant find any pattern, and as I mentioned, when we re-run the pipeline on a failed run, it work without a hitch.

Has anyone experienced this? And if so, any recommendations?

Steve

Created 2014-02-04 21:03:21 | Updated | Tags: unifiedgenotyper indels

Hi all,

I am running UnifiedGenotyper GLM=Indel Ploidy=1, which produces my raw indels, with no apparent problems. I then run BaseRecalibrator and PrintReads to generate a new BAM, which i am then recalling the indels. This time, I am getting no indels in the VCF (only headers). Curiously, this problem is only present in all 13/52 isolates that are most distant from the reference (expecting ~50K indels), while the rest have all worked (~6K indels). Any clues what might be causing this problem? Thanks, Rhys

Created 2014-01-31 13:44:10 | Updated | Tags: unifiedgenotyper

Hi, I encounter a strange error when using UnifiedGenotyper, The command that I use is: /usr/bin/java -Xmx4g -jar /home/shengyu_ni/bin/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /mnt/genotyping/sendru/human_g1k_v37.fasta -I bqsr.bam -o haplogroup.vcf -out_mode EMIT_ALL_SITES -gt_mode GENOTYPE_GIVEN_ALLELES -alleles /mnt/genotyping/sendru/isogg.vcf -nt 4 -ploidy 1 -L /mnt/genotyping/sendru/comprey.interval_list

and the error mesage is:

java.lang.ArrayIndexOutOfBoundsException: -1 at org.broadinstitute.sting.gatk.walkers.genotyper.GeneralPloidyGenotypeLikelihoods.subsetToAlleles(GeneralPloidyGenotypeLikelihoods.java:380) at org.broadinstitute.sting.gatk.walkers.genotyper.GeneralPloidyGenotypeLikelihoodsCalculationModel.getLikelihoods(GeneralPloidyGenotypeLikelihoodsCalculationModel.java:294) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoods(UnifiedGenotyperEngine.java:331) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:232) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutorWorker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.8-1-g932cd3a): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: -1 ##### ERROR ------------------------------------------------------------------------------------------ I already notice that it can be some compatible problem with my file: isogg.vcf, because when I replace that file with another vcf file, it runs successfully, but I can not spot what makes the difference. so I list the header of the vcf file as the following, additionally, this vcf file works fine with realignment ## fileformat=VCFv4.1 ## fileDate=20140131 ## source=isogg ## isogg_BUILD_ID=2014 ## reference=GRCh37.p10 ## phasing=partial ## INFO=<ID=ISOGG,Number=0,Type=Flag,Description="Used as y chromosome haplogroup in ISOGG. "> ## FILTER=<ID=PASS,Description="Only pass snp is kept, the other possibles are private and investigation. "> ## contig=<ID=1,assembly=b37,length=249250621> ## contig=<ID=10,assembly=b37,length=135534747> ## contig=<ID=11,assembly=b37,length=135006516> ## contig=<ID=12,assembly=b37,length=133851895> ## contig=<ID=13,assembly=b37,length=115169878> ## contig=<ID=14,assembly=b37,length=107349540> ## contig=<ID=15,assembly=b37,length=102531392> ## contig=<ID=16,assembly=b37,length=90354753> ## contig=<ID=17,assembly=b37,length=81195210> ## contig=<ID=18,assembly=b37,length=78077248> ## contig=<ID=19,assembly=b37,length=59128983> ## contig=<ID=2,assembly=b37,length=243199373> ## contig=<ID=20,assembly=b37,length=63025520> ## contig=<ID=21,assembly=b37,length=48129895> ## contig=<ID=22,assembly=b37,length=51304566> ## contig=<ID=3,assembly=b37,length=198022430> ## contig=<ID=4,assembly=b37,length=191154276> ## contig=<ID=5,assembly=b37,length=180915260> ## contig=<ID=6,assembly=b37,length=171115067> ## contig=<ID=7,assembly=b37,length=159138663> ## contig=<ID=8,assembly=b37,length=146364022> ## contig=<ID=9,assembly=b37,length=141213431> ## contig=<ID=GL000191.1,assembly=b37,length=106433> ## contig=<ID=GL000192.1,assembly=b37,length=547496> ## contig=<ID=GL000193.1,assembly=b37,length=189789> ## contig=<ID=GL000194.1,assembly=b37,length=191469> ## contig=<ID=GL000195.1,assembly=b37,length=182896> ## contig=<ID=GL000196.1,assembly=b37,length=38914> ## contig=<ID=GL000197.1,assembly=b37,length=37175> ## contig=<ID=GL000198.1,assembly=b37,length=90085> ## contig=<ID=GL000199.1,assembly=b37,length=169874> ## contig=<ID=GL000200.1,assembly=b37,length=187035> ## contig=<ID=GL000201.1,assembly=b37,length=36148> ## contig=<ID=GL000202.1,assembly=b37,length=40103> ## contig=<ID=GL000203.1,assembly=b37,length=37498> ## contig=<ID=GL000204.1,assembly=b37,length=81310> ## contig=<ID=GL000205.1,assembly=b37,length=174588> ## contig=<ID=GL000206.1,assembly=b37,length=41001> ## contig=<ID=GL000207.1,assembly=b37,length=4262> ## contig=<ID=GL000208.1,assembly=b37,length=92689> ## contig=<ID=GL000209.1,assembly=b37,length=159169> ## contig=<ID=GL000210.1,assembly=b37,length=27682> ## contig=<ID=GL000211.1,assembly=b37,length=166566> ## contig=<ID=GL000212.1,assembly=b37,length=186858> ## contig=<ID=GL000213.1,assembly=b37,length=164239> ## contig=<ID=GL000214.1,assembly=b37,length=137718> ## contig=<ID=GL000215.1,assembly=b37,length=172545> ## contig=<ID=GL000216.1,assembly=b37,length=172294> ## contig=<ID=GL000217.1,assembly=b37,length=172149> ## contig=<ID=GL000218.1,assembly=b37,length=161147> ## contig=<ID=GL000219.1,assembly=b37,length=179198> ## contig=<ID=GL000220.1,assembly=b37,length=161802> ## contig=<ID=GL000221.1,assembly=b37,length=155397> ## contig=<ID=GL000222.1,assembly=b37,length=186861> ## contig=<ID=GL000223.1,assembly=b37,length=180455> ## contig=<ID=GL000224.1,assembly=b37,length=179693> ## contig=<ID=GL000225.1,assembly=b37,length=211173> ## contig=<ID=GL000226.1,assembly=b37,length=15008> ## contig=<ID=GL000227.1,assembly=b37,length=128374> ## contig=<ID=GL000228.1,assembly=b37,length=129120> ## contig=<ID=GL000229.1,assembly=b37,length=19913> ## contig=<ID=GL000230.1,assembly=b37,length=43691> ## contig=<ID=GL000231.1,assembly=b37,length=27386> ## contig=<ID=GL000232.1,assembly=b37,length=40652> ## contig=<ID=GL000233.1,assembly=b37,length=45941> ## contig=<ID=GL000234.1,assembly=b37,length=40531> ## contig=<ID=GL000235.1,assembly=b37,length=34474> ## contig=<ID=GL000236.1,assembly=b37,length=41934> ## contig=<ID=GL000237.1,assembly=b37,length=45867> ## contig=<ID=GL000238.1,assembly=b37,length=39939> ## contig=<ID=GL000239.1,assembly=b37,length=33824> ## contig=<ID=GL000240.1,assembly=b37,length=41933> ## contig=<ID=GL000241.1,assembly=b37,length=42152> ## contig=<ID=GL000242.1,assembly=b37,length=43523> ## contig=<ID=GL000243.1,assembly=b37,length=43341> ## contig=<ID=GL000244.1,assembly=b37,length=39929> ## contig=<ID=GL000245.1,assembly=b37,length=36651> ## contig=<ID=GL000246.1,assembly=b37,length=38154> ## contig=<ID=GL000247.1,assembly=b37,length=36422> ## contig=<ID=GL000248.1,assembly=b37,length=39786> ## contig=<ID=GL000249.1,assembly=b37,length=38502> ## contig=<ID=MT,assembly=b37,length=16569> ## contig=<ID=X,assembly=b37,length=155270560> ## contig=<ID=Y,assembly=b37,length=59373566> # CHROM POS ID REF ALT QUAL FILTER INFO Y 2649696 M236 G T . PASS ISOGG Y 2655180 M176 C G . PASS ISOGG Y 2656127 Z12426 G A . PASS ISOGG Y 2656959 L1233 G C . PASS ISOGG Thank you. Created 2014-01-29 14:38:17 | Updated 2014-01-29 15:22:02 | Tags: unifiedgenotyper qc I run UnifiedGenotyper for 84 samples using RNAseq data (Bam-files). I got the following information in the Log file. As it can bee seen, about 27% of my reads failed for many reasons. How can I decrease the percentage of reads which filtered? In other words how I can improve my QC of data before UnifiedGenotyper. Thanks! INFO 14:15:23,762 MicroScheduler - 970952111 reads were filtered out during the traversal out of approximately 3587224222 total reads (27.07%) INFO 14:15:23,762 MicroScheduler - -> 147143075 reads (4.10% of total) failing BadMateFilter INFO 14:15:23,762 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 14:15:23,762 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 14:15:23,763 MicroScheduler - -> 566795160 reads (15.80% of total) failing MalformedReadFilter INFO 14:15:23,763 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 14:15:23,763 MicroScheduler - -> 257013876 reads (7.16% of total) failing NotPrimaryAlignmentFilter INFO 14:15:23,763 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter  Created 2014-01-28 18:29:03 | Updated | Tags: unifiedgenotyper ploidy scala Hi all, I am running a scala script, and would like to the include "-ploidy 1". Any advice on how can I do this? Some information that may or may not be relevant (idk!): I am attempting this using UnifiedGenotyper (v2.7-4). At the top of the script I have: package org.broadinstitute.sting.queue.qscripts.examples import org.broadinstitute.sting.queue.QScript import org.broadinstitute.sting.queue.extensions.gatk._ I got the glm both to work using: genotyper.glm = org.broadinstitute.sting.gatk.walkers.genotyper.GenotypeLikelihoodsCalculationModel.Model.BOTH I was hoping for something similar for ploidy. Thanks, Rhys Created 2014-01-23 08:01:04 | Updated 2014-01-23 08:03:11 | Tags: unifiedgenotyper Hi, everyone. from.. GATK Document -out_mode,--output_mode specifies which sites to emit; possible values are EMIT_VARIANTS_ONLY (the default), EMIT_ALL_CONFIDENT_SITES (include confident reference sites), or EMIT_ALL_SITES (any callable site regardless of confidence). I really want to know the meaning of confident reference site. When I calling with the GATK UnifiedGenotyper EMIT_ALL_CONFIDENT_SITES option in each sample BAM file, Can I distinguish the genotype in each sample? (No call, Ref homo, Alt homo, Hetero) In other words, I know the some site is no call or ref homo for this purpose. Created 2014-01-21 07:50:41 | Updated | Tags: unifiedgenotyper combinevariants Hi sir, I used GATK combine variants to call SNP from my 7 vcf files. Now I am using unified genotyper to call all samples simultaneously for SNP. When I focused on my gene of interest (DLX6), in combine variants output, there are 5 "./." among 7 samples. but in case of multisample SNP calling, DLX6 gave 0/0 in those samples and its corresponding information. So, I want to know what is the meaning of "./." in combine variants results and which results should I take for analysis? Created 2014-01-17 20:34:26 | Updated | Tags: unifiedgenotyper vcf How I can remove SNPs with Low quality or out of Hardy-Weinberg in my VCF file. In column 7 of my VCF file I have only "LowQual" or ".". There is no "PASS" word in this column. Is there something wrong in my VCF or any problem with UnifiedGenotyper procedure? thanks Created 2014-01-17 10:16:07 | Updated | Tags: unifiedgenotyper haplotypecaller Hi, I'm updating my pipeline for exome sequencing analysis, so I'm experiencing the HaplotypeCaller capabilities! I have analyzed the same sample with the UnifiedGenotyper walker and the HC one and I have examined the differences between the two output vcf files and I had a very bad finding... HC failed to find a true novel variant!! I know that this is a true variants because I validated that with Sanger sequencing after the first calling with UG. I have run UG using GATK version 1.6-11-g3b2fab9. This is the VCF line of the variant: chr7 45123943 . A T 3436.17 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=2.043;DP=114;Dels=0.00;FS=2.678;HRun=1;HaplotypeScore=0.0000;MQ=42.18;MQ0=1;MQRankSum=2.152;QD=30.14;ReadPosRankSum=-0.781;SB=-1010.47 GT:AD:DP:GQ:PL 1/1:8,105:114:99:3436,253,0 I have run HC using GATK version 2.7-4-g6f46d11 both in a single- and in a multi-sample manner but not the shadow of this variant in the VCF output.. I also noticed that together with this novel variant, HC lost other two variants upstream the first; these are the VCF lines: chr7 45123881 rs61740891 C T 654.25 PASS AC=1;AF=0.50;AN=2;BaseQRankSum=0.205;DB;DP=43;DS;Dels=0.00;FS=65.862;HRun=0;HaplotypeScore=2.2312;MQ=30.27;MQ0=1;MQRankSum=3.254;QD=15.22;ReadPosRankSum=-3.921;SB=-3.02 GT:AD:DP:GQ:PL 0/1:18,25:43:99:684,0,176 chr7 45123888 . C T 161.90 PASS AC=1;AF=0.50;AN=2;BaseQRankSum=-2.293;DP=26;DS;Dels=0.00;FS=49.656;HRun=2;HaplotypeScore=0.0000;MQ=23.81;MQ0=1;MQRankSum=0.425;QD=6.23;ReadPosRankSum=-3.821;SB=-3.00 GT:AD:DP:GQ:PL 0/1:17,9:26:99:192,0,249 How is it possible? Many thanks in advance Best, Flavia Created 2014-01-16 18:09:51 | Updated 2014-01-16 18:28:41 | Tags: unifiedgenotyper vcf Hi, I must be doing something silly because all of my VCF files are empty (header, no results) after calling with Unified Genotyper. I'm using GATK version 2.7-2-g6bda569 with Java 1.7.00_40. The FASTQs were aligned with Novoalign. java -Xmx4g -Djava.io.tmpdir=tmp -jar GenomeAnalysisTK.jar -R hg19_random.fa -D dbsnp_137.hg19.vcf -T UnifiedGenotyper -I data/gatk.realigned.recal.chr1-1-249250621.bam -o data/gatk.snps.raw.chr1-1-249250621.vcf -metrics data/gatk.snps.raw.chr1-1-249250621.vcf.metrics -L chr1:1-249250621 --computeSLOD -stand_call_conf 30 -stand_emit_conf 1 -nt 8 -K GATK_public.key -et NO_ET Here is a snippet from my BAM file: H801:144:C39TBACXX:7:2113:7016:95747 73 chr1 671881 20 101M = 671881 0 GCGGAGGCTGCCGTGACGTAGGGTATGGGCCTAAATAGGCCATTGTGAGTCATGAGCTTGGTCTGTAGAGGCTGACTGGAGAAAGTTCTCGGCCTGGAGAG YYYZZZZZZZZZZZZYZ]ZZ]]]YYYZ]]]ZZ]]]Z]Z]Z]]]]]Z]]]YZZZ]]]]ZYZZZZZZZZZYYYZZZZYYZZYZYZZZYZYZZZZZZZZZWXWW BD:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] MD:Z:101 PG:Z:novoalign RG:Z:SWID:testing:0 BI:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] NM:i:0 UQ:i:0 AS:i:0 H801:144:C39TBACXX:7:2302:11261:7162 153 chr1 971492 70 101M = 971492 0 GCTGTCTGTGAGGGCTGTGCTGAGGCCTTCCTGACCAGCACATGGGGTGGGAAGGACGACCTGGGGAATCCTGAAGTGATCTGAAGACAGAGCCCTGGGCT WYWXYZZZZZYZZXZYZZZYZZZZZZYYZZZZZZZZZZZZZZY]]]]ZZYYYZZZYYZZZZ]]]ZZZYZXZ]ZZZZZZYZ]]]]ZZYZZZZYZZZZZZYYY BD:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] MD:Z:101 PG:Z:novoalign RG:Z:SWID:testing:0 BI:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] NM:i:0 UQ:i:0 AS:i:0 H801:144:C39TBACXX:7:2302:13699:15898 153 chr1 971492 70 101M = 971492 0 GCTGTCTGTGAGGGCTGTGCTGAGGCCTTCCTGACCAGCACATGGGGTGGGAAGGACGACCTGGGGAATCCTGAAGTGATCTGAAGACAGAGCCCTGGGCT XYXZZYZZZZYZZZYYXYZZYZZZZZYYZZZZYYZZYYZZZYY]]]]]ZZZ]]]ZZZYZ]]Z]]Z]ZZZ]]]]]]ZZYZZ]]ZZYYYZYZXZZZZZZZYYX BD:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] MD:Z:101 PG:Z:novoalign RG:Z:SWID:testing:0 BI:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] NM:i:0 UQ:i:0 AS:i:0 H801:144:C39TBACXX:7:2206:12702:96307 97 chr1 987905 70 101M = 988021 217 GTGTGCATATGGGTCCATGTATGTGTGTGTATATGAGGGAGACACGCAGGTGTGTGTCTGAGTGTGTGCGCACATGGGTCCATGTATGTGTGTGTATAGGT YXYZZZZZZZZZZZZ]]]]ZZZ]ZZZYYZZZZ]]]]]]]Z]ZZ]Z]]]]ZZZZZZZYZZ]ZZXYYYWYYZWXWXYWWWYZZYZYZZZYZZZZWXWWYZZYW BD:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] MD:Z:101 PG:Z:novoalign RG:Z:SWID:testing:0 BI:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] NM:i:0 MQ:i:70 UQ:i:0 AS:i:0 H801:144:C39TBACXX:7:2206:12702:96307 145 chr1 988021 70 101M = 987905 -217 GTGTGTGTCCGTGTGTGTGCATGGGTCCATGTGTGTATAGTGTGTACACATGGGTCCATGTATGTGTGTGTATATGAGGGAGACACGCAGGTGTGTGTCCG WWWWWZXZXXWWWXZYWWWYWWWWWWWXWXZYXYWWYYWX]]]ZYXZZYZZ]]]]]]]Z]Z]]]]]]]]]]]]]Z]]]]]]]]]Z]]]ZZZZZZZZZZYYY BD:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]  And the output on the command line when running INFO 12:44:55,654 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:44:55,656 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-2-g6bda569, Compiled 2013/08/28 16:30:29 INFO 12:44:55,656 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 12:44:55,656 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 12:44:55,660 HelpFormatter - Program Args: -R hg19_random.fa -D dbsnp_137.hg19.vcf -T UnifiedGenotyper -I data/gatk.realigned.recal.chr1-1-249250621.bam -o data/gatk.snps.raw.chr1-1-249250621.vcf -metrics data/gatk.snps.raw.chr1-1-249250621.vcf.metrics -et NO_ET -L chr1:1-249250621 --computeSLOD -stand_call_conf 30 -stand_emit_conf 1 -nt 8 -K GATK_public.key INFO 12:44:55,660 HelpFormatter - Date/Time: 2014/01/16 12:44:55 INFO 12:44:55,661 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:44:55,661 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:44:55,755 ArgumentTypeDescriptor - Dynamically determined type of dbsnp_137.hg19.vcf to be VCF INFO 12:44:56,282 GenomeAnalysisEngine - Strictness is SILENT INFO 12:44:56,370 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 12:44:56,377 SAMDataSourceSAMReaders - Initializing SAMRecords in serial
INFO  12:44:56,433 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.05 INFO 12:44:56,490 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf INFO 12:44:56,736 IntervalUtils - Processing 249250621 bp from intervals INFO 12:44:56,749 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 1 CPU thread(s) for each of 8 data thread(s), of 24 processors available on this machine INFO 12:44:56,799 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 12:44:56,992 GenomeAnalysisEngine - Done preparing for traversal INFO 12:44:56,992 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 12:44:56,992 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 12:44:57,085 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:44:57,090 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00 INFO 12:44:57,095 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:44:57,101 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 12:44:57,105 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:44:57,108 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf
INFO  12:44:57,112 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 12:44:57,114 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:44:57,124 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 12:44:57,129 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:44:57,140 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 12:44:57,168 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:44:57,177 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 12:44:57,186 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:44:57,191 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00 INFO 12:44:57,285 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf INFO 12:44:57,482 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf INFO 12:44:57,655 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf INFO 12:44:57,827 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf INFO 12:44:57,996 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf INFO 12:44:58,301 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf INFO 12:45:26,995 ProgressMeter - chr1:93237277 9.18e+07 30.0 s 0.0 s 37.4% 80.0 s 50.0 s INFO 12:45:56,999 ProgressMeter - chr1:200739385 1.99e+08 60.0 s 0.0 s 80.5% 74.0 s 14.0 s INFO 12:46:12,672 ProgressMeter - done 2.49e+08 75.0 s 0.0 s 100.0% 75.0 s 0.0 s INFO 12:46:12,673 ProgressMeter - Total runtime 75.68 secs, 1.26 min, 0.02 hours INFO 12:46:12,673 MicroScheduler - 5225 reads were filtered out during the traversal out of approximately 500463 total reads (1.04%) INFO 12:46:12,673 MicroScheduler - -> 5225 reads (1.04% of total) failing BadMateFilter INFO 12:46:12,673 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 12:46:12,673 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 12:46:12,674 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 12:46:12,674 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 12:46:12,674 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 12:46:12,674 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter  Can anyone tell me what might be wrong? It's reliably failing for every chromosome on reasonably-sized data (this BAM is 20M with ~500k reads). Thanks in advance Created 2014-01-15 21:12:18 | Updated | Tags: unifiedgenotyper Hi, I'm trying to run UnifiedGenotyper on some data, and I'm getting an error I can't make heads nor tails of. It seems to be complaining that I have an equal sign in the sample name, which as far as I can tell I do not. Bellow is the command I've executed, the stack trace, and part of the header from the sam file I'm trying to call data from. Command: java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ../reference/Spolyrrhiza.current.genomic.fasta -I SRR072269_reorder_readgroup_sorted.bam --output_mode EMIT_ALL_SITES -o robertTest.vcf Stack: Spolyrrhiza.current.genomic.fasta -I SRR072269_reorder_readgroup_sorted.bam --output_mode EMIT_ALL_SITES -o Test.vcf INFO 13:02:22,959 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:02:22,964 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51 INFO 13:02:22,964 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 13:02:22,964 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 13:02:22,973 HelpFormatter - Program Args: -T UnifiedGenotyper -R ../reference/Spolyrrhiza.current.genomic.fasta -I SRR072269_reorder_readgroup_sorted.bam --output_mode EMIT_ALL_SITES -o robertTest.vcf INFO 13:02:22,974 HelpFormatter - Date/Time: 2014/01/15 13:02:22 INFO 13:02:22,974 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:02:22,974 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:02:24,330 GenomeAnalysisEngine - Strictness is SILENT INFO 13:02:25,935 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 13:02:25,950 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:02:26,529 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.58 INFO 13:02:27,069 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 13:02:27,188 GenomeAnalysisEngine - Done preparing for traversal INFO 13:02:27,189 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 13:02:27,190 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 13:02:28,450 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.IllegalArgumentException: VCFHeaderLine: ID cannot contain an equals sign at org.broadinstitute.variant.vcf.VCFSimpleHeaderLine.initialize(VCFSimpleHeaderLine.java:80) at org.broadinstitute.variant.vcf.VCFSimpleHeaderLine.(VCFSimpleHeaderLine.java:71) at org.broadinstitute.variant.vcf.VCFContigHeaderLine.(VCFContigHeaderLine.java:52) at org.broadinstitute.variant.vcf.VCFUtils.makeContigHeaderLine(VCFUtils.java:165) at org.broadinstitute.variant.vcf.VCFUtils.makeContigHeaderLines(VCFUtils.java:156) at org.broadinstitute.variant.vcf.VCFUtils.withUpdatedContigsAsLines(VCFUtils.java:128) at org.broadinstitute.variant.vcf.VCFUtils.withUpdatedContigsAsLines(VCFUtils.java:114) at org.broadinstitute.variant.vcf.VCFUtils.withUpdatedContigs(VCFUtils.java:110) at org.broadinstitute.sting.utils.variant.GATKVCFUtils.withUpdatedContigs(GATKVCFUtils.java:175) at org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub.writeHeader(VariantContextWriterStub.java:232) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.initialize(UnifiedGenotyper.java:304) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.7-4-g6f46d11): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: VCFHeaderLine: ID cannot contain an equals sign ##### ERROR ------------------------------------------------------------------------------------------ Sam header: @HD VN:1.4 SO:coordinate ... @RG ID:1 PL:illumina PU:flowcell_1 LB:library_1 SM:SRR072269$@PG ID:dvtgm PN:stampy VN:1.0.21_(r1713) CL:-g ../reference/spirodela -h ../reference/spirodela -M ../rawdata/SRR072269.sra.fastq@CO TM:Fri, 10 Jan 2014 17:43:03 EST WD:/data/home/learnspirodela/stampy_out HN:server.ca UN:williarj

Also, I've noticed that the reads in the files I'm using do have "=" in the read headers:

@SRR072258.sra.2 GL1KSOJ02I7J10 length=93 GACTGGACCTTCTGAGCCTTGCACCAGCGCATTGATGCGGACCTTGAACTCCTCATACTCCNCTGNACCGCCAGGCCAGCAGGGNTAGGGNNN +SRR072258.sra.2 GL1KSOJ02I7J10 length=93 BBBBAA:66669AAAAAABA@:655BBAA??<9:::441-/----411446<=?=>=<<44!34<!<;9722/..997774111!43,,,!!!

Could that be where my mystery equal sign problem originates?

Created 2014-01-10 22:01:49 | Updated | Tags: unifiedgenotyper vcf gatk

One of my samples has this entry "1/1:0,1:1:3:36,3,0" in the information field, and from my understanding, the genotype is homo variant, because it has 1/1. However, I do not understand why. Since it has 0 REF reads and has only 1 ALT read, how does GATK tell this variant is a homo variant??

Created 2013-12-20 16:00:53 | Updated | Tags: unifiedgenotyper

Hi all ! We have been using GATK for a few years now. Unified Genotyper was working perfectly fine (exome sequencing, TCGA bam files) for SNV and Indels using the -n option. The calls were taking around 15 to 20minutes.

Then it changed to -nct -nt for parallelization. We have assigned 6CPUs for each computation so we tried the different options -nt 6 -nct 1 : This fails very quickly with the "Too many files open" ERROR MESSAGE - nt 2 -nct 3 or -nt 3 -nct 2 : This runs for about 10minutes and then fails with the exact same message - nt 1 -nct 6 : It does run with no error but takes around 1hour instead of the 20minutes we had previously achieved.

We did set up the limit of open files to 65535. It is weird because when it crashes, I then go to the tmp directory that I put in the java command, and there are not even 100 .tmp files created so I'm wondering what could be the issue.

We have tried it with all versions from 6 to 8 and the same problem happens.

It's really creating a bottleneck for us right now so I'm wondering why, however the limit is not reached, GATK keeps crashing.

Thanks all for your help. Happy holidays. Manon

Created 2013-12-17 17:07:40 | Updated 2013-12-17 17:08:47 | Tags: unifiedgenotyper vcf empty

Hey there,

I was trying to build an analysis pipeline for paired reads with BWA, Duplicate Removal Local Realignment and Base Quality Score Recalibration to finally use GATK's UnifiedGenotyper for SNP and Indel calling. However, for both SNPs and Indels, I receive no called variants no matter how low my used thresholds are. Quality values of the reads look ok, leaving out dbSNP does not change results. I have used the same reference throughout the whole pipeline. I use GATK 2.7, nevertheless a switch to GATK 1.6 did not change anything.

This is my shell command for SNP calling on chromosome X (GATK delivers no results for all chromosomes): java -Xmx4g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R Homo_sapiens_assembly19.fasta -stand_call_conf 30.0 -stand_emit_conf 30.0 -glm SNP -mbq 17 -I test.bam -L X -o test.snps.vcf -D dbsnp_135.hg19.excluding_sites_after_129.vcf

Entries in my BAM file look like this: SRR389458.1885965 113 X 10092397 37 76M = 10092397 1 CCTGTTTCCCCTGGGGCTGGGCTNGANACTGGGCCCAACCNGTGGCTCCCACCTGCACACACAGGGCTGGAGGGAC 98998999989:99:9:999888#88#79999:;:89998#99:;:88:989:;:91889888:;:9;:::::999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:23G2G13C35 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:3 SM:i:37 XM:i:3 XO:i:0 MQ:i:37 XT:A:U SRR389458.1885965 177 X 10092397 37 76M = 10092397 -1 CCTGTTTCCCCTGGGGCTGGGCTNGANACTGGGCCCAACCNGTGGCTCCCACCTGCACACACAGGGCTGGAGGGAC 98998999989:99:9:999888#88#79999:;:89998#99:;:88:989:;:91889888:;:9;:::::999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:23G2G13C35 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:3 SM:i:37 XM:i:3 XO:i:0 MQ:i:37 XT:A:U SRR389458.1888837 113 X 14748343 37 76M = 14748343 1 TCGTGAAAGTCGTTTTAATTTTAGCGGTTATGGGATGGGTCACTGCCTCCAAGTGAAAGATGGAACAGCTGTCAAG 889999:9988;98:9::9;9986::::99:8:::::999988989:8;;9::989:999:9;9:;:99:98:999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:76 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 MQ:i:37 XT:A:U

And this is the output of the UnifiedGenotyper: INFO 17:57:00,575 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:57:00,578 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51 INFO 17:57:00,578 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:57:00,578 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:57:00,582 HelpFormatter - Program Args: -T UnifiedGenotyper -R /hana/exchange/reference_genomes/hg19/Homo_sapiens_assembly19.fasta -stand_call_conf 30.0 -stand_emit_conf 30.0 -glm SNP -mbq 17 -I test.bam -L X -o testX.snps.vcf -D dbsnp_135.hg19.excluding_sites_after_129.vcf INFO 17:57:00,583 HelpFormatter - Date/Time: 2013/12/17 17:57:00 INFO 17:57:00,583 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:57:00,583 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:57:00,943 ArgumentTypeDescriptor - Dynamically determined type of /hana/exchange/reference_genomes/hg19/dbsnp_135.hg19.excluding_sites_after_129.vcf to be VCF INFO 17:57:01,579 GenomeAnalysisEngine - Strictness is SILENT INFO 17:57:02,228 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 17:57:02,237 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 17:57:02,364 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.13 INFO 17:57:02,594 RMDTrackBuilder - Loading Tribble index from disk for file /hana/exchange/reference_genomes/hg19/dbsnp_135.hg19.excluding_sites_after_129.vcf INFO 17:57:02,867 IntervalUtils - Processing 155270560 bp from intervals INFO 17:57:02,943 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 17:57:03,166 GenomeAnalysisEngine - Done preparing for traversal INFO 17:57:03,167 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:57:03,167 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 17:57:33,171 ProgressMeter - X:11779845 1.01e+07 30.0 s 2.0 s 7.6% 6.6 m 6.1 m INFO 17:58:03,173 ProgressMeter - X:24739805 1.89e+07 60.0 s 3.0 s 15.9% 6.3 m 5.3 m INFO 17:58:33,175 ProgressMeter - X:37330641 3.25e+07 90.0 s 2.0 s 24.0% 6.2 m 4.7 m INFO 17:59:03,177 ProgressMeter - X:49404077 4.94e+07 120.0 s 2.0 s 31.8% 6.3 m 4.3 m INFO 17:59:33,178 ProgressMeter - X:64377965 5.36e+07 2.5 m 2.0 s 41.5% 6.0 m 3.5 m INFO 18:00:03,180 ProgressMeter - X:75606869 7.54e+07 3.0 m 2.0 s 48.7% 6.2 m 3.2 m INFO 18:00:33,189 ProgressMeter - X:88250233 7.74e+07 3.5 m 2.0 s 56.8% 6.2 m 2.7 m INFO 18:01:03,190 ProgressMeter - X:100393213 9.94e+07 4.0 m 2.0 s 64.7% 6.2 m 2.2 m INFO 18:01:33,192 ProgressMeter - X:110535705 1.09e+08 4.5 m 2.0 s 71.2% 6.3 m 109.0 s INFO 18:02:03,193 ProgressMeter - X:121257489 1.20e+08 5.0 m 2.0 s 78.1% 6.4 m 84.0 s INFO 18:02:33,195 ProgressMeter - X:132533757 1.32e+08 5.5 m 2.0 s 85.4% 6.4 m 56.0 s INFO 18:03:03,197 ProgressMeter - X:144498909 1.41e+08 6.0 m 2.0 s 93.1% 6.4 m 26.0 s INFO 18:03:30,079 ProgressMeter - done 1.55e+08 6.4 m 2.0 s 100.0% 6.4 m 0.0 s INFO 18:03:30,079 ProgressMeter - Total runtime 386.91 secs, 6.45 min, 0.11 hours INFO 18:03:30,080 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 150 total reads (0.00%) INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing BadMateFilter INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter INFO 18:03:32,167 GATKRunReport - Uploaded run statistics report to AWS S3

Do I miss anything here?

Best,

Cindy

Created 2013-12-11 09:59:05 | Updated | Tags: unifiedgenotyper

Dear all,

all my GATK 2.8.1 UG jobs crash with the "Null alleles are not supported" error message. It works fine with 2.7.4 (more precisely the nightly release version as of November 28) but this error message kills all my jobs right now with the just released 2.8.1 version. Based on the traces, it looks like an indel calling issue.

It's multisample calling, with reduced reads format. The issue may come from the fact that a handful of BAM files have been reduced with older versions of GATK (something I am currently fixing, but it takes time). The fact that the issue is brand new to 2.8.1 suggests something different though but I can't be sure. Full error message below.

Any idea?

V

##### ERROR stack trace

java.lang.IllegalArgumentException: Null alleles are not supported at org.broadinstitute.variant.variantcontext.Allele.(Allele.java:117) at org.broadinstitute.sting.utils.haplotype.Haplotype.(Haplotype.java:61) at org.broadinstitute.sting.gatk.walkers.indels.PairHMMIndelErrorModel.trimHaplotypes(PairHMMIndelErrorModel.java:235) at org.broadinstitute.sting.gatk.walkers.indels.PairHMMIndelErrorModel.computeGeneralReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:438) at org.broadinstitute.sting.gatk.walkers.indels.PairHMMIndelErrorModel.computeDiploidReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:251) at org.broadinstitute.sting.gatk.walkers.genotyper.IndelGenotypeLikelihoodsCalculationModel.getLikelihoods(IndelGenotypeLikelihoodsCalculationModel.java:149) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoods(UnifiedGenotyperEngine.java:331) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:232) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724)

##### ERROR ------------------------------------------------------------------------------------------

Created 2013-11-30 15:32:34 | Updated | Tags: unifiedgenotyper haplotypecaller sensitivity

Hi there,

I compared HaplotypeCaller and UnifiedGenotype of the latest version of GATK (2.7-4) on the same NA12878 trio. It seems HC missed some true variants (specifically, 7 out of the 49 experimentally validated de novo SNVs are missed). I guess this might be partly due to HCMappingQualityFilter, because the log says ~9% of the total reads have been filtered by this filter, and it looks there is not such a filter in UnifiedGenotyper. I wonder is there a safe way to relax HCMappingQualityFilter (default 20) to achieve a better sensitivity for HC?

Aaron

Created 2013-11-07 20:10:30 | Updated | Tags: unifiedgenotyper

Hi, I'm using unifiedgenotyper for the SNP calling in one lane illumina RNA-seq data, the bam file is ~15gb. The command I used is: java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R assembly.fa -I seq.bam --out result.vcf -ploidy 48 -stand_call_conf 20 -stand_emit_conf 20.0 I run it with 2 cpus and 72gb memory. It has been run for 3 days (I also used -nt 12 to run it with 12 cpus at the same time but the speed didn't improve significantly) and haven't finished. In addition, there are only 6000 loci were write into the vcf file by now, but I analysed this data using other pipelines at the same time, all of them have been finished and reported ~1 million loci. So can anybody tell me usually how long it will take for unifiedgenotyper to analysis one lane illumina RNA-seq data? Thanks.

Created 2013-10-16 12:50:25 | Updated | Tags: unifiedgenotyper haplotypecaller dp

Dear GATK Team,

I've recently been exploring HaplotypeCaller and noticed that, for my data, it is reporting ~10x lower DP and AD values in comparison to reads visible in the igv browser and reported by the UnifiedGenotyper.

I'm analyzing a human gene panel of amplicon data produced on a MiSeq, 150bp paired end. The coverage is ~5,000x.

My pipeline is:

Novoalign -> GATK (recalibrate quality) -> GATK (re-align) -> HaplotypeCaller/UnifiedGenotyper.

Here are the minimum commands that reproduce the discrepancy:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.HC.vcf \
-L ROI.bed \
-dt NONE \
-nct 8


Example variant from sample1.HC.vcf:

chr17 41245466 . G A 18004.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.411;ClippingRankSum=-1.211;DP=462;FS=2.564;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.250;QD=31.14;ReadPosRankSum=1.159 GT:AD:DP:GQ:PL 1/1:3,458:461:99:18033,1286,0

... In comparison to using UnifiedGenotyper with exactly the same alignment file:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.UG.vcf \
-L ROI.bed \
-nct 4 \
-dt NONE \
-glm BOTH


Example variant from sample1.UG.vcf:

chr17 41245466 . G A 140732.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=5.488;DP=6382;Dels=0.00;FS=0.000;HaplotypeScore=568.8569;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.096;QD=22.05;ReadPosRankSum=0.104 GT:AD:DP:GQ:PL 1/1:56,6300:6378:99:140761,8716,0

I looked at the mapping quality and number of the alignments at the example region (200nt window) listed above and they look good:

awk '{if ($3=="chr17" &&$4 > (41245466-100) && $4 < (41245466+100)) print}' sample1.rg.sam | awk '{count[$5]++} END {for(i in count) print count[i], i}' | sort -nr
8764 70
77 0


With other data generated in our lab, that has ~200x coverage and the same assay principle [just more amplicons], the DP reported by HaplotypeCaller corresponds perfectly to UnifiedGenotyper and igv.

Is there an explanation as to why I should see a difference between HaplotypeCaller and UnifiedGenotyper, using these kinds of data?

Sam

Created 2013-09-30 19:55:29 | Updated | Tags: unifiedgenotyper

I run UnifiedGenotyper for my 42 samples (bam files) to get VCF file. after almost 48 hours I got my vcf file but there was no 0/0 reference homozygous genotype in the vcf file. How I can achieve these genotype information? I need them for the eQTL analyses. I want to see all 3 class of genotype in the vcf file (0/0, 0/1, and 1/1). Thanks

Created 2013-08-22 18:55:31 | Updated | Tags: unifiedgenotyper gatk

Dear GATK Team,

I am receiving the following error while running GATK 1.6. Unfortunately, for project consistency I cannot update to a more recent version of GATK and would at least wish to understand the source of the error. I ran ValidateSamFile on the input bam files and they appear to be OK.

Any insight or advice would be greatly appreciated:

##### ERROR ------------------------------------------------------------------------------------------

##### ERROR ------------------------------------------------------------------------------------------

Abbreviated commandline used:

GenomeAnalysisTK.jar -T UnifiedGenotyper -glm BOTH -et NO_ET \ -R "Saccharomyces_cerevisiae/UCSC/sacCer2/Sequence/WholeGenomeFasta/genome.fa" \ -dcov 5000 -I "someFile.bam" --output_mode EMIT_ALL_SITES -gvcf -l OFF \ -stand_call_conf 1 -L chrIV:1-1531919

Created 2013-06-12 09:22:28 | Updated | Tags: unifiedgenotyper

Dear GATK team, I have used the UG following the best practice GATK workflow to call snps and Indels from exomeseq data of 24 human samples. First I called snps and Indels separately for each bam file, and I obtained separate vcfs. Then I decided to try to call the snps and Indels all in one go. I noticed that the output was quite different and the number of inser/delitions was higher when I called variants in contemporary (starting from separate bam files: -I sample1.bam -I sample2.bam...ETC). I also noticed that the called indels mostly were adjacent to tricky areas such as repetitive elements (ATATATATATATAT) or next to polyAAAAA. These snps and Indels weren't called by the UG when I called the variants separately. Is it more error prone to call variants in contemporary?

Created 2013-04-11 21:30:55 | Updated 2013-04-11 21:32:22 | Tags: unifiedgenotyper haplotypecaller miseq

Hey there,

I'm currently working on a project that uses the MiSeq system to perform very deep sequencing (5000x coverage) on a small set of genes related to a particular cancer type. I have had some odd calls with HaplotypeCaller and UnifiedGenotyper, especially near the edges of the sequenced regions. It seems that often Haplotype calls very large deletions and Unified calls SNPs, e.g. (from the VCFs):

Haplotype Caller:

chr2 212576988 . TATTTTTAATTGTA T 13.95 LowQual AC=1;AF=0.500;AN=2;BaseQRankSum=1.805;ClippingRankSum=-0.916;DP=1000;FS=8.285;MLEAC=1;MLEAF=0.500;MQ=59.98;MQ0=0;MQRankSum=0.836;QD=0.00;ReadPosRankSum=0.528 GT:AD:GQ:PL 0/1:308,53:42:42,0,8418

Unified Genotyper:

chr2 212576990 . T C 555.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.991;DP=1000;Dels=0.00;FS=0.000;HaplotypeScore=22.6027;MLEAC=1;MLEAF=0.500;MQ=59.98;MQ0=0;MQRankSum=0.073;QD=0.56;ReadPosRankSum=0.164 GT:AD:DP:GQ:PL 0/1:860,130:3743:99:584,0,32767

A screenshot from IGV showing some of the (few) BAM records in the area of these calls is attached.

Many of these false variant calls can be removed through hard filtering, but I was a little surprised to see them called in the first place, because our raw results had been quite good with lower-coverage data. Is this to be expected with high coverage data?

I am also wondering if there is a hard cap after which downsampling occurs. I set -dcov 10000 for all our samples, but the depth reported in the DP tag seemed max out at 1000. Is there another parameter I need to change? In some areas, due to the very small regions being sequenced, I have coverage in excess of 15,000 reads, so downsampling to 1000 might skew the results.

Any guidance is appreciated! Cheers, Natascha

Created 2013-03-26 11:34:41 | Updated | Tags: unifiedgenotyper snp snps

I was wondering, in the vcf output from UnifiedGenotyper what metrics go into the variant quality score, QUAL ?

I'm assuming that depth, DP is one of them but I can't find further information. Apologies if I've missed it.

Attached are some plots of the depth and quality distribution for variant calls, and also of the relationship between depth and quality. I'm slightly worried that the bimodal quality distribution indicates an error, otherwise I've just attached the graphs for general interest.

Created 2012-11-29 17:51:33 | Updated | Tags: unifiedgenotyper depthperallelebysample

Hello,

I found some strange entries for indels in my VCF file created by the Unified Genotyper. For example:

The first sample has genotype 0/1 with a good GQ value. However, according the allele depth field, there is no read supporting the deletion. When I look at the reads using the IGV, I find some reads supporting the deletion for the first sample (and even some for the second one).

Moreover, when I looked at the AD values for SNPs, I noticed the the sum of all AD values is much less than the coverage shown in the IGV. I filtered duplicated reads in the IGV.

Best greetings, Hans-Ulrich

Created 2012-10-11 21:54:39 | Updated 2012-10-18 01:33:56 | Tags: unifiedgenotyper badmatefilter mappingqualityfilter

It was a bit unclear what the BadMateFilter is doing in the documentation. Any information would be appreciated!

Created 2012-09-10 10:50:14 | Updated 2012-09-10 15:30:14 | Tags: unifiedgenotyper haplotypecaller

Hi,

how does the speed of the haplotypeCaller usually compare to that of the UnifiedGenotyper?

When I try to use it, it seems to be about 90x slower.

these are the command lines I use:

java -jar -Xmx32g GenomeAnalysisTK.jar -R hg19.fasta -T UnifiedGenotyper -dcov 1000 -I file.bam -o file.vcf -L myTarget

java -jar -Xmx32g GenomeAnalysisTK.jar -R hg19.fasta -T HaplotypeCaller -dcov 1000 -I file.bam -o file.vcf -L myTarget


thanks!

Created 2012-08-10 18:19:46 | Updated 2013-01-07 19:45:12 | Tags: unifiedgenotyper vcf indels

Hello, I'm maybe missing something obvious but it seems a GATK vcf file does not tell a given variant is SNP, Insertion or deletion. Did I missed some command when I called variations? I can easily classify variations by eye or a script from a given vcf entry but the entry does not explicitly say variation type.

Here are deletions:

Here are insertions:

Ld04 23671 . G GAAAT 6952 . AC=2;AF=1.00;AN=2;DP=100;FS=0.000;HaplotypeScore=24.8695;MLEAC=2;MLEAF=1.00;MQ=59.54;MQ0=0;QD=69.52;SB=-3.355e+03 GT:AD:DP:GQ:PL 1/1:65,34:100:99:6952,301,0

Here are SNPs

Ld04 19890 . T C 3276.01 . AC=2;AF=1.00;AN=2;DP=85;Dels=0.00;FS=0.000;HaplotypeScore=0.7887;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=38.54;SB=-1.614e+03 GT:AD:DP:GQ:PL 1/1:0,85:85:99:3276,253,0

I do not see any markers that tell me variation type. Nor the length of variations. Such information is given in a vcf of Samtools.

PS I used this command line of V 2.0-38:

java -jar ~/GenomeAnalysisTK-2.0-38/GenomeAnalysisTK.jar -T UnifiedGenotyper --genotype_likelihoods_model BOTH -R ref1 -I IN.bam -o OUT.GK2.vcf

Thank you.

Hideo

Created 2012-08-01 16:50:30 | Updated 2012-08-01 16:55:22 | Tags: unifiedgenotyper fslockwithshared

We have a SGE environment where we run our GATK jobs. Lately we have had a few cases of UnifiedGenotyper getting hung up while trying to read our previously created .fai and .dict reference files. On one occasion we had 6 jobs on different nodes all hung for about 5 days before finally giving the FSLockWithShared - WARNING messages and then continuing on just fine. We haven't been able to find out why this is happening. Do you know of anything that would cause UnifiedGenotyper to try and read those files before ultimately throwing the warning messages 5 days later?

We looked into the possibility of other users/jobs having an exclusive lock on those reference files, but we had other UnifiedGenotyper jobs run fine during that same time period without these problems.

We also haven't been able to find any hardware problems yet. The jobs with this problem were on multiple nodes, so I thought I would see if you have any ideas.

Using strace we were able to find that all 6 of the UnifiedGenotyper processes were hung in the same spot:

$strace -p 22924 Process 22924 attached - interrupt to quit futex(0x41c319d0, FUTEX_WAIT, 22925, NULL <unfinished ...>$ strace -p 22925
Process 22925 attached - interrupt to quit
write(1, "WARN  12:39:09,312 FSLockWithSha"..., 158) = 158
write(1, "INFO  12:39:09,364 ReferenceData"..., 116) = 116
write(1, "INFO  12:39:09,364 ReferenceData"..., 89) = 89
open("/reference/sequence/human/ncbi/37.1/allchr.fa.fai", O_RDONLY) = 16
fstat(16, {st_mode=S_IFREG|0770, st_size=2997, ...}) = 0
fstat(16, {st_mode=S_IFREG|0770, st_size=2997, ...}) = 0
fcntl(16, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=0, len=9223372036854775807} <unfinished ...>


Here is the log file for one of the jobs:

Thu Jul 19 09:42:53 CDT 2012
INFO  09:42:58,047 HelpFormatter - --------------------------------------------------------------------------------
INFO  09:42:58,048 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.6-7-g2be5704, Compiled 2012/05/25 16:27:30
INFO  09:42:58,048 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa
INFO  09:42:58,049 HelpFormatter - Program Args: -R /reference/sequence/human/ncbi/37.1/allchr.fa -et NO_ET -K /p
rojects/bsi/bictools/apps/alignment/GenomeAnalysisTK/1.6-7-g2be5704//key -T UnifiedGenotyper --output
_mode EMIT_ALL_SITES --min_base_quality_score 20 -nt 4 --max_alternate_alleles 5 -glm BOTH -L /data2/bsi/target.bed -I /data2/bsi/H-001.chr22-sorted.bam --out /data2/bsi/variants.chr22.raw.all.vcf
INFO  09:42:58,050 HelpFormatter - Date/Time: 2012/07/19 09:42:58
INFO  09:42:58,050 HelpFormatter - --------------------------------------------------------------------------------
INFO  09:42:58,050 HelpFormatter - --------------------------------------------------------------------------------
INFO  09:42:58,166 GenomeAnalysisEngine - Strictness is SILENT
WARN  12:39:09,312 FSLockWithShared - WARNING: Unable to lock file /reference/sequence/human/ncbi/37.1/allchr.dict: Protocol family not supported.
INFO  12:39:09,364 ReferenceDataSource - Unable to create a lock on dictionary file: Protocol family not supported
INFO  12:39:09,364 ReferenceDataSource - Treating existing dictionary file as complete.
WARN  12:39:12,527 FSLockWithShared - WARNING: Unable to lock file /reference/sequence/human/ncbi/37.1/allchr.fa.fai: Protocol family not supported.
INFO  12:39:12,528 ReferenceDataSource - Unable to create a lock on index file: Protocol family not supported
INFO  12:39:12,528 ReferenceDataSource - Treating existing index file as complete.
INFO  12:39:12,663 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:39:12,954 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.29
INFO  12:39:13,108 MicroScheduler - Running the GATK in parallel mode with 4 concurrent threads
WARN  12:39:13,864 UnifiedGenotyper - WARNING: note that the EMIT_ALL_SITES option is intended only for point mutations (SNPs) in DISCOVERY mode or generally when running in GENOTYPE_GIVEN_ALLELES mode; it will by no means produce a comprehensive set of indels in DISCOVERY mode
INFO  12:39:14,361 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:39:14,568 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.21
INFO  12:39:14,569 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:39:14,627 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06
INFO  12:39:14,628 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:39:14,686 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06
INFO  12:39:15,123 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
...
Tue Jul 24 12:47:32 CDT 2012