# Tagged with #unifiedgenotyper 5 documentation articles | 2 announcements | 92 forum discussions

Use HaplotypeCaller!

The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper. Its ability to call SNPs is equivalent to that of the UnifiedGenotyper, its ability to call indels is far superior, and it is now capable of calling non-diploid samples. It also comprises several unique functionalities such as the reference confidence model (which enables efficient and incremental variant discovery on ridiculously large cohorts) and special settings for RNAseq data.

As of GATK version 3.3, we recommend using HaplotypeCaller in all cases, with no exceptions.

Caveats for older versions

If you are limited to older versions for project continuity, you may opt to use UnifiedGenotyper in the following cases:

• If you are working with non-diploid organisms (UG can handle different levels of ploidy while older versions of HC cannot)
• If you are working with pooled samples (also due to the HC’s limitation regarding ploidy)
• If you want to analyze more than 100 samples at a time (for performance reasons) (versions 2.x)

#### Objective

Call variants on a haploid genome with the UnifiedGenotyper, producing a raw (unfiltered) VCF.

• TBD

#### Steps

1. Determine the basic parameters of the analysis
2. Call variants in your sequence data

### 1. Determine the basic parameters of the analysis

If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.

• Ploidy (-ploidy)

In its basic use, this is the ploidy (number of chromosomes) per sample. By default it is set to 2, to process diploid organisms' genomes, but it can be set to any other desired value, starting at 1 for haploid organisms, and up for polyploids. This argument can also be used to handle pooled data. For that purpose, you'll need to set -ploidy to Number of samples in each pool * Sample Ploidy. There is no fixed upper limit, but keep in mind that high-level ploidy will increase processing times since the calculations involved are more complex. For full details on how to process pooled data, see Eran et al. (in preparation).

• Genotype likelihood model (-glm)

This is the model that the program will use to calculate the genotype likelihoods. By default, it is set to SNP, but it can also be set to INDEL or BOTH. If set to BOTH, both SNPs and Indels will be called in the same run and be output to the same variants file.

• Emission confidence threshold (–stand_emit_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.

• Calling confidence threshold (–stand_call_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.

The terms called and filtered are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.

### 2. Call variants in your sequence data

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
-R haploid_reference.fa \
-L 20 \
-ploidy 1
--glm BOTH \
--stand\_call\_conf 30 \
--stand\_emit\_conf 10 \
-o raw_haploid_variants.vcf


This creates a VCF file called raw_haploid_variants.vcf, containing all the sites that the UnifiedGenotyper evaluated to be potentially variant.

Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.

For a complete, detailed argument reference, refer to the technical documentation page.

### 1. Slides

The GATK requires the reference sequence in a single reference sequence in FASTA format, with all contigs in the same file. The GATK requires strict adherence to the FASTA standard. Only the standard ACGT bases are accepted; no non-standard bases (W, for example) are tolerated. Gzipped fasta files will not work with the GATK, so please make sure to unzip them first. Please see [Preparing the essential GATK input files: the reference genome] for more information on preparing FASTA reference sequences for use with the GATK.

### 2. Relatively Recent Changes

The Unified Genotyper now makes multi-allelic variant calls!

#### Fragment-based calling

The Unified Genotyper calls SNPs via a two-stage inference, first from the reads to the sequenced fragments, and then from these inferred fragments to the chromosomal sequence of the organism. This two-stage system properly handles the correlation of errors between read pairs when the sequenced fragments contains errors itself. See Fragment-based calling PDF for more details and analysis.

#### The Allele Frequency Calculation

The allele frequency calculation model used by the Unified Genotyper computes a mathematically precise estimation of the allele frequency at a site given the read data. The mathematical derivation is similar to the one used by Samtools' mpileup tool. Heng Li has graciously allowed us to post the mathematical calculations backing the EXACT model here. Note that the calculations in the provided document assume just a single alternate allele for simplicity, whereas the Unified Genotyper has been extended to handle genotyping multi-allelic events. A slide showing the mathematical details for multi-allelic calling is available here.

### 3. Indel Calling with the Unified Genotyper

While the indel calling capabilities of the Unified Genotyper are still under active development, they are now in a stable state and are supported for use by external users. Please note that, as with SNPs, the Unified Genotyper is fairly aggressive in making a call and, consequently, the false positive rate will be high in the raw call set. We expect users to properly filter these results as per our best practices (which will be changing continually).

Note also that it is critical for the correct operation of the indel calling that the BAM file to be called is previously indel-realigned (see the IndelRealigner section on details). We strongly recommend doing joint Smith-Waterman alignment and not only per-lane or per-sample alignment at known sites. This is important because the caller is only empowered to genotype indels which are already present in reads.

Finally, while many of the parameters are common between indel and SNP calling, some parameters have different meaning or operate differently. For example, --min_base_quality_score has a fixed, well defined operation for SNPs (bases at a particular location with base quality lower than this threshold are ignored). However, indel calling is by definition delocalized and haplotype-based, so this parameter does not make sense. Instead, the indel caller will clip both ends of the reads if their quality is below a certain threshold (Q20), up to the point where there is a base in the read exceeding this threshold.

### 4. Miscellaneous notes

Note that the Unified Genotyper will not call indels in 454 data!

It's common to want to operate only over a part of the genome and to output SNP calls to standard output, rather than a file. The -L option lets you specify the region to process. If you set -o to /dev/stdout (or leave it out completely), output will be sent to the standard output of the console.

You can turn off logging completely by setting -l OFF so that the GATK operates in silent mode.

By default the Unified Genotyper downsamples each sample's coverage to no more than 250x (so there will be at most 250 * number_of_samples reads at a site). Unless there is a good reason for wanting to change this value, we suggest using this default value especially for exome processing; allowing too much coverage will require a lot more memory to run. When running on projects with many samples at low coverage (e.g. 1000 Genomes with 4x coverage per sample) we usually lower this value to about 10 times the average coverage: -dcov 40.

The Unified Genotyper does not use reads with a mapping quality of 255 ("unknown quality" according to the SAM specification). This filtering is enforced because the genotyper caps a base's quality by the mapping quality of its read (since the probability of the base's being correct depends on both qualities). We rely on sensible values for the mapping quality and therefore using reads with a 255 mapping quality is dangerous.

• That being said, if you are working with a data type where alignment quality cannot be determined, there is a (completely unsupported) workaround: the ReassignMappingQuality filter enables you to reassign the mapping quality of all reads on the fly. For example, adding -rf ReassignMappingQuality -DMQ 60 to your command-line would change all mapping qualities in your bam to 60.

• Or, if you are working with data from a program like TopHat which uses MAPQ 255 to convey meaningful information, you can use the ReassignOneMappingQuality filter (new in 2.4) to assign a different MAPQ value to those reads so they won't be ignored by GATK tools. For example, adding -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 would change the mapping qualities of reads with MAPQ 255 in your bam to MAPQ 60.

### 5. Explanation of callable base counts

At the end of a GATK UG run, you should see if you have -l INFO enabled a report that looks like:

INFO  00:23:29,795 UnifiedGenotyper - Visited bases
247249719
INFO  00:23:29,796 UnifiedGenotyper - Callable bases
219998386
INFO  00:23:29,796 UnifiedGenotyper - Confidently called bases
219936125
INFO  00:23:29,796 UnifiedGenotyper - % callable bases of all loci
88.978
INFO  00:23:29,797 UnifiedGenotyper - % confidently called bases of all loci
88.953
INFO  00:23:29,797 UnifiedGenotyper - % confidently called bases of callable loci
88.953
INFO  00:23:29,797 UnifiedGenotyper - Actual calls made
303126


This is what these lines mean:

• Visited bases

This the total number of reference bases that were visited.

• Callable bases

Visited bases minus reference Ns and places with no coverage, which we never try to call.

• Confidently called bases

Callable bases that exceed the emit confidence threshold, either for being non-reference or reference. That is, if T is the min confidence, this is the count of bases where QUAL > T for the site being reference in all samples and/or QUAL > T for the site being non-reference in at least one sample.

Note a subtle implication of the last statement, with all samples vs. any sample: calling multiple samples tends to reduce the percentage of confidently callable bases, as in order to be confidently reference one has to be able to establish that all samples are reference, which is hard because of the stochastic coverage drops in each sample.

Note also that confidently called bases will rise with additional data per sample, so if you don't dedup your reads, include lots of poorly mapped reads, the numbers will increase. Of course, just because you confidently call the site doesn't mean that the data processing resulted in high-quality output, just that we had sufficient statistical evident based on your input data to called ref / non-ref.

### 6. Calling sex chromosomes

The GATK can be used to call the sex (X and Y) chromosomes, without explicit knowledge of the gender of the samples. In an ideal world, with perfect upfront data processing, we would get perfect genotypes on the sex chromosomes without knowledge of who is diploid on X and has no Y, and who is hemizygous on both. However, misalignment and mismapping contributes especially to these chromosomes, as their reference sequence is clearly of lower quality than the autosomal regions of the genome. Nevertheless, it is possible to get reasonably good SNP calls, even with simple data processing and basic filtering. Results with proper, full data processing as per the best practices in the GATK should lead to very good calls. You can view a presentation "The GATK Unified Genotyper on chrX and chrY" in the GSA Public Drop Box.

Our general approach to calling on X and Y is to treat them just as we do the autosomes and then applying a gender-aware tools to correct the genotypes afterwards. It makes sense to filter out sites across all samples (outside PAR) that appear as confidently het in males, as well as sites on Y that appear confidently non-reference in females. Finally, it's possible to simply truncate the genotype likelihoods for males and females as appropriate from their diploid likelihoods -- AA, AB, and BB -- to their haploid equivalents -- AA and BB -- and adjust the genotype calls to reflect only these two options. We applied this approach in 1000G, but we only did it as the data went into imputation, so there's no simple tool to do this, unfortunately. The GATK team is quite interested in a general sex correction tool (analogous to the PhaseByTransmission tool for trios), so please do contact us if you are interested in contributing such a tool to the GATK codebase.

### 7. Related materials

• Explanation of the VCF Output

This can happen when you expect a call to be made based on the output of other variant calling tools, or based on examination of the data in a genome browser like IGV.

There are several possibilities, and among them, it is possible that GATK may be missing a real variant. But we are generally very confident in the calculations made by our tools, and in our experience, most of the time, the problem lies elsewhere. So, before you post this issue in our support forum, please follow these troubleshooting guidelines, which hopefully will help you figure out what's going on.

In all cases, to diagnose what is happening, you will need to look directly at the sequencing data at the position in question.

### 1. Generate the bamout and compare it to the input bam

If you are using HaplotypeCaller to call your variants (as you nearly always should) you'll need to run an extra step first to produce a file called the "bamout file". See this tutorial for step-by-step instructions on how to do this.

What often happens is that when you look at the reads in the original bam file, it looks like a variant should be called. However, once HaplotypeCaller has performed the realignment, the reads may no longer support the expected variant. Generating the bamout file and comparing it to the original bam will allow you to elucidate such cases.

In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!

### 2. Check the base qualities of the non-reference bases

The variant callers apply a minimum base quality threshold, under which bases will not be counted as supporting evidence for a variant. This is because low base qualities mean that the sequencing machine was not confident that it called the right bases. If your expected variant is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the DP field of the VCF is the unfiltered depth. You may believe you have good coverage at your site of interest, but since the variant callers ignore bases that fail the quality filters, the actual coverage seen by the variant callers may be lower than you think.

### 3. Check the mapping qualities of the reads that support the non-reference allele(s)

The quality of a base is capped by the mapping quality of the read that it is on. This is because low mapping qualities mean that the aligner had little confidence that the read was mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- in fact, you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

### 4. Check how many alternate alleles are present

By default the variant callers will only consider a certain number of alternate alleles. This parameter can be relaxed using the --max_alternate_alleles argument (see the HaplotypeCaller documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles increases the computational cost of the processing, scaling exponentially with the number of alternate alleles, which means it will use more resources and take longer. Unless you have a really good reason to change the default value, we highly recommend that you not modify this parameter.

### 5. When using UnifiedGenotyper, check for overlapping deletions

The UnifiedGenotyper ignores sites if there are too many overlapping deletions. This parameter can be relaxed using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument) but be aware that increasing its value could adversely affect the reliability of your results.

### 6. Check for systematic biases introduced by your sequencing technology

Some sequencing technologies introduce particular sources of bias. For example, in data produced by the SOLiD platform, alignments tend to have reference bias and it can be severe in some cases. If the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site, you are probably seeing false positives.

In general most GATK tools don't care about ploidy. The major exception is, of course, at the variant calling step: the variant callers need to know what ploidy is assumed for a given sample in order to perform the appropriate calculations.

### Ploidy-related functionalities

As of version 3.3, the HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the -ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the -ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the -ploidy argument.

For earlier versions (all the way to 2.0) the fallback option is UnifiedGenotyper, which also accepts the -ploidy argument.

### Cases where ploidy needs to be specified

1. Native variant calling in haploid or polyploid organisms.
2. Pooled calling where many pooled organisms share a single barcode and hence are treated as a single "sample".
3. Pooled validation/genotyping at known sites.

For normal organism ploidy, you just set the -ploidy argument to the desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e. (Ploidy per individual) * (Individuals in pool).

## Important limitations

Several variant annotations are not appropriate for use with non-diploid cases. In particular, InbreedingCoeff will not be annotated on non-diploid calls. Annotations that do work and are supported in non-diploid use cases are the following: QUAL, QD, SB, FS, AC, AF, and Genotype annotations such as PL, AD, GT, etc.

You should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.

GATK 2.8 was released on December 6, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Note that this release is relatively smaller than previous ones. We are working hard on some new tools and frameworks that we are hoping to make available to everyone for our next release.

## Unified Genotyper

• Fixed bug where indels in very long reads were sometimes being ignored and not used by the caller.

## Haplotype Caller

• Improved the indexing scheme for gVCF outputs using the reference calculation model.
• The reference calculation model now works with reduced reads.
• Fixed bug where an error was being generated at certain homozygous reference sites because the whole assembly graph was getting pruned away.
• Fixed bug for homozygous reference records that aren't GVCF blocks and were being treated incorrectly.

## Variant Recalibrator

• Disable tranche plots in INDEL mode.
• Various VQSR optimizations in both runtime and accuracy. Some particular details include: for very large whole genome datasets with over 2M variants overlapping the training data randomly downsample the training set that gets used to build; annotations are ordered by the difference in means between known and novel instead of by their standard deviation; removed the training set quality score threshold; now uses 2 gaussians by default for the negative model; numBad argument has been removed and the cutoffs are now chosen by the model itself by looking at the LOD scores.

• Fixed bug where mapping quality was being treated as a byte instead of an int, which caused high MQs to be treated as negative.

## Diagnose Targets

• Added calculation for GC content.
• Added an option to filter the bases based on their quality scores.

## Combine Variants

• Fixed bug where annotation values were parsed as Doubles when they should be parsed as Integers due to implicit conversion; submitted by Michael McCowan.

## Select Variants

• Changed the behavior for PL/AD fields when it encounters a record that has lost one or more alternate alleles: instead of stripping them out these fields now get fixed.

## Miscellaneous

• SplitSamFile now produces an index with the BAM.
• Length metric updates to QualifyMissingIntervals.
• Provide close methods to clean up resources used while creating AlignmentContexts from BAM file regions; submitted by Brad Chapman.
• Picard jar updated to version 1.104.1628.
• Tribble jar updated to version 1.104.1628.
• Variant jar updated to version 1.104.1628.

As reported here:

If you encounter this bug too, please don't post a new question about it. Feel free to comment in this thread to let us know you have also had the same problem. Tell us what version of the GATK you were using and post your command line.

Thank you for your patience while we work to fix this issue.

### Latest update: we found that the three tools (PrintRead, HaplotypeCaller and UnifiedGenotyper) had different issues that only produced the same symptom (the ArrayIndexOutOfBoundsException error). The underlying issues have all been fixed in version 2.5. If you encounter an ArrayIndexOutOfBoundsException in later versions, it is certainly caused by a different problem, so please open a new thread to report the issue.

Question: Is it possible to have CV merge like bcftools does it?

I get this warning, when running UG in GGA mode using an -alleles vcf generated with CV:

WARN  10:17:21,394 GenotypingGivenAllelesUtils - Multiple valid VCF records detected in the alleles input file at site 20:106089, only considering the first record


I made this call with HC from 10 samples:

20  106089  .   CA  C


And this call with UG from 10 other samples:

20  106089  .   C   A


CV merges like this:

20  106089  .   C   A
20  106089  .   CA  C


bcftools merges like this:

20  106089  .   CA  AA,C


The UG recall from the CV generated -alleles vcf is incomplete:

20  106089  .   C   A


The UG recall from the bcftools generated -alleles vcf is complete:

20  106089  .   CA  AA,C


Is it possible to have CV merge like bcftools does it?

I'm really not sure. It's not a use case that UG was designed for (with UG we kept SNPs and indels separate until post-analysis), so I would recommend being cautious with it.

I checked the genotypes and UG seems to handle merged MNPs and indels just fine; see below. But I will do some additional testing. Or I might just take the safe path and do the recalling separately for SNPs and indels as suggested. The reason I have UG and HC calls in the first place is because I have low and high coverage data for different cohorts. I want to create a merged dataset.

Despite --interval_padding 100 helping to recall more sites with HC in GGA mode as per previous recommendation, some sites still fail to be called with HC in GGA mode. Hence I opted for UG.

UG calls on samples 1-10:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  535 545 546 550 554 564 567 574 575 578
20  106089  .   C   A   16.19   .   AC=2;AF=0.125;AN=16;BaseQRankSum=-0.854;DP=37;Dels=0.00;FS=0.000;HaplotypeScore=1.5282;MLEAC=2;MLEAF=0.125;MQ=58.74;MQ0=0;MQRankSum=-0.560;QD=2.70;ReadPosRankSum=-1.797;SOR=0.935;VariantType=SNP  GT:AD:DP:GQ:PL  0/0:3,0:3:6:0,6,76  0/0:4,2:6:9:0,9,115 0/1:3,1:4:24:24,0,80    0/0:6,0:6:12:0,12,130   0/1:1,1:2:29:30,0,29    ./. 0/0:7,0:7:15:0,15,188   0/0:3,1:4:6:0,6,74  ./. 0/0:5,0:5:12:0,12,142


HC calls on samples 11-20:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  585 590 622 625 628 640 655 668 687 693

20  106089  .   CA  C   47.95   .   AC=5;AF=0.250;AN=20;BaseQRankSum=0.925;DP=36;FS=1.850;InbreedingCoeff=0.0646;MLEAC=5;MLEAF=0.250;MQ=59.48;MQ0=0;MQRankSum=0.175;QD=3.00;ReadPosRankSum=-1.725;SOR=0.387 GT:AD:GQ:PL 0/0:2,0:6:0,6,49    0/0:2,0:6:0,6,49    0/0:3,0:12:0,12,130 0/0:5,0:15:0,15,122 0/0:2,0:6:0,6,46    0/1:2,1:14:14,0,39  0/1:2,1:15:15,0,38  0/0:4,0:12:0,12,93  0/1:3,1:12:12,0,46  1/1:0,3:9:67,9,0


UG GGA recalls on samples 1-20:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  535 545 546 550 554 564 567 574 575 578 585 590 622 625 628 640 655 668 687 693
20  106089  .   CA  AA,C    110.56  .   AC=0,8;AF=0.00,0.222;AN=36;DP=81;FS=0.000;InbreedingCoeff=0.5076;MLEAC=0,6;MLEAF=0.00,0.167;MQ=58.56;MQ0=0;QD=3.45;SOR=0.859;VariantType=MULTIALLELIC_MIXED GT:AD:DP:GQ:PL:SB   0/0:0,0,0:3:0:0,0,0,6,6,52:0,0,0,0  0/2:0,0,1:6:0:5,5,5,0,0,109:0,0,1,0 0/2:0,0,1:4:0:12,12,12,0,0,47:0,0,1,0   0/0:0,0,0:6:0:0,0,0,17,17,123:0,0,0,0   0/0:0,0,0:2:0:0,0,0,3,3,10:0,0,0,0  ./. 0/0:0,0,0:7:0:0,0,0,9,9,60:0,0,0,0  0/2:0,0,1:4:0:12,12,12,0,0,61:0,0,0,1   ./. 0/0:0,0,1:5:0:0,0,0,4,4,30:0,0,0,1  0/0:0,0,0:3:0:0,0,0,6,6,49:0,0,0,0  0/0:0,0,0:3:0:0,0,0,9,9,76:0,0,0,0  0/0:0,0,1:4:0:0,0,0,1,1,22:0,0,1,0  0/0:0,0,0:7:0:0,0,0,18,18,149:0,0,0,0   0/0:0,0,0:4:0:0,0,0,11,11,76:0,0,0,0    0/2:0,0,1:5:0:9,9,9,0,0,65:0,0,0,1  0/2:0,0,1:4:0:12,12,12,0,0,60:0,0,0,1   0/0:0,0,0:5:0:0,0,0,15,15,116:0,0,0,0   0/2:0,0,1:6:0:12,12,12,0,0,47:0,0,0,1   2/2:0,0,3:3:9:67,67,67,9,9,0:0,0,3,0


http://gatkforums.broadinstitute.org/discussion/5249/overcalling-deletion-in-unifiedgenotyper-genotype-given-alleles-mode


P.S. I might gate crash your Cambridge party this week despite not being invited :D The course was already fully booked, when you announced it. I don't have a time machine!

Hi,

I've effectively got a sample of 200 individuals, that share one father but each have a different mother. I'm wondering how to best change the default parameters of Unified Genotyper to call variants without bias caused by assumptions of the algorithm.

Cheers,

Blue

Hello,

when I run the UnifiedGenotyper to call INDELs, I get the following error (detailed command line output below) HAPLOTYPE_MAX_LENGTH must be > 0 but got 0

When I call only SNPs instead, it doe not occur. I have searched to find an answer to why this is happening, but cannot figure out the reason. Could this be a bug? I get the error no matter if I run the Realigner before or not.

Did you observe this problem before?

Command line output: java -Xmx10g -jar ~/work/tools/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /media/rebecca/UUI/Work/BputSem/BputSem_gapfilled.final.fa -I realigned_A.bam -gt_mode DISCOVERY -glm INDEL -stand_call_conf 30 -stand_emit_conf 10 -o rawINDELS_q30_A.vcf -ploidy 10

INFO 15:50:47,987 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:50:47,990 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 INFO 15:50:47,990 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 15:50:47,990 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 15:50:47,996 HelpFormatter - Program Args: -T UnifiedGenotyper -R /media/rebecca/UUI/Work/BputSem/BputSem_gapfilled.final.fa -I realigned_A.bam -gt_mode DISCOVERY -glm INDEL -stand_call_conf 30 -stand_emit_conf 10 -o rawINDELS_q30_A.vcf -ploidy 10 INFO 15:50:48,002 HelpFormatter - Executing as rebecca@rebecca-ThinkPad-T440s on Linux 3.13.0-44-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_65-b32. INFO 15:50:48,002 HelpFormatter - Date/Time: 2015/03/12 15:50:47 INFO 15:50:48,002 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:50:48,002 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:50:48,132 GenomeAnalysisEngine - Strictness is SILENT INFO 15:50:48,402 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 15:50:48,409 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 15:50:48,447 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04 INFO 15:50:48,594 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 15:50:48,622 GenomeAnalysisEngine - Done preparing for traversal INFO 15:50:48,622 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 15:50:48,622 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 15:50:48,623 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 15:50:48,658 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values. INFO 15:50:48,658 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values. INFO 15:51:06,585 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR stack trace

java.lang.IllegalArgumentException: HAPLOTYPE_MAX_LENGTH must be > 0 but got 0 at org.broadinstitute.gatk.utils.pairhmm.PairHMM.initialize(PairHMM.java:97) at org.broadinstitute.gatk.utils.pairhmm.N2MemoryPairHMM.initialize(N2MemoryPairHMM.java:60) at org.broadinstitute.gatk.utils.pairhmm.LoglessPairHMM.initialize(LoglessPairHMM.java:66) at org.broadinstitute.gatk.utils.pairhmm.PairHMM.computeLikelihoods(PairHMM.java:194) at org.broadinstitute.gatk.tools.walkers.indels.PairHMMIndelErrorModel.computeGeneralReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:461) at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyIndelGenotypeLikelihoods.add(GeneralPloidyIndelGenotypeLikelihoods.java:201) at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyIndelGenotypeLikelihoods.add(GeneralPloidyIndelGenotypeLikelihoods.java:124) at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyGenotypeLikelihoodsCalculationModel.getLikelihoods(GeneralPloidyGenotypeLikelihoodsCalculationModel.java:270) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateLikelihoods(UnifiedGenotypingEngine.java:317) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotypingEngine.java:201) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:379) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:151) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

##### ERROR ------------------------------------------------------------------------------------------

I'm encountering a problem similar to what I experienced with a mismatch between reference and cosmic files. Specifically, I created .bam files using human_g1k_v3.fasta, which reference coordinates as 1...22,X,Y etc. When I try to run UnifiedGenotyper with this reference file, the .bam file I created, and the dbsnp_137.b37.vcf, I get an error on account of the fact that snp coordinates are listed as chr1...chr22,chrX,chrY etc. Other than writing a script to remove all occurrences of "chr" is there another way to get around this problem, i.e. a dbsnp reference file that has the desired coordinates without the "chr"?

I resolved the problem with reference/cosmic by finding a cosmic file with consistent notation, but can't find a similar fix for this one. I'd appreciate any suggestions.

Hello GATK Team,

I've just run the following using GATK v3.1.

gatk -R ref.fa -T UnifiedGenotyper -I Q11_Final.sorted.rg.bam -I Q11D1.sorted.bam -I Q11D4.sorted.bam -I Q11D2.sorted.bam -I Q11D5.sorted.bam -o PX.raw.vcf -nt 5 -glm SNP -ploidy 2

I then trimmed SNPs within 10bp of indels (called using the same but -glm indel).

The samples therein represent a mother (Sample Q11) and her 4 haploid sons.

The problem I'm getting is that calls that should be heterozygotic in the mother are not being called as such (see attached image; .bam file of the mother is int he bottom and the resulting VCF file is loaded in the top panel. The green cmd screen shows the VCF file line for the site in question where the 14th element of the indicates the queen's results (1/1:68,86:154:99:1837,120,0)

I'm not sure what's going on here and was wondering if you could help out?

I am working with plants, and so cannot follow most of your human/cancer specific workflows etc. Also my question is probably not GATK specifc, but I would be very grateful for any kind of answer. I have used BWA to align a set of reads and the contigs assembled from these reads (de novo). I have used samtools mpileup and GATK unifiedGenotyper on the aligned reads. This results for both in a list of ONLY SNPs, the same ones (aside of sensitivity). I have also used the samtools steps on the aligned contigs, resulting in a list of ONLY indels. I can clearly see both SNPs and indels in IGV, both looking at the reads or the contigs. generally I see that reads are being used for variant calling, so is it "wrong" to use contigs? and why do I not get ANY indels in either tool's output when using reads? any ideas?

Hi, I'm running on the Unified Genotyper (Version=3.3-0-g37228af) on a pooled sample. The ploidy is set to 32. I'm trying to get allele frequency information. I'm trying to filter sites based on depth of coverage but the DP flag is not consistent within a loci

The DP at the start says there is a depth of 37, but at the end it says its 13. Judging from neighbouring sites, the actual depth seems to be 37, but I'm confused why it seems to only be using 13 of them. At other sites the scores match.

Hi, How do I know based on the REF and ALT column of a VCF file the actual position where an indel event happened? I usually see an event that occur in the 2nd base. For example, REF ALT GGCGTGGCGT G,GCGCGTGGCGT --deletion; insertion of "C" ATTT A,ATT --deletions

But I saw a VCF format documentation(http://samtools.github.io/hts-specs/VCFv4.2.pdf) allowing a different case: GTC G,GTCT --deletion; insertion of "T" at the end

How is UnifiedGenotyper formatting the indels? Does it differ from the one used in the 1000Genome Project? Thank you!

Roven

Hi

I'm wondering why some positions in VCF overlap. Can GATK skip emitting positions that are already part of an indel/concatenated ref? We need to use EMIT_ALL_SITES but it's quite confusing if a position is represented multiple times especially if indel is present. I understand that there is always a duplicated position preceding an indel, but the positions after the indel should ideally be not overlapping the neighboring positions. Please see some examples below:

chr02 28792823 . C . 34.23 . AN=2;DP=12;MQ=57.31;MQ0=0 GT:DP 0/0:12 chr02 28792823 . CAA . 10000037.27 . AN=2;DP=12;MQ=57.31;MQ0=0 GT:AD:DP 0/0:12:12 chr02 28792824 . A . . . DP=12;MQ=57.31;MQ0=0 GT ./. chr02 28792825 . A . 31.22 . AN=2;DP=12;MQ=57.31;MQ0=0 GT:DP 0/0:7

chr02 314879 . T . 100.23 . AN=2;DP=27;MQ=55.82;MQ0=0 GT:DP 0/0:27 chr02 314879 . TAGAG T,TAG 647.19 . AC=1,1;AF=0.500,0.500;AN=2;DP=27;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=55.82;MQ0=0;QD=23.97;RPA=8,6,7;RU=AG;STR GT:AD:DP:GQ:PL 1/2:0,9,11:26:99:989,445,424,395,0,335 chr02 314880 . A . 43.23 . AN=2;DP=26;MQ=55.66;MQ0=0 GT:DP 0/0:5 chr02 314881 . G . 40.23 . AN=2;DP=25;MQ=55.47;MQ0=0 GT:DP 0/0:4 chr02 314882 . A . 73.23 . AN=2;DP=25;MQ=55.47;MQ0=0 GT:DP 0/0:16 chr02 314883 . G . 73.23 . AN=2;DP=25;MQ=55.47;MQ0=0 GT:DP 0/0:16

This next example even calls a SNP for position 10221400 even if deletion occurs in 10221399.

chr02 10221399 . C . 76.23 . AN=2;DP=17;MQ=58.74;MQ0=0 GT:DP 0/0:17 chr02 10221399 . CA CCTA,C 143.19 . AC=1,1;AF=0.500,0.500;AN=2;DP=17;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=58.74;MQ0=0;QD=8.42 GT:AD:DP:GQ:PL 1/2:0,8,6:17:99:527,131,112,185,0,143 chr02 10221400 . A C,T 287.29 . AC=1,1;AF=0.500,0.500;AN=2;DP=17;Dels=0.29;FS=0.000;HaplotypeScore=6.7569;MLEAC=1,1;MLEAF=0.500,0.500;MQ=58.74;MQ0=0;QD=16.90 GT:AD:DP:GQ:PL 1/2:0,4,8:12:88:407,294,285,112,0,88 chr02 10221401 . A C 163.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.854;DP=17;Dels=0.00;FS=1.913;HaplotypeScore=9.7562;MLEAC=1;MLEAF=0.500;MQ=58.74;MQ0=0;MQRankSum=1.558;QD=9.63;ReadPosRankSum=2.764 GT:AD:DP:GQ:PL 0/1:11,6:17:99:192,0,392 chr02 10221402 . A . 76.23 . AN=2;DP=17;MQ=58.74;MQ0=0 GT:DP 0/0:17

I'm a bit confused about the instructions for --input_prior in UnifiedGenotyper but they seem quite useful. Is there any chance you could clarify or simplify ?

https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_genotyper_UnifiedGenotyper.php#--input_prior --input_prior / -inputPrior Input prior for calls By default, the prior specified with the argument --heterozygosity/-hets is used for variant discovery at a particular locus, using an infinite sites model, see e.g. Waterson (1975) or Tajima (1996). This model asserts that the probability of having a population of k variant sites in N chromosomes is proportional to theta/k, for 1=1:N There are instances where using this prior might not be desireable, e.g. for population studies where prior might not be appropriate, as for example when the ancestral status of the reference allele is not known. By using this argument, user can manually specify priors to be used for calling as a vector for doubles, with the following restriciotns: a) User must specify 2N values, where N is the number of samples. b) Only diploid calls supported. c) Probability values are specified in double format, in linear space. d) No negative values allowed. e) Values will be added and Pr(AC=0) will be 1-sum, so that they sum up to one. f) If user-defined values add to more than one, an error will be produced. If user wants completely flat priors, then user should specify the same value (=1/(2N+1)) 2N times,e.g. -inputPrior 0.33 -inputPrior 0.33 for the single-sample diploid case.

List[Double]

I have a small UG question to ask and a minor bug to report, but I will give the whole context. I have cohorts at low and high coverage. I will probably call my low coverage data with UG and my high coverage data with HC. I will probably run VQSR per cohort. I will then create a union set of variants and recall all samples with GENOTYPE_GIVEN_ALLELES. I will probably use UG and HC for the low and high coverage recalling, respectively. Prior to generating the union set I will need to merge the UG outputted SNPs and InDels at the same position. I can merge SNPs and InDels with bcftools:

bcftools norm -m +any


I remember from another thread that UG treats SNPs and InDels separately:

http://gatkforums.broadinstitute.org/discussion/3375/combine-snp-and-indel-vcf


But I just checked that UG can do SNP and InDels calls at the same position, when an alleles file with merged SNPs and InDels is provided, when I use GENOTYPE_GIVEN_ALLELES. I expected it to fail, but it seems to work fine. SNPs and InDels are emitted at the same site/record/line. For one record and 3 samples (HOMREF, HET, HOMALT) the genotypes are correct. Can I trust the PLs? Can I also trust QUAL and other calculated values? Or is UG not meant to be used for calling SNPs and InDels at the same site with GGA?

I only had a minor issue despite tabix indexing my bgzipped known alleles file:

##### ERROR MESSAGE: An index is required, but none found., for input source: alleles.vcf.gz


Feel free to delete the original post of this question.

Does UG and HC downsample before or after exclusion of marked duplicates? I guess code is king for the answer...

The documentation on the HaplotypeScore annotation reads:

HaplotypeCaller does not output this annotation because it already evaluates haplotype segregation internally. This annotation is only informative (and available) for variants called by Unified Genotyper.

The annotation used to be part of the best practices:

http://gatkforums.broadinstitute.org/discussion/15/best-practice-variant-detection-with-the-gatk-v1-x-retired


I will include it in the VQSR model for UG calls from low coverage data. Is this an unwise decision? I guess this is for myself to evaluate. I thought I would ask, in case I have missed something obvious.

Hi

I've been trying to use the --allSitePLs option with Unified Genotyper with GaTK v3.2-2-gec30cee.

My command is:

java -Xmx${MEM} -jar${gatk_dir}/GenomeAnalysisTK.jar -R ${ref_dir}/${genome} -T UnifiedGenotyper \
--min_base_quality_score 30 \
-I ${in_dir}/"34E_"${REGION}"_ddkbt_RS74408_recalibrated_3.bam" \
-I ${in_dir}/"34E_"${REGION}"_ddber_RS86405_recalibrated_3.bam" \
--intervals $sites_dir/$file \
-o ${out_dir}/"38F_"${REGION}"_UG_allPL.vcf" \
--output_mode EMIT_ALL_SITES \
--allSitePLs


I ran the same command with and without the --allSitePLs option and compared, and the output seems strange.

Specifically, with --allSitePLs - the FILTER column: 5053700 sites = lowqual, 19303 = . and 5059568 had QUAL < 30. By contrast WITHOUT the --allSitePLs 5067411 = lowqual, 5592 = . and 11460 had QUAL < 30. I don't understand why does adding this option changes the QUAL so much when I'm running UG on the exact same data but just requesting that I get a PL for all sites.

Lastly, how is the specific ALT allele selected? Is it random - because they all seem equally unlikely in the example below, where only one allele was seen in both my samples.

## 2 lines from the output
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ddber_RS86405   ddkbt_RS74408
chr01   1753201 .   C   A   0   LowQual AC=0;AF=0.00;AN=4;DP=43;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0 GT:AD:APL:DP:GQ:PL   0/0:17,0:655,51,0,655,51,655,655,51,655,655:17:51:0,51,655  0/0:26,0:1010,78,0,1010,78,1010,1010,78,1010,1010:26:78:0,78,1010
chr01   1753202 .   T   A   0   LowQual AC=0;AF=0.00;AN=4;DP=43;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0 GT:AD:APL:DP:GQ:PL   0/0:17,0:630,630,630,630,630,630,48,48,48,0:17:48:0,48,630  0/0:26,0:1027,1027,1027,1027,1027,1027,78,78,78,0:26:78:0,78,1027


Hi GATK team, i would like to seek opinion from your team to find the best workflow that best fit my data. Previously i've been exploring both variant calling algorithms UnifiedGenotyper and HaplotypeCaller, and i would love to go for UnifiedGenotyper considering of the sensitivity and the analysis runtime. Due to my experimental cohort samples grows from time to time, so i've opt for single sample calling follow by joint-analysis using combineVariants instead of doing multiple-samples variant calling. However by doing so, i've experience few drawbacks from it (this issue was discussed at few forums). For a particular SNP loci, we wouldn't know whether the "./." reported for some of the samples are due to no reads covering that particular loci, or it doesn't pass certain criteria during variant calling performed previously, or it is a homo-reference base (which i concern this most and can't cope to lost this information).

Then, i found this "gvcf", and it is potentially to solve my problem (Thanks GATK team for always understand our researcher's need)!! Again, i'm insist of opt for unifiedGenotyper instead of haplotypeCaller to generate the gvcf, and reading from the forum at https://www.broadinstitute.org/gatk/guide/tagged?tag=gvcf, i would assume that as VCFs produced by "UnifiedGenotyper with --output_mode EMIT_ALL_SITES" to be something alike with the gvcf file produced by HaplotyperCaller. However i couldn't joint them up using either "CombineVariants" or "CombineGVCFs", most probably i think "UnifiedGenotyper with --output_mode EMIT_ALL_SITES" doesn't generate gvcf format.

Can you please give me some advice to BEST fit my need and with minimum runtime (UnifiedGenotyper would be my best choice), is there any method to joint the ALL_SITES vcf file produced by UnifiedGenotyper which i might probably missed out from the GATK page?

When I run this command:

java -jar /storage/app/GATK_3.3/GenomeAnalysisTK.jar -R /storage/data/gabriel/ucsc/hg19/hg19.fasta -T UnifiedGenotyper -I SRR493939.valid.bam -o teste2.vcf

This error happens:

WARN 04:34:22,257 RestStorageService - Error Response: PUT '/DAXKr9W5oe2LWMJhFdSveQd50kAK6tex.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 393, Content-MD5: 33uF1RoVpa9w9uTWFUxP0g==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: df7b85d51a15a5af70f6e4d6154c4fd2, Date: Tue, 25 Nov 2014 06:34:20 GMT, Authorization: AWS AKIAI22FBBJ37D5X62OQ:agZvhjUckV8Jnth+aubQX9KViUo=, User-Agent: JetS3t/0.8.1 (Linux/3.13.0-30-generic; amd64; en; JVM 1.7.0_65), Host: broad.gsa.gatk.run.reports.s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: B6DD058B64A98870, x-amz-id-2: ADKbyes3Qb8ovjofT4tKpP3gV3yPuc9v5C157qzTkfII2q3U2ZwEpUgzy0eIOz+J, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Tue, 25 Nov 2014 14:39:19 GMT, Connection: close, Server: AmazonS3] WARN 04:34:22,846 RestStorageService - Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately 29096 seconds. Retrying connection.

Can someone help me?

I was running some samtools calls through UG with genotyping_mode GENOTYPE_GIVEN_ALLELES after reading this thread:

http://gatkforums.broadinstitute.org/discussion/1842/variant-annotator-not-annotating-mappingqualityranksumtest-and-readposranksumtest-for-indels


I got this error message:

##### ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Alignment 440969 | 36S26M1D6M1I2M1D1M3D27M1S


What did I do wrong? It works for most other fragments (chrom20:pos1-pos2). Is it possible to avoid this error message and just do ./. calls at coordinates not covered by reads? I'm not sure, why samtools called at a coordinate not covered by a read. I don't know at which coordinate this happened. I was at this coordinate, when the error happened:

INFO  10:21:06,036 ProgressMeter -       20:479014   2329564.0    55.0 m      23.6 m        0.6%     6.7 d       6.7 d


I was using this command:

vcf=samtools.vcf.gz
$java -Djava.io.tmpdir=tmp -Xmx63900m \ -jar$jar \
--analysis_type UnifiedGenotyper \
--reference_sequence $ref \ --input_file bam.list \ --dbsnp$dbsnp138 \
--out $out \ --intervals$vcf \
-stand_call_conf 10 \
-stand_emit_conf 10 \
--genotype_likelihoods_model BOTH \
--output_mode EMIT_ALL_SITES \
--annotation Coverage -A FisherStrand -A HaplotypeScore -A MappingQualityRankSumTest -A QualByDepth -A RMSMappingQuality -A ReadPosRankSumTest \

Error:

##### ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: 96 at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.getCache(DiploidSNPGenotypeLikelihoods.java:298) at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.inCache(DiploidSNPGenotypeLikelihoods.java:262) at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.add(DiploidSNPGenotypeLikelihoods.java:229) at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.add(DiploidSNPGenotypeLikelihoods.java:190) at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.add(DiploidSNPGenotypeLikelihoods.java:176) at org.broadinstitute.sting.gatk.walkers.genotyper.SNPGenotypeLikelihoodsCalculationModel.getLikelihoods(SNPGenotypeLikelihoodsCalculationModel.java:114) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoods(UnifiedGenotyperEngine.java:331) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:232) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

##### ERROR MESSAGE: 96

I found this documented somewhat for Haplotype caller, but the solution didn't look applicable here. These are pooled samples so we must use unified genotyper.

Hi Geraldine, I am confused when I read the paper and tutorial on GATK variant caller. First, the DePristo et al. 2011 paper says local re-aligner generates a list of candidate haplotypes using a) dbSNP info, b) presence of atleast one indel in a read and c) cluster of mismatches at a site. Then it finds the best alternative haplotype based on read haplotype likelihoods and uses log odds ratio to declare ref and alt haplotype pair. Second, the tutorial on GATK variant caller mentions Indel genotype likelihood calculation where UG estimates this via 1) first generate haplotypes from indels in the reads. 2) genotype likelihoods for all haplotype pairs 3) for each haplotype compute read haplotype likelihood using HMM

My question is do both the steps local realignment and UG do haplotype generation and then compute read haplotype likelihood. To me these two steps are redundant because if the best haplotype pairs are generated using local realignment then why is there a need to generate haplotypes for genotype likelihood estimation. I mean in theory, UG should just be able to call indels and assign them genotype by testing all the possible genotypes against the two haplotypes called by the local realigner. Please help me sort this out. I highly appreciate your concern.

Hello everyone, I have recently sequenced 15 sample exomes using the Ion proton system, which directly generates VCFs after sequencing runs. However, since I will be using GATK for downstream analysis, I have decided to recall the variants using UG or HC. I read a document stating that HC should not be used in the case of pooled samples. Can someone clarify what exactly this means? For Ion proton, three samples are barcoded and pooled together prior to sequencing. Also, is 15 samples enough to obtain good discovery power? I was thinking of using external, publicly available bam files of exome sequencing data if 15 samples is not sufficient for joint variant calling.

I used GATK 3.2-2 to do indel realignment and multisample calling with UnifiedGenotyper (I cannot use HaplotypeCaller because it is incompatible with the type of data I am analyzing).

Among the ~96 samples sequenced, I have one sample with two nearby variant calls, an indel and a missense, that we checked by Sanger. The missense is real but we found no trace of the indel in the Sanger traces. When I look at indel call in the multisample vcf, it has good allele depth and GQ, but suspiciously has the same AD as the missense call. Additionally, when I look at the bam in IGV, I see no evidence for the indel whatsoever and the variant is not called in any other samples in this project.

indel: chr13:101755523 CA>C GT:AD:DP:GQ:PL 0/1:55,56:111:99:1388,0,1538

missense: chr13:101755530 A>G GT:AD:DP:GQ:PL 0/1:55,56:111:99:2170,0,2125

I went back and recalled just this one sample (single sample calling)… which resulted in the correct variants, i.e. the indel was not called at all, but the SNP, which does validate, is called.

I understand that this is not an easy region to call because of the 7xA repeat, but it’s not obvious to me why this happens only in multisample mode and I'd like to have a better understanding of what is going on.

Hi,

Is GATK Unified genotyper able to call multi-allelic positions in a single pooled sample? Case is a pool of 13 samples, we use UG with ploidy set to 26. If I understand the supplementaries of the original publications correct, UG will never be able to call three alleles at a single position. in single sample calling. Or does this not hold for high ploidy analysis?

If needed, we can call multiple pools together, but this becomes computationally intensive.

In summary, we would like to call a 14xG,6xA,6xT call for example.

Also, how does UG take noise into account when genotyping (sequencing errors), when for example 3% of reads is aberrant at a position, this could correspond to ~ 1/26.

Thanks for any guidelines,

geert

Can someone tell me what version of the UnifiedGenotyper introduced joint sample calling? I'm specifically interested in whether 1.6 supported joint calling.

Thanks!

Hello, Is there any possibility that I can separate the reads in AD (Allele Depths by sample) into 5' and 3' strands? e.g. the current expression of AD is REF:ALT1:ALT2 ...etc, Can I get the expression of AD like REF_5':REF_3':ALT1_5':ALT1_3':ALT2_5':ALT2_3'

I used the EMIT_ALL_SITES option with Unified Genotyper. For polymorphic sites, the quality score (QUAL field) corresponds to the Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. But for monomorphic sites, (at this site, we have an homozygote for the reference allele), what is the definition of the quality score ? and how is it computed ? Many thanks for your explanation.

Hello,

It seems that while non-parallel UnifiedGenotyper (SNP mode) always generates the same output vcf, parallel version (either -nt and/or -nct >1) of UnifiedGenotyper generates not exactly the same vcf files for each run. For example, below is the GenotypeConcordance output of two runs with same input file (chr 21) and same parameters (-nt4 -nct4):

Sample Eval_Genotype Comp_Genotype Count
ALL HET HET 46341 ALL HET HOM_REF 0 ALL HET HOM_VAR 8 ALL HET MIXED 0 ALL HET NO_CALL 0 ALL HOM_REF HET 300 ALL HOM_REF HOM_REF 34306900 ALL HOM_REF HOM_VAR 0 ALL HOM_REF MIXED 0 ALL HOM_REF NO_CALL 18 ALL HOM_VAR HET 12 ALL HOM_VAR HOM_REF 0 ALL HOM_VAR HOM_VAR 23490 ALL HOM_VAR MIXED 0 ALL HOM_VAR NO_CALL 18 ALL Mismatching_Alleles Mismatching_Alleles 303 ALL NO_CALL HET 0 ALL NO_CALL HOM_REF 11 ALL NO_CALL HOM_VAR 24 ALL NO_CALL MIXED 0 ALL NO_CALL NO_CALL 729217

So I am wondering if there is a way to get the same output vcf. Or do I miss something here? Thanks.

hi,

is there a way to output all possible variants (the alternate allele)? i have total reads of 2100 whereby 2045 reads are match reference and 55 reads are alternate allele. My case is very specific, i would like GATK UnifiedGenotyper to call out my variants although the proportion of alternate read is relative low (1-5%) . And regardless of the quality score of alternate reads. Is there a way to specify this settings during the SNP calling in GATK UnifiedGenotyper?

thank you very much Best J

I performed Variant Calling using UnifiedGenotyper and HaplotypeCaller for Whole Exome Sequenced samples.

GATK version used GATK 3.0

I had only 4 African samples, since did not have any controls I took 70 African BAMs from 1000Genomes. Used Agilent Sure Select WES Version 5 target bed file as interval file to limit the calls to exomes.

I couldot find a single INDEL in UnifiedGenotyper called data and hence couldnot run VQSR for indels on it. But HaplotypeCaller called 20,456 INDELs.

UnifiedGenotyper Called Data

# INDELs-0

HaplotypeCaller Called Data

# INDELs-20456

Is it possible that for variants called from 74 BAMs, UnifiedGenotyper could not find a single INDEL ??

Thanks, Tinu

Hi all,

I am working with a set of samples at the moment which have the following problems:

1) The organism is aneuploid 2) Ploidy for individual chromosomes varies between samples (we have a script that estimates this from the coverage so we can use this information to improve variant calling) 3) Coverage is thin (sequencing was done on a student's budget)

I am using the UnifiedGenotyper for variant discovery as it can account for aneuploidy.

I initially tried calling variants for each sample, grouping chromosomes by ploidy (i.e, for sample 1 I call all the diploid chromosomes, then all the triploid chromosomes etc). I also tried doing multi-sample variant calling across all the samples, setting the ploidy to 2 (The modal chromosome number is 2). Comparison of these two analyses shows that each is missing good SNPs that are called by the other. Multi-sample analyses is missing or mis-calling SNPs on aneuploid chromosomes, and individual analysis is missing SNPs in individual samples due to thin coverage (or more accurately, they are being called, but failing QD or quality filters due to thin coverage - these SNPs pass these filters in the multi-sample analysis).

I am thinking of trying to combine these approaches. The idea is to analyse each chromosome separately. For each chromosome, I would do a multi-sample call on all the samples which are diploid and a multi-sample call on all the samples which are triploid etc etc. I am hoping that this will give me the best of the two approaches above.

So my questions are:

1) Does this strategy makes sense? Is there any reason not to do it this way?

2) How should I proceed downstream? I know I will have to hard-filter regardless. Can I merge all my vcf files before filtering and annotation, or is there a reason to keep them separate?

Any input from the GATK team or the wider community is appreciated!

Thanks

Kathryn

I'm seeing Mendelian errors in small but significant proportion of my unified genotyper calls. For many instances the SNP is called incorrectly as a homozygote even though there are some reads of the second allele. How can I set unified genotyper to be more sensitive ?

Hi all,

Do you have a recommendation to estimate how much heap memory (-Xmx) is necessary to cal variants using the Unified Genotyper. I think that with my project I might be facing a situation where I will run out of memory until there is not more left to increase. To give you an idea, I have 185 samples (that together are 8Gb) and the fasta reference that I am using has too many scaffolds (3 Million). I don't have the opportunity to improve the reference I have at the moment. I have been using -Xmx52G and -nt 10 (in GATK 3.1) but it gives an error at the same point.

INFO 14:45:41,790 HelpFormatter - -------------------------------------------------------------------------------- INFO 14:45:42,773 GenomeAnalysisEngine - Strictness is SILENT INFO 14:59:01,106 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 14:59:01,171 SAMDataSourceSAMReaders - Initializing SAMRecords in serial ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java ##### ERROR ------------------------------------------------------------------------------------------ If you have a suggestion/advice of how to make the analysis work it would be very much appreciated. I know that increasing scaffolds length (reducing number of scaffolds) can improve the analysis so I am wondering if I am facing a situation where I can't do any analysis until the fasta reference is improved. Many thanks, Ximena Dear All, I am Calling SNPs using Unified Genotyper. I am using the Version GATK 2.5.2. Command Used: java -Djava.io.tmpdir=TMP -jar /GenomeAnalysisTK/2.5.2/GenomeAnalysisTK.jar -R Ref.fa -T UnifiedGenotyper -I input1.bam -I input2.bam -I input3.bam -o output.raw.vcf -stand_call_conf 50 -stand_emit_conf 20 -out_mode EMIT_ALL_CONFIDENT_SITES. Usually I get A summary of callable base counts: Like INFO 00:23:29,795 UnifiedGenotyper - Visited bases 247249719 INFO 00:23:29,796 UnifiedGenotyper - Callable bases 219998386 INFO 00:23:29,796 UnifiedGenotyper - Confidently called bases 219936125 INFO 00:23:29,796 UnifiedGenotyper - % callable bases of all loci 88.978 INFO 00:23:29,797 UnifiedGenotyper - % confidently called bases of all loci 88.953 INFO 00:23:29,797 UnifiedGenotyper - % confidently called bases of callable loci 88.953 INFO 00:23:29,797 UnifiedGenotyper - Actual calls made 303126 But I did not get it. Can you tell me how can I calculate this from my output VCF file now? I am running some exome samples with UG. I tried this in 2 ways: 1. Run UG and then apply exome region filter. 2. Run UG with exome regions. Is there any difference in these approaches if I am only concerned about the variants in exome region? I have ran in both ways and on comparison I see that some values are different for the same locus. For eg. PL, MQRankSum, BaseQRankSum. However, the different is not huge but does UG perform slightly differently for making calls with target regions vs without? PS: I have ran this for UG with multiple samples If I specify both -inputPrior and -hets, which one will UnifiedGenotyper use to determine the prior? Is --heterozygosity used for anything other than prior specification? All of my whole-genome sequenced individuals are a cross of different outbred lines with one reference line - effectively identical to the reference genome used for read mapping and variant calling. Thus, effectively all SNPs are heterozygous relative to the reference genome. Should I alter the settings on the Unified Genotyper to maximise detection of these kind of variants ? Hello, I used UnifiedGenotyper and HaplotypeCaller on my sample to get out INDELS and SNPS. There is a 147bp deletion in one of the genes(working with yeast genome)(see attached file). But this deletion is not reported by UnifiedGenotyper and HaplotypeCaller. Since I am working with yeast genome(which is comparatively very small compared to human genome), I am only interested in INDELS ranging from 150bp-200bp maximum. Is there a way by changing parameters that this can happen. I have read some of the previous questions about how haplotypecaller is unable to call large INDELS(in kb's), but I guess in my case INDELS max I am looking for is 200bp. Can this be achieved?? Also UnifiedGenotyper has an option to provide ploidy level. I could not locate that option in HaplotypeCaller. Any way to have that on HaplotypeCaller ? I ask you this because the yeast genome samples I work with are eseentially haploid. NOTE: My sample is paired end data with 100bp reads Hope to hear from you soon Regards Varun We've used GATK(v2.1-8) in a SGE environment for more than 1 years now. Without any changing in environment, we have had a few cases of UnifiedGenotyper getting hung up while trying to read our previously created .fai and .dict reference files. FSLockWithShared - WARNING: Unable to lock file xxx/ucsc.hg19.dict; ReferenceDataSource - Unable to create a lock on dictionary file. Then GATK was updated to 2.7-4 and 3.1-1. There was still an error. "ERROR MESSAGE: Timeout of 30000 milliseconds was reached while trying to acquire a lock on file xxx/dbsnp_135.hg19.vcf.idx. Since the GATK uses non-blocking lock acquisition calls that are not supposed to wait, this implies a problem with the file locking support in your operating system." Besides, there is no problem reported when we run the UnifiedGenotyper step on data store node. This problem happens if we run the UnifiedGenotyper step on compute nodes. We used NFS for shared data between nodes. Is it the way, one is access the data directly and the other one is by NFS shared, leading this problem? If so, how can I fix this problem? Hoping someone can answer this problem as soon as possible. Thank you! Hi all, I was wondering if there is a recommended way how to call variants in trio data (affected children, healthy parents) using the Unified Genotyper. As I see it, there are three possible ways: 1. Call each sample individually 2. Call trios together 3. Call all samples together But which one would you recommend? Hi GATK team, I am using GATK to call random mutations caused by mutagen. I have more than 5 mutant lines sequenced and I want to call SNP/INDELs which are unique for each samples (since they are randomly introduced they must be unique for each samples). Majoirity of SNP will be shared between all mutant samples as this are the natural variations between line used for mutagenesis and the Reference sequence. I am using UnifiedGenotyper (because I have sequenced pooled DNA for each mutant family) by giving all the 5 bam files from samples. I wonder, by calling SNP from all 5 mutant samples together may negatively influence the unique SNPs? Does GATK consider all the 5 samples as a population and give preference (or high score to) locus which are shared between majority of samples? In my case, would it be better to call SNP/INDEL separately for all 5 samples? The command I am using is java -jar GenomeAnalysisTKLite-2.3-9-gdcdccbb/GenomeAnalysisTKLite.jar -T UnifiedGenotyper -glm BOTH -R ref.fa -nt 10 -I results/1_sorted.bam_fixed_rmdup_realign.bam -I results/2_sorted.bam_fixed_rmdup_realign.bam -I results/3_sorted.bam_fixed_rmdup_realign.bam -I results/4_sorted.bam_fixed_rmdup_realign.bam -I results/5_sorted.bam_fixed_rmdup_realign.bam -o results/out.vcf I hope I am clear, and looking forward to learn more, Thank you in advance I found that the allele counts of a position by GATK is different from those by IGV acculumation. For example, IGV gives (A:0, C:13, G:1, T:7) counts. But, the vcf file generated by GATK UnfiedGenotyper contains the following: chr1 10180 . T C 107.93 PASS AC=1;AF=0.50;AN=2;BaseQRankSum=-0.278;DP=17;Dels=0.00;FS=3.522;HRun=2;Haplot ypeScore=7.9790;MQ=25.98;MQ0=4;MQRankSum=-0.597;QD=6.35;ReadPosRankSum=1.663;SB=-67.79 GT:AD:DP:GQ:PL 0/1:7,9:17:98.17:138,0,98 I thought that there might be some filtered out bases. Did I guess right? Then, could you tell me What are the default filtering options in UnifiedGenotyper? HI, I'd like to report a weird result from HaplotypeCaller. We have a patient sequenced by targeted sequencing on HNF1B. This patient has been confirmed to have a whole gene deletion of HNF1B so we used this patient as a positive control. We expected to see no heterozygous variants called in HNF1B. However, HaplotypeCaller called two heterozygous variants: one deletion (it didn't pass the FS strand bias filter and the ReadPosRankSumTest filter) and one substitution (this one passed all the quality filters). Both these two variants were not called by UnifiedGenotyper (and the variants called by UnifiedGenotyper in HNF1B region were all homozygous as what we expected) Please see the VCF table: There are three things I want to highlight: 1, The deletion is only 10 bases upstream of the substitution, but the FS score is enormous for the deletion whereas very low for the substitution. If there's a strand bias, it must affect both variants if they are so close to each other. 2, The score of ReadPosRankSumTest of the deletion didn't pass the filter because it's always called near the end of the reads. The ReadPosRankSumTest score for the substitution is missing. 3, The genotype was called 0/1 for the substitution, but if we look at the AD, there are 206 reads supporting the ref allele and 0 read supporting the alt allele. Going by the AD, it's clearly a homozygous ref/ref genotype. Then I looked into the bam files. It turns out the all the alternative alleles of the substitution were from the ends of bad reads, and there are not too many of them after all. So the reads in the bam file also support a homozygous ref/ref genotype. Therefore I'm really confused why the substitution has 0/1 genotype called by the HaplotypeCaller and why it passed the filter. Many Thanks Betty Hello, I am running unified genotyper and for some reason in the log file it shows that all reads are being filtered out and no variants have been called in the vcf file. I get an empty vcf file with just the header. I could use some help regarding this. Dear, I want to output the number of all bases (ACGT) for each position using UnifiedGenotyper, in GATK2 we used the option -A BaseCounts. Now I am wondering what option I should use in GATK3, Thanks, Bram Hi I am hitting with the following error when I try to use UnifiedGenotyper. Wondering if its a bug or something wrong with my bam file? Thanks for your help in advance. ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.IllegalArgumentException: Illegal base [K] seen in the allele at org.broadinstitute.variant.variantcontext.Allele.create(Allele.java:205) at org.broadinstitute.variant.variantcontext.Allele.create(Allele.java:213) at org.broadinstitute.sting.gatk.walkers.annotator.RankSumTest.fillQualsFromPileup(RankSumTest.java:159) at org.broadinstitute.sting.gatk.walkers.annotator.RankSumTest.annotate(RankSumTest.java:113) at org.broadinstitute.sting.gatk.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:192) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateGenotypes(UnifiedGenotyperEngine.java:560) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:234) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNanoTraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.7-4-g6f46d11): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Illegal base [K] seen in the allele ##### ERROR ------------------------------------------------------------------------------------------ Cheers Scott.. Hi, How many samples and of what file/genome size can the Unified Genotyper process simultaneously ? I will have about 250 samples, genome size 180Mb, bam sizes 3-5GB, reduced bam sizes ~1GB. Also, how long would this take can it be multi-threaded ? Cheers, Blue Dear GATK team I am trying to produce a mutli-sample VCF file, this is the command line I used: java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R /home/mohameam/Pchabaudi/ref.fasta -I S1.bam -I S2.bam -I S3.bam -ploidy 1 -o raw.S1.S2.S3.vcf the program runs smoothly and the output file seems ok, however I get the following error msg in the end: ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.0-0-g6bad1c6): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: Input/output error ##### ERROR ------------------------------------------------------------------------------------------ please advise ? thank you Alyaa Hi, we observe an unexpected deletion call (GATK UnifiedGenotyper 3.0 and 2.5) in a setup with three samples called together. In the file call.txt you find the call. From the pileup (pileup.txt, position 19:57325555) we would expect, that the indel would only be called for the first sample. Is there another source of information, that makes GATK believe, that the deletion also occurs in the second and third sample? Thanks in advance, Johannes Hi, I am running Variant Recalibration on Indels,prior to this I completed Variant Recalibration and ApplyRecalibration on SNPs. So,the input file is the recalibrated VCF file from Apply Recalibration step of SNP's. Below is the error I am getting. ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 2.7-4-g6f46d11): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: Bad input: Values for DP annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://gatkforums.broadinsti tute.org/discussion/49/using-variant-annotator The command I am using for this is : jre1.7.0_40/bin/java -Djava.io.tmpdir=./rb_2905_VCF/tmp -Xmx2g -jar GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T VariantRecalibrator -R dbdata/human_g1k_v37.fasta -input${input_file} --maxGaussians 4 -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.vcf - resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_137.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.indels.b37.vcf -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum -mode INDEL -recalFile $destdir/${input_file%recal.snps.vcf}recal.indel.recal -tranchesFile $destdir/${input_file%recal.snps.vcf}recal.indel.tranches -rscriptFile $destdir/${input_file%recal.snps.vcf}recal.indel.plots.R

If I remove the options -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum,then I get this error:

##### ERROR ------------------------------------------------------------------------------------------

generated with gatk 2.8-1-g932cd3a

Although it is rare I see Genotype Fields that are inconsistent with the AD values (Read as table):

CHROM   POS ID  REF ALT FILTER  QUAL    ABHet   ABHom   AC  AF  AN  BaseCounts  BaseQRankSum    DP  Dels    FS  GC  HRun    HaplotypeScore  LowMQ   MLEAC   MLEAF   MQ  MQ0 MQRankSum   MeanDP  MinDP   OND PercentNBaseSolid   QD  ReadPosRankSum  Samples Somatic VariantType cosmic.ID   1.AB    1.AD    1.DP    1.F 1.GQ    1.GT    1.MQ0   1.PL    1.Z 2.AB    2.AD    2.DP    2.F 2.GQ    2.GT    2.MQ0   2.PL    2.Z 3.AB    3.AD    3.DP    3.F 3.GQ    3.GT    3.MQ0   3.PL    3.Z 4.AB    4.AD    4.DP    4.F 4.GQ    4.GT    4.MQ0   4.PL    4.Z 5.AB    5.AD    5.DP    5.F 5.GQ    5.GT    5.MQ0   5.PL    5.Z
11  92616485    0   A   C   PASS    63.71   0.333   0.698   1   0.1 10  89,54,0,0   -5.631  143 0   49.552  71.29   2   4.4154  0.0000,0.0000,143   1   0.1 50.27   0   -1.645  28.6    16  0.242   0   2.36    2.125   R5_A3_1 NA  SNP COSM467570  NA  24,9    33  0.2727272727    54  A/A 0   0,54,537    -1.3055824197   0.33    9,18    27  0.6666666667    96  A/C 0   96,0,178    0.8660254038    NA  21,11   32  0.34375 21  A/A 0   0,21,466    -0.8838834765   NA  12,4    16  0.25    27  A/A 0   0,27,272    -1  NA  23,12   35  0.3428571429    42  A/A 0   0,42,537    -0.9296696802


This shows that for example sample 5 has a AD value of '23,12' and a GT of 'A/A' aka homyzougous reference allele. I've included a screenshot wich shows low base quality and complete strand bias (Which I suspect to mis variants). So whats the prob? and how can i recalculate the GT's based on AD? because i cannot filter based on genotypes when they are buggy....

Dear GATK team,

Could you please help me how to explain genotyping in my vcf file. I have Illumina data and vcf caller was GATK. My variant frequency (Alt variant freq) is 99.7%. DP = 4622 (AD = 16, 4606) - so I would expect that this sample is alternate homozygous. But when I check PL value, which is - PL = 1655,0,323 - after calculating my likelihood -

REF= G ALT= A

P(D|GG) = 10 ^ -165.5 = small
P(D|AG) = 10 ^ 0  = 1
P(D|AA) = 10 ^ 32.3 = small


we can see it is heterozygous. Can anybody help me how to interpret my result? How it is possible that likelihoods show me heterozygous and coverage and VF show me homozygous?

Here is part of my vcf file:

chr13 32899193 . G A 1625.01 PASS AC=1;AF=0.5;AN=2;DP=4622;QD=0.35;TI=NM_000059;GI=BRCA2;FC=Silent GT:AD:DP:GQ:PL:VF:GQX 0/1:16,4606:5000:99:1655,0,323:0.997:99

Thank you for any explanation.

Paul.

Hi, I have run gatk UnifiedGenotyper on ~45 human whole exomes. I have noticed that the maximum size for called indels in these exomes is 60bp. My question is, is 60 a default for the maximum indel size? I can't seem to find information regarding the threshold for indel size called by UnifiedGenotyper. Is the maximum indel size a parameter that can be adjusted?

In case this is of any use; This is the command I am using:

java -jar /usr/local/gatk/2.6-4/GenomeAnalysisTK.jar -R Homo_sapiens_assembly19.fasta -T UnifiedGenotyper -glm INDEL -I bams.list -L Targets.bed -stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 1000

Thank you very much for your support! Dalia

Hi there,

I'm systematically comparing different read mappers right now. That is, I align reads and run MarkDup/IndelRealignment/Recalibration followed by the UnifiedGenotyper.

When using stampy, I only get SNP calls but no indels. I've extracted one example where the alignments clearly show a homozygous 14bp deletion that does not show up in the VCF. For comparison purposes, I've looked at GSNAP alignments at the same locus. There the deletion looks a bit less clear in the IGV but is nonetheless present in the VCF produced by UG (screenshot attached).

My call to UG (2.8-1-g932cd3a) is the following:

gatk -T UnifiedGenotyper -glm BOTH -R hg18/hg18.fasta -rf BadCigar -L chr1:4742500-4744500 -I example.stampy.bam -o example.stampy.vcf

gatk -T UnifiedGenotyper -glm BOTH -R hg18/hg18.fasta -rf BadCigar -L chr1:4742500-4744500 -I example.gsnap.bam -o example.gsnap.vcf


So my question boils down to: Why is the deletion called for GSNAP but not for STAMPY, although it looks much cleaner for STAMPY. I'm attaching example BAM files and UG output.

Your help is very much appreciated. Please let me know if there is any additional information I can provide.

Cheers, Tobi

Hi, I have run GATK Unified genotyper with multisample and I am getting unexpected output. (Hard filters used)

Here are examples of the output format I get in the vcf for a sample: 0/1:43,4:PASS:24:24,0,4368, ./.:.:PASS or 0/0:2,0:2:DPFilter:6:0,6,47 as well as 0/0:12,0:37:0,37,830. The latter is the format I would expect. What does this mean? Also I get the filter PASS even when all but two samples have ./.:.:PASS and the other two have 1/1:0,1:1:DPFilter;GQFilter:3:38,3,0 and 1/1:0,1:1:DPFilter;GQFilter:3:38,3,0.

Also I ran HalpotypeCaller multisample and get similar output all though more of the 0/0:12,0:37:0,37,830 format....

I using the UnifiedGenotyper to call SNPs in a pooled sample of 30 diploid individuals (i.e., I am setting the ploidy to 60). Does this mean that if the coverage is < 60 at a given variant site, the vcf file will read "./." for all alleles at that site? In other words, does it require the coverage to be >= the ploidy or it won't produce a called variant at that site? I'm just trying to make sure I am interpreting the vcf file correctly, in that: if there is a called genotype at a given variant site that I can interpret that as the estimated allele frequency in that pool.

Best, Jon

Dear all in GATK forums,

I am dealing with a slightly annoying feature of GATK, and I am wondering if there is something I can do to prevent it. I am using multisample calling for a large cohort and UG version 2.8.1 and I see in my VCF situations like this one: 5 96443177 . GC G 862.42 . 5 96443178 . CC C 859.42 .

The same individual is a carrier for both, and quite clearly this is the same variant (a deletion of a C) called in two different places. It is not a disaster but it messes up some downstream analyses (case control...). Would you have a suggestion to avoid that problem?

Hi, I'm trying to understand how do you usually calculate the GQ of a SNP. I understood the model to calculate the Likelihood of all the genotypes (AA,AC, AG,AT,CC,CG,CT,GG,GT,TT). Once all the likelihood have been calculated, correct me if I'm wrong, you normalize the likelihood of the best genotype to 1 and all the other likelihoods according to that scale. So the PL field in the VCF should be the Phred-scaled values of the normalized values. But is not clear to me how do you finally calculate the GQ value. What values do you use to calculate that quality (normalized or phred scaled)? And what's the right formula? I've tried to debug the code but it ends to be really tricky. I really hope that you would help me. I would be really thankfull for that.

Hi All We are running into some random weirdness when running jobs using SGE, GATK version 2.7-2-g6bda569, pretty much all GATK tools - but mostly IndelRealigner abd UnifiedGenotyper, we often get the following error:-

##### ERROR MESSAGE: Couldn't read file /scratch/project/pipelines/novorecal.bam because java.io.FileNotFoundException: /scratch/project/pipelines/novorecal.bam (No such file or directory)

This also happens for supplied reference genomes and vcf files. The GATK tool cant find them.

These "missing" files do exist, and have often even been created by the previous tool/step in the pipeline.

When we re-run the pipeline on a failed sample, it works. So we end up having to re-run our pipeline on the same set of samples multiple times and are beginning to find this very frustrating. These errors seem to be random, I cant find any pattern, and as I mentioned, when we re-run the pipeline on a failed run, it work without a hitch.

Has anyone experienced this? And if so, any recommendations?

Steve

Hi all,

I am running UnifiedGenotyper GLM=Indel Ploidy=1, which produces my raw indels, with no apparent problems. I then run BaseRecalibrator and PrintReads to generate a new BAM, which i am then recalling the indels. This time, I am getting no indels in the VCF (only headers). Curiously, this problem is only present in all 13/52 isolates that are most distant from the reference (expecting ~50K indels), while the rest have all worked (~6K indels). Any clues what might be causing this problem? Thanks, Rhys

Hi, I encounter a strange error when using UnifiedGenotyper, The command that I use is: /usr/bin/java -Xmx4g -jar /home/shengyu_ni/bin/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /mnt/genotyping/sendru/human_g1k_v37.fasta -I bqsr.bam -o haplogroup.vcf -out_mode EMIT_ALL_SITES -gt_mode GENOTYPE_GIVEN_ALLELES -alleles /mnt/genotyping/sendru/isogg.vcf -nt 4 -ploidy 1 -L /mnt/genotyping/sendru/comprey.interval_list

and the error mesage is:

java.lang.ArrayIndexOutOfBoundsException: -1 at org.broadinstitute.sting.gatk.walkers.genotyper.GeneralPloidyGenotypeLikelihoods.subsetToAlleles(GeneralPloidyGenotypeLikelihoods.java:380) at org.broadinstitute.sting.gatk.walkers.genotyper.GeneralPloidyGenotypeLikelihoodsCalculationModel.getLikelihoods(GeneralPloidyGenotypeLikelihoodsCalculationModel.java:294) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoods(UnifiedGenotyperEngine.java:331) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:232) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutorWorker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.8-1-g932cd3a): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: -1 ##### ERROR ------------------------------------------------------------------------------------------ I already notice that it can be some compatible problem with my file: isogg.vcf, because when I replace that file with another vcf file, it runs successfully, but I can not spot what makes the difference. so I list the header of the vcf file as the following, additionally, this vcf file works fine with realignment ## fileformat=VCFv4.1 ## fileDate=20140131 ## source=isogg ## isogg_BUILD_ID=2014 ## reference=GRCh37.p10 ## phasing=partial ## INFO=<ID=ISOGG,Number=0,Type=Flag,Description="Used as y chromosome haplogroup in ISOGG. "> ## FILTER=<ID=PASS,Description="Only pass snp is kept, the other possibles are private and investigation. "> ## contig=<ID=1,assembly=b37,length=249250621> ## contig=<ID=10,assembly=b37,length=135534747> ## contig=<ID=11,assembly=b37,length=135006516> ## contig=<ID=12,assembly=b37,length=133851895> ## contig=<ID=13,assembly=b37,length=115169878> ## contig=<ID=14,assembly=b37,length=107349540> ## contig=<ID=15,assembly=b37,length=102531392> ## contig=<ID=16,assembly=b37,length=90354753> ## contig=<ID=17,assembly=b37,length=81195210> ## contig=<ID=18,assembly=b37,length=78077248> ## contig=<ID=19,assembly=b37,length=59128983> ## contig=<ID=2,assembly=b37,length=243199373> ## contig=<ID=20,assembly=b37,length=63025520> ## contig=<ID=21,assembly=b37,length=48129895> ## contig=<ID=22,assembly=b37,length=51304566> ## contig=<ID=3,assembly=b37,length=198022430> ## contig=<ID=4,assembly=b37,length=191154276> ## contig=<ID=5,assembly=b37,length=180915260> ## contig=<ID=6,assembly=b37,length=171115067> ## contig=<ID=7,assembly=b37,length=159138663> ## contig=<ID=8,assembly=b37,length=146364022> ## contig=<ID=9,assembly=b37,length=141213431> ## contig=<ID=GL000191.1,assembly=b37,length=106433> ## contig=<ID=GL000192.1,assembly=b37,length=547496> ## contig=<ID=GL000193.1,assembly=b37,length=189789> ## contig=<ID=GL000194.1,assembly=b37,length=191469> ## contig=<ID=GL000195.1,assembly=b37,length=182896> ## contig=<ID=GL000196.1,assembly=b37,length=38914> ## contig=<ID=GL000197.1,assembly=b37,length=37175> ## contig=<ID=GL000198.1,assembly=b37,length=90085> ## contig=<ID=GL000199.1,assembly=b37,length=169874> ## contig=<ID=GL000200.1,assembly=b37,length=187035> ## contig=<ID=GL000201.1,assembly=b37,length=36148> ## contig=<ID=GL000202.1,assembly=b37,length=40103> ## contig=<ID=GL000203.1,assembly=b37,length=37498> ## contig=<ID=GL000204.1,assembly=b37,length=81310> ## contig=<ID=GL000205.1,assembly=b37,length=174588> ## contig=<ID=GL000206.1,assembly=b37,length=41001> ## contig=<ID=GL000207.1,assembly=b37,length=4262> ## contig=<ID=GL000208.1,assembly=b37,length=92689> ## contig=<ID=GL000209.1,assembly=b37,length=159169> ## contig=<ID=GL000210.1,assembly=b37,length=27682> ## contig=<ID=GL000211.1,assembly=b37,length=166566> ## contig=<ID=GL000212.1,assembly=b37,length=186858> ## contig=<ID=GL000213.1,assembly=b37,length=164239> ## contig=<ID=GL000214.1,assembly=b37,length=137718> ## contig=<ID=GL000215.1,assembly=b37,length=172545> ## contig=<ID=GL000216.1,assembly=b37,length=172294> ## contig=<ID=GL000217.1,assembly=b37,length=172149> ## contig=<ID=GL000218.1,assembly=b37,length=161147> ## contig=<ID=GL000219.1,assembly=b37,length=179198> ## contig=<ID=GL000220.1,assembly=b37,length=161802> ## contig=<ID=GL000221.1,assembly=b37,length=155397> ## contig=<ID=GL000222.1,assembly=b37,length=186861> ## contig=<ID=GL000223.1,assembly=b37,length=180455> ## contig=<ID=GL000224.1,assembly=b37,length=179693> ## contig=<ID=GL000225.1,assembly=b37,length=211173> ## contig=<ID=GL000226.1,assembly=b37,length=15008> ## contig=<ID=GL000227.1,assembly=b37,length=128374> ## contig=<ID=GL000228.1,assembly=b37,length=129120> ## contig=<ID=GL000229.1,assembly=b37,length=19913> ## contig=<ID=GL000230.1,assembly=b37,length=43691> ## contig=<ID=GL000231.1,assembly=b37,length=27386> ## contig=<ID=GL000232.1,assembly=b37,length=40652> ## contig=<ID=GL000233.1,assembly=b37,length=45941> ## contig=<ID=GL000234.1,assembly=b37,length=40531> ## contig=<ID=GL000235.1,assembly=b37,length=34474> ## contig=<ID=GL000236.1,assembly=b37,length=41934> ## contig=<ID=GL000237.1,assembly=b37,length=45867> ## contig=<ID=GL000238.1,assembly=b37,length=39939> ## contig=<ID=GL000239.1,assembly=b37,length=33824> ## contig=<ID=GL000240.1,assembly=b37,length=41933> ## contig=<ID=GL000241.1,assembly=b37,length=42152> ## contig=<ID=GL000242.1,assembly=b37,length=43523> ## contig=<ID=GL000243.1,assembly=b37,length=43341> ## contig=<ID=GL000244.1,assembly=b37,length=39929> ## contig=<ID=GL000245.1,assembly=b37,length=36651> ## contig=<ID=GL000246.1,assembly=b37,length=38154> ## contig=<ID=GL000247.1,assembly=b37,length=36422> ## contig=<ID=GL000248.1,assembly=b37,length=39786> ## contig=<ID=GL000249.1,assembly=b37,length=38502> ## contig=<ID=MT,assembly=b37,length=16569> ## contig=<ID=X,assembly=b37,length=155270560> ## contig=<ID=Y,assembly=b37,length=59373566> # CHROM POS ID REF ALT QUAL FILTER INFO Y 2649696 M236 G T . PASS ISOGG Y 2655180 M176 C G . PASS ISOGG Y 2656127 Z12426 G A . PASS ISOGG Y 2656959 L1233 G C . PASS ISOGG Thank you. I run UnifiedGenotyper for 84 samples using RNAseq data (Bam-files). I got the following information in the Log file. As it can bee seen, about 27% of my reads failed for many reasons. How can I decrease the percentage of reads which filtered? In other words how I can improve my QC of data before UnifiedGenotyper. Thanks! INFO 14:15:23,762 MicroScheduler - 970952111 reads were filtered out during the traversal out of approximately 3587224222 total reads (27.07%) INFO 14:15:23,762 MicroScheduler - -> 147143075 reads (4.10% of total) failing BadMateFilter INFO 14:15:23,762 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 14:15:23,762 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 14:15:23,763 MicroScheduler - -> 566795160 reads (15.80% of total) failing MalformedReadFilter INFO 14:15:23,763 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 14:15:23,763 MicroScheduler - -> 257013876 reads (7.16% of total) failing NotPrimaryAlignmentFilter INFO 14:15:23,763 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter  Hi all, I am running a scala script, and would like to the include "-ploidy 1". Any advice on how can I do this? Some information that may or may not be relevant (idk!): I am attempting this using UnifiedGenotyper (v2.7-4). At the top of the script I have: package org.broadinstitute.sting.queue.qscripts.examples import org.broadinstitute.sting.queue.QScript import org.broadinstitute.sting.queue.extensions.gatk._ I got the glm both to work using: genotyper.glm = org.broadinstitute.sting.gatk.walkers.genotyper.GenotypeLikelihoodsCalculationModel.Model.BOTH I was hoping for something similar for ploidy. Thanks, Rhys Hi, everyone. from.. GATK Document -out_mode,--output_mode specifies which sites to emit; possible values are EMIT_VARIANTS_ONLY (the default), EMIT_ALL_CONFIDENT_SITES (include confident reference sites), or EMIT_ALL_SITES (any callable site regardless of confidence). I really want to know the meaning of confident reference site. When I calling with the GATK UnifiedGenotyper EMIT_ALL_CONFIDENT_SITES option in each sample BAM file, Can I distinguish the genotype in each sample? (No call, Ref homo, Alt homo, Hetero) In other words, I know the some site is no call or ref homo for this purpose. Hi sir, I used GATK combine variants to call SNP from my 7 vcf files. Now I am using unified genotyper to call all samples simultaneously for SNP. When I focused on my gene of interest (DLX6), in combine variants output, there are 5 "./." among 7 samples. but in case of multisample SNP calling, DLX6 gave 0/0 in those samples and its corresponding information. So, I want to know what is the meaning of "./." in combine variants results and which results should I take for analysis? How I can remove SNPs with Low quality or out of Hardy-Weinberg in my VCF file. In column 7 of my VCF file I have only "LowQual" or ".". There is no "PASS" word in this column. Is there something wrong in my VCF or any problem with UnifiedGenotyper procedure? thanks Hi, I'm updating my pipeline for exome sequencing analysis, so I'm experiencing the HaplotypeCaller capabilities! I have analyzed the same sample with the UnifiedGenotyper walker and the HC one and I have examined the differences between the two output vcf files and I had a very bad finding... HC failed to find a true novel variant!! I know that this is a true variants because I validated that with Sanger sequencing after the first calling with UG. I have run UG using GATK version 1.6-11-g3b2fab9. This is the VCF line of the variant: chr7 45123943 . A T 3436.17 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=2.043;DP=114;Dels=0.00;FS=2.678;HRun=1;HaplotypeScore=0.0000;MQ=42.18;MQ0=1;MQRankSum=2.152;QD=30.14;ReadPosRankSum=-0.781;SB=-1010.47 GT:AD:DP:GQ:PL 1/1:8,105:114:99:3436,253,0 I have run HC using GATK version 2.7-4-g6f46d11 both in a single- and in a multi-sample manner but not the shadow of this variant in the VCF output.. I also noticed that together with this novel variant, HC lost other two variants upstream the first; these are the VCF lines: chr7 45123881 rs61740891 C T 654.25 PASS AC=1;AF=0.50;AN=2;BaseQRankSum=0.205;DB;DP=43;DS;Dels=0.00;FS=65.862;HRun=0;HaplotypeScore=2.2312;MQ=30.27;MQ0=1;MQRankSum=3.254;QD=15.22;ReadPosRankSum=-3.921;SB=-3.02 GT:AD:DP:GQ:PL 0/1:18,25:43:99:684,0,176 chr7 45123888 . C T 161.90 PASS AC=1;AF=0.50;AN=2;BaseQRankSum=-2.293;DP=26;DS;Dels=0.00;FS=49.656;HRun=2;HaplotypeScore=0.0000;MQ=23.81;MQ0=1;MQRankSum=0.425;QD=6.23;ReadPosRankSum=-3.821;SB=-3.00 GT:AD:DP:GQ:PL 0/1:17,9:26:99:192,0,249 How is it possible? Many thanks in advance Best, Flavia Hi, I must be doing something silly because all of my VCF files are empty (header, no results) after calling with Unified Genotyper. I'm using GATK version 2.7-2-g6bda569 with Java 1.7.00_40. The FASTQs were aligned with Novoalign. java -Xmx4g -Djava.io.tmpdir=tmp -jar GenomeAnalysisTK.jar -R hg19_random.fa -D dbsnp_137.hg19.vcf -T UnifiedGenotyper -I data/gatk.realigned.recal.chr1-1-249250621.bam -o data/gatk.snps.raw.chr1-1-249250621.vcf -metrics data/gatk.snps.raw.chr1-1-249250621.vcf.metrics -L chr1:1-249250621 --computeSLOD -stand_call_conf 30 -stand_emit_conf 1 -nt 8 -K GATK_public.key -et NO_ET Here is a snippet from my BAM file: H801:144:C39TBACXX:7:2113:7016:95747 73 chr1 671881 20 101M = 671881 0 GCGGAGGCTGCCGTGACGTAGGGTATGGGCCTAAATAGGCCATTGTGAGTCATGAGCTTGGTCTGTAGAGGCTGACTGGAGAAAGTTCTCGGCCTGGAGAG YYYZZZZZZZZZZZZYZ]ZZ]]]YYYZ]]]ZZ]]]Z]Z]Z]]]]]Z]]]YZZZ]]]]ZYZZZZZZZZZYYYZZZZYYZZYZYZZZYZYZZZZZZZZZWXWW BD:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] MD:Z:101 PG:Z:novoalign RG:Z:SWID:testing:0 BI:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] NM:i:0 UQ:i:0 AS:i:0 H801:144:C39TBACXX:7:2302:11261:7162 153 chr1 971492 70 101M = 971492 0 GCTGTCTGTGAGGGCTGTGCTGAGGCCTTCCTGACCAGCACATGGGGTGGGAAGGACGACCTGGGGAATCCTGAAGTGATCTGAAGACAGAGCCCTGGGCT WYWXYZZZZZYZZXZYZZZYZZZZZZYYZZZZZZZZZZZZZZY]]]]ZZYYYZZZYYZZZZ]]]ZZZYZXZ]ZZZZZZYZ]]]]ZZYZZZZYZZZZZZYYY BD:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] MD:Z:101 PG:Z:novoalign RG:Z:SWID:testing:0 BI:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] NM:i:0 UQ:i:0 AS:i:0 H801:144:C39TBACXX:7:2302:13699:15898 153 chr1 971492 70 101M = 971492 0 GCTGTCTGTGAGGGCTGTGCTGAGGCCTTCCTGACCAGCACATGGGGTGGGAAGGACGACCTGGGGAATCCTGAAGTGATCTGAAGACAGAGCCCTGGGCT XYXZZYZZZZYZZZYYXYZZYZZZZZYYZZZZYYZZYYZZZYY]]]]]ZZZ]]]ZZZYZ]]Z]]Z]ZZZ]]]]]]ZZYZZ]]ZZYYYZYZXZZZZZZZYYX BD:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] MD:Z:101 PG:Z:novoalign RG:Z:SWID:testing:0 BI:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] NM:i:0 UQ:i:0 AS:i:0 H801:144:C39TBACXX:7:2206:12702:96307 97 chr1 987905 70 101M = 988021 217 GTGTGCATATGGGTCCATGTATGTGTGTGTATATGAGGGAGACACGCAGGTGTGTGTCTGAGTGTGTGCGCACATGGGTCCATGTATGTGTGTGTATAGGT YXYZZZZZZZZZZZZ]]]]ZZZ]ZZZYYZZZZ]]]]]]]Z]ZZ]Z]]]]ZZZZZZZYZZ]ZZXYYYWYYZWXWXYWWWYZZYZYZZZYZZZZWXWWYZZYW BD:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] MD:Z:101 PG:Z:novoalign RG:Z:SWID:testing:0 BI:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] NM:i:0 MQ:i:70 UQ:i:0 AS:i:0 H801:144:C39TBACXX:7:2206:12702:96307 145 chr1 988021 70 101M = 987905 -217 GTGTGTGTCCGTGTGTGTGCATGGGTCCATGTGTGTATAGTGTGTACACATGGGTCCATGTATGTGTGTGTATATGAGGGAGACACGCAGGTGTGTGTCCG WWWWWZXZXXWWWXZYWWWYWWWWWWWXWXZYXYWWYYWX]]]ZYXZZYZZ]]]]]]]Z]Z]]]]]]]]]]]]]Z]]]]]]]]]Z]]]ZZZZZZZZZZYYY BD:Z:]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]  And the output on the command line when running INFO 12:44:55,654 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:44:55,656 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-2-g6bda569, Compiled 2013/08/28 16:30:29 INFO 12:44:55,656 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 12:44:55,656 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 12:44:55,660 HelpFormatter - Program Args: -R hg19_random.fa -D dbsnp_137.hg19.vcf -T UnifiedGenotyper -I data/gatk.realigned.recal.chr1-1-249250621.bam -o data/gatk.snps.raw.chr1-1-249250621.vcf -metrics data/gatk.snps.raw.chr1-1-249250621.vcf.metrics -et NO_ET -L chr1:1-249250621 --computeSLOD -stand_call_conf 30 -stand_emit_conf 1 -nt 8 -K GATK_public.key INFO 12:44:55,660 HelpFormatter - Date/Time: 2014/01/16 12:44:55 INFO 12:44:55,661 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:44:55,661 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:44:55,755 ArgumentTypeDescriptor - Dynamically determined type of dbsnp_137.hg19.vcf to be VCF INFO 12:44:56,282 GenomeAnalysisEngine - Strictness is SILENT INFO 12:44:56,370 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 12:44:56,377 SAMDataSourceSAMReaders - Initializing SAMRecords in serial
INFO  12:44:56,433 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.05 INFO 12:44:56,490 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf INFO 12:44:56,736 IntervalUtils - Processing 249250621 bp from intervals INFO 12:44:56,749 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 1 CPU thread(s) for each of 8 data thread(s), of 24 processors available on this machine INFO 12:44:56,799 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 12:44:56,992 GenomeAnalysisEngine - Done preparing for traversal INFO 12:44:56,992 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 12:44:56,992 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 12:44:57,085 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:44:57,090 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00 INFO 12:44:57,095 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:44:57,101 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 12:44:57,105 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:44:57,108 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf
INFO  12:44:57,112 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 12:44:57,114 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:44:57,124 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 12:44:57,129 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:44:57,140 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 12:44:57,168 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:44:57,177 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 12:44:57,186 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:44:57,191 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00 INFO 12:44:57,285 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf INFO 12:44:57,482 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf INFO 12:44:57,655 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf INFO 12:44:57,827 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf INFO 12:44:57,996 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf INFO 12:44:58,301 RMDTrackBuilder - Loading Tribble index from disk for file dbsnp_137.hg19.vcf INFO 12:45:26,995 ProgressMeter - chr1:93237277 9.18e+07 30.0 s 0.0 s 37.4% 80.0 s 50.0 s INFO 12:45:56,999 ProgressMeter - chr1:200739385 1.99e+08 60.0 s 0.0 s 80.5% 74.0 s 14.0 s INFO 12:46:12,672 ProgressMeter - done 2.49e+08 75.0 s 0.0 s 100.0% 75.0 s 0.0 s INFO 12:46:12,673 ProgressMeter - Total runtime 75.68 secs, 1.26 min, 0.02 hours INFO 12:46:12,673 MicroScheduler - 5225 reads were filtered out during the traversal out of approximately 500463 total reads (1.04%) INFO 12:46:12,673 MicroScheduler - -> 5225 reads (1.04% of total) failing BadMateFilter INFO 12:46:12,673 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 12:46:12,673 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 12:46:12,674 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 12:46:12,674 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 12:46:12,674 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 12:46:12,674 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter  Can anyone tell me what might be wrong? It's reliably failing for every chromosome on reasonably-sized data (this BAM is 20M with ~500k reads). Thanks in advance Hi, I'm trying to run UnifiedGenotyper on some data, and I'm getting an error I can't make heads nor tails of. It seems to be complaining that I have an equal sign in the sample name, which as far as I can tell I do not. Bellow is the command I've executed, the stack trace, and part of the header from the sam file I'm trying to call data from. Command: java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ../reference/Spolyrrhiza.current.genomic.fasta -I SRR072269_reorder_readgroup_sorted.bam --output_mode EMIT_ALL_SITES -o robertTest.vcf Stack: Spolyrrhiza.current.genomic.fasta -I SRR072269_reorder_readgroup_sorted.bam --output_mode EMIT_ALL_SITES -o Test.vcf INFO 13:02:22,959 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:02:22,964 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51 INFO 13:02:22,964 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 13:02:22,964 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 13:02:22,973 HelpFormatter - Program Args: -T UnifiedGenotyper -R ../reference/Spolyrrhiza.current.genomic.fasta -I SRR072269_reorder_readgroup_sorted.bam --output_mode EMIT_ALL_SITES -o robertTest.vcf INFO 13:02:22,974 HelpFormatter - Date/Time: 2014/01/15 13:02:22 INFO 13:02:22,974 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:02:22,974 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:02:24,330 GenomeAnalysisEngine - Strictness is SILENT INFO 13:02:25,935 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 13:02:25,950 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:02:26,529 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.58 INFO 13:02:27,069 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 13:02:27,188 GenomeAnalysisEngine - Done preparing for traversal INFO 13:02:27,189 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 13:02:27,190 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 13:02:28,450 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.IllegalArgumentException: VCFHeaderLine: ID cannot contain an equals sign at org.broadinstitute.variant.vcf.VCFSimpleHeaderLine.initialize(VCFSimpleHeaderLine.java:80) at org.broadinstitute.variant.vcf.VCFSimpleHeaderLine.(VCFSimpleHeaderLine.java:71) at org.broadinstitute.variant.vcf.VCFContigHeaderLine.(VCFContigHeaderLine.java:52) at org.broadinstitute.variant.vcf.VCFUtils.makeContigHeaderLine(VCFUtils.java:165) at org.broadinstitute.variant.vcf.VCFUtils.makeContigHeaderLines(VCFUtils.java:156) at org.broadinstitute.variant.vcf.VCFUtils.withUpdatedContigsAsLines(VCFUtils.java:128) at org.broadinstitute.variant.vcf.VCFUtils.withUpdatedContigsAsLines(VCFUtils.java:114) at org.broadinstitute.variant.vcf.VCFUtils.withUpdatedContigs(VCFUtils.java:110) at org.broadinstitute.sting.utils.variant.GATKVCFUtils.withUpdatedContigs(GATKVCFUtils.java:175) at org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub.writeHeader(VariantContextWriterStub.java:232) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.initialize(UnifiedGenotyper.java:304) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.7-4-g6f46d11): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: VCFHeaderLine: ID cannot contain an equals sign ##### ERROR ------------------------------------------------------------------------------------------ Sam header: @HD VN:1.4 SO:coordinate ... @RG ID:1 PL:illumina PU:flowcell_1 LB:library_1 SM:SRR072269$@PG ID:dvtgm PN:stampy VN:1.0.21_(r1713) CL:-g ../reference/spirodela -h ../reference/spirodela -M ../rawdata/SRR072269.sra.fastq@CO TM:Fri, 10 Jan 2014 17:43:03 EST WD:/data/home/learnspirodela/stampy_out HN:server.ca UN:williarj

Also, I've noticed that the reads in the files I'm using do have "=" in the read headers:

@SRR072258.sra.2 GL1KSOJ02I7J10 length=93 GACTGGACCTTCTGAGCCTTGCACCAGCGCATTGATGCGGACCTTGAACTCCTCATACTCCNCTGNACCGCCAGGCCAGCAGGGNTAGGGNNN +SRR072258.sra.2 GL1KSOJ02I7J10 length=93 BBBBAA:66669AAAAAABA@:655BBAA??<9:::441-/----411446<=?=>=<<44!34<!<;9722/..997774111!43,,,!!!

Could that be where my mystery equal sign problem originates?

One of my samples has this entry "1/1:0,1:1:3:36,3,0" in the information field, and from my understanding, the genotype is homo variant, because it has 1/1. However, I do not understand why. Since it has 0 REF reads and has only 1 ALT read, how does GATK tell this variant is a homo variant??

Hi all ! We have been using GATK for a few years now. Unified Genotyper was working perfectly fine (exome sequencing, TCGA bam files) for SNV and Indels using the -n option. The calls were taking around 15 to 20minutes.

Then it changed to -nct -nt for parallelization. We have assigned 6CPUs for each computation so we tried the different options -nt 6 -nct 1 : This fails very quickly with the "Too many files open" ERROR MESSAGE - nt 2 -nct 3 or -nt 3 -nct 2 : This runs for about 10minutes and then fails with the exact same message - nt 1 -nct 6 : It does run with no error but takes around 1hour instead of the 20minutes we had previously achieved.

We did set up the limit of open files to 65535. It is weird because when it crashes, I then go to the tmp directory that I put in the java command, and there are not even 100 .tmp files created so I'm wondering what could be the issue.

We have tried it with all versions from 6 to 8 and the same problem happens.

It's really creating a bottleneck for us right now so I'm wondering why, however the limit is not reached, GATK keeps crashing.

Thanks all for your help. Happy holidays. Manon

Hey there,

I was trying to build an analysis pipeline for paired reads with BWA, Duplicate Removal Local Realignment and Base Quality Score Recalibration to finally use GATK's UnifiedGenotyper for SNP and Indel calling. However, for both SNPs and Indels, I receive no called variants no matter how low my used thresholds are. Quality values of the reads look ok, leaving out dbSNP does not change results. I have used the same reference throughout the whole pipeline. I use GATK 2.7, nevertheless a switch to GATK 1.6 did not change anything.

This is my shell command for SNP calling on chromosome X (GATK delivers no results for all chromosomes): java -Xmx4g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R Homo_sapiens_assembly19.fasta -stand_call_conf 30.0 -stand_emit_conf 30.0 -glm SNP -mbq 17 -I test.bam -L X -o test.snps.vcf -D dbsnp_135.hg19.excluding_sites_after_129.vcf

Entries in my BAM file look like this: SRR389458.1885965 113 X 10092397 37 76M = 10092397 1 CCTGTTTCCCCTGGGGCTGGGCTNGANACTGGGCCCAACCNGTGGCTCCCACCTGCACACACAGGGCTGGAGGGAC 98998999989:99:9:999888#88#79999:;:89998#99:;:88:989:;:91889888:;:9;:::::999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:23G2G13C35 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:3 SM:i:37 XM:i:3 XO:i:0 MQ:i:37 XT:A:U SRR389458.1885965 177 X 10092397 37 76M = 10092397 -1 CCTGTTTCCCCTGGGGCTGGGCTNGANACTGGGCCCAACCNGTGGCTCCCACCTGCACACACAGGGCTGGAGGGAC 98998999989:99:9:999888#88#79999:;:89998#99:;:88:989:;:91889888:;:9;:::::999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:23G2G13C35 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:3 SM:i:37 XM:i:3 XO:i:0 MQ:i:37 XT:A:U SRR389458.1888837 113 X 14748343 37 76M = 14748343 1 TCGTGAAAGTCGTTTTAATTTTAGCGGTTATGGGATGGGTCACTGCCTCCAAGTGAAAGATGGAACAGCTGTCAAG 889999:9988;98:9::9;9986::::99:8:::::999988989:8;;9::989:999:9;9:;:99:98:999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:76 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 MQ:i:37 XT:A:U

And this is the output of the UnifiedGenotyper: INFO 17:57:00,575 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:57:00,578 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51 INFO 17:57:00,578 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:57:00,578 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:57:00,582 HelpFormatter - Program Args: -T UnifiedGenotyper -R /hana/exchange/reference_genomes/hg19/Homo_sapiens_assembly19.fasta -stand_call_conf 30.0 -stand_emit_conf 30.0 -glm SNP -mbq 17 -I test.bam -L X -o testX.snps.vcf -D dbsnp_135.hg19.excluding_sites_after_129.vcf INFO 17:57:00,583 HelpFormatter - Date/Time: 2013/12/17 17:57:00 INFO 17:57:00,583 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:57:00,583 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:57:00,943 ArgumentTypeDescriptor - Dynamically determined type of /hana/exchange/reference_genomes/hg19/dbsnp_135.hg19.excluding_sites_after_129.vcf to be VCF INFO 17:57:01,579 GenomeAnalysisEngine - Strictness is SILENT INFO 17:57:02,228 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 17:57:02,237 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 17:57:02,364 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.13 INFO 17:57:02,594 RMDTrackBuilder - Loading Tribble index from disk for file /hana/exchange/reference_genomes/hg19/dbsnp_135.hg19.excluding_sites_after_129.vcf INFO 17:57:02,867 IntervalUtils - Processing 155270560 bp from intervals INFO 17:57:02,943 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 17:57:03,166 GenomeAnalysisEngine - Done preparing for traversal INFO 17:57:03,167 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:57:03,167 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 17:57:33,171 ProgressMeter - X:11779845 1.01e+07 30.0 s 2.0 s 7.6% 6.6 m 6.1 m INFO 17:58:03,173 ProgressMeter - X:24739805 1.89e+07 60.0 s 3.0 s 15.9% 6.3 m 5.3 m INFO 17:58:33,175 ProgressMeter - X:37330641 3.25e+07 90.0 s 2.0 s 24.0% 6.2 m 4.7 m INFO 17:59:03,177 ProgressMeter - X:49404077 4.94e+07 120.0 s 2.0 s 31.8% 6.3 m 4.3 m INFO 17:59:33,178 ProgressMeter - X:64377965 5.36e+07 2.5 m 2.0 s 41.5% 6.0 m 3.5 m INFO 18:00:03,180 ProgressMeter - X:75606869 7.54e+07 3.0 m 2.0 s 48.7% 6.2 m 3.2 m INFO 18:00:33,189 ProgressMeter - X:88250233 7.74e+07 3.5 m 2.0 s 56.8% 6.2 m 2.7 m INFO 18:01:03,190 ProgressMeter - X:100393213 9.94e+07 4.0 m 2.0 s 64.7% 6.2 m 2.2 m INFO 18:01:33,192 ProgressMeter - X:110535705 1.09e+08 4.5 m 2.0 s 71.2% 6.3 m 109.0 s INFO 18:02:03,193 ProgressMeter - X:121257489 1.20e+08 5.0 m 2.0 s 78.1% 6.4 m 84.0 s INFO 18:02:33,195 ProgressMeter - X:132533757 1.32e+08 5.5 m 2.0 s 85.4% 6.4 m 56.0 s INFO 18:03:03,197 ProgressMeter - X:144498909 1.41e+08 6.0 m 2.0 s 93.1% 6.4 m 26.0 s INFO 18:03:30,079 ProgressMeter - done 1.55e+08 6.4 m 2.0 s 100.0% 6.4 m 0.0 s INFO 18:03:30,079 ProgressMeter - Total runtime 386.91 secs, 6.45 min, 0.11 hours INFO 18:03:30,080 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 150 total reads (0.00%) INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing BadMateFilter INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter INFO 18:03:32,167 GATKRunReport - Uploaded run statistics report to AWS S3

Do I miss anything here?

Best,

Cindy

Dear all,

all my GATK 2.8.1 UG jobs crash with the "Null alleles are not supported" error message. It works fine with 2.7.4 (more precisely the nightly release version as of November 28) but this error message kills all my jobs right now with the just released 2.8.1 version. Based on the traces, it looks like an indel calling issue.

It's multisample calling, with reduced reads format. The issue may come from the fact that a handful of BAM files have been reduced with older versions of GATK (something I am currently fixing, but it takes time). The fact that the issue is brand new to 2.8.1 suggests something different though but I can't be sure. Full error message below.

Any idea?

V

##### ERROR stack trace

java.lang.IllegalArgumentException: Null alleles are not supported at org.broadinstitute.variant.variantcontext.Allele.(Allele.java:117) at org.broadinstitute.sting.utils.haplotype.Haplotype.(Haplotype.java:61) at org.broadinstitute.sting.gatk.walkers.indels.PairHMMIndelErrorModel.trimHaplotypes(PairHMMIndelErrorModel.java:235) at org.broadinstitute.sting.gatk.walkers.indels.PairHMMIndelErrorModel.computeGeneralReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:438) at org.broadinstitute.sting.gatk.walkers.indels.PairHMMIndelErrorModel.computeDiploidReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:251) at org.broadinstitute.sting.gatk.walkers.genotyper.IndelGenotypeLikelihoodsCalculationModel.getLikelihoods(IndelGenotypeLikelihoodsCalculationModel.java:149) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoods(UnifiedGenotyperEngine.java:331) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:232) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724)

##### ERROR ------------------------------------------------------------------------------------------

Dear Team, I was looking at a VCF file produced with UnifiedGenotyper (2.4.9). It is a multisample call and, for a limited number of calls, I have genotypes that are telling the exact opposite of AD field, as in this case

GT:AD:DP:GQ:PL  1/1:10,1:11:3:24,3,0


or

GT:AD:DP:GQ:PL  1/1:18,1:19:3:22,3,0


I have ten reads supporting the reference allele, 1 read supporting the alternate and the genotype is 1/1. This is happening in ~200 sites per sample in my dataset. I've checked the other way around and I found <100 sites in which the genotype is called 0/0 and the AD suggests 1/1 or (more frequently) 0/1. This seems to happen in sites in which the number of variant samples is low (no more than 3 samples in a set of ~50 samples) and it is puzzling me a lot. Can you give me a comment on why this is happening? Thanks

d

Is it possible to source and read bam files directly from ftp site into Unified Genotyper without downloading them first. This option works in samtools view -- but if I try the simple straightforward way in GATK (just replacing the -I inputfilename.bam with -I ftplinktobamfile.bam ) it does not work. Is there another way of doing this? This would save me a lot of diskspace if I could do this.

Hi there,

I compared HaplotypeCaller and UnifiedGenotype of the latest version of GATK (2.7-4) on the same NA12878 trio. It seems HC missed some true variants (specifically, 7 out of the 49 experimentally validated de novo SNVs are missed). I guess this might be partly due to HCMappingQualityFilter, because the log says ~9% of the total reads have been filtered by this filter, and it looks there is not such a filter in UnifiedGenotyper. I wonder is there a safe way to relax HCMappingQualityFilter (default 20) to achieve a better sensitivity for HC?

Aaron

Hi, I'm using unifiedgenotyper for the SNP calling in one lane illumina RNA-seq data, the bam file is ~15gb. The command I used is: java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R assembly.fa -I seq.bam --out result.vcf -ploidy 48 -stand_call_conf 20 -stand_emit_conf 20.0 I run it with 2 cpus and 72gb memory. It has been run for 3 days (I also used -nt 12 to run it with 12 cpus at the same time but the speed didn't improve significantly) and haven't finished. In addition, there are only 6000 loci were write into the vcf file by now, but I analysed this data using other pipelines at the same time, all of them have been finished and reported ~1 million loci. So can anybody tell me usually how long it will take for unifiedgenotyper to analysis one lane illumina RNA-seq data? Thanks.

Dear GATK Team,

I've recently been exploring HaplotypeCaller and noticed that, for my data, it is reporting ~10x lower DP and AD values in comparison to reads visible in the igv browser and reported by the UnifiedGenotyper.

I'm analyzing a human gene panel of amplicon data produced on a MiSeq, 150bp paired end. The coverage is ~5,000x.

My pipeline is:

Novoalign -> GATK (recalibrate quality) -> GATK (re-align) -> HaplotypeCaller/UnifiedGenotyper.

Here are the minimum commands that reproduce the discrepancy:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.HC.vcf \
-L ROI.bed \
-dt NONE \
-nct 8


Example variant from sample1.HC.vcf:

chr17 41245466 . G A 18004.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.411;ClippingRankSum=-1.211;DP=462;FS=2.564;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.250;QD=31.14;ReadPosRankSum=1.159 GT:AD:DP:GQ:PL 1/1:3,458:461:99:18033,1286,0

... In comparison to using UnifiedGenotyper with exactly the same alignment file:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.UG.vcf \
-L ROI.bed \
-nct 4 \
-dt NONE \
-glm BOTH


Example variant from sample1.UG.vcf:

chr17 41245466 . G A 140732.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=5.488;DP=6382;Dels=0.00;FS=0.000;HaplotypeScore=568.8569;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.096;QD=22.05;ReadPosRankSum=0.104 GT:AD:DP:GQ:PL 1/1:56,6300:6378:99:140761,8716,0

I looked at the mapping quality and number of the alignments at the example region (200nt window) listed above and they look good:

awk '{if ($3=="chr17" &&$4 > (41245466-100) && $4 < (41245466+100)) print}' sample1.rg.sam | awk '{count[$5]++} END {for(i in count) print count[i], i}' | sort -nr
8764 70
77 0


With other data generated in our lab, that has ~200x coverage and the same assay principle [just more amplicons], the DP reported by HaplotypeCaller corresponds perfectly to UnifiedGenotyper and igv.

Is there an explanation as to why I should see a difference between HaplotypeCaller and UnifiedGenotyper, using these kinds of data?

Sam

I run UnifiedGenotyper for my 42 samples (bam files) to get VCF file. after almost 48 hours I got my vcf file but there was no 0/0 reference homozygous genotype in the vcf file. How I can achieve these genotype information? I need them for the eQTL analyses. I want to see all 3 class of genotype in the vcf file (0/0, 0/1, and 1/1). Thanks

Dear GATK Team,

I am receiving the following error while running GATK 1.6. Unfortunately, for project consistency I cannot update to a more recent version of GATK and would at least wish to understand the source of the error. I ran ValidateSamFile on the input bam files and they appear to be OK.

Any insight or advice would be greatly appreciated:

##### ERROR ------------------------------------------------------------------------------------------

##### ERROR ------------------------------------------------------------------------------------------

Abbreviated commandline used:

GenomeAnalysisTK.jar -T UnifiedGenotyper -glm BOTH -et NO_ET \ -R "Saccharomyces_cerevisiae/UCSC/sacCer2/Sequence/WholeGenomeFasta/genome.fa" \ -dcov 5000 -I "someFile.bam" --output_mode EMIT_ALL_SITES -gvcf -l OFF \ -stand_call_conf 1 -L chrIV:1-1531919

Dear GATK team, I have used the UG following the best practice GATK workflow to call snps and Indels from exomeseq data of 24 human samples. First I called snps and Indels separately for each bam file, and I obtained separate vcfs. Then I decided to try to call the snps and Indels all in one go. I noticed that the output was quite different and the number of inser/delitions was higher when I called variants in contemporary (starting from separate bam files: -I sample1.bam -I sample2.bam...ETC). I also noticed that the called indels mostly were adjacent to tricky areas such as repetitive elements (ATATATATATATAT) or next to polyAAAAA. These snps and Indels weren't called by the UG when I called the variants separately. Is it more error prone to call variants in contemporary?

Hey there,

I'm currently working on a project that uses the MiSeq system to perform very deep sequencing (5000x coverage) on a small set of genes related to a particular cancer type. I have had some odd calls with HaplotypeCaller and UnifiedGenotyper, especially near the edges of the sequenced regions. It seems that often Haplotype calls very large deletions and Unified calls SNPs, e.g. (from the VCFs):

Haplotype Caller:

chr2 212576988 . TATTTTTAATTGTA T 13.95 LowQual AC=1;AF=0.500;AN=2;BaseQRankSum=1.805;ClippingRankSum=-0.916;DP=1000;FS=8.285;MLEAC=1;MLEAF=0.500;MQ=59.98;MQ0=0;MQRankSum=0.836;QD=0.00;ReadPosRankSum=0.528 GT:AD:GQ:PL 0/1:308,53:42:42,0,8418

Unified Genotyper:

chr2 212576990 . T C 555.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.991;DP=1000;Dels=0.00;FS=0.000;HaplotypeScore=22.6027;MLEAC=1;MLEAF=0.500;MQ=59.98;MQ0=0;MQRankSum=0.073;QD=0.56;ReadPosRankSum=0.164 GT:AD:DP:GQ:PL 0/1:860,130:3743:99:584,0,32767

A screenshot from IGV showing some of the (few) BAM records in the area of these calls is attached.

Many of these false variant calls can be removed through hard filtering, but I was a little surprised to see them called in the first place, because our raw results had been quite good with lower-coverage data. Is this to be expected with high coverage data?

I am also wondering if there is a hard cap after which downsampling occurs. I set -dcov 10000 for all our samples, but the depth reported in the DP tag seemed max out at 1000. Is there another parameter I need to change? In some areas, due to the very small regions being sequenced, I have coverage in excess of 15,000 reads, so downsampling to 1000 might skew the results.

Any guidance is appreciated! Cheers, Natascha

Aloha,

I am calling SNPs on an organism without a reference genome or database of known polymorphisms, so I'm trying to follow the advice posted here (and in the BaseRecalibrator documentation).

I've successfully called SNPs on the un-recalibrated .bam file, then used those SNPs to recalibrate, then called SNPs on the recalibrated .bam file. As expected, I got significantly fewer (and presumably more accurate) results.

I then used the new, reduced set of SNPs to recalibrate again. When I attempted to call SNPs on this "Round Two" recalibrated .bam file, I got the following error:

I attempted to use PicardTools ValidateSamFile and CleanSam but received the same message (as an IllegalArgumentException). I would definitely consider myself a novice in the field. Any advice you can give will be greatly appreciated.

I was wondering, in the vcf output from UnifiedGenotyper what metrics go into the variant quality score, QUAL ?

I'm assuming that depth, DP is one of them but I can't find further information. Apologies if I've missed it.

Attached are some plots of the depth and quality distribution for variant calls, and also of the relationship between depth and quality. I'm slightly worried that the bimodal quality distribution indicates an error, otherwise I've just attached the graphs for general interest.

Hi

For GATK: GenomeAnalysisTK-2.4-7-g5e89f01

It would appear that the issue with the HaplotypeCaller making incorrect Het calls when it should be Hom has turned up again (if it ever actually went away). Note this appears to be the same issue I reported last time: http://gatkforums.broadinstitute.org/discussion/1805/haplotype-caller-incorrectly-calling-blocks-of-variants-heterozygous but considering the time since that report, and that this is from a different sample set and different versions of GATK I thought it best to create a new post. If your prefer to merge them please do so.

So in this occasion we've been looking at a single animal (~16x) using the HaplotypeCaller & UnifiedGenotyper and once again we are finding that the HaplotypeCaller is making Heterozygous calls where there is no support for them in the BAM.

Example Regions, BosTau6 reference: chr18:55,432,023-55,432,220 chr18:55,350,724-55,351,079

Attached you will find images showing this: For the first position & image chr18:55,432,023-55,432,220 the tracks in IGV are: HaplotypeCaller VCF, Population UnifiedGenotyper VCF, Sample BAM, Sample ReducedReads BAM. If we look at this call we have 9x depth, all reads with mapQ 60, Cigar 101M and BaseQ between 21 & 33, a good balance between forward and reverse and ALL 9 reads contain the Alternate allele. Which means the Site should have been called Alt/Alt not Alt/Ref, but for some reason even through there are no reference reads the HaplotypeCaller has called the site Ref/Alt. Note the UnifiedGenotyper correctly calls this site Alt/Alt.

For the second image, position chr18:55,350,724-55,351,079 the tracks in IGV are: Haplotype Caller VCF, UnifiedGenotyper VCF, UG Population VCF, Sample BAM (PCRdedup, IR, BQSR), Sample ReduceReads Bam In this example we have 4 Variants in a cluster (chr18:55,350,895-55,350,975) that the HaplotypeCaller has called as Het (Ref/Alt) when there is no support for this call in the Reads, secondly the UnifiedGenotyper has successfully called each site as Homozygous (Alt/Alt). The reads are a bit more complex at this site but in each case there are 11 reads all of which are Alt alleles and no Reference allele.

Note: HaplotypeCaller & UnifiedGenotyper were run on the full bams, not the ReducedRead bams.

I will upload the region of the BAM file and the VCF files as: Chr18-HC-Het-issues-ULG.tar.gz

Hello,

I found some strange entries for indels in my VCF file created by the Unified Genotyper. For example:

The first sample has genotype 0/1 with a good GQ value. However, according the allele depth field, there is no read supporting the deletion. When I look at the reads using the IGV, I find some reads supporting the deletion for the first sample (and even some for the second one).

Moreover, when I looked at the AD values for SNPs, I noticed the the sum of all AD values is much less than the coverage shown in the IGV. I filtered duplicated reads in the IGV.

Best greetings, Hans-Ulrich

It was a bit unclear what the BadMateFilter is doing in the documentation. Any information would be appreciated!

Hi,

how does the speed of the haplotypeCaller usually compare to that of the UnifiedGenotyper?

When I try to use it, it seems to be about 90x slower.

these are the command lines I use:

java -jar -Xmx32g GenomeAnalysisTK.jar -R hg19.fasta -T UnifiedGenotyper -dcov 1000 -I file.bam -o file.vcf -L myTarget

java -jar -Xmx32g GenomeAnalysisTK.jar -R hg19.fasta -T HaplotypeCaller -dcov 1000 -I file.bam -o file.vcf -L myTarget


thanks!

Hello, I'm maybe missing something obvious but it seems a GATK vcf file does not tell a given variant is SNP, Insertion or deletion. Did I missed some command when I called variations? I can easily classify variations by eye or a script from a given vcf entry but the entry does not explicitly say variation type.

Here are deletions:

Here are insertions:

Ld04 23671 . G GAAAT 6952 . AC=2;AF=1.00;AN=2;DP=100;FS=0.000;HaplotypeScore=24.8695;MLEAC=2;MLEAF=1.00;MQ=59.54;MQ0=0;QD=69.52;SB=-3.355e+03 GT:AD:DP:GQ:PL 1/1:65,34:100:99:6952,301,0

Here are SNPs

Ld04 19890 . T C 3276.01 . AC=2;AF=1.00;AN=2;DP=85;Dels=0.00;FS=0.000;HaplotypeScore=0.7887;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=38.54;SB=-1.614e+03 GT:AD:DP:GQ:PL 1/1:0,85:85:99:3276,253,0

I do not see any markers that tell me variation type. Nor the length of variations. Such information is given in a vcf of Samtools.

PS I used this command line of V 2.0-38:

java -jar ~/GenomeAnalysisTK-2.0-38/GenomeAnalysisTK.jar -T UnifiedGenotyper --genotype_likelihoods_model BOTH -R ref1 -I IN.bam -o OUT.GK2.vcf

Thank you.

Hideo

HI GATK - I am still using the GenomeAnalysisTK-1.6-5-g557da77 version for UnifiedGenotyper. This is probably a silly question, but is there a way to set a parameter for minimum mapping quality score for reads, in deciding whether to evaluate them for variant detection. I know there is a --min_base_quality_score parameter, but I don't see on for mapping quality. http://www.broadinstitute.org/gsa/gatkdocs/release/org_broadinstitute_sting_gatk_walkers_genotyper_UnifiedGenotyper.html

We have a SGE environment where we run our GATK jobs. Lately we have had a few cases of UnifiedGenotyper getting hung up while trying to read our previously created .fai and .dict reference files. On one occasion we had 6 jobs on different nodes all hung for about 5 days before finally giving the FSLockWithShared - WARNING messages and then continuing on just fine. We haven't been able to find out why this is happening. Do you know of anything that would cause UnifiedGenotyper to try and read those files before ultimately throwing the warning messages 5 days later?

We looked into the possibility of other users/jobs having an exclusive lock on those reference files, but we had other UnifiedGenotyper jobs run fine during that same time period without these problems.

We also haven't been able to find any hardware problems yet. The jobs with this problem were on multiple nodes, so I thought I would see if you have any ideas.

Using strace we were able to find that all 6 of the UnifiedGenotyper processes were hung in the same spot:

$strace -p 22924 Process 22924 attached - interrupt to quit futex(0x41c319d0, FUTEX_WAIT, 22925, NULL <unfinished ...>$ strace -p 22925
Process 22925 attached - interrupt to quit
write(1, "WARN  12:39:09,312 FSLockWithSha"..., 158) = 158
write(1, "INFO  12:39:09,364 ReferenceData"..., 116) = 116
write(1, "INFO  12:39:09,364 ReferenceData"..., 89) = 89
open("/reference/sequence/human/ncbi/37.1/allchr.fa.fai", O_RDONLY) = 16
fstat(16, {st_mode=S_IFREG|0770, st_size=2997, ...}) = 0
fstat(16, {st_mode=S_IFREG|0770, st_size=2997, ...}) = 0
fcntl(16, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=0, len=9223372036854775807} <unfinished ...>


Here is the log file for one of the jobs:

Thu Jul 19 09:42:53 CDT 2012
INFO  09:42:58,047 HelpFormatter - --------------------------------------------------------------------------------
INFO  09:42:58,048 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.6-7-g2be5704, Compiled 2012/05/25 16:27:30
INFO  09:42:58,048 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa
INFO  09:42:58,049 HelpFormatter - Program Args: -R /reference/sequence/human/ncbi/37.1/allchr.fa -et NO_ET -K /p
rojects/bsi/bictools/apps/alignment/GenomeAnalysisTK/1.6-7-g2be5704//key -T UnifiedGenotyper --output
_mode EMIT_ALL_SITES --min_base_quality_score 20 -nt 4 --max_alternate_alleles 5 -glm BOTH -L /data2/bsi/target.bed -I /data2/bsi/H-001.chr22-sorted.bam --out /data2/bsi/variants.chr22.raw.all.vcf
INFO  09:42:58,050 HelpFormatter - Date/Time: 2012/07/19 09:42:58
INFO  09:42:58,050 HelpFormatter - --------------------------------------------------------------------------------
INFO  09:42:58,050 HelpFormatter - --------------------------------------------------------------------------------
INFO  09:42:58,166 GenomeAnalysisEngine - Strictness is SILENT
WARN  12:39:09,312 FSLockWithShared - WARNING: Unable to lock file /reference/sequence/human/ncbi/37.1/allchr.dict: Protocol family not supported.
INFO  12:39:09,364 ReferenceDataSource - Unable to create a lock on dictionary file: Protocol family not supported
INFO  12:39:09,364 ReferenceDataSource - Treating existing dictionary file as complete.
WARN  12:39:12,527 FSLockWithShared - WARNING: Unable to lock file /reference/sequence/human/ncbi/37.1/allchr.fa.fai: Protocol family not supported.
INFO  12:39:12,528 ReferenceDataSource - Unable to create a lock on index file: Protocol family not supported
INFO  12:39:12,528 ReferenceDataSource - Treating existing index file as complete.
INFO  12:39:12,663 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:39:12,954 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.29
INFO  12:39:13,108 MicroScheduler - Running the GATK in parallel mode with 4 concurrent threads
WARN  12:39:13,864 UnifiedGenotyper - WARNING: note that the EMIT_ALL_SITES option is intended only for point mutations (SNPs) in DISCOVERY mode or generally when running in GENOTYPE_GIVEN_ALLELES mode; it will by no means produce a comprehensive set of indels in DISCOVERY mode
INFO  12:39:14,361 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:39:14,568 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.21
INFO  12:39:14,569 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:39:14,627 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06
INFO  12:39:14,628 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:39:14,686 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06
INFO  12:39:15,123 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
...
Tue Jul 24 12:47:32 CDT 2012