For a complete, detailed argument reference, refer to the technical documentation page.
The GATK requires the reference sequence in a single reference sequence in FASTA format, with all contigs in the same file. The GATK requires strict adherence to the FASTA standard. Only the standard ACGT bases are accepted; no non-standard bases (W, for example) are tolerated. Gzipped fasta files will not work with the GATK, so please make sure to unzip them first. Please see [Preparing the essential GATK input files: the reference genome] for more information on preparing FASTA reference sequences for use with the GATK.


The Unified Genotyper now makes multi-allelic variant calls!
The Unified Genotyper calls SNPs via a two-stage inference, first from the reads to the sequenced fragments, and then from these inferred fragments to the chromosomal sequence of the organism. This two-stage system properly handles the correlation of errors between read pairs when the sequenced fragments contains errors itself. See Fragment-based calling PDF for more details and analysis.
The allele frequency calculation model used by the Unified Genotyper computes a mathematically precise estimation of the allele frequency at a site given the read data. The mathematical derivation is similar to the one used by Samtools' mpileup tool. Heng Li has graciously allowed us to post the mathematical calculations backing the EXACT model here. Note that the calculations in the provided document assume just a single alternate allele for simplicity, whereas the Unified Genotyper has been extended to handle genotyping multi-allelic events. A slide showing the mathematical details for multi-allelic calling is available here.
While the indel calling capabilities of the Unified Genotyper are still under active development, they are now in a stable state and are supported for use by external users. Please note that, as with SNPs, the Unified Genotyper is fairly aggressive in making a call and, consequently, the false positive rate will be high in the raw call set. We expect users to properly filter these results as per our best practices (which will be changing continually).
Note also that it is critical for the correct operation of the indel calling that the BAM file to be called is previously indel-realigned (see the IndelRealigner section on details). We strongly recommend doing joint Smith-Waterman alignment and not only per-lane or per-sample alignment at known sites. This is important because the caller is only empowered to genotype indels which are already present in reads.
Finally, while many of the parameters are common between indel and SNP calling, some parameters have different meaning or operate differently. For example, --min_base_quality_score has a fixed, well defined operation for SNPs (bases at a particular location with base quality lower than this threshold are ignored). However, indel calling is by definition delocalized and haplotype-based, so this parameter does not make sense. Instead, the indel caller will clip both ends of the reads if their quality is below a certain threshold (Q20), up to the point where there is a base in the read exceeding this threshold.
Note that the Unified Genotyper will not call indels in 454 data!
It's common to want to operate only over a part of the genome and to output SNP calls to standard output, rather than a file. The -L option lets you specify the region to process. If you set -o to /dev/stdout (or leave it out completely), output will be sent to the standard output of the console.
You can turn off logging completely by setting -l OFF so that the GATK operates in silent mode.
By default the Unified Genotyper downsamples each sample's coverage to no more than 250x (so there will be at most 250 * number_of_samples reads at a site). Unless there is a good reason for wanting to change this value, we suggest using this default value especially for exome processing; allowing too much coverage will require a lot more memory to run. When running on projects with many samples at low coverage (e.g. 1000 Genomes with 4x coverage per sample) we usually lower this value to about 10 times the average coverage: -dcov 40.
The Unified Genotyper does not use reads with a mapping quality of 255 ("unknown quality" according to the SAM specification). This filtering is enforced because the genotyper caps a base's quality by the mapping quality of its read (since the probability of the base's being correct depends on both qualities). We rely on sensible values for the mapping quality and therefore using reads with a 255 mapping quality is dangerous.
That being said, if you are working with a data type where alignment quality cannot be determined, there is a (completely unsupported) workaround: the ReassignMappingQuality filter enables you to reassign the mapping quality of all reads on the fly. For example, adding -rf ReassignMappingQuality -DMQ 60 to your command-line would change all mapping qualities in your bam to 60.
Or, if you are working with data from a program like TopHat which uses MAPQ 255 to convey meaningful information, you can use the ReassignOneMappingQuality filter (new in 2.4) to assign a different MAPQ value to those reads so they won't be ignored by GATK tools. For example, adding -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 would change the mapping qualities of reads with MAPQ 255 in your bam to MAPQ 60.
At the end of a GATK UG run, you should see if you have -l INFO enabled a report that looks like:
INFO 00:23:29,795 UnifiedGenotyper - Visited bases
247249719
INFO 00:23:29,796 UnifiedGenotyper - Callable bases
219998386
INFO 00:23:29,796 UnifiedGenotyper - Confidently called bases
219936125
INFO 00:23:29,796 UnifiedGenotyper - % callable bases of all loci
88.978
INFO 00:23:29,797 UnifiedGenotyper - % confidently called bases of all loci
88.953
INFO 00:23:29,797 UnifiedGenotyper - % confidently called bases of callable loci
88.953
INFO 00:23:29,797 UnifiedGenotyper - Actual calls made
303126
This is what these lines mean:
This the total number of reference bases that were visited.
Visited bases minus reference Ns and places with no coverage, which we never try to call.
Callable bases that exceed the emit confidence threshold, either for being non-reference or reference. That is, if T is the min confidence, this is the count of bases where QUAL > T for the site being reference in all samples and/or QUAL > T for the site being non-reference in at least one sample.
Note a subtle implication of the last statement, with all samples vs. any sample: calling multiple samples tends to reduce the percentage of confidently callable bases, as in order to be confidently reference one has to be able to establish that all samples are reference, which is hard because of the stochastic coverage drops in each sample.
Note also that confidently called bases will rise with additional data per sample, so if you don't dedup your reads, include lots of poorly mapped reads, the numbers will increase. Of course, just because you confidently call the site doesn't mean that the data processing resulted in high-quality output, just that we had sufficient statistical evident based on your input data to called ref / non-ref.
The GATK can be used to call the sex (X and Y) chromosomes, without explicit knowledge of the gender of the samples. In an ideal world, with perfect upfront data processing, we would get perfect genotypes on the sex chromosomes without knowledge of who is diploid on X and has no Y, and who is hemizygous on both. However, misalignment and mismapping contributes especially to these chromosomes, as their reference sequence is clearly of lower quality than the autosomal regions of the genome. Nevertheless, it is possible to get reasonably good SNP calls, even with simple data processing and basic filtering. Results with proper, full data processing as per the best practices in the GATK should lead to very good calls. You can view a presentation "The GATK Unified Genotyper on chrX and chrY" in the GSA Public Drop Box.
Our general approach to calling on X and Y is to treat them just as we do the autosomes and then applying a gender-aware tools to correct the genotypes afterwards. It makes sense to filter out sites across all samples (outside PAR) that appear as confidently het in males, as well as sites on Y that appear confidently non-reference in females. Finally, it's possible to simply truncate the genotype likelihoods for males and females as appropriate from their diploid likelihoods -- AA, AB, and BB -- to their haploid equivalents -- AA and BB -- and adjust the genotype calls to reflect only these two options. We applied this approach in 1000G, but we only did it as the data went into imputation, so there's no simple tool to do this, unfortunately. The GATK team is quite interested in a general sex correction tool (analogous to the PhaseByTransmission tool for trios), so please do contact us if you are interested in contributing such a tool to the GATK codebase.
Just because something looks like a SNP in IGV doesn't mean that it is of high quality. We are extremely confident in the genotype likelihoods calculations in the Unified Genotyper (especially for SNPs), so before you post this issue in our support forum you will first need to do a little investigation on your own.
To diagnose what is happening, you should take a look at the pileup of bases at the position in question. It is very important for you to look at the underlying data here.
Here is a checklist of questions you should ask yourself:
The genotyper ignores sites if there are too many overlapping deletions. This value can be set using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument), but be aware that increasing it could affect the reliability of your results.
Remember that there is a minimum base quality threshold and that low base qualities mean that the sequencer assigned a low confidence to that base. If your would-be SNP is only supported by low-confidence bases, it is probably a false positive.
Keep in mind that the depth reported in the VCF is the unfiltered depth. You may think you have good coverage at that site, but the Unified Genotyper ignores bases if they don't look good, so actual coverage seen by the UG may be lower than you think.
A base's quality is capped by the mapping quality of its read. The reason for this is that low mapping qualities mean that the aligner had little confidence that the read is mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- you may be looking at the sequence of some other locus in the genome!
Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.
By default the UG will only consider a certain number of alternate alleles. This value can be set using the --max_alternate_alleles argument (see the UG's documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles is both CPU and memory intensive and it scales exponentially based on the number of alternate alleles. Unless there is a good reason to change the default value, we highly recommend that you not play around with this parameter.
SOLiD alignments tend to have reference bias and it can be severe in some cases. Do the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site? If so, you are probably seeing false positives.
As reported here:
If you encounter this bug too, please don't post a new question about it. Feel free to comment in this thread to let us know you have also had the same problem. Tell us what version of the GATK you were using and post your command line.
Thank you for your patience while we work to fix this issue.
Hello, I want to know how important it is to have the -L target-intervals.intervals option in UnifiedGenotyper, and if it is recommended in VariantCallRecalibrator too.
I have ran the Unified Genotyper tool with 1 input file at a time or the 2 files I want to compare at once. My command-lines are the following:
java -jar -Xmx15g GenomeAnalysisTK.jar -R ./genome.fa -T UnifiedGenotyper -I ./1.bam --dbsnp ./dbsnp_137.hg19.vcf -o ./1-gatk.vcf --min_base_quality_score 25 -stand_call_conf 50 -stand_emit_conf 10 -dcov 200 -L ./intervals-1.intervals
java -jar -Xmx15g GenomeAnalysisTK.jar -R ./genome.fa -T UnifiedGenotyper -I ./2.bam --dbsnp ./dbsnp_137.hg19.vcf -o ./2-gatk.vcf --min_base_quality_score 25 -stand_call_conf 50 -stand_emit_conf 10 -dcov 200 -L ./intervals-2.intervals
java -jar -Xmx15g GenomeAnalysisTK.jar -R ./genome.fa -T UnifiedGenotyper -I ./1.bam -I ./2.bam --dbsnp ./dbsnp_137.hg19.vcf -o ./1vs2-gatk.vcf --min_base_quality_score 25 -stand_call_conf 50 -stand_emit_conf 10 -dcov 200 -L ./intervals-1.intervals -L ./intervals-2.intervals
Thanks, G.
Hello,
I called variants in chunks (100000 bp chunks) on chromosome 20 on 450 reduced BAMS using -T UnifiedGenotyper, which resulted in the generation of 631 VCFs.
Now I want to combine these 631 VCFs using -T CombineVariants, however there seems to be a problem with some of the VCFs and all I get is an error message.
I can combine the 5 first VCFs that were generated, but when I add a sixth one (chr20:400000_500000) I get the following error:
##### ERROR MESSAGE: For input string: "20"
If I exclude that file, I can combine about 30 VCFs until I encounter a similar error message for a different VCF file:
##### ERROR MESSAGE: For input string: "120"
I've looked through both of the "problematic" VCFs and can't see how they differ from the ones that I can combine.
Any idea of what may be going wrong and how I can solve this? At the moment I can only identify problematic VCF by trying to combine them using CombineVariants since I don't know what it is about these particular VCF files that is causing the problem.
Thanks for your help, Tota
We have a 15bp deletion in a sample that was run on an Illumina sequencer twice:
1) on GAII, 76bp paired-end reads, 10-sample pool;
2) HiSeq2000, 100bp paired-end reads, 24-sample pool.
Both times aligned with bwa and analysed with recommended GATK protocol using the latest available version. The deletion was called both times by GATK 1.0.5083 (76bp batch) and GATK 1.4 (100bp batch). However the deletion is not detected by GATK 2.0 or 2.1-8, either by UnifiedGenotyper or HaplotypeCaller. The newer GATK versions do generate fewer calls and substantially cut out false positives, but on the other hand the sensitivity seems to have dropped too much! Here is the variant called by the GATK 1.4, 100bp batch:
7 44190538 . CCCTCCACCCGGCCCA C 8793.94 PASS AC=1;AF=0.021;AN=48;BaseQRankSum=14.575;DP=7878;FS=10.160;HRun=0;HaplotypeScore=3737.9770;InbreedingCoeff=-0.0213;MQ=57.74;MQ0=0;MQRankSum=-10.741;QD=29.71;ReadPosRankSum=-0.056 GT:AD:DP:GQ:PL 0/1:216,80:296:99:8794,0,39333
And here it is called by GATK 1.0.5083, 76bp batch:
7 44190538 . CCCTCCACCCGGCCCA C 7765.65 PASS AC=1;AF=0.050;AN=20;DP=1907;Dels=0.02;HRun=1;HaplotypeScore=40.0818;MQ=59.30;MQ0=0;QD=49.78;SB=-1238.06;sumGLbyD=49.78 GT:DP:GQ:PL 0/1:118:99:7766,0,14618
Where is this loss in sensitivity likely to be occurring? Can we adjust any of the default GATK settings to be able to detect these slightly larger indels? Reads with the 15bp deletion inherently have slightly lower mapping qualities, but I would have thought this is taken into account for indel calling. (I should add that we have a sample with a 27bp homozygous deletion and this always gets called.)
I'd be happy to send the relevant part of the bam file if that would help.
Thanks, Hana
Hi to all,
Jaime
Hi,
I ran print reads command with -baq option as recalculate. But while running UnifiedGenotyper, i am getting following error:
"BAQ tag error: the BAQ value is larger than the base quality"
I am running UnifiedGenotyper with -baq option as CALCULATE_AS_NECESSARY.
Can you please let me know the reason for above error.
Regards
Gaurav
The best practice guide states to call variants across all samples simultaneously. Besides the ease of working with one multi-sample VCF, what advantages are there to calling the variants at the same time? Does GATK leverage information across all samples when making calls? If so, what assumptions is the UnifiedGenotyper making about the relationship of these samples to each other, and what are the effects on the variant calls?
thanks, Justin
I am running into a bug with an older version of GATK. Calls for 1bp insertions sometimes have bad genotypes, e.g. I see 11 reference reads and 0 insertions and the genotype is 1/1
GT:AD:DP:DP4:GQ:PL
1/1:11,0:10:7,4,0,0:18.03:160,18,0
In one recent project, 72 out of 4069 1bp insertions are of this pattern.
When I look in IGV at the realigned BAM's, I see that the allele counts are OK (did that on multiple variants and samples).. but that the genotype call is not. (note the likelyhoods match the genotype call).
Can someone run the following command on a recent multi-sample vcf? to see if the bug still exists?
The command prints 1bp insertions, limits to indels not near runs of single nucleotides (may be false positives), and looks for any homozygous calls.. finally the perl looks for the pattern above
gawk -F "\t" '$0 ~ /^#.*/{next}(length($5)==(length($4)+1)){print $0}' group_1.variants.filter.vcf | grep 'HRun=0' | gawk -F "\t" '{for(i=10;i<=NF;i++) {if($i ~ /1\/1/) {print $i}}}' | perl -ane 'if($_ =~ /1\/1\:([0-9])+\,0\:/){if($1>5) {print $_;}}'
I get something like
1/1:16,0:10:8,8,0,0:23.91:274,24,0
1/1:28,0:24:16,12,0,0:72.24:1033,72,0
1/1:17,0:14:6,11,0,0:42.14:598,42,0
1/1:18,0:18:9,9,0,0:54.15:776,54,0
1/1:19,0:16:10,9,0,0:48.16:713,48,0
1/1:9,0:2:7,2,0,0:6.02:100,6,0
1/1:18,0:18:8,10,0,0:38.82:133,39,0
1/1:28,0:28:12,16,0,0:39.54:117,40,0
...
Sorry, this is part of a large workflow and updating to a more GATK version is a large endeavour that I only want to undertake if the bug is fixed.
These were the main parameters of the calling.
genotype_likelihoods_model=BOTH
p_nonref_model=EXACT
genotyping_mode=DISCOVERY
output_mode=EMIT_VARIANTS_ONLY
Here is the header information
##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[/..chr5.cleaned.bam] read_buffer_size=null phone_home=NO_ET gatk_key=/projects/bsi/bictools/apps/alignment/GenomeAnalysisTK/1.7/Hossain.Asif_mayo.edu.key read_filter=[] intervals=[chr5.target.bed] excludeIntervals=null i
nterval_set_rule=UNION interval_merging=ALL reference_sequence=/data2/bsi/reference/sequence/human/ncbi/37.1/allchr.fa nonDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null d
ownsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null quantize_quals=-1 defaultBaseQualities=-1 validation_strictness=SILENT unsafe=null num_threads=
4 num_cpu_threads=null num_io_threads=null num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false logging_leve
l=INFO log_to_file=null help=false genotype_likelihoods_model=BOTH p_nonref_model=EXACT heterozygosity=0.0010 pcr_error_rate=1.0E-4 genotyping_mode=DISCOVERY output_mode=EMIT_VARIANTS_ONLY standard_min_confide
nce_threshold_for_calling=30.0 standard_min_confidence_threshold_for_emitting=30.0 noSLOD=false annotateNDA=false alleles=(RodBinding name= source=UNBOUND) min_base_quality_score=17 max_deletion_fraction=0.05
max_alternate_alleles=5 min_indel_count_for_genotyping=5 min_indel_fraction_per_sample=0.25 indel_heterozygosity=1.25E-4 indelGapContinuationPenalty=10 indelGapOpenPenalty=45 indelHaplotypeSize=80 noBandedInde
l=false indelDebug=false ignoreSNPAlleles=false dbsnp=(RodBinding name= source=UNBOUND) comp=[] out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWrit
erStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub debug_file=null metrics_file=null annotation=[] excludeAnnotation=[] filter_mismatching_base_and_quals=false"
##VariantAnnotator="analysis_type=VariantAnnotator input_file=[chr5.cleaned.bam] read_buffer_size=null phone_home=NO_ET gatk_key=/projects/bsi/bictools/apps/alignment/GenomeAnalysisTK/1.7/Hossain.Asif_mayo.edu.key read_filter=[] intervals=[variants.chr5.raw.vcf] excludeIntervals=
null interval_set_rule=UNION interval_merging=ALL reference_sequence=/data2/bsi/reference/sequence/human/ncbi/37.1/allchr.fa nonDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=
null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null quantize_quals=-1 defaultBaseQualities=-1 validation_strictness=SILENT unsafe=null num_t
hreads=1 num_cpu_threads=null num_io_threads=null num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false loggi
ng_level=INFO log_to_file=null help=false variant=(RodBinding name=variant source=variants.chr5.raw.vcf) snpEffFile=(RodBinding name= source=UNBOUND) dbsnp=(RodBinding name=dbsnp source=/data2/bsi/reference/annotation/dbSNP/hg19/dbsnp_135.hg19.
vcf.gz) comp=[] resource=[] out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VCFWriterSt
ub annotation=[QualByDepth, MappingQualityRankSumTest, ReadPosRankSumTest, HaplotypeScore, DepthOfCoverage, MappingQualityZero, DepthPerAlleleBySample, RMSMappingQuality, FisherStrand, ForwardReverseAlleleCoun
ts] excludeAnnotation=[] group=[] expression=[] useAllAnnotations=false list=false alwaysAppendDbsnpId=false MendelViolationGenotypeQualityThreshold=0.0 requireStrictAlleleMatch=false filter_mismatching_base_a
nd_quals=false"
Hi,
For SNP calling, the documentation suggested to pool samples together to call Unified genotyper. My questions are:
I have samples for the same study done with exome-seq using Illumina platforms but some using GAII and some using HiSeq 2000 due to historical reasons. My question is: are they OK to be pooled together to call SNPs with Unified genotyper? How about the new HaplotyperCaller? Any concerns on that?
what about data from the same platform but using different exome-capture kits? My take-on for this is probably just the matter of where to look at the variants.
what about data from different platforms? e.g., some from Illumina, some from Ion torrent etc. Any concerns except for the needs of a common interval files for shared regions etc? Anybody tried before? Or just call SNPs for data from the same platform separately?
Thanks a lot for your help! Happy Thanksgiving!
Best
Mike
Hi,
I was using GATK UnifiedGenotyper for SNPs/indels. My command is as follows:
java -Xmx16g -jar $gatkPath/GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R $humanRefSequence \ -I "$sampleID".recal.bam \ --dbsnp $humanDbsnpFile \ -stand_call_conf 30.0 \ -stand_emit_conf 10.0 \ -dcov 250 \ -l INFO -log "$sampleID".GATKvariants.log \ -o "$sampleID".GATKvariants.raw.vcf \ --output_mode EMIT_ALL_SITES
When I am selecting Indels from the vcf file, it's producing an empty vcf file (which is working fine with SNPs). My command for selecting Indels are as follows:
java -Xmx10g -jar $gatkPath/GenomeAnalysisTK.jar \ -T SelectVariants \ -R $humanRefSequence \ --variant "$sampleID".GATKvariants.raw.vcf \ -o "$sampleID".GATKindels.raw.vcf \ -selectType INDEL \ -log "$sampleID".SelectIndels.log
Do I need to use "-glm BOTH" in UnifiedGenotyper to output both SNPs and Indels? As from the documentation, I understand that the default value of -glm is set to SNP.
Many thanks in advance.
Hi Team,
I'm sequencing the genome of an organism which is a cross between the reference line (with no SNPs) and an individual from an outbred population (with many SNPs). Therefore all of the SNPs in my target organism will be heterozygous. So far I have sequenced three individuals which are crosses and one individual from our reference line.
I understand that the UnifiedGenotyper uses population genetic principles to ascertain genotype but I can't find more information about how this is performed. Thus, I am primarily worried that heterozygotes with strongly asymmetric allele counts in the reads will be called as homozygotes in order to fit in with, say Hard-Wienberg equilibrium.
Is there any chance you could enlighten me on this ? (or direct me to more detailed information on UG mechanism and settings).
Just to let you know the background, my study organism is Drosophila melanogaster. The whole genome of 164Mb is paired-end sequenced on an Illumina. I have so far sequenced one individual from our in-house reference line, and three individuals which are crosses of the reference line with a diverse, out-bred population. Average coverage is 30X. The 'crosses' are hemiclones in which recombination between the parental chromosomes is suppressed. I plan on sequencing 200 hemiclone individuals in which one haplotype will be shared between them (the reference gene) and the other haplotype will be diverse and unique to each line. As expected, I have identified a limited number of mutations in our in-house laboratory reference line compared to that of the assembly.
Any advice on how to best call genotypes in this unorthodox sample would be most appreciated.
Hi,
I have 15 affected samples. 2 are whole exome and 13 are whole genome. They have already been realigned on a single-sample level and had BQSR performed. I am contemplating running UnifiedGenotyper on all 15 samples together because we would like to compare the calls across the samples (especially in the coding regions). I am aware that there would be a large number of variant calls in the whole genome samples that would have little to no coverage in the exome samples. I haven't been able to find any posts that say you should or shouldn't run whole genome and exome samples through UnifiedGenotyper together. Are there any reasons why this should be discouraged?
Also, assuming I do perform multi-sample calling across all 15 samples, would it be ok to run that multi-sample VCF file through VQSR?
Thanks! Jared
Hi,
I am interested in the following scenario: 1. sequence tumor and control samples to separate fastq files. 2. Execute read alignment for the 2 samples separately. 3. Execute UnifiedGenotyper with the 2 BAM files (control and tumor) with the following command:
java -jar GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R ReferenceGenome.fasta \ -I contro.bam -I tumor.bam \ -D /usr/sap/GenomicsPlatform/dbSNP/dbsnp_137.b37.vcf \ -o output.vcf
Is this a proper usage for UnifiedGenotyper? How does UnifiedGenotyper refer to the 2 BAM files it recieves as input? What do I expect to see in the output.vcf file? are they somatic variants which describe the variation between control and tumor samples?
Thank you, Stas
I've noticed a change in the log files that now shows a period of "starting" before actually genotyping. This can be hours...or minutes... I looked in previous logs and saw that there aren't such line in the log, but the clock jumps a similar amount of time. I was wondering if this is a known phenomenon and what it the GATK doing during that time. Is there anything we can do to reduce that waiting period?
Here are examples:
old (calling with 4500 samples, glm BOTH):
INFO 12:21:37,515 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 1157.62 INFO 12:26:52,780 RMDTrackBuilder - Loading Tribble index from disk for file /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.dbsnp.vcf WARN 12:26:54,098 VCFStandardHeaderLines$Standards - Repairing standard header line for field AF because -- count types disagree; header has UNBOUNDED but standard is A INFO 12:26:54,457 GenomeAnalysisEngine - Processing 43138 bp from intervals INFO 12:26:54,473 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 12:26:54,473 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 21:37:08,951 ProgressMeter - 17:26944207 0.00e+00 9.2 h 54587.4 w 0.0% Infinity w Infinity w INFO 21:40:40,002 ProgressMeter - 17:26945957 1.82e+02 9.2 h 301.8 w 0.9% 6.2 w 6.1 w INFO 21:41:53,697 ProgressMeter - 17:26946820 6.49e+02 9.2 h 84.8 w 1.5% 3.6 w 3.5 w
and here newer (calling 1500 sample, glm SNP):
INFO 18:42:42,025 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 24.55 INFO 18:42:47,648 RMDTrackBuilder - Loading Tribble index from disk for file /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.dbsnp.vcf WARN 18:42:47,972 VCFStandardHeaderLines$Standards - Repairing standard header line for field AF because -- count types disagree; header has UNBOUNDED but standard is A INFO 18:42:48,192 GenomeAnalysisEngine - Processing 43138 bp from intervals INFO 18:42:48,206 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 18:42:48,206 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 18:43:18,301 ProgressMeter - starting 0.00e+00 30.1 s 49.8 w 100.0% 30.1 s 0.0 s INFO 18:43:48,310 ProgressMeter - starting 0.00e+00 60.1 s 99.4 w 100.0% 60.1 s 0.0 s INFO 18:44:18,344 ProgressMeter - starting 0.00e+00 90.1 s 149.0 w 100.0% 90.1 s 0.0 s INFO 18:44:48,353 ProgressMeter - starting 0.00e+00 2.0 m 198.7 w 100.0% 2.0 m 0.0 s INFO 18:45:18,363 ProgressMeter - starting 0.00e+00 2.5 m 248.3 w 100.0% 2.5 m 0.0 s INFO 18:45:48,373 ProgressMeter - starting 0.00e+00 3.0 m 297.9 w 100.0% 3.0 m 0.0 s INFO 18:46:18,382 ProgressMeter - starting 0.00e+00 3.5 m 347.5 w 100.0% 3.5 m 0.0 s INFO 18:46:48,391 ProgressMeter - starting 0.00e+00 4.0 m 397.1 w 100.0% 4.0 m 0.0 s INFO 18:47:18,401 ProgressMeter - starting 0.00e+00 4.5 m 446.7 w 100.0% 4.5 m 0.0 s INFO 18:47:48,411 ProgressMeter - starting 0.00e+00 5.0 m 496.4 w 100.0% 5.0 m 0.0 s INFO 18:48:18,426 ProgressMeter - starting 0.00e+00 5.5 m 546.0 w 100.0% 5.5 m 0.0 s INFO 18:48:48,436 ProgressMeter - starting 0.00e+00 6.0 m 595.6 w 100.0% 6.0 m 0.0 s INFO 18:49:18,446 ProgressMeter - starting 0.00e+00 6.5 m 645.2 w 100.0% 6.5 m 0.0 s INFO 18:49:48,455 ProgressMeter - starting 0.00e+00 7.0 m 694.9 w 100.0% 7.0 m 0.0 s INFO 18:50:18,465 ProgressMeter - starting 0.00e+00 7.5 m 744.5 w 100.0% 7.5 m 0.0 s INFO 18:50:48,475 ProgressMeter - starting 0.00e+00 8.0 m 794.1 w 100.0% 8.0 m 0.0 s INFO 18:51:18,484 ProgressMeter - starting 0.00e+00 8.5 m 843.7 w 100.0% 8.5 m 0.0 s INFO 18:51:48,510 ProgressMeter - starting 0.00e+00 9.0 m 893.4 w 100.0% 9.0 m 0.0 s INFO 18:52:18,520 ProgressMeter - starting 0.00e+00 9.5 m 943.0 w 100.0% 9.5 m 0.0 s INFO 18:52:48,530 ProgressMeter - starting 0.00e+00 10.0 m 992.6 w 100.0% 10.0 m 0.0 s INFO 18:53:18,541 ProgressMeter - starting 0.00e+00 10.5 m 1042.2 w 100.0% 10.5 m 0.0 s INFO 18:53:48,552 ProgressMeter - starting 0.00e+00 11.0 m 1091.8 w 100.0% 11.0 m 0.0 s INFO 18:54:18,561 ProgressMeter - starting 0.00e+00 11.5 m 1141.5 w 100.0% 11.5 m 0.0 s INFO 18:54:48,570 ProgressMeter - starting 0.00e+00 12.0 m 1191.1 w 100.0% 12.0 m 0.0 s INFO 18:55:18,579 ProgressMeter - starting 0.00e+00 12.5 m 1240.7 w 100.0% 12.5 m 0.0 s INFO 18:55:48,590 ProgressMeter - 18:12725582 1.00e+01 13.0 m 129.0 w 0.5% 45.4 h 45.2 h
I've also noticed that output seems to be quite wrong about how much work it has left to do...perhaps these two things are related?
I would be grateful for any information.
I have a question regarding the bayesian methods employed by the UnifiedGenotyper. I recently upgraded to v2.4-9-g532efad after running into a bug in an old version.
The context is sixty whole exome samples (human), and my pipeline performs mark-duplicate, then bq recalibration, then indel realigning, then finally UnifiedGenotyper. Each sample has some number of variants called (combined VCF output of SNP and INDEL for about 200K lines per sample), without reference calls. Later when looking at individual variants we need to know which samples were homozygous reference versus no-coverage, which is information not contained in my initial VCFs. To get the coverage at each variant I'm running a new big UnifiedGenotyper at the alleles specified by a vcf-merge of each file. This alleles file is about 4 million entries, where I have asked UG to call SNPs and reference bases too.
Now we have more complete data, and some strange things are found. Here's an excerpt from the ouput:
1 17538 . C A 710.37 .
AC=22;AF=0.183;AN=120;BaseQRankSum=-5.993;DP=317;Dels=0.00;FS=4.527;HaplotypeScore=0.2483;InbreedingCoeff=-0.1553;MLEAC=23;MLEAF=0.192;MQ=34.98;MQ0=65;MQRankSum=-0.773;QD=5.46;ReadPosRankSum=-0.115 GT:AD:DP:GQ:PL
0/1:5,1:6:27:27,0,96 0/0:5,1:6:12:0,12,156 0/1:5,1:6:26:26,0,113
The concern is that the same five reads reference, one read alternate is sometimes called Hom and sometimes Het.
At another site I've got
0/1:6,1:7:28:28,0,79 and 0/0:4,1:5:12:0,12,147 which shows a 6/1 being called hom and a 4/1 being called het!
Apologies for my lack of a well-posed question, I'm trying to get my head wrapped around this so that the next time I tell my lab a variant is discovered, we have more confidence.
Hi the GATK team;
I use the UnifiedGenotyper the following way:
java -jar GenomeAnalysisTK-2.1-13-g1706365/GenomeAnalysisTK.jar \
-R /human_g1k_v37.fasta \
-T UnifiedGenotyper \
-glm BOTH \
-S SILENT \
-L ../align/capture.bed \
-I myl.bam \
--dbsnp broadinstitute.org/bundle/1.5/b37/dbsnp_135.b37.vcf.gz \
-o output.vcf
When I look at the generated VCF , the variation 18:55997929 (CTTCT/C) is said to be rs4149608
18 55997929 rs4149608 CTTCT C (...)
but in the dbsnp_135.b37.vcf.gz, you can see that the right rs## should be rs144384654
$ gunzip -c broadinstitute.org/bundle/1.5/b37/dbsnp_135.b37.vcf.gz |grep -E -w '(rs4149608|rs144384654)' 18 55997929 rs4149608 CT C,CTTCT (...) 18 55997929 rs144384654 CTTCT C (...)
does UnifiedGenotyper uses the first rs## it finds at a given position ? Or should I use another method/tool to get the 'right' rs## ?
Thank you,
Pierre
Dear GATK team,
when I use the UnifiedGenotyper to call variants in chrX and/or chrY with ploidy=2, does the program call the variants and genotypes in the pseudoautosomal regions PAR1 and PAR2 correctly? More specifically, does the UG acknowledge that these regions are homologous on the X and Y chromosome, even though they differ in chromosomal position? If the UG does not take care of this by default, is there a way I can handle this? I read point 6 of this page, but I could not find the answer there: http://www.broadinstitute.org/gatk/guide/article?id=1237
Thanks alot Eva
When I use EMIT_ALL_CONFIDENT_SITES for SNPs, I get an expected very large list of genotypes regardless if the genotypes vary from the reference. When I use the same command line but I switch the model to Indels, I only get a VCF of variant sites. Is the EMIT_ALL_CONFIDENT_SITES option not compatible with Indel discovery?
I'm grateful for any clarification.
Hello, I have an alignment with 140 reference reads (Ref Base = C) and 10 variant Reads (Var Base = T) at locus: Chr17:7578406. When I use the "EMIT_ALL_SITES" mode, the UnifiedGenotyper generates the following output:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT z
17 7578406 rs28934578 C A 0 LowQual AC=0;AF=0.00;AN=2;DB;DP=150;Dels=0.00;FS=0.000;HaplotypeScore=11.1188;MLEAC=0;MLEAF=0.00;MQ=38.26;MQ0=0 GT:AD:DP:GQ:PL 0/0:140,0:150:99:0,361,4503
My questions are:
1) In ALT column, why is the base "A" being shown? Is it a randomly selected base when no SNP is identified at that position?
2) The ID column shows dbSNP record rs28934578 which is a C>T mutation (which is what my data has). Why is the dbSNP records for C>T mutation in the output when no variant is identified at this position (or C>A variant is shown?). Does this imply that ID column shows ALL dbSNP records at that position rather than a dbSNP record of the identified variant?
3) Is there a document that details the VCF output when EMIT_ALL_VARIANTS is used so I could understand the output vcf?
Hi,
I would like to use UnifiedGenotyper to produce a vcf file where all sites I specify via BED file will have a count of reads carrying the reference allele and a count of reads carrying an alternate allele, regardless of whether a variant has been called in the site. I've tried to use EMIT_ALL_SITES, but this appears not to provide this count breakdown for sites that are not called as variable. I would appreciate any pointers. I am using GATK 2.2.4.
Thank you very much.
(There was another question about a similar symptom, but the answer does not appear to apply to what I'm seeing.)
I get an empty VCF file that just contains the header lines. The input VCF file is 1.8Gb, and as far as I can tell the content is OK - it has MAPQ scores, the flags seem reasonable, etc. I've attached a copy of the console output, and the beginning of the input file in SAM format. Let me know if you have any suggestions. Thanks,
Ravi
Aloha,
I am calling SNPs on an organism without a reference genome or database of known polymorphisms, so I'm trying to follow the advice posted here (and in the BaseRecalibrator documentation).
I've successfully called SNPs on the un-recalibrated .bam file, then used those SNPs to recalibrate, then called SNPs on the recalibrated .bam file. As expected, I got significantly fewer (and presumably more accurate) results.
I then used the new, reduced set of SNPs to recalibrate again. When I attempted to call SNPs on this "Round Two" recalibrated .bam file, I got the following error:
SAM/BAM file recalibrated.2.bam is malformed: Program record with group id GATK PrintReads already exists in SAMFileHeader!
I attempted to use PicardTools ValidateSamFile and CleanSam but received the same message (as an IllegalArgumentException). I would definitely consider myself a novice in the field. Any advice you can give will be greatly appreciated.
I am implementing a tool that uses reads to identify potential sites for cancer, but I need a reliable genotype call from non-cancerous tissue. Currently, I'm trying to use GATK to make those calls, filter out the bad, and then use the remaining calls when I look over the raw reads in the cancer tissue.
The problem I'm finding is that GATK doesn't provide a GQ for variants that are homozygous for the reference, and I don't know how to correlate the QUAL with the GQs from those homozygous non-ref. Is there a way that I'm overlooking to have GATK provide the GQ for homozygous reference non-variant sites?
I ran the same sample through a pipeline using GATK twice and received different variants. I am trying to understand the reason behind this. My samples are from a MiSeq/capture kit run and downsampling could be one reason (given in one scenario that variant is called and in other it isn't) the variant is called at 32% when looked into the .bam files.
As I understand the UnifiedGenotyper downsamples my dataset randomly to 250, so I played around with -dcov parameter
But setting -dt to NONE could be computationally exhaustive for a big sample set. Is there an identifiable reason to why this is happening..?
Curious..!
I have been using GATK (v2.2) UnifiedGenotyper to generate VCFs. I did a multisample realignment around indels which generated a multisample BAM of size ~500Gb. After looking at some of the SNP calls I decided to try removing duplicates. I used MarkDuplicates with "REMOVE_DUPLICATES=true" and although only 10% of reads were duplicates, the BAM reduced to ~75Gb. This did not seem to affect the depth of reads at a site more than the expected ~10% but now the AD field in the genotype columns is missing. ie GT:AD:GQ 0/1:.:30 When I run UnifiedGenotyper with the old BAM prior to MarkDuplicates the AD field is present.
I am currently running the MarkDuplicates on each sample prior to realignment - because I think this makes the most sense, but isn't clear why this should matter,
Any ideas on what is happening here?
Hi guys,
I have Googled my problem, with no luck, so I am asking you directly.
I am currently testing an established pipeline on BAMs from a new source, so I advance step by step, and the last step, the calling with UG, seems to have trouble.
My BAM endured, in order, Picard AddOrReplaceReadGroup, Picard MarkDup, GATK RealignerTargetCreator, GATK IndelRealigner, Picard FixMateInformation, GATK BaseRecalibrator, GATK PrintReads.
Arrived at UG, this is my (stuck) output (I removed file names because of privacy):
INFO 14:54:29,692 ArgumentTypeDescriptor - Dynamically determined type of /scratch/appli57_local_duplicates/reference/exome_target_intervals.bed to be BED
INFO 14:54:29,748 HelpFormatter - ---------------------------------------------------------------------------------
INFO 14:54:29,748 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.1-11-g13c0244, Compiled 2012/09/29 06:03:05
INFO 14:54:29,749 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 14:54:29,749 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 14:54:29,750 HelpFormatter - Program Args: -T UnifiedGenotyper -nt 6 -R /scratch/appli57_local_duplicates/reference/Homo_sapiens_assembly19.fasta -I /scratch/user/FILE.marked.realigned.fixed.recal.bam --dbsnp /scratch/appli57_local_duplicates/dbsnp/dbsnp_132.b37.vcf -L /scratch/appli57_local_duplicates/reference/exome_target_intervals.bed --metrics_file /scratch/user/FILE.snps.metrics -o /scratch/user/FILE.vcf
INFO 14:54:29,750 HelpFormatter - Date/Time: 2013/03/20 14:54:29
INFO 14:54:29,750 HelpFormatter - ---------------------------------------------------------------------------------
INFO 14:54:29,751 HelpFormatter - ---------------------------------------------------------------------------------
INFO 14:54:29,783 ArgumentTypeDescriptor - Dynamically determined type of /scratch/appli57_local_duplicates/dbsnp/dbsnp_132.b37.vcf to be VCF
INFO 14:54:29,799 GenomeAnalysisEngine - Strictness is SILENT
INFO 14:54:29,906 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 14:54:29,943 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04
INFO 14:54:29,959 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/appli57_local_duplicates/dbsnp/dbsnp_132.b37.vcf
WARN 14:54:30,190 VCFStandardHeaderLines$Standards - Repairing standard header line for field AF because -- count types disagree; header has UNBOUNDED but standard is A -- descriptions disagree; header has 'Allele Frequency' but standard is 'Allele Frequency, for each ALT allele, in the same order as listed'
INFO 14:54:32,484 MicroScheduler - Running the GATK in parallel mode with 6 concurrent threads
And it does not move from there. In my destination folder, a bamschedule.*.tmp file appears every 5 minutes or so, and in top, the program seems to be running:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29468 valleem 19 0 20.6g 3.9g 10m S 111.5 4.2 28:45.46 java
Can you help me?
hi,
can anyone help to advice if splitting the bam files Versus splitting the -L region, which one could speed up faster? asssume that i have 500 bam files, will splitting the bam files to 22 different chromosomes will increase the speed further as compare to splitting the -L region (intervals)? Can anyone help to advice or probably had experienced it before. thank you.
Dear GATK authors,
I have two little questions about the genotyping strategy for using UnifiedGenotyper.
1) Imaging I have a list of candidate sites in hand, then I just want to use UG to genotype my individuals at these sites. Considering the sequencing coverage for each individual is low or median, will there be any difference between genotyping them on individual level and on a cohort of population level?
2) I have a list of candidate sites, including SNPs and complex indels, which were discovered by HaplotypeCaller, Then can I use UG to genotype these sites across individuals to get the correct genotypes? I know HC used a local denovo assembly to discover the variants, but I'm not sure after the variants have been discovered, whether there is any difference in genotyping result between HC and UG.
Best, SK
HI
I am using the following set of commands on GATK2.1.13 to generate a VCF file
echo `java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -I B2_with_ReadGroup.ddup.sorted.bam -R human_g1k_v37.fasta -T RealignerTargetCreator -o my.intervals -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key`
echo "Realignment Done at `date`"
echo "Starting IndelRealigner at `date`"
echo `java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -I B2_with_ReadGroup.ddup.sorted.bam -R human_g1k_v37.fasta -T IndelRealigner -targetIntervals my.intervals -o myrealignedBam.bam -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key`
echo "Realignment done at `date`"
echo "Starting UnifiedGenotyper at `date`"
echo `java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -l INFO -R human_g1k_v37.fasta -T UnifiedGenotyper -I myrealignedBam.bam -o mygatk_vcf.vcf --output_mode EMIT_ALL_SITES -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key`
echo "Gentoypxing complete at `date`"
When i do a 'mpileup' for B2_with_ReadGroup.ddup.sorted.bam , I get a devcent 10 MB VCF file. But on the last ste of the above pipeline, my " mygatk_vcf.vcf " is goinging into 81GBs !!
Do you know what is wrong ?
Hi,
I had a seemingly random RUNTIME ERROR previously (mentioned on this thread http://gatkforums.broadinstitute.org/discussion/1860/runtime-error-in-baserecalibrator-version-2-2-8-g99996f2), but I'm seeing this more and more for UnifiedGenotyper runs.
It seems that every 6-10 runs I get one or two that fail as below. Re-runinng seems to work correctly, but this is annoying to find that a 12hr job has died randomly. Please can this be investigated.
java.lang.NullPointerException at java.util.concurrent.locks.AbstractQueuedSynchronizer.hasQueuedPredecessors(AbstractQueuedSynchronizer.java:1453) at java.util.concurrent.locks.ReentrantLock$FairSync.tryAcquire(ReentrantLock.java:240) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1136) at java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:229) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) at java.util.concurrent.PriorityBlockingQueue.peek(PriorityBlockingQueue.java:286) at org.broadinstitute.sting.utils.nanoScheduler.Reducer.reduceNextValueInQueue(Reducer.java:89) at org.broadinstitute.sting.utils.nanoScheduler.Reducer.reduceAsMuchAsPossible(Reducer.java:120) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$MapReduceJob.run(NanoScheduler.java:510) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636)
My commandline is this: java -Xmx8g -jar GenomeAnalysisTKLite.jar -nct 4 -T UnifiedGenotyper --genotype_likelihoods_model BOTH --genotyping_mode DISCOVERY -R fastafile -I infile -o outfile --dbsnp dbsnp_137.b37.vcf -stand_call_conf 30 -stand_emit_conf 30
This usually coincides with errors is calling home code, although I don't know if it's cause or effect.
"INFO 16:41:33,533 HttpMethodDirector - I/O exception (javax.net.ssl.SSLException) caught when processing request: java.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty INFO 16:41:33,534 HttpMethodDirector - Retrying request INFO 16:41:33,716 HttpMethodDirector - I/O exception (javax.net.ssl.SSLException) caught when processing request: java.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty INFO 16:41:33,716 HttpMethodDirector - Retrying request INFO 16:41:33,897 HttpMethodDirector - I/O exception (javax.net.ssl.SSLException) caught when processing request: java.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty INFO 16:41:33,897 HttpMethodDirector - Retrying request INFO 16:41:34,077 HttpMethodDirector - I/O exception (javax.net.ssl.SSLException) caught when processing request: java.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty INFO 16:41:34,085 HttpMethodDirector - Retrying request INFO 16:41:34,267 HttpMethodDirector - I/O exception (javax.net.ssl.SSLException) caught when processing request: java.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty INFO 16:41:34,267 HttpMethodDirector - Retrying request "
Dear GATK team,
I know that in the past GATK was not suitable for haploid genomes. I wanted to ask if this possibly changed since then - and whether it is possible to use GATK for haploid genomes.
Thanks a lot, Gilgi
Hi
I seem to have found a bit of an issue with the Haplotype caller. Looking at variants called with it I've come across a number of small blocks in the genome where the Haplotype caller has called every individual (50 individuals) either RA or RR, which seemed a bit odd considering the population.
Looking at the BAMs and VCFs from SAMtools and the Unified Genotyper these blocks of snps clearly contain all three states as I'd expect RR/RA/AA. Looking at the BAM the reads are of decent quality and have no nearby insertions or deletions to complicate things, and the variants have been called correctly by Samtools and UG.
Any idea what's causing this? Attached is an IGV image showing one of the regions in question, Top VCF is the Haplotype Caller (showing all calls as RA or RR, which is incorrect), Second is UG (showing a mix of RR/RA/AA which is correct). The First BAM shows one of the Animals HC is calling incorrectly as RA for the 5 SNPs shown, while the Second is an Animal that HC is calling RA correctly.
Note these incorrect calls from the HC also passed VQSR. I believe the version of GATK is one of the 2.1 releases.
Hi there,
I'm trying to understand the haplotype scoring algorithm in GATK 1.6.5. I fortunately got a printed page where I have a simple diagram that explains the algorithm, I can't find it anymore in the new web. The case is that the formula for calculating the haplotype score in this diagram has a variable that I'am missing what it is. This is the formula as it's written:
P(read | haplotype_j) = sum_bi (bi == hi ? ei : 1 - ei / 3) - sum_bi (ei)
I guess bi stands for base at position i at the current read and hi stands base at position i at haplotype_j, that makes sense for me. But, what is ei?? maybe I'm missing something... it looks like it should be a probability in the range (0, 1) for the haplotype score to make sense.
Thanks in advance! Pablo.
Hi,
I am adapting a two-year old pipeline, which includes UnifiedGenotyper to call variants. I would like to update this to the HaplotypeCaller. In the command for the UnifiedGenotyper there is an option --metrics_file. I cannot find reference to this in the GATK documentation - do you know what it is and whether I can use it with HaplotypeCaller as well?
Also, do I understand correctly that using HaplotypeCaller to call variants means you don't have to carry out the local realignment around indels step prior to this?
Thanks very much,
Kath
Hi
For GATK: GenomeAnalysisTK-2.4-7-g5e89f01
It would appear that the issue with the HaplotypeCaller making incorrect Het calls when it should be Hom has turned up again (if it ever actually went away). Note this appears to be the same issue I reported last time: http://gatkforums.broadinstitute.org/discussion/1805/haplotype-caller-incorrectly-calling-blocks-of-variants-heterozygous but considering the time since that report, and that this is from a different sample set and different versions of GATK I thought it best to create a new post. If your prefer to merge them please do so.
So in this occasion we've been looking at a single animal (~16x) using the HaplotypeCaller & UnifiedGenotyper and once again we are finding that the HaplotypeCaller is making Heterozygous calls where there is no support for them in the BAM.
Example Regions, BosTau6 reference: chr18:55,432,023-55,432,220 chr18:55,350,724-55,351,079
Attached you will find images showing this: For the first position & image chr18:55,432,023-55,432,220 the tracks in IGV are: HaplotypeCaller VCF, Population UnifiedGenotyper VCF, Sample BAM, Sample ReducedReads BAM. If we look at this call we have 9x depth, all reads with mapQ 60, Cigar 101M and BaseQ between 21 & 33, a good balance between forward and reverse and ALL 9 reads contain the Alternate allele. Which means the Site should have been called Alt/Alt not Alt/Ref, but for some reason even through there are no reference reads the HaplotypeCaller has called the site Ref/Alt. Note the UnifiedGenotyper correctly calls this site Alt/Alt.
For the second image, position chr18:55,350,724-55,351,079 the tracks in IGV are: Haplotype Caller VCF, UnifiedGenotyper VCF, UG Population VCF, Sample BAM (PCRdedup, IR, BQSR), Sample ReduceReads Bam In this example we have 4 Variants in a cluster (chr18:55,350,895-55,350,975) that the HaplotypeCaller has called as Het (Ref/Alt) when there is no support for this call in the Reads, secondly the UnifiedGenotyper has successfully called each site as Homozygous (Alt/Alt). The reads are a bit more complex at this site but in each case there are 11 reads all of which are Alt alleles and no Reference allele.
Note: HaplotypeCaller & UnifiedGenotyper were run on the full bams, not the ReducedRead bams.
I will upload the region of the BAM file and the VCF files as: Chr18-HC-Het-issues-ULG.tar.gz
Hi,
how does the speed of the haplotypeCaller usually compare to that of the UnifiedGenotyper?
When I try to use it, it seems to be about 90x slower.
these are the command lines I use:
java -jar -Xmx32g GenomeAnalysisTK.jar -R hg19.fasta -T UnifiedGenotyper -dcov 1000 -I file.bam -o file.vcf -L myTarget
java -jar -Xmx32g GenomeAnalysisTK.jar -R hg19.fasta -T HaplotypeCaller -dcov 1000 -I file.bam -o file.vcf -L myTarget
thanks!
How does GATK2 handle the variants called at homopolymeric regions in the genome? Is this feature enabled and used during the variant calling (with UG/HC) or should we do it separately with VariantFilteration. Is there any specific parameter to control this? Best Raj
Hi, I am calling indel in pooled samples using this command: java -jar -Xmx2g /PATH/2.1.13/GenomeAnalysisTK.jar -l INFO -T UnifiedGenotyper -I pool1.bam -I pool2.bam --out INDEL.vcf -R /reference.fa -glm INDEL
Currently i donot have any information of already known indels. 1.Do i need to first realign (RealignerTargetCreator and IndelRealigner) and then call indels even for pooled data? 2. How different will this be for calling indel on individual sample?
Looking forward for your suggesions. with thanks sasha
I had annotated raw indel file (given by UnifiedGenotyper), 1000G_omni2.5.b37.sites.vcf and hapmap_3.3.b37.sites.vcf with all possible annotations including QD (QualByDepth) using VariantAnnotator. However, i got an error when i tried to run VariantRecalibrator. It was complaing that QD has not been found in training variant. Is QD important annotation for indel filtering. Can it be ignored ?
P.S. - i did not use sample bam file while annotating training data set.
.
.
.
INFO 15:11:55,999 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK.vcf
INFO 15:12:21,650 TraversalEngine - chr1:128346793 1.98e+07 30.0 s 1.5 s 4.1% 12.1 m 11.6 m
INFO 15:12:51,650 TraversalEngine - chr9:130658800 5.26e+07 60.0 s 1.1 s 53.9% 111.2 s 51.2 s
INFO 15:13:13,618 VariantDataManager - QD: mean = NaN standard deviation = NaN
INFO 15:13:16,417 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.1-13-g1706365):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://www.broadinstitute.org/gsa/wiki/index.php/VariantAnnotator
##### ERROR ------------------------------------------------------------------------------------------
I have used UnifiedGenotyper to call SNPS. I found some SNPs that has been reported from low quality reads in chromosome X and chromosome Y. Is it possible not to take low quality reads into account while calling SNPs using UnifiedGenotyper? Or, do I need to do quality filtering of BAM files before hand ?
Hello, I'm maybe missing something obvious but it seems a GATK vcf file does not tell a given variant is SNP, Insertion or deletion. Did I missed some command when I called variations? I can easily classify variations by eye or a script from a given vcf entry but the entry does not explicitly say variation type.
Here are deletions:
d02 264482 . TT T 389.93 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.224;DP=25;FS=0.000;HaplotypeScore=58.0966;MLEAC=2;MLEAF=1.00;MQ=47.51;MQ0=0;MQRankSum=1.714;QD=15.60;ReadPosRankSum=0.075;SB=-1.112e+02 GT:AD:DP:GQ:PL 1/1:15,9:20:9:431,9,0
Ld04 26597 . CCC C 2923.96 . AC=2;AF=1.00;AN=2;BaseQRankSum=-2.217;DP=98;FS=2.954;HaplotypeScore=145.2957;MLEAC=2;MLEAF=1.00;MQ=59.82;MQ0=0;MQRankSum=0.283;QD=29.84;ReadPosRankSum=1.150;SB=-1.306e+03 GT:AD:DP:GQ:PL 1/1:58,31:98:99:2966,195,0
Here are insertions:
Ld04 23671 . G GAAAT 6952 . AC=2;AF=1.00;AN=2;DP=100;FS=0.000;HaplotypeScore=24.8695;MLEAC=2;MLEAF=1.00;MQ=59.54;MQ0=0;QD=69.52;SB=-3.355e+03 GT:AD:DP:GQ:PL 1/1:65,34:100:99:6952,301,0
Ld04 26880 . T TTATT 4973 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.552;DP=99;FS=2.935;HaplotypeScore=116.0082;MLEAC=2;MLEAF=1.00;MQ=59.46;MQ0=0;MQRankSum=0.192;QD=50.23;ReadPosRankSum=1.834;SB=-1.950e+03 GT:AD:DP:GQ:PL 1/1:49,38:98:99:4973,247,0
Here are SNPs
Ld04 19890 . T C 3276.01 . AC=2;AF=1.00;AN=2;DP=85;Dels=0.00;FS=0.000;HaplotypeScore=0.7887;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=38.54;SB=-1.614e+03 GT:AD:DP:GQ:PL 1/1:0,85:85:99:3276,253,0
Ld04 19963 . A G 3571.01 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.895;DP=96;Dels=0.00;FS=0.000;HaplotypeScore=2.4678;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=1.769;QD=37.20;ReadPosRankSum=-0.316;SB=-1.523e+03 GT:AD:DP:GQ:PL 1/1:3,93:96:99:3571,205,0
I do not see any markers that tell me variation type. Nor the length of variations. Such information is given in a vcf of Samtools.
PS I used this command line of V 2.0-38:
java -jar ~/GenomeAnalysisTK-2.0-38/GenomeAnalysisTK.jar -T UnifiedGenotyper --genotype_likelihoods_model BOTH -R ref1 -I IN.bam -o OUT.GK2.vcf
Thank you.
Hideo
Hi,
I have exome data for a few sets of parent-offspring trios, in which offspring have phenotypically related but probably genetically different diseases. Their parents are unaffected so I am particularly interested in identifying de novo mutations. My plan was to analyse each individual separately up to the variant calling phase and then to input three (mother, father, child) analysis-ready BAMs into the UnifiedGenotyper along with a ped file. My questions are:
1) Can you tell me whether the UnifiedGenotyper uses pedigree information in the ped file to call genotypes more accurately? In other words, is this better than calling variants jointly without supplying the ped file? If not, does GATK recommend any external tools for doing this step?
2) It is better to call variants jointly using all the trios (even though they are not related and probably don't share the same disease-causing mutations)?
Best wishes,
Kath
The Unified Genotyper calls SNPs relative to the specified, publicly-available reference assembly.
How can I call SNPs (in many samples, with UG) relative to an in-house individual, which I have sequenced at high-coverage?
My current solution is to perform a de novo assembly on the in-house reference individual using e.g. Velvet, and then simply use the fasta as the reference for UG.
Can the publicly-available reference assembly still be useful here for speeding up the mapping and filling-in missing parts ?
My organism is Drosophila melanogaster.
HI GATK - I am still using the GenomeAnalysisTK-1.6-5-g557da77 version for UnifiedGenotyper. This is probably a silly question, but is there a way to set a parameter for minimum mapping quality score for reads, in deciding whether to evaluate them for variant detection. I know there is a --min_base_quality_score parameter, but I don't see on for mapping quality. http://www.broadinstitute.org/gsa/gatkdocs/release/org_broadinstitute_sting_gatk_walkers_genotyper_UnifiedGenotyper.html
Hi,
I did some tests with a "best practices"-like pipeline to check if results were deterministic and found that they are not. Some posts already mention that UnifiedGenotyper is non-deterministic when using multi-threading as different seeds are used for downsampling. But I think I'm missing something if single-thread UnifiedGenotyper is deterministic, why would it chose exactly the same reads for downsampling? Wouldn't it always be non-deterministic when downsampling reads? Anyway, the difference was only of 31 variants for an exome sample.
About the VariantRecalibrator I guess the filtering is non-deterministic, but I did not found any reference to this in the forum. The difference between runs is greater in this case. After filtering I had 301 variants only non-filtered in the first run; and 1684 variants only non-filtered in the second run; the non-filtered variants in both runs were 11328.
Thanks in advance! Pablo.
I'm building a variant calling qscript (it's available here), heavily based on the the MethodsDevelopmentCallingPipeline.scala. I cannot however run into trouble when setting the "this.scatterCount" of the GenotyperBase to more than 1 - in which case I get a NullPointerException (I include the full error message below).
I use the following command line:
java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/NewVariantCalling.scala -i NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam -R /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/human_g1k_v37.fasta -res /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/ **-sg 2** -nt 8 -run -l DEBUG -startFromScratch
As you can see, I'm using the files from the gatk bundle, and I guess these should be alright for this purpose? Just to be clear I use the "-res" parameter to point to the directory where all the resource files are located, dbsnp, hapmap, etc. and the -sg parameter is what controls the scatter/gather count.
I've tried to search in the code for what might be causing this, and I can conclude that the org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc is called with str (its parameter) being an empty string, which is what causes contig to be null, which in turn creates the NullPointerException on line 408 when this line is executed:
stop = getContigInfo(contig).getSequenceLength();
This, I guess, is the obvious stuff, but this far I haven't been able to figure this out any further that this. I'm not sure if this is caused by a bug in my script, or by a bug in the GATK. Right now I'm thinking the latter of the two, since I have used the scatter/gather function in other scripts without any trouble.
Any ideas of where to continue from here, or confirmation that this is indeed something related to the GATK code would be much appreciated.
Cheers, Johan
ERROR 16:22:50,781 FunctionEdge - Error: LocusScatterFunction: List(/bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam.bai, /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/dbsnp_137.b37.vcf, /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/human_g1k_v37.fasta, /bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bai, /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/dbsnp_137.b37.vcf.idx, /bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam) > List(/bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/.queue/scatterGather/.qlog/project.snpcall-sg/temp_1_of_2/scatter.intervals, /bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/.queue/scatterGather/.qlog/project.snpcall-sg/temp_2_of_2/scatter.intervals)
java.lang.NullPointerException
at org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:408)
at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:84)
at org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:97)
at org.broadinstitute.sting.utils.interval.IntervalUtils.loadIntervals(IntervalUtils.java:538)
at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalBindingsPair(IntervalUtils.java:520)
at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalBindings(IntervalUtils.java:499)
at org.broadinstitute.sting.queue.extensions.gatk.GATKIntervals.locs(GATKIntervals.scala:60)
at org.broadinstitute.sting.queue.extensions.gatk.LocusScatterFunction.run(LocusScatterFunction.scala:39)
at org.broadinstitute.sting.queue.engine.InProcessRunner.start(InProcessRunner.scala:28)
at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:83)
at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:432)
at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:154)
at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:145)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)
at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
Basically I have an odd-looking distribution of my variant quality scores (see attached png), and was wondering how concerned should I be and how can I rectify it.
The input data from the graph is from UnifiedGenotyper vcf output file, QUAL values. The four samples in the vcf file are one Drosophila reference line, and three more which are outcrosses of the reference line and thus are heterozygous for the reference allele.
My fastq read-mapping pipeline includes adapter and low-quality base removal, and local re-alignment. I've also attached a pdf showing read quality distribution from one of the samples which also looks a bit odd.
Hi,
I have noticed, sometimes GATK UnifiedGenotyper calls overlapping INDELs even though there should be 3 non-overlapping (independent) calls.

I have attached 200bp region from my BAM, reference and vcf file, so you can reproduce this issue.
Cheers, Leszek
Hello,
I have use Unified Genotyper and Haplotype Caller from GATK to do the SNP calling from my RNA-seq data. Now, I have the .vcf files generated by these two tools and I need to process them. I have read documentations on Select Variant and Variant filteration, also the vcf tools but I am lost, I don't know what I should do first. I know there are lots of information out there on net but it would be great if you could give me some general outlines.
Thanks a lot in advance.
Hello, I have been successfully using the GATK pipeline and Unified Genotyper to call snps. However, I really need GQ scores for invariant sites as well as variant sites.
Currently at invariant sites, only GT and DP are outputted. Is it possible to also get GQ for these sites? I can't seem to find how to do this.
Thank you! Amanda
I'm running the UnifiedGenotyper on a large number of metagenomic samples (~110) that have been mapped by BWA against a database of 671 reference genomes. I know that the UG is meant only for diploid use, but I was hoping to use the vcf output to get at base frequencies at specific loci (as opposed to the probability of whatever diploid genotype is predicted).
I'm running the UG with a 63Gb memory allocation, using 12 threads. Here is the error message I received:
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: An error occurred during the traversal.
at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.getTraversalError(HierarchicalMicroScheduler.java:356)
at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:105)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:246)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:128)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.regex.Pattern$Start.
I'm worried that this error might be caused by my use of the UG for such a non-traditional experiment (100+ metagenomic samples mapped to 671 reference genomes), but I wanted to check here to see if maybe it was a more mundane bug, and the UG should otherwise be working for my input (albeit with the genotypes & probabilities called under the false assumption that my data is diploid ... all I want is the base frequencies at snp loci)
I'm trying to call variants on metagenomic data using the UnifiedGenotyper. I know that the diploid genotype calls & likelihoods will not be valid since my data is not diploid, but I want to use the vcf output so sum up base frequencies at detected variant loci.
I mapped 100+ samples (each being ~2 Illumina GA2 lanes of data that after host filtering usually contain about 20-40 million reads per sample) against a database of 671 bacterial reference sequences (and each reference can be in multiple parts, so I probably have 10s of thousands of sequence records in my ref db, spanning the 671 reference genomes...around 2.2Gb in total size). I am then feeding the resulting 100+ bam files to the UnifiedGenotyper.
After some initial mistakes on my part (yes I have entered the future and am using GATK 2.2-5 now :) ) I've now started a run in proper fashion, but after a couple hours its dying with the message that the java application has run out of memory:
I had set -Xmx60g for that failed run, so now I'm wondering if its possible to estimate how much memory would be needed for this job I'm trying to run. Do you think a job of this size is even possible with the UG? Is it the number of references that is killing me here? Or the number of samples?
Sometimes, while running the UnifiedGenotyper with -nct 8, my job might die for any number of reasons unrelated to your fine software. Can I start my next attempt, where the last one left off, simply by using the -L flag to specify the last genotyped location from the vcf output (and the rest of the genome)? This of course works, but the deeper question is this - since I used the -nct flag, might some of the threads have been scanning ahead in the genome and therefore the last genotyped location in that vcf is too far ahead and I might miss some intermediate variants? My sincere apologies if this has already been asked, I did look around first.
Hi,
I am using RNA-seq data for SNP calling. I have used the Unified genotyper tool from GATK with this command:
java -Xmx22g -jar /bubo/sw/apps/bioinfo/GATK/2.3.6/GenomeAnalysisTK.jar -T UnifiedGenotyper -R path/to/myreference.fa -I path/to/mybamfile.bam -stand_call_conf 30 -stand_emit_conf 30 -o snps.raw.vcf
I have obtained SNPs mostly with very high phred scale but the filter field is "." for all my SNPs, there is no PASS or LowQual in the field. I read in the GATK forum that: " If the FILTER field is a ".", then no filtering has been applied to the records, meaning that all of the records will be used for analysis but without explicitly saying that any PASS. You should avoid such a situation by always filtering raw variant calls before analysis. "
I was wondering about the filtering process, is it a parameter that I should add to my command or I must do it in another way?
Thanks a lot for your help.
Hi to all, I ran UnifiedGenotyper of three exome seq samples and phased with familial pedigree. During the manual filtration I saw several inconsistency. For example I get this output from UnifiedGenotyper and phasing:
chr2 38525498 rs76204302 T G 66.79 . AC=2;AF=0.333;AN=6;BaseQRankSum=-7.191;DB;DP=180;Dels=0.00;FS=208.951;HaplotypeScore=13.8978;MLEAC=2;MLEAF=0.333;MQ=60.00;MQ0=0;MQRankSum=2.030;QD=0.52;ReadPosRankSum=1.325;SB=-2.263e+00 GT:AD:DP:GQ:PL 0/0:30,22:52:39:0,39,729 0/1:9,7:16:5:5,0,280 0/1:55,57:112:98:98,0,1169
The order of samples are father, mother and son.How is possible that the father with, respectively 30 bases REF and 22 bases VAR is called 0/0? Thanks
Giuliano
Hello,
I am using the following command to generate SNPs and Indels from matching tumor normal pair bam files (GenomeAnalysisTK-2.4-9-g532efad)
java -jar GenomeAnalysisTK.jar -glm BOTH -R ucsc.hg19.fasta -T UnifiedGenotyper -I tumor.bam -I normal.bam -D dbsnp_135.hg19.vcf -o raw.snps.indels.vcf -metrics snps.metrics -stand_call_conf 50.0 -stand_emit_conf 10.0.
I would like to know the specific command (in SelectVariants) to separate SNPs unique to tumor but not to normal sample
Hello,
I have two strains from small eukariote (say S1 and S2) plus and two reference genomes: G1 (closest to S1 and S2) and sister species G2. Using GATK I can call SNPs in S1 and S2 genomic data. Since G1 and G2 are quite close, I want to get G2 SNPs after performing G2 to G1 alignment. I got MAF file, reduced it (= removed weaker mappings of the same contig to the other part of the genome), then managed to create SAM and sorted BAM file. I used picard to add fake ReadGroups and "MarkDuplicates". In the end I am running:
java -Xmx240G -jar ~/soft/GATK_current/GenomeAnalysisTK.jar -T UnifiedGenotyper -R G1.fa \ -I G1_vs_G2.reduced.gr.dup.bam \ -o G1_vs_G2.reduced.gr.dup.gatk.vcf
I got no running errors, but apart from VCF header the file is empty.
Is there any way to pass some argument to UnifiedGenotyper so it will ignore coverage, and simply call every SNP it encounters?
Many thanks,
Darek Kedra
Hi all, I've somewhere in this site that before VQSR the FP rate is expected to be around 10% (I guess for UnifiedGenotyper). Are there some updated statistics for VQRS? For HaplotypeCaller? For Exome/WG data? Another thing: we apply VQRS on all our analysis, we are trying to collect some validation statistics. We suspect that most of the FP have some particular "culprits" in VQRS (especially QD and MQ). Do you have some data about this? Best
d
Hi,
I observed a significant difference of the variant call sets from the same exomes between v1.6 and v2.2(-10). In fact, I observed a significant decrease in the overall novel TiTv in the latter call sets from around 2.6 to 2.1 at TruthSensitivity threshold at 99.0. When I looked at a sample to compare variant sites using VariantEval, it showed that
Filter JexlExpression Novelty nTi nTv tiTvRatio
called Intersection known 14624 4563 3.2
called Intersection novel 856 312 2.74
called filterIngatk22-gatk16 known 264 132 2
called filterIngatk22-gatk16 novel 28 18 1.56
called gatk16 known 3 1 3
called gatk16 novel 1 1 1
called gatk22-filterIngatk16 known 258 94 2.74
called gatk22-filterIngatk16 novel 144 425 0.34
called gatk22 known 2 2 1
called gatk22 novel 17 30 0.57
filtered FilteredInAll known 1344 649 2.07
filtered FilteredInAll novel 1076 1642 0.66
The novel TiTv of new calls in v2.2 not found in v1.6 or called in v2.2 but filtered in v1.6 demonstrated novel TiTv around 0.5. So I suspect that VQSLOD scoring (or ranking) of SNPs was changed substantially in somewhat an unfavorable way.
The major updates in v2.2 affecting my result were BQSRv2, ReduceReads, UG and VariantAnnotation. (Too many things to pin-point the culprit...) The previous BAM processing and variant calls were made using v1.6. For the new call set, I used v2.1-9 (so after serious bug fix in ReduceReads, thank you for the fix) for BQSRv2 and ReduceReads and v2.2-10 for UG and VQSR.
As a first clue, I found that distribution of FS values changed dramatically from the v1.6 (please see attached plots). Although I recognized that FS value calculations were recently updated, the distribution of previous FS values (please see attached) makes more sense for me because the current FS values do not seem to provide us information to classify true positives and false positives.
Thanks in advance. Katsuhito
The title kind of explains the situation, but basically I've got a SNP that shows up in IGV that I would call homozygous that the Unified Genotyper has labeled as heterozygous. The total read depth is 35, 32 of which were called as a SNP (A-->T), 2 were called the reference base (A), and one read contained a G. I went through your article describing why a SNP visible in IGV might not get called, and none of those five questions explained this situation. I didn't alter the --hets option at all either. Any help you might be able to offer would be greatly appreciated.
Just in the process of updating our pipeline from v2.3-4-gb8f1308 Lite to v2.4-7-g5e89f01 Academic and have run into a small issue. The command line:
-T UnifiedGenotyper -glm SNP -R /lustre/scratch111/resources/ref/Homo_sapiens/1000Genomes_hs37d5/hs37d5.fa -I /lustre/scratch111/projects/helic/vcf-newpipe/lists/chr1-pooled.list --alleles /lustre/scratch111/projects/helic/vcf-newpipe/pooled/1/1:1-1000000.snps.vcf.gz -L 1:1-1000000 -U LENIENT_VCF_PROCESSING -baq CALCULATE_AS_NECESSARY -gt_mode GENOTYPE_GIVEN_ALLELES -out_mode EMIT_ALL_SITES --standard_min_confidence_threshold_for_calling 4.0 --standard_min_confidence_threshold_for_emitting 4.0 -l INFO -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A InbreedingCoeff -A DepthOfCoverage -o /lustre/scratch111/projects/helic/vcf-newpipe/pooled/1/1:1-1000000.asnps.vcf.gz.part.vcf.gz
This worked in 2.3.4. But now gives:
Invalid command line: Argument annotation has a bad value: Class DepthOfCoverage is not found; please check that you have specified the class name correctly
I've looked at the release notes but it's not giving me a clue as to what has changed. Has the DepthOfCoverage annotation now been dropped? I've checked and I can reproduce this on the latest nightly (nightly-2013-03-11-g184e5ac)
Hi GATK,
I've seen this error before on the forum associated with the Haplotyper tool, but not with the UnifiedGenotyper. Any thoughts on how I can fix this?
Best,
Alex
java.lang.IllegalArgumentException: Duplicate allele added to VariantContext: T
at org.broadinstitute.variant.variantcontext.VariantContext.makeAlleles(VariantContext.java:1335)
at org.broadinstitute.variant.variantcontext.VariantContext.
I am using Unified Genotyper to call variants from multiple samples. I have used the emit_all_confident_sites flag. The output vcf file occasionally has two entries for one position. It is always a monomorphic site and the depth between the two entries is quite different. Usually one entry has very high depth & when I return to the original bam file, the depth does not match. Any idea what I am missing here?
Hello,
I found some strange entries for indels in my VCF file created by the Unified Genotyper. For example:
4 184513470 . TC T 4009 PASS AC=4;AF=0.250;AN=16;BaseQRankSum=1.972;DP=1315;DS;FS=3.466;HaplotypeScore=537.6937;MLEAC=4;MLEAF=0.250;MQ=52.55;MQ0=0;MQRankSum=-10.581;QD=4.55;ReadPosRankSum=-10.128;SB=-3.500e+01;set=variant2 GT:AD:DP:GQ:PL 0/1:230,0:239:99:282,0,5011 0/0:92,0:95:99:0,133,2435
The first sample has genotype 0/1 with a good GQ value. However, according the allele depth field, there is no read supporting the deletion. When I look at the reads using the IGV, I find some reads supporting the deletion for the first sample (and even some for the second one).
Moreover, when I looked at the AD values for SNPs, I noticed the the sum of all AD values is much less than the coverage shown in the IGV. I filtered duplicated reads in the IGV.
Can someone please give an explanation? This link http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_DepthPerAlleleBySample.html explains the difference between AD and DP, but does not help in my case.
Best greetings, Hans-Ulrich
Hi. I was hoping you could help track down the source of this error. I haven't seen this exact error anywhere here in the forum. I am trying to run unified genotyper on 48 bam files at once. Each bam file has been aligned using bwa and processed through the best practices as mentioned on the website. They all pass validation with no error (I had to run picard tools, cleanSamFile and fixMateInformation, along with removing one read from a single file). The data is targeted sequencing. I am running GATKLite 2.3-5-gd738181.
The error message is: Somehow the requested coordinate is not covered by the read. Alignment 189665677 | 48M103S
The command line I am using is: java -Xmx4g -jar pathToGATK.jar -R standardRefFileHG37 -I (48 bam files) -o OutputFile.vcf -nt 4 -nct 4 -T UnifiedGenotyper --dbsnp standardRef -glm Both -metrics metric.file -debug_file debug.txt -A DepthOfCoverage -A AlleleBalance
Is there any way to see if it is a particular bam file that is causing the error? In the error message, what Alignment is it referring to? I am doing the processing on a machine that is not connected to the internet so it is harder for me to cut/copy paste information to here. Is there anything else that you would need to help me track down the source of this error?
In the meantime I am going to try to run it with a .bed file of targeted areas to see if that helps. Should I also try to process each file individually? Thank you for your input.
Lisa
I've spent some time looking at the various documentation and have a few lingering questions for understanding my data a little better. I was hoping you could help to clarify a bit and make sure I am making correct assumptions.
I have a set of 96 samples that we ran targeted sequencing on. I have followed the best practices for bam file processing, etc, and ran Unified Genotyper on all 96 bam files at once using -EMIT ALL SITES. I have snp chip data on all of my samples as well so I have some "truth" to compare to. It looks like around 25% of the SNPs I expected to get (are present in dbSNP and I have chip data for them) are not present in my vcf file. Where did they go - is it normal to not get all of the dbSNP sites back? Is this because the variant site did not have good enough quality of bases to make it a variant? Does Unified Genotyper not emit sites for which all samples are homozygous reference even though it knows it's there because of the dbSNP file (like CASAVA)? I looked at the coverage at these positions and in some cases it was very high (over 100X) so I'm not sure that they could all be bad. I am somewhat new to this sort of data analysis - what are the best tools to use to look at the quality of the bases at a single site to see if this is the culprit? Any other ideas as to what could be causing this?
What are the internal things in Unified Genotyper that make a site not be called?
Would I be able to use the GENOTYPE GIVEN ALLELES option to make Unified Genotyper call all dbSNPs regardless of base quality or Phred Score? If so, do I use the same reference db snp .vcf file that I use for the rest of the steps?
For the variants I do have: If I am interested in the coverage at a site, the AD is the unfiltered depth and the DP is the filtered depth (the sum of the allele depths is the total depth), correct?
FInally, I have not run VSQR on my output. We did targeted sequencing and I'm not sure if the number of bases we did is enough got VSQR to be helpful. What is the minimum number of bases that VSQR is recommended with? I have 96 samples and I know that is enough. We are mostly interested in dbSNP values anyhow. In your opinion, is it reasonable to stop after Unified Genotyper since we are only interested in dbSNP calls or should we do the VSQR step anyhow?
Thank you for all your help and insights. I really appreciate the feedback allowed for in this forum.
Lisa
Hi,
I am running the unified genotyper in "EMIT_ALL_CONFIDENT_SITES" with the latest version of GATK and am getting some odd quality assignments of 10000030. It is clear that these sites are not very stellar, so I am wondering what is happening to these sites. Any thoughts?
Below is a snippet from one such region, starting with some seemingly "normal" rows and ending with this seemingly odd assessment of quality. Any thoughts? As some background info, this sample is quite different than reference, so not abnormal to have lots of missing regions when aligned against the reference that I used. It is bacterial. Here is the command line that I ran:
GenomeAnalysisTK.jar -T UnifiedGenotyper -dt NONE -I "$i" -R ../../Ref.fasta -baq RECALCULATE -nt 4 -ploidy 1 -out_mode EMIT_ALL_CONFIDENT_SITES -o "$j".vcf -stand_call_conf 100 -stand_emit_conf 100 -mbq 20
Ref 930 . C . 102 . AN=1;DP=2;MQ=150.00;MQ0=0 GT:DP:MLPSAC:MLPSAF 0:2
Ref 931 . C . 107 . AN=1;DP=2;MQ=150.00;MQ0=0 GT:DP:MLPSAC:MLPSAF 0:2
Ref 932 . T . 105 . AN=1;DP=2;MQ=150.00;MQ0=0 GT:DP:MLPSAC:MLPSAF 0:2
Ref 942 . A . 10000030 . DP=1;MQ=150.00;MQ0=0 GT .
Ref 1109 . G . 10000030 . DP=2;MQ=150.00;MQ0=0 GT .
Ref 1110 . A . 10000030 . DP=2;MQ=150.00;MQ0=0 GT .
Ref 1111 . T . 10000030 . DP=2;MQ=150.00;MQ0=0 GT .
Ref 1112 . T . 10000030 . DP=2;MQ=150.00;MQ0=0 GT .
Ref 1113 . T . 10000030 . DP=2;MQ=150.00;MQ0=0 GT .
Ref 1114 . A . 10000030 . DP=2;MQ=150.00;MQ0=0 GT .
Ref 1115 . C . 10000030 . DP=2;MQ=150.00;MQ0=0 GT .
Hello GATK team,
I recently ran both caller on the same bam file. However, I got significantly less calls using HC (2844), vs UG (9676 snp mode, 860 indel mode). The version of GATK is GenomeAnalysisTK-nightly-2013-04-23-ga4d6edb, and the command used are:
# ##Variant calling using Haplotypecaller##
# $JAVADIR/java -Xmx2G -jar $GATK2DIR/GenomeAnalysisTK.jar \
# -T HaplotypeCaller \
# -R $HG19REF \
# -I s_${1}.newRG.realigned.recal.bam \
# --dbsnp $SNP137 \
# -L $TARGETFILE \
# -stand_emit_conf 10.0 \
# -rf BadCigar \
# -o s_${1}.HC.vcf
#
#
# ##Variant calling using UnifiedGenotyper##
# $JAVADIR/java -Xmx2G -jar $GATK2DIR/GenomeAnalysisTK.jar \
# -T UnifiedGenotyper \
# -R $HG19REF \
# -I s_${1}.newRG.realigned.recal.bam \
# --dbsnp $SNP137 \
# -L $TARGETFILE \
# -stand_emit_conf 10.0 \
# -rf BadCigar \
# -glm SNP \
# -o s_${1}.UGSNP.vcf
#
#
#
# ##Variant calling using UnifiedGenotyper##
# $JAVADIR/java -Xmx2G -jar $GATK2DIR/GenomeAnalysisTK.jar \
# -T UnifiedGenotyper \
# -R $HG19REF \
# -I s_${1}.newRG.realigned.recal.bam \
# --dbsnp $SNP137 \
# -L $TARGETFILE \
# -stand_emit_conf 10.0 \
# -rf BadCigar \
# -glm INDEL \
# -o s_${1}.UGIndel.vcf
#
Can anyone explain it to me what could went wrong.
Thanks!!
Hi there,
I just had a look at the code of the UnifiedGenotyper and how its Variant Calling algorithm is implemented (very well documented by the way :) . But now I wonder how GATK reads in the SAM file and finds out where the differences to the reference (SNPs, Indels) are that are then examined in the Variant Calling. I only see that the UnifiedGenotyper gets a set of alleles, but not where the alleles are actually determined. I have also found out that GATK is using Samtools for parsing SAM files, but have not found the point where the actual reads are parsed and processed (e.g., by using the CIGAR string). Are you maybe doing a local realignment before the actual Variant Calling from which you get the alleles?
I would appreciate if you could guide me to the right direction where this happens in the code.
Best regards,
Cindy
Thank you for developing a great set of tools and adding in the ploidy option to the UnifiedGenotyper! My question is in regards to the ploidy argument when calling multiple pooled samples together. If I have pooled samples from the same population at different time points and want to call SNPs with multiple sample awareness, do I use the ploidy of one sample or that of all samples. For example, if I have pooled samples of approximately 25 haploid genomes each from three different time points from the same population should I use 25 or 75 as my ploidy?
Hi,
When I use UnifiedGenotyper with --genotype_likelihoods_model SNP --output_mode EMIT_ALL_CONFIDENT_SITES I get the reference SNP homozygote calls (or ./. if insufficient depth/quality etc). Great!
But when I use UnifiedGenotyper with --genotype_likelihoods_model INDEL --output_mode EMIT_ALL_CONFIDENT_SITES I only get non-reference calls, everything else (i.e. reference homozygotes, and anything uncallable) is ./.
I want to be able to select variants (SNPs and INDELs) on call rate across samples - as one would do for array genotype data. And avoid case-control bias due to differential missingness.
thanks,
david van heel
I have used UnifiedGenotyper with the EMIT_ALL_SITES option on selected sites of interest. However, in some cases a genotype is called (see below), but no probability is given. Is it possible to force UG to write those probabilities?
Here is a monomorphic site for which no probabilities are given: 1 100008719 . C .13.19 . AN=76;DP=70;MQ=55.57;MQ0=0 GT:DP ./. ./. 0/0:2 ./. 0/0:1 0/0:2 0/0:2 0/0:1 ./. 0/0:2 0/0:3 0/0:2 ./. 0/0:1 0/0:1 0/0:3 ./. 0/0:2 ./. ./. ./. 0/0:1 0/0:1 0/0:3 0/0:1 ./. 0/0:1 0/0:1 ./. 0/0:1 0/0:1 ./. 0/0:2 ./. ./. ./. 0/0:3 ./. 0/0:1 0/0:1 0/0:2 ./. ./. ./. 0/0:2 ./. 0/0:1 ./.0/0:1 ./. ./. 0/0:3 ./. 0/0:1 0/0:3 0/0:1 0/0:2 ./. 0/0:1 0/0:3 0/0:3 ./. 0/0:2 ./. ./. 0/0:1 ./.
Thank you.
I think I can solve my problem by using the GENOTYPE_GIVEN_ALLELES option in UG. I'm currently checking it.
Hallo,
I running a pipline for exome data. I have 7 exome samples. When the pipeline arrive at the UnifiedGenotyper i always get the error my bam files are malformed.
MESSAGE: SAM/BAM file /tmp/db99fcb54f49eea7ce46ff16a802f7ac/ISDBM261951.recal.bam is malformed: BGZF file has invalid uncompressedLength: -1792842732
All the bam files have are serveral GB big. So I do not get i why the uncompressed length is negative. Here is what is executed:
java -Xmx8g -jar /lib/gatk/GenomeAnalysisTK-2.1-13-g1706365/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /tmp/db99fcb54f49eea7ce46ff16a802f7ac/human_g1k_v37.fasta -nt 30 -glm BOTH -A AlleleBalance -A DepthOfCoverage -A FisherStrand -D /tmp/db99fcb54f49eea7ce46ff16a802f7ac/dbsnp_135.b37.vcf -o samples.variants.raw.vcf -I /tmp/db99fcb54f49eea7ce46ff16a802f7ac/ISDBM261951.recal.bam -I /tmp/db99fcb54f49eea7ce46ff16a802f7ac/ISDBM261953.recal.bam -I /tmp/db99fcb54f49eea7ce46ff16a802f7ac/ISDBM261952.recal.bam -I /tmp/db99fcb54f49eea7ce46ff16a802f7ac/ISDBM261954.recal.bam -I /tmp/db99fcb54f49eea7ce46ff16a802f7ac/ISDBM261950.recal.bam -I /tmp/db99fcb54f49eea7ce46ff16a802f7ac/ISDBM261955.recal.bam -I /tmp/db99fcb54f49eea7ce46ff16a802f7ac/ISDBM261956.recal.bam
Thanks,
Robin
Hi,
I trying to detect variants in a population of haploid genomes with UnifiedGenotyper, using the -sample_ploidy option to specify the number of haploid genomes in the pool. I would like to be able to specify other options, but so far the ones I have tried do not seem to work. I have tried: -out_mode EMIT_ALL_SITES --sample_ploidy 50 -minIndelFrac 0.05 --genotype_likelihoods_model BOTH -pnrm EXACT_GENERAL_PLOIDY. Could you please tell me which, if any, of the other commands are available when calling variants on non-diploid samples? Specifically, I would really like to be able to call indels too.
Thanks for your time and help.
Shelbi
We have a SGE environment where we run our GATK jobs. Lately we have had a few cases of UnifiedGenotyper getting hung up while trying to read our previously created .fai and .dict reference files. On one occasion we had 6 jobs on different nodes all hung for about 5 days before finally giving the FSLockWithShared - WARNING messages and then continuing on just fine. We haven't been able to find out why this is happening. Do you know of anything that would cause UnifiedGenotyper to try and read those files before ultimately throwing the warning messages 5 days later?
We looked into the possibility of other users/jobs having an exclusive lock on those reference files, but we had other UnifiedGenotyper jobs run fine during that same time period without these problems.
We also haven't been able to find any hardware problems yet. The jobs with this problem were on multiple nodes, so I thought I would see if you have any ideas.
Thanks for your help!
Using strace we were able to find that all 6 of the UnifiedGenotyper processes were hung in the same spot:
$ strace -p 22924
Process 22924 attached - interrupt to quit
futex(0x41c319d0, FUTEX_WAIT, 22925, NULL <unfinished ...>
$ strace -p 22925
Process 22925 attached - interrupt to quit
write(1, "WARN 12:39:09,312 FSLockWithSha"..., 158) = 158
write(1, "INFO 12:39:09,364 ReferenceData"..., 116) = 116
write(1, "INFO 12:39:09,364 ReferenceData"..., 89) = 89
open("/reference/sequence/human/ncbi/37.1/allchr.fa.fai", O_RDONLY) = 16
fstat(16, {st_mode=S_IFREG|0770, st_size=2997, ...}) = 0
fstat(16, {st_mode=S_IFREG|0770, st_size=2997, ...}) = 0
fcntl(16, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=0, len=9223372036854775807} <unfinished ...>
Here is the log file for one of the jobs:
Thu Jul 19 09:42:53 CDT 2012
INFO 09:42:58,047 HelpFormatter - --------------------------------------------------------------------------------
INFO 09:42:58,048 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.6-7-g2be5704, Compiled 2012/05/25 16:27:30
INFO 09:42:58,048 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 09:42:58,048 HelpFormatter - Please view our documentation at http://www.broadinstitute.org/gsa/wiki
INFO 09:42:58,048 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa
INFO 09:42:58,049 HelpFormatter - Program Args: -R /reference/sequence/human/ncbi/37.1/allchr.fa -et NO_ET -K /p
rojects/bsi/bictools/apps/alignment/GenomeAnalysisTK/1.6-7-g2be5704//key -T UnifiedGenotyper --output
_mode EMIT_ALL_SITES --min_base_quality_score 20 -nt 4 --max_alternate_alleles 5 -glm BOTH -L /data2/bsi/target.bed -I /data2/bsi/H-001.chr22-sorted.bam --out /data2/bsi/variants.chr22.raw.all.vcf
INFO 09:42:58,050 HelpFormatter - Date/Time: 2012/07/19 09:42:58
INFO 09:42:58,050 HelpFormatter - --------------------------------------------------------------------------------
INFO 09:42:58,050 HelpFormatter - --------------------------------------------------------------------------------
INFO 09:42:58,166 GenomeAnalysisEngine - Strictness is SILENT
WARN 12:39:09,312 FSLockWithShared - WARNING: Unable to lock file /reference/sequence/human/ncbi/37.1/allchr.dict: Protocol family not supported.
INFO 12:39:09,364 ReferenceDataSource - Unable to create a lock on dictionary file: Protocol family not supported
INFO 12:39:09,364 ReferenceDataSource - Treating existing dictionary file as complete.
WARN 12:39:12,527 FSLockWithShared - WARNING: Unable to lock file /reference/sequence/human/ncbi/37.1/allchr.fa.fai: Protocol family not supported.
INFO 12:39:12,528 ReferenceDataSource - Unable to create a lock on index file: Protocol family not supported
INFO 12:39:12,528 ReferenceDataSource - Treating existing index file as complete.
INFO 12:39:12,663 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 12:39:12,954 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.29
INFO 12:39:13,108 MicroScheduler - Running the GATK in parallel mode with 4 concurrent threads
WARN 12:39:13,864 UnifiedGenotyper - WARNING: note that the EMIT_ALL_SITES option is intended only for point mutations (SNPs) in DISCOVERY mode or generally when running in GENOTYPE_GIVEN_ALLELES mode; it will by no means produce a comprehensive set of indels in DISCOVERY mode
INFO 12:39:14,361 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 12:39:14,568 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.21
INFO 12:39:14,569 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 12:39:14,627 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06
INFO 12:39:14,628 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 12:39:14,686 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06
INFO 12:39:15,123 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
...
Tue Jul 24 12:47:32 CDT 2012
Hi, I'm using The Genome Analysis Toolkit (GATK) v2.3-9-ge5ebf34
I was running the UnifiedGenotyper tools on a 100bp paired end read with the the command below:
Program Args: -T UnifiedGenotyper -R ref.fasta -I realignedBwa.bam -glm BOTH -log variantsCalling.log -o bwa_variants.vcf
I noticed that only SNPs were called and not a single indel was called. however, when i used a 150bp paired end read with the similar command above it worked fine, the SNPs and Indels were called.
To make sure that the realignedBWA.bam file that i used was not corrupt/malformed, i used two different other pileup program on this bam file and the SNPs and Indels were called nicely. Hope you can take a look at this. I attached together the log file just in case
thanks in advance
I have a VCF containing 7.4m SNPs over 70 individuals from an F2 intercross, called by the UnifiedGenotyper v2.3.6. I am trying to set appropriate thresholds for filtering these SNPs. The attached plots summarise the individual calls from this data set, with depth on the x-axis, genotype quality on the y-axis and frequency of particular DP+GQ combinations shown in greyscale. The first plot shows 0/1 (heterozygote) calls, the second shows 0/0 (homozyote) calls (the 1/1 plot looks similar to the 0/0 plot).
The homozygote plot shows a clear relationship between minimum depth and maximum GQ; it is impossible to get high GQs at low depth. However, this is not the case for heterozygotes. This makes intuitive sense to me - at low depth, one cannot be sure that a call really is homozygote; perhaps the other allele simply hasn't been sequenced. But we can have more confidence in a low depth heterozygote, as both alleles have been seen.
However, I am wondering what your recommendations for best practice are here; do you recommend using the same GQ thresholds for homozygote and heterozygote calls, or different thresholds? If the same thresholds, it seems like there will be a bias at low coverage; many (quite possibly real) homozygote calls will be excluded, which will make it appear that there is an excess of heterozygosity in low coverage individuals.
Also, there seems to be a periodicity in the homozygote (but not the heterozygote) GQ values; GQ values divisible by three have a different distribution to other GQ values. I assume this doesn't affect the results too much (after all, the scale is fairly arbitrary in the first place) but I'd be interested to know what causes this, if it is known.
Thanks for your help,
John Davey
Hello,
Im trying to call variants using UnifiedGenotyper on ca 450 reduced bams in 100000 bp chunks. It works fine for some of the chunks, but for others I get the following error message:
Can anyone explain to me why there is a problem with a specific bam file when I call on for example chunk chr20:25400000-25500000 but not when I call on chunk chr20:10000000-10100000?
Thank you, Tota
Hi, I'm running UnifiedGenotyper with different glm values (BOTH, INDEL, SNP). I expected that the set of variants from glm BOTH is the same as the union of variants from glm SNP and INDEL, but it wasn't. Althought the different was not big (less than 100 variants), I'm curious why there's such a difference and want to know which is better way to find variants (both snps & indels). Thank you.
I have run UnifiedGenotyper with the -glm options SNP and BOTH. These two approaches yield identical variants and identical genotype likelihoods (at least the first 100k variants I checked). However, a few of the annotations have different values: BaseQRankSum MQRankSum ReadPosRankSum
-glm SNP on the left and -glm BOTH on the right:
MQRankSum=-1.762 MQRankSum=-1.785
MQRankSum=-5.307 MQRankSum=-4.970
MQRankSum=0.262 MQRankSum=-0.022
MQRankSum=-0.680 MQRankSum=-0.710
MQRankSum=1.016 MQRankSum=0.231
MQRankSum=-0.693 MQRankSum=-0.681
MQRankSum=-0.839 MQRankSum=-0.830
MQRankSum=1.924 MQRankSum=1.889
MQRankSum=-0.991 MQRankSum=-0.665
MQRankSum=-0.459 MQRankSum=-0.958
BaseQRankSum=-1.803 BaseQRankSum=-1.881
BaseQRankSum=6.918 BaseQRankSum=6.894
BaseQRankSum=-2.512 BaseQRankSum=-2.524
BaseQRankSum=2.000 BaseQRankSum=2.020
BaseQRankSum=2.095 BaseQRankSum=2.006
BaseQRankSum=2.134 BaseQRankSum=2.223
BaseQRankSum=-3.622 BaseQRankSum=-3.547
BaseQRankSum=1.569 BaseQRankSum=1.586
BaseQRankSum=-3.416 BaseQRankSum=-3.733
BaseQRankSum=-1.745 BaseQRankSum=-1.769
ReadPosRankSum=-0.341 ReadPosRankSum=-0.280
ReadPosRankSum=4.207 ReadPosRankSum=4.190
ReadPosRankSum=-3.809 ReadPosRankSum=-3.832
ReadPosRankSum=-2.047 ReadPosRankSum=-2.060
ReadPosRankSum=-1.279 ReadPosRankSum=-1.232
ReadPosRankSum=-3.921 ReadPosRankSum=-3.955
ReadPosRankSum=-1.500 ReadPosRankSum=-1.486
ReadPosRankSum=-0.374 ReadPosRankSum=-0.403
ReadPosRankSum=3.209 ReadPosRankSum=3.188
ReadPosRankSum=1.889 ReadPosRankSum=1.868
Why is that?
I noticed another user got different variants, but I get the same variants and the same likelihoods: [http://gatkforums.broadinstitute.org/discussion/1782/unifiedgenotyper-different-glm-value-result-in-different-sets-of-variants]
I ran single threaded.
I use MQRankSum and ReadPosRankSum for VariantRecalibrator, so it affects my downstream results, if the annotations are -glm dependent. Hence I am asking my question. I hope you can illuminate me. Thank you.
I was wondering, in the vcf output from UnifiedGenotyper what metrics go into the variant quality score, QUAL ?
I'm assuming that depth, DP is one of them but I can't find further information. Apologies if I've missed it.
Attached are some plots of the depth and quality distribution for variant calls, and also of the relationship between depth and quality. I'm slightly worried that the bimodal quality distribution indicates an error, otherwise I've just attached the graphs for general interest.
Hi to all
I began a variant analysis from 4 family related exome-seq samples in which a patology seems to be related to a polimorphism. I am just wondering which variant calling tools is better to use and if applying PhasebyTrasmission refinement is the correct way (in PhasebyTrasmission analysis does the read group that I assigned to bam file play a role in definition of the relation or I have to use just the ped file?).
Best
Giuliano
Hello,
I am trying to investigate why a variant (SNP) is not reported in the vcf file when I use the output_mode as "EMIT_VARIANTS_ONLY". However, when I use "EMIT_ALL_SITES" the site shows a very high QUAL score for the variant. Below is the output my region of interest:
17 7578393 . A . 11642 . AN=2;DP=3927;MQ=41.68;MQ0=0 GT:DP 0/0:3927
17 7578394 . T A 32767.01 . AC=1;AF=0.500;AN=2;BaseQRankSum=-15.336;DP=3927;Dels=0.10;FS=261.618;HaplotypeScore=6301.9653;MLEAC=1;MLEAF=0.500;MQ=41.68;MQ0=0;MQRankSum=-0.197;QD=8.34;ReadPosRankSum=0.049;SB=-5.709e+03 GT:AD:DP:GQ:PL 0/1:1588,1922:3526:99:32767,0,32767
17 7578395 . G A 984.01 . AC=1;AF=0.500;AN=2;BaseQRankSum=-18.165;DP=3927;Dels=0.08;FS=2960.660;HaplotypeScore=5221.3579;MLEAC=1;MLEAF=0.500;MQ=41.68;MQ0=0;MQRankSum=1.442;QD=0.25;ReadPosRankSum=1.547;SB=-6.519e-03 GT:AD:DP:GQ:PL 0/1:3218,401:3619:99:1014,0,32767
17 7578396 . G . 9145 . AN=2;DP=3927;MQ=41.68;MQ0=0 GT:DP 0/0:3910
17 7578397 . T . 11392 . AN=2;DP=3927;MQ=41.68;MQ0=0 GT:DP 0/0:3910
The questionable variant is T to A variant at Chr17:7578394 position.
The command I used is:
java -Xmx4g -jar /usr/local/lib/GenomeAnalysisTK-2.0-35-g2d70733/GenomeAnalysisTK.jar -R human_g1k_v37.fasta -T UnifiedGenotyper -I CL255.ordered.sorted.realigned.bam -o CL255_SNP.vcf --dbsnp dbsnp_135.b37.vcf -stand_call_conf 30 -stand_emit_conf 0 -dcov 5000 -L "17:7,578,336-7,578,451" -out_mode EMIT_ALL_SITES_
I am unable to attach the data since uploading bam seems to be disallowed.
Any help is appreciated. Thanks
Hey there,
I'm currently working on a project that uses the MiSeq system to perform very deep sequencing (5000x coverage) on a small set of genes related to a particular cancer type. I have had some odd calls with HaplotypeCaller and UnifiedGenotyper, especially near the edges of the sequenced regions. It seems that often Haplotype calls very large deletions and Unified calls SNPs, e.g. (from the VCFs):
Haplotype Caller:
chr2 212576988 . TATTTTTAATTGTA T 13.95 LowQual AC=1;AF=0.500;AN=2;BaseQRankSum=1.805;ClippingRankSum=-0.916;DP=1000;FS=8.285;MLEAC=1;MLEAF=0.500;MQ=59.98;MQ0=0;MQRankSum=0.836;QD=0.00;ReadPosRankSum=0.528 GT:AD:GQ:PL 0/1:308,53:42:42,0,8418
Unified Genotyper:
chr2 212576990 . T C 555.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.991;DP=1000;Dels=0.00;FS=0.000;HaplotypeScore=22.6027;MLEAC=1;MLEAF=0.500;MQ=59.98;MQ0=0;MQRankSum=0.073;QD=0.56;ReadPosRankSum=0.164 GT:AD:DP:GQ:PL 0/1:860,130:3743:99:584,0,32767
A screenshot from IGV showing some of the (few) BAM records in the area of these calls is attached.
Many of these false variant calls can be removed through hard filtering, but I was a little surprised to see them called in the first place, because our raw results had been quite good with lower-coverage data. Is this to be expected with high coverage data?
I am also wondering if there is a hard cap after which downsampling occurs. I set -dcov 10000 for all our samples, but the depth reported in the DP tag seemed max out at 1000. Is there another parameter I need to change? In some areas, due to the very small regions being sequenced, I have coverage in excess of 15,000 reads, so downsampling to 1000 might skew the results.
Any guidance is appreciated! Cheers, Natascha
Hi,
I used the UnifiedGenotyper (GATK 1.6) on a multi-sample set to call variants, and for some of the positions I get multiple mutated alleles. The genotype entries in the combined VCF file look like (GT:AD:DP:GQ:PL):
0/1:94,11,0:124:22.18:22,0,2485,209,2500,2709
0/2:27,0,54:81:99:1651,1726,2695,0,968,836
so it's three AD values per entry. Running SelectVariants yields the following line for the second example from above:
GT:AD:DP:GQ 0/1:27,0,54:81:99
Although it changed the genotype from 0/2 to 0/1, it did not update the AD field. I checked the forums, but I could not really find anything discussing specifically the update of AD, except for the GATK 2.2 release notes where it says SelectVariants: "Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file."
I was wondering whether you could confirm if cases like the one above would benefit from the bugfix, or if the bug description applies to something else.
Thanks a lot for all your hard work, Markus
Hi,
This question was brought by my colleague who pursued the GATK called variants downstream. She used a program called ANNOVAR to annotate the variants derived form GATK calls from our exome-seq data. After that she saw many of variants are annotated to be at 5' or 3' UTR regions, intronic regions, or even intergenic regions. However, when I called the variants with GATK, I did use -L option at the Unified genotyper step with the bed file directly download from Agilent (I used the enrichment kit from Agilent), which supposed to restrict the variants only to the exons or exon-approximate regions. UTRs or intronic regions may be understandable, but the variants from intergenic regions are kind of odd, is it? Anybody has similar observation or just something wrong with ANNOVAR or our local setting of ANNOVAR? I hope GATK did not introduce these odds but faithfully call variants only at the target interval regions as defined to do. Unless the Agilent's target enrichment regions are spread out into intergenic regions?
Any comments or insights are appreciated!
Thanks
Mike
Previously I have been running a command like this:
java -jar /path/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
-R /path/human_g1k_v37.fasta \
-et NO_ET \
-K /path/key \
-out_mode EMIT_ALL_SITES \
--input_file /path/bam \
-L /path/intervals \
-gt_mode GENOTYPE_GIVEN_ALLELES \
--alleles /path/vcf \
--dbsnp /path/dbsnp_135.b37.vcf \
-o /path/my.vcf
But I was reading the documentation again and I read this statement: GENOTYPE_GIVEN_ALLELES only the alleles passed in from a VCF rod bound to the -alleles argument will be used for genotyping
Which lead me to believe that there wasn't a need to include the lines: --input_file /path/bam \ -L /path/intervals \
because it would be redundant. But when I try to run without those line I get back an error message: Walker requires reads but none were provided.
Can you give an explaination as to why both of those lines AND GENOTYPE_GIVEN_ALLELES would be needed?
We are using the unified genotyper to create a vcf file. In the vcf file we see the info tags;
but when we look at the exported annotations
ABHom=1.00;AC=2;AF=1.00;AN=2;DB;DP=2;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=40.00;MQ0=0;QD=12.21
we only see ABHom not ABHet.
Greetings!
First of all, thank you for a truly great toolkit! It is no doubt the best one out there.
Now, I have a question regarding visualization of a SNP that is not called by UG but looks convincing in IGV. Yes, I've looked at the FAQ page gatkforums.broadinstitute.org/discussion/1235/why-didnt-the-unified-genotyper-call-my-snp-i-can-see-it-right-there-in-igv but I'm still not completely convinced that this is a false positive.
The BAM files have gone through the Best Practices workflow prior to SNP calling. Calling was done using UG with subsequent recalibration steps, where I followed the guidelines under gatkforums.broadinstitute.org/discussion/1259/what-vqsr-training-sets-arguments-should-i-use-for-my-specific-project. SNP calling was done using GATK 2.4-9.
Below is a screenshot from IGV showing the SNP call:

Fullsize here: s24.postimg.org/sepow851v/igv_snp.png
The average mapping quality for the reads that include the SNP is 50 and the average base quality at the locus of the SNP is 28.7 (not including 4 positions where base quality is below 10). These values are calculated from the values shown by IGV
Are these values really too low to not confidently call this SNP? I mean a base quality of 28.7 means a probability of 99.87% that the base call is correct. Isn't that good enough?
Please help me understand this, and let me know if you need more information. :)
Hello,
Could anyone give me an example generic command for single-sample and multi-sample INDEL calling using the Unified Genotyper? I only found documentation for SNP calling. And when I tried something like that :
java -jar GenomeAnalysisTK.jar \
-R resources/Homo_sapiens_assembly19.fasta \
-T UnifiedGenotyper \
-I input.bam \
--dbsnp dbsnp_137.b37.vcf \
--genotype_likelihoods_model INDEL \
-o output.indels.raw.vcf \
-stand_call_conf 50.0 \
-stand_emit_conf 10.0 \
-dcov 50
The number of indels detected and reported in my output vcf file is too low (only 400 indels for an exome).
Thank you for any help you can provide me, Geoffroy
I've tried to get genotype for all sites provided in interval file using haplotypeCaller. If using unifiedGenotyper, I can get the result by setting "output_mode EMIT_ALL_SITES". But haplotypeCaller doesn't report as expected by "output_mode EMIT_ALL_SITES". Even though I set "genotypeFullActiveRegion" or "fullHaplotype", haplotypeCaller doesn't seem to emit genotype at all sites. How to get desirable result using haplotypeCaller? Thanks!
I wonder if there is any limitation for the command length. I provide more than 90 bam files, including their directory address to GATK for variant calling using UnifiedGenotyper, and it seems it doesn't parse whole the command. I know It could be because of java, also. Have you ever experiences similar problem ? Is there any way to provide a directory path instead of all single bam files path as the input ? thanks,
Hi, I use bowtie to do the mapping. And then use samtools to convert the sam to bam file, then use Picard to add RG info, then use the GATK local indel tool, then use the GATK UnifiedGenotyper to call SNP.
java -Xmx4g -jar ~/fly/GenomeAnalysisTK-2.2-3/GenomeAnalysisTK.jar -T UnifiedGenotyper -I read_target.bam --dbsnp ../human_chr.vcf -R ~/fly/GenomeAnalysisTK-2.2-3/ucsc.hg19.fasta -o read.vcf -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000
And all the progress, there is no error. But I just got a vcf file as following.
##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
...
action_per_sample=0.25 indel_heterozygosity=1.25E-4 indelGapContinuationPenalty=10 indelGapOpenPenalty=45 indelHaplotypeSize=80 noBandedIndel=false indelDebug=false ignoreSNPAlleles=false allReadsSP=false ignoreLaneInfo=false reference_sample_calls=(RodBinding name= source=UNBOUND) reference_sample_name=null sample_ploidy=2 min_quality_score=1 max_quality_score=40 site_quality_prior=20 min_power_threshold_for_calling=0.95 min_reference_depth=100 exclude_filtered_reference_sites=false heterozygosity=0.001 genotyping_mode=DISCOVERY output_mode=EMIT_VARIANTS_ONLY standard_min_confidence_threshold_for_calling=50.0 standard_min_confidence_threshold_for_emitting=10.0 alleles=(RodBinding name= source=UNBOUND) max_alternate_alleles=3 dbsnp=(RodBinding name=dbsnp source=exampleSNP.vcf) comp=[] out=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub no_cmdline_in_header=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub bcf=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub debug_file=null metrics_file=null annotation=[] excludeAnnotation=[] filter_mismatching_base_and_quals=false"
##contig=<ID=chr1,length=100000>
##reference=file:///home/fly//GenomeAnalysisTK-2.2-3/ucsc.hg19.fasta
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT read.bam
I don't know why, but I am sure it can't be right with no SNPs. Through the samtools pileup method, I can get about 100 thousand SNPs.
Hope to get your answer, thanks.
Hi,
I was wondering whether there was any way to obtain the ref/alt allele count based only on filtered reads in UnifiedGenotyper? The output format allows you to see how many reads were filtered in total based on the difference between INFO and FORMAT DP, but not how filtering affects the AD. I understand the logic behind this standard output format, but what approach could I take to get the post-filtering AD?
Many thanks.
Hello the team,
For some genotypes, it seems are wrong, I know it's model based, and base q, map q, etc are considered in the model. I also read this link: http://gatkforums.broadinstitute.org/discussion/1235/why-didnt-the-unified-genotyper-call-my-snp-i-can-see-it-right-there-in-igv#latest But my case are special, the format is (ref allele count)/(alternative allele count) genotype call: 22/24 0/0 109/125 0/0 85/109 0/0 26/32 0/0 40/161 0/0 195/6 1/1 239/5 1/1 83/6 1/1 46/28 1/1
In one case, the two variants are adjacent to each other. In some case, they are one base indels.
Thanks,
Jim
I am facing this error, when I try to validate the variants (ValidateVariants) of a vcf file which is produced through GATK just after UnifiedGenotyper. I am using GenomeAnalysisTK-2.3-6-gebbba25 and dbsnp_137.hg19.vcf. These variants are annotated by DepthOfCoverage, aplotypeScore, ,InbreedingCoeff and LowMQ ...
Basically, I generate two VCF files using UnifiedGenotyper separately, one for SNP and the other for INDEL.
the error for both is about the Allele Count (AC) tag: ##### ERROR MESSAGE: File F93.snp.vcf fails strict validation: the Allele Count (AC) tag is incorrect for the record at position chr1:1225579, 1 vs. 1
I appreciate your comments,