# Tagged with #unifiedgenotyper 4 documentation articles | 1 announcement | 94 forum discussions

Created 2013-08-23 21:34:04 | Updated 2014-10-24 16:21:11 | Tags: unifiedgenotyper haplotypecaller ploidy

Use HaplotypeCaller!

The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper. Its ability to call SNPs is equivalent to that of the UnifiedGenotyper, its ability to call indels is far superior, and it is now capable of calling non-diploid samples. It also comprises several unique functionalities such as the reference confidence model (which enables efficient and incremental variant discovery on ridiculously large cohorts) and special settings for RNAseq data.

As of GATK version 3.3, we recommend using HaplotypeCaller in all cases, with no exceptions.

Caveats for older versions

If you are limited to older versions for project continuity, you may opt to use UnifiedGenotyper in the following cases:

• If you are working with non-diploid organisms (UG can handle different levels of ploidy while older versions of HC cannot)
• If you are working with pooled samples (also due to the HC’s limitation regarding ploidy)
• If you want to analyze more than 100 samples at a time (for performance reasons) (versions 2.x)

Created 2013-06-17 21:39:48 | Updated 2016-02-11 10:52:35 | Tags: unifiedgenotyper variant-discovery

### Note: the UnifiedGenotyper has been replaced by HaplotypeCaller, which is a much better tool. UG is still available but you should really consider using HC instead.

#### Objective

Call variants on a haploid genome with the UnifiedGenotyper, producing a raw (unfiltered) VCF.

• TBD

#### Steps

1. Determine the basic parameters of the analysis
2. Call variants in your sequence data

### 1. Determine the basic parameters of the analysis

If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.

• Ploidy (-ploidy)

In its basic use, this is the ploidy (number of chromosomes) per sample. By default it is set to 2, to process diploid organisms' genomes, but it can be set to any other desired value, starting at 1 for haploid organisms, and up for polyploids. This argument can also be used to handle pooled data. For that purpose, you'll need to set -ploidy to Number of samples in each pool * Sample Ploidy. There is no fixed upper limit, but keep in mind that high-level ploidy will increase processing times since the calculations involved are more complex. For full details on how to process pooled data, see Eran et al. (in preparation).

• Genotype likelihood model (-glm)

This is the model that the program will use to calculate the genotype likelihoods. By default, it is set to SNP, but it can also be set to INDEL or BOTH. If set to BOTH, both SNPs and Indels will be called in the same run and be output to the same variants file.

• Emission confidence threshold (-stand_emit_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.

• Calling confidence threshold (-stand_call_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.

The terms called and filtered are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.

### 2. Call variants in your sequence data

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
-R haploid_reference.fa \
-L 20 \
-glm BOTH \
--stand_call_conf 30 \
--stand_emit_conf 10 \
-o raw_ug_variants.vcf 

This creates a VCF file called raw_ug_variants.vcf, containing all the sites that the UnifiedGenotyper evaluated to be potentially variant.

Note that -L specifies that we only want to run the command on a subset of the data (here, chromosome 20). This is useful for testing as well as other purposes. For example, when running on exome data, we use -L to specify a file containing the list of exome targets corresponding to the capture kit that was used to generate the exome libraries.

Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.

Created 2012-07-30 17:37:12 | Updated 2015-10-30 19:34:24 | Tags: unifiedgenotyper haplotypecaller snp bamout

This can happen when you expect a call to be made based on the output of other variant calling tools, or based on examination of the data in a genome browser like IGV.

There are several possibilities, and among them, it is possible that GATK may be missing a real variant. But we are generally very confident in the calculations made by our tools, and in our experience, most of the time, the problem lies elsewhere. So, before you post this issue in our support forum, please follow these troubleshooting guidelines, which hopefully will help you figure out what's going on.

In all cases, to diagnose what is happening, you will need to look directly at the sequencing data at the position in question.

### 1. Generate the bamout and compare it to the input bam

If you are using HaplotypeCaller to call your variants (as you nearly always should) you'll need to run an extra step first to produce a file called the "bamout file". See this tutorial for step-by-step instructions on how to do this.

What often happens is that when you look at the reads in the original bam file, it looks like a variant should be called. However, once HaplotypeCaller has performed the realignment, the reads may no longer support the expected variant. Generating the bamout file and comparing it to the original bam will allow you to elucidate such cases.

In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!

### 2. Check the base qualities of the non-reference bases

The variant callers apply a minimum base quality threshold, under which bases will not be counted as supporting evidence for a variant. This is because low base qualities mean that the sequencing machine was not confident that it called the right bases. If your expected variant is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the DP field of the VCF is the unfiltered depth. You may believe you have good coverage at your site of interest, but since the variant callers ignore bases that fail the quality filters, the actual coverage seen by the variant callers may be lower than you think.

### 3. Check the mapping qualities of the reads that support the non-reference allele(s)

The quality of a base is capped by the mapping quality of the read that it is on. This is because low mapping qualities mean that the aligner had little confidence that the read was mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- in fact, you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

### 4. Check how many alternate alleles are present

By default the variant callers will only consider a certain number of alternate alleles. This parameter can be relaxed using the --max_alternate_alleles argument (see the HaplotypeCaller documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles increases the computational cost of the processing, scaling exponentially with the number of alternate alleles, which means it will use more resources and take longer. Unless you have a really good reason to change the default value, we highly recommend that you not modify this parameter.

### 5. When using UnifiedGenotyper, check for overlapping deletions

The UnifiedGenotyper ignores sites if there are too many overlapping deletions. This parameter can be relaxed using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument) but be aware that increasing its value could adversely affect the reliability of your results.

### 6. Check for systematic biases introduced by your sequencing technology

Some sequencing technologies introduce particular sources of bias. For example, in data produced by the SOLiD platform, alignments tend to have reference bias and it can be severe in some cases. If the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site, you are probably seeing false positives.

Created 2012-07-26 14:50:55 | Updated 2014-10-24 00:55:34 | Tags: unifiedgenotyper ploidy haploid intermediate

In general most GATK tools don't care about ploidy. The major exception is, of course, at the variant calling step: the variant callers need to know what ploidy is assumed for a given sample in order to perform the appropriate calculations.

### Ploidy-related functionalities

As of version 3.3, the HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the -ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the -ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the -ploidy argument.

For earlier versions (all the way to 2.0) the fallback option is UnifiedGenotyper, which also accepts the -ploidy argument.

### Cases where ploidy needs to be specified

1. Native variant calling in haploid or polyploid organisms.
2. Pooled calling where many pooled organisms share a single barcode and hence are treated as a single "sample".
3. Pooled validation/genotyping at known sites.

For normal organism ploidy, you just set the -ploidy argument to the desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e. (Ploidy per individual) * (Individuals in pool).

## Important limitations

Several variant annotations are not appropriate for use with non-diploid cases. In particular, InbreedingCoeff will not be annotated on non-diploid calls. Annotations that do work and are supported in non-diploid use cases are the following: QUAL, QD, SB, FS, AC, AF, and Genotype annotations such as PL, AD, GT, etc.

You should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.

Created 2013-03-14 12:29:25 | Updated 2014-12-01 17:34:07 | Tags: unifiedgenotyper haplotypecaller printreads bug

As reported here:

If you encounter this bug too, please don't post a new question about it. Feel free to comment in this thread to let us know you have also had the same problem. Tell us what version of the GATK you were using and post your command line.

Thank you for your patience while we work to fix this issue.

### Latest update: we found that the three tools (PrintRead, HaplotypeCaller and UnifiedGenotyper) had different issues that only produced the same symptom (the ArrayIndexOutOfBoundsException error). The underlying issues have all been fixed in version 2.5. If you encounter an ArrayIndexOutOfBoundsException in later versions, it is certainly caused by a different problem, so please open a new thread to report the issue.

Created 2016-04-24 00:40:28 | Updated | Tags: unifiedgenotyper haplotypecaller

Hello , I recently saw a webinar by NCBI "Advanced Workshop on SRA and dbGaP Data Analysis" (ftp://ftp.ncbi.nlm.nih.gov/pub/education/public_webinars/2016/03Mar23_Advanced_Workshop/). They mentioned that they were able to run GATK directly on SRA files.

I downloaded GenomeAnalysisTK-3.5 jar file to my computer. I tried both these commands:

java -jar /path/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T HaplotypeCaller -R SRRFileName -I SRRFileName -stand_call_conf 30 -stand_emit_conf 10 -o SRRFileName.vcf

java -jar /path/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T SRRFileName -R SRR1718738 -I SRRFileName -stand_call_conf 30 -stand_emit_conf 10 -o SRRFileName.vcf

For both these commands, I got this error: ERROR MESSAGE: Invalid command line: The GATK reads argument (-I, --input_file) supports only BAM/CRAM files with the .bam/.cram extension and lists of BAM/CRAM files with the .list extension, but the file SRR1718738 has neither extension. Please ensure that your BAM/CRAM file or list of BAM/CRAM files is in the correct format, update the extension, and try again.

I don't see any documentation here about this, so wanted to check with you or anyone else has had any experience with this.

Thanks K

Created 2016-04-06 21:06:40 | Updated | Tags: unifiedgenotyper haplotypecaller rnaseq splitncigarreads

Hi, I work on plant species. I am using GATK on variant discovery in RNAseq data. I am not able to decide whether I should use option --filter_reads_with_N_cigar or
-U ALLOW_N_CIGAR_READS. What do you suggest?

I Would like to count the number of reads affected by N's in the CIGAR? Could you please suggest any tool?

Secondly, I am using Haplotype Caller for variant discovery. It is running very slow (on 12 CPU).
Is it okay to use UnifiedGenotyper instead of Haplotype Caller on RNAseq data?

Thanks

Created 2016-04-05 10:08:24 | Updated | Tags: unifiedgenotyper multi-sample

Does this mean this will report SNPs between input samples?

Created 2016-03-31 13:34:56 | Updated | Tags: unifiedgenotyper

Hi, I have 3 sRNA samples. Each was aligned to a virus ref genome (bowtie2) and reads that align were filtered out., and I used unified genotyper on these filtered reads to find variants. Came up with 3 lists, no problems, used a cutoff score of 100. I then decided to check on some of the SNPs using IGV. That is when I noticed that the 3 samples behaved quite differently from one another as far as being able to verify the SNPs. In one sample only 8 out 80 SNPs were certain, of the rest maybe half might possibly be SNPs if the alignments scores were much higher in the lower frequency nt (these were cases where >50% of reads show the template nt, but the SNP nt was higher than 10%). The rest had 85-98% of reads with the original nt. in the other two samples I could verify just about every SNP and of course some variation between tools is expected. In one of those two samples there also seemed to be a large number of SNPs that were missed.

so: sample one : lots of false positives, sample two: lots of missed SNPs, sample 3 : just about right on target. sample 1 has a much higher read density than the other two. But other than that I can't think of any differences.

does this mean unified genotyper is not optimal for use with these very short reads (36nt and shorter)? Is there something else that affects this? Would samtools pileup etc be better? Thanks! Susanne

Created 2016-03-20 19:57:09 | Updated | Tags: unifiedgenotyper genotype

I know HC call genotype, but does unifiedgenotyper also call genotype or just variants?

Created 2016-03-20 19:55:44 | Updated | Tags: unifiedgenotyper haplotypecaller algorithm

Under every poster of GATK asking which is better, HC or UG, Geraldine always said HC.

So Is there any documents talking about the detail algorithms HC and UG are using, so that I can get a clear idea why HC is better?

Thanks.

Created 2016-03-17 14:42:53 | Updated | Tags: unifiedgenotyper haplotypecaller multi-sample gvcf

Hi, I'm using GATK ver 3.4 for SNP calling and I have some question about it. My data set has 500 samples, and I used genome data as reference for bowtie/GATK

1) I called SNP by sample (gvcf) with haplotype and then combined gvcf, however, the combination takes a long time, the GATK wants to recreate gvcf.idx files (4 of my gatk mission stuck at this step), one gatk combination finished after about 20 days calculation. I also try to use '-nct' to improve this, but it still stuck at preparing idx files.

2) For that finished gatk combination data set, I also used Unifiedgenotype with Gr.sorted.bam as input to call SNPs. The result is output with Gr.sorted.bam has 5 times more SNPs number than gvcf combination, and most missing SNPs could be found in individual gvcf files but missing in final result.

Could you help me with these? Thank you!

Created 2016-03-15 06:05:14 | Updated | Tags: unifiedgenotyper pooling

I've got a pool of 80 individuals sequencing data. Each individual doesn't have index, so that i have just one fastq file and bam file that i cannot sort any sample data from it.

I try to use GATK UnifiedGenotyper v3.3 including "-ploid" option There're some questions.

1. command : java -Xmx100g -jar /ruby/Tools/GATK/GenomeAnalysisTK-3.3/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ./Ref.fasta -I 1.bam -o 1_unified.vcf --sample_ploidy 160 -minIndelFrac 0.05 --genotype_likelihoods_model BOTH -pnrm EXACT_GENERAL_PLOIDY -nct 4 -nt 10

--sample_ploidy = 160 = 80 (pooling 80 individuals) * ploid ( diploid, 2 ) bamfile size = 3.4GB

SERVER SPEC : CPU core = 40x memory = 256G

I've started the process 4 days ago. The progress percent is 0.3%, and remain time is 176.9 weeks. I think it's too slow to complete. I just wonder how long takes time to process GATK UnifiedGenotyper on that data. Is there any recommendations to improve this job?

Created 2016-03-09 20:17:57 | Updated | Tags: unifiedgenotyper

I've been scouring the forums, but I fear that my question is so basic that I am alone: I have whole genome sequences of 6 samples (and so 6 .bam files) of a non-model organism and I am trying to compare SNPs for downstream population genetics analyses. I attempted this using UnifiedGenotyper (I realize that HaplotypeCaller is better, but UG finished first, while HC has been running for days at the time of this writing.) Here is what I entered:

java -jar GenomeAnalysisTK.jar -R reference.fasta -T UnifiedGenotyper -I sample01.bam -I sample02.bam -I sample03.bam -I sample04.bam -I sample05.bam -I sample06.bam -o output.raw.snps.indels.vcf

Unless I am mis-reading the output VCF file (first few lines are pasted below), it seems to contain only a single sample (based on the fact that there is only a single column for ALT, rather than one per sample). I tried to use this file in SNPHYLO, but it errors because "There are no SNPs", which seems to confirm this.

What am I doing wrong? Thanks, and apologies for any redundancies.

# CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TU114

Created 2016-03-02 09:34:18 | Updated | Tags: unifiedgenotyper user-error

hello, i tried to launch unified genotyper but it's seems doesnt worked. i show you the error msg:

[foulquie@node6 palmx_bam]$/usr/local/java/latest/bin/java -jar /home/sabotf/sources/GenomeAnalysisTK/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ./Eg_RNA_RefSeq_MyOPGP.fasta -nt \$NSLOTS -I ./merged_bams_file.bam -o ./merged_bams_file.vcf -stand_call_conf 30 -stand_emit_conf 10 -rf BadCigar

##### ERROR ------------------------------------------------------------------------------------------

What does it means ?

Created 2016-01-12 04:14:19 | Updated | Tags: unifiedgenotyper

Hello,

It says that the priors are calculated during multi-sample calculation. I, however, just has the sequence data from one sample with 80X depth coverage (I'm not sure if that's relevant but my understanding is that it does allow for calling of high confidence variants). So I was wondering how the priors would be calculated in this case. Thank you!

Created 2016-01-07 14:29:16 | Updated | Tags: unifiedgenotyper java

I've seen a few java.lang.IllegalArgumentException questions here, but I haven't been able to find any quit like mine! Let me know if this has in fact been answered somewhere else, or if you have any insight into the problem! Here is the output from my UnifiedGenotyper run:

INFO 16:37:28,331 HelpFormatter - --------------------------------------------------------------------------------- INFO 16:37:28,364 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12 INFO 16:37:28,364 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 16:37:28,364 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 16:37:28,366 HelpFormatter - Program Args: -T UnifiedGenotyper -R /n/regal/mallet_lab/edelman/18Genomes/data/reference/Herato2/Herato2.fasta -I bam/h_erato_himera_hybrid_RG.bam -I bam/h_erato_pe_aln_sorted.bam -I bam/h_himera_pe_aln_sorted.bam -o snps.raw.vcf -stand_emit_conf 10 -stand_call_conf 50 INFO 16:37:28,487 HelpFormatter - Executing as nedelman@holyitc07.rc.fas.harvard.edu on Linux 2.6.32-431.17.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14. INFO 16:37:28,488 HelpFormatter - Date/Time: 2016/01/06 16:37:28 INFO 16:37:28,488 HelpFormatter - --------------------------------------------------------------------------------- INFO 16:37:28,488 HelpFormatter - --------------------------------------------------------------------------------- INFO 16:37:29,264 GenomeAnalysisEngine - Strictness is SILENT INFO 16:37:29,525 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 16:37:29,535 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 16:37:29,719 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.18 INFO 16:37:30,310 GenomeAnalysisEngine - Preparing for traversal over 3 BAM files INFO 16:37:30,360 GenomeAnalysisEngine - Done preparing for traversal INFO 16:37:30,360 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 16:37:30,361 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 16:37:30,361 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 16:37:30,394 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values. INFO 16:37:30,394 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values. INFO 16:38:00,381 ProgressMeter - Herato_chr11_8:315097 430691.0 30.0 s 69.0 s 0.1% 7.3 h 7.3 h INFO 16:38:07,739 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR ------------------------------------------------------------------------------------------

Thanks, Nate

Created 2016-01-07 06:21:51 | Updated | Tags: unifiedgenotyper java gatk

Hi ,

I am using the UnifiedGenotyper command but got the following error. Can some one put a light what it could mean!!

INFO 17:17:23,650 HelpFormatter - ----------------------------------------------------------------------------------- INFO 17:17:23,653 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.0-6121-g8a78414, Compiled 2011/07/11 16:05:10 INFO 17:17:23,654 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:17:23,654 HelpFormatter - Please view our documentation at http://www.broadinstitute.org/gsa/wiki INFO 17:17:23,655 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa INFO 17:17:23,655 HelpFormatter - Program Args: -T UnifiedGenotyper -R /projects/u71/nandan/Sven_Uthicke_populationGenomics/analysis/SNP_SFG/seaurchin_topHit_Ids.fasta -I merged_realigned_woUnsafe_1.bam -stand_call_conf 20.0 -stand_emit_conf 20.0 -o raw_snps_indels_Q20_all.vcf -nt 12
INFO 17:17:23,655 HelpFormatter - Date/Time: 2016/01/07 17:17:23 INFO 17:17:23,656 HelpFormatter - ----------------------------------------------------------------------------------- INFO 17:17:23,656 HelpFormatter - ----------------------------------------------------------------------------------- INFO 17:17:23,699 GenomeAnalysisEngine - Strictness is SILENT INFO 17:17:26,575 MicroScheduler - Running the GATK in parallel mode with 12 concurrent threads WARN 17:17:29,044 RestStorageService - Error Response: PUT '/GATK_Run_Reports/Ar4GP2YBVmBmWH6kz3hZjIAP6ikSNlHP.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 1993, Content-MD5: ykLRiVk7ELWpIyvrZz9dyg==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: ca42d189593b10b5a9232beb673f5dca, Date: Thu, 07 Jan 2016 06:17:27 GMT, Authorization: AWS AKIAJXU7VIHBPDW4TDSQ:Ul22+ARecieW4VAUerP/AzNX3qQ=, User-Agent: JetS3t/0.8.0 (Linux/3.0.76-0.11-default; amd64; en; JVM 1.6.0_30), Host: s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: 6FAAE2E0B5A437DF, x-amz-id-2: oJ9rYHwvwv7LZ2xpsni+V3YQF+6emAiSko8wGj5d09v7CX9nG7E+Duz0KEXFKj2u7jJNNFPOJ4Y=, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Thu, 07 Jan 2016 06:17:28 GMT, Connection: close, Server: AmazonS3]

##### ERROR ------------------------------------------------------------------------------------------

Regards,

Nandan

Created 2015-12-22 14:29:56 | Updated | Tags: unifiedgenotyper haplotypecaller trio pedigree

Greetings.

I've been exploring de novo mutation identification in the context of a pedigree of trios. I've run the UnfiedGenotyper (UG) given all the bam files for ~25 sets of trios and it appears to identify a set of de novo mutations. When I run the HaplotypeCaller (HC) pipeline, first generating gVCF files for each individual, and then using the merged gVCF files along with the pedigree for genotype refinement and de novo mutation calling, it also finds a number of de novo mutations annotated as hi-confidence de novo mutations. When I compare the UG de novo mutations to the high confidence HC list, there's very little overlap. Many of the UG hi-confidence de novo variants are called by HC, but listed as low-confidence de novo variants, and from looking at a few examples, it would appear that the HC calls have assigned lower genotype confidence levels for the parental (non-mutated, reference) genotypes. Could it be that because the gVCF files aren't storing position-specific information for the reference (non-mutated) positions in the genome, the pedigree-type de novo mutation calling is not as accurate as it could be? Should I be generating gVCFs that include position-specific information?

Many thanks for any insights. If it would help, I could post some examples.

Created 2015-12-17 19:38:46 | Updated | Tags: indelrealigner unifiedgenotyper haplotypecaller gatk

Does anyone know of an effective way to determine haplotypes or phasing data for SNPs and STRs? I understand that STRs are inherently difficult for aligners; however, I'm trying to determine haplotypes for a large number of STRs (including the flanking region information...SNPs) on a large number of samples. So, manual verification is not really an option. We've developed an in-house perl script that calls STRs accurately; however, it currently does not include flanking region information.

Any help is greatly appreciated.

Created 2015-12-02 23:31:39 | Updated | Tags: unifiedgenotyper ploidy gatk variant-calling

I am using the Gatk v3.1 UnifiedGenotyper for variant calling of some GBS type data in the plant genus Boechera. I have diploids as well as triploids, where I have the ploidy from a prior microsat run. The diploids work out well (I get some 150k variable loci), but I can't seem to get variants for the triploids.

Here's my initial call for the triploids:

Program Args: -T UnifiedGenotyper -R /home/A01768064/assembly/Bstricta_278_v1.fa -I mytripbam.list -o boechera_triploids.vcf -nt 22 -glm SNP -hets 0.001 -mbq 20 -ploidy 3 -pnrm EXACT_GENERAL_PLOIDY -stand_call_conf 50 -maxAltAlleles 2

I was able to call the triploids with the diploid option, but - obviously - this is not what I want. I additionally ran the following relaxed stringency options.

hets 0.01 mbq 10 hets 0.1 mbq 10

None of them return any variants.

The read group for one example file:

@RG ID:CR1308 LB:CR1308 SM:CR1308 PL:ILLUMINA (with unique values for each sample; note that in the output vcf, the samples are appearing in the header line)

Additionally, I attached the slurm log file for the last mentioned run.

I am at a complete loss, what am I missing here?

Created 2015-11-06 07:09:54 | Updated | Tags: unifiedgenotyper

We have a pooling of 64 samples and the organism is diploid, we are using ploidy of 128 (64x2=128). Whether we can use such a high ploidy with UnifiedGenotyper. The command line which we have used is below

java -jar -Xmx8g /home/prateek/Downloads/GenomeAnalysisTK.jar -T UnifiedGenotyper -R tilling.fa -I d12_S44.sorted.bam -o d12_S44.vcf -stand_call_conf 30.0 -stand_emit_conf 10.0 -glm BOTH -ploidy 128

whether we are using right command or what is the other alternative we can use.

What is maximum ploidy or pooling level we can use with UnifiedGenotyper.

Prateek

Created 2015-10-26 16:40:33 | Updated | Tags: unifiedgenotyper

While using UnifiedGenotyper for variant calling, i am getting the following output. In altered base sometimes instead of single altered base, i am getting three or two different bases. I am unable to understand what does it means. How can i infer the result from this. plz help, i am new to this. Thanks in advance

# CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

Created 2015-10-21 08:11:48 | Updated | Tags: unifiedgenotyper

We have sequencing information for a whole family (father, mother, 3 kids) and analyzed them with the gatk (more info below). And while looking through the results I found a SNP with these predicted genotypes and read frequencies: kid 1: 0/0, 168 vs. 26 reads mother: 0/0, 211 vs. 34 reads father: 0/0, 220 vs 18 reads kid 2: 1/1, 35 vs. 54 reads kid 3: 1/1, 20 vs. 42 reads

First it caught my eye because of the impossible heredity but than I also noticed the, at least to me, strange genotype calls for kid 2 and 3. Can you explain the effects that lead to such results?

Thank you Martin

exact vcf entry: 4 71347185 . T C 332.95 . AC=4;AF=0.400;AN=10;BaseQRankSum=3.959;DP=828;Dels=0.00;FS=0.000;HaplotypeScore=11.5830;MLEAC=4;MLEAF=0.400;MQ=29.13;MQ0=153;MQRankSum=-6.418;QD=2.20;ReadPosRankSum=-10.775;SOR=0.708 GT:AD:DP:GQ:PL 0/0:168,26:194:99:0,157,2194 0/0:211,34:245:93:0,93,2661 0/0:220,18:238:99:0,101,2422 1/1:35,54:89:15:131,15,0 1/1:20,42:62:27:250,27,0

pipeline (not all under my control):

• mapper is unknown, probably bwa
• Picard MarkDuplicates (Version 1.128)
• GATK IndelRealigner (Version unknown)
• GATK UnifiedGenotyper (Version 3.4-46), I know it is deprecated but we had to use it for a reason

Created 2015-10-20 17:37:29 | Updated | Tags: unifiedgenotyper baserecalibrator haplotype-caller bootstrap

Hello,

I have asked a question about this project before, but the problem in this post is less easy to define than the problem in my first question.

I am using GATK (GenomeAnalysisTK-3.3-0) to compare levels of heterozygosity between three separate diploid lizards.

I have around 18x coverage per animal, however that data is combined from three separate sequencing runs.

One of the three samples is suspected to be highly homozygous, while the other two should have much higher levels of heterozygosity.

The DNA from the homozygous animal was used to generate the genome assembly being used as the reference genome for all three animals.

To assess levels of heterozygosity, I had planned to use the bootstrapping method with BQSR and HaplotypeCaller to train for a set of known variants. Next perform the final round of recalibration with the set of known variants and variant calling in GVCF mode for each sample. Then I would be able to perform joint genotyping and only look at sites with sufficient data to make calls across all three samples (i.e. not "./." for all three samples). This way there would be no biased towards SNP detection for the animal used to make the reference.

I ran into my first difficulty when training for known variants with HaplotypeCaller, and I mention it here incase it may provide insight to a more experienced GATK user. After my second round of BSQR I was no longer detecting SNPs for either of my heterozygous animals, and very few for the homozygous animal. I have attached my training commands to this post.

If I examine the number total number of passing SNPs per each round: allSNPs0.recode.vcf 6694578 allSNPs1.recode.vcf 10660 allSNPs2.recode.vcf 107

Then look at the number of SNPs for each sample: Atig001.filtSNP.vcf <- heterozygous 4130988 Atig001.bsqr0.filtSNP.vcf 0 Atig001.bsqr1.filtSNP.vcf 0

Atig003.filtSNP.vcf <- heterozygous 4959403 Atig003.bsqr0.filtSNP.vcf 3635 Atig003.bsqr1.filtSNP.vcf 0

A_tigris8450.filtSNP.vcf <- homozygous animal, used for reference genome 272215 Atigris8450.bsqr0.filtSNP.vcf 8822 Atigris8450.bsqr1.filtSNP.vcf 125

I examined the recalibrated bam files and found that the average phred score had dropped below 25 for the two heterozygous animals across all sites. Instead of further trouble shooting the HaplotypeCaller I instead decided to try training with UnifiedGenotyper. I will attach the BSQR plot for Atig001.

Unsure of what to do next I decided to try and train for a set of known variants with UnifiedGenotyper. This seemed to work, I continued to detect SNPs after each round of recalibration. However I misunderstood the documentation, and expected the number of SNPs to converge, so I ended up running more recalibration steps than needed (my last question)...

Here are some of the numbers for UnifiedGenotyper training:

Atig001.filtSNP.vcf 5546061 Atig001.bsqr_0.filtSNP.vcf 950172 Atig001.bsqr_1.filtSNP.vcf 523390 Atig001.bsqr_2.filtSNP.vcf 574732 Atig001.bsqr_3.filtSNP.vcf 588325 Atig001.bsqr_4.filtSNP.vcf 590560 Atig001.bsqr_5.filtSNP.vcf 590773

Atig003.filtSNP.vcf 5757825 Atig003.bsqr_0.filtSNP.vcf 1489475 Atig003.bsqr_1.filtSNP.vcf 555349 Atig003.bsqr_2.filtSNP.vcf 600669 Atig003.bsqr_3.filtSNP.vcf 613971 Atig003.bsqr_4.filtSNP.vcf 616467 Atig003.bsqr_5.filtSNP.vcf 615607

A_tigris8450.filtSNP.vcf 355944 A_tigris8450.bsqr_0.filtSNP.vcf 161685 A_tigris8450.bsqr_1.filtSNP.vcf 92969 A_tigris8450.bsqr_2.filtSNP.vcf 100055 A_tigris8450.bsqr_3.filtSNP.vcf 105300 A_tigris8450.bsqr_4.filtSNP.vcf 110698 A_tigris8450.bsqr_5.filtSNP.vcf 104465

Anyone have any ideas as to why HaplotypeCaller is showing such different results? I have proceeded to final variant calling using my set of known variants from UG training, but I actually have another question to post on that as well.

Created 2015-10-06 15:30:10 | Updated | Tags: unifiedgenotyper non-human heterozygosity bootstrap

Hello,

I am currently using GATK (GenomeAnalysisTK-3.3-0) to assess the level of heterozygosity in individual lizards (diploid). There is no known set SNPs or INDELs, so I have been using the bootstrapping protocol recommended in the best practices. Based on the recalibration plots, it appears that the before and after quality scores have converged, however the total number of filtered SNPs continues to change for each iteration of recalibration.

Should we expect the number of filtered SNPs to stop changing as we continue to bootstrap? Or should I be concerned about my dataset...

Some basic information: We are using the unified genotyper for variant calling. We have now completed 10 rounds of recalibration. We are counting the number of SNPs that pass filtering

If I count the number of SNPs in unix: for file in *.recode.vcf; do echo $file; cat$file | grep -v "#" | wc -l; done

all_SNPs_4.recode.vcf 922016 all_SNPs_5.recode.vcf 929913 all_SNPs_6.recode.vcf 923369 all_SNPs_7.recode.vcf 922344 all_SNPs_8.recode.vcf 923250 all_SNPs_9.recode.vcf 924366

I can provide further information if it would be helpful.

Thanks, Duncan Tormey dut@stowers.org

Created 2015-09-15 01:47:02 | Updated | Tags: unifiedgenotyper haplotypecaller

I've read through many of your posts/responses regarding HaplotypeCaller not calling variants, and tried many of the suggestions you've made to others, but I'm still missing variants. My situation is a little different (I'm trying to identify variants from Sanger sequence reads) but I'm hoping you might have additional ideas or can see something I've overlooked. I hope I haven't given you too much information below, but I've seen it mentioned that too much info is better than not enough.

A while back, I generated a variant call set from Illumina Next Gen Sequencing data using UnifiedGenotyper (circa v2.7.4), identifying ~46,000 discordant variants between the genomes of two haploid strains of S. cerevisiae. Our subsequent experiments included Sanger sequencing ~95 kb of DNA across 17 different loci in these two strains. I don't think any of the SNP calls were false positives, and there were very, very few were false negatives.

Since then, we've constructed many strains by swapping variants at these loci between these two strains of yeast. To check if a strain was constructed successfully, we PCR the loci of interest, and Sanger sequencing the PCR product. I'm trying to use GATK (version 3.4-46) HaplotypeCaller (preferably, or alternatively UnifiedGenotyper) in a variant detection pipeline to confirm a properly constructed strain. I convert the .ab1 files to fastqs using EMBOSS seqret, map the Sanger reads using bwa mem ($bwa mem -R$RG $refFasta$i > ${outDir}/samFiles/${fileBaseName}.sam), merge the sam files for each individual, and then perform the variant calling separately for each individual. I do not dedup (I actually intentionally leave out the -M flag in bwa), nor do I realign around indels (I plan to in the future, but there aren't any indels in any of the regions we are currently looking at), or do any BQSR in this pipeline. Also, when I do the genotyping after HaplotypCaller, I don't do joint genotyping, each sample (individual) gets genotyped individually.

In general, this pipeline does identify many variants from the Sanger reads, but I'm still missing many variant calls that I can clearly see in the Sanger reads. Using a test set of 36 individuals, I examined the variant calls made from 364 Sanger reads that cover a total of 63 known variant sites across three ~5kb loci (40 SNPs in locus 08a-s02, 9 SNPs in locus 10a-s01, 14 SNPs in locus 12c-s02). Below are some example calls to HaplotypeCaller and UnifiedGenotyper, as well as a brief summary statement of general performance using the given command. I've also included some screenshots from IGV showing the alignments (original bam files and bamOut files) and SNP calls from the different commands.

Ideally, I'd like to use the HaplotypeCaller since not only can it give me a variant call with a confidence value, but it can also give me a reference call with a confidence value. And furthermore, I'd like to stay in DISCOVERY mode as opposed to Genotype Given Alleles, that way I can also assess whether any experimental manipulations we've performed might have possibly introduced new mutations.

Again, I'm hoping someone can advice me on how to make adjustments to reduce the number of missed calls.

Call 1: The first call to HaplotypeCaller I'm showing produced the least amount of variant calls at sites where I've checked the Sanger reads.

java -Xmx4g -jar $gatkJar \ -R$refFasta \
-T HaplotypeCaller \
-I $inBam \ -ploidy 1 \ -nct 1 \ -bamout${inBam%.bam}_hapcallRealigned.bam \
-forceActive \
-disableOptimizations \
-dontTrimActiveRegions \
--genotyping_mode DISCOVERY \
--emitRefConfidence BP_RESOLUTION \
--intervals $outDir/tmp.intervals.bed \ --min_base_quality_score 5 \ --standard_min_confidence_threshold_for_calling 0 \ --standard_min_confidence_threshold_for_emitting 0 \ -A VariantType \ -A SampleList \ -A AlleleBalance \ -A BaseCounts \ -A AlleleBalanceBySample \ -o$outDir/vcfFiles/${fileBaseName}_hc_bp_raw.g.vcf Call 2: I tried a number of different -kmerSize values [(-kmerSize 10 -kmerSize 25), (-kmerSize 9), (-kmerSize 10), (-kmerSize 12), (-kmerSize 19), (-kmerSize 12 -kmerSize 19), (maybe some others). I seemed to have the best luck when using -kmerSize 12 only; I picked up a few more SNPs (where I expected them), and only lost one SNP call as compared Call 1. java -Xmx4g -jar$gatkJar \
-R $refFasta \ -T HaplotypeCaller \ -I$inBam \
-ploidy 1 \
-nct 1 \
-bamout {inBam%.bam}_kmer_hapcallRealigned.bam \ -forceActive \ -disableOptimizations \ -dontTrimActiveRegions \ --genotyping_mode DISCOVERY \ --emitRefConfidence BP_RESOLUTION \ --interval_padding 500 \ --intervalsoutDir/tmp.intervals.bed \
--min_base_quality_score 5 \
--standard_min_confidence_threshold_for_calling 0 \
--standard_min_confidence_threshold_for_emitting 0 \
-kmerSize 12 \
-A VariantType \
-A SampleList \
-A AlleleBalance \
-A BaseCounts \
-A AlleleBalanceBySample \
-o $outDir/vcfFiles/${fileBaseName}_hc_bp_kmer_raw.g.vcf

Call 3: I tried adjusting --minPruning 1 and --minDanglingBranchLength 1, which helped more than playing with kmerSize. I picked up many more SNPs compared to both Call 1 and Call 2 (but not necessarily the same SNPs I gained in Call 2).

java -Xmx4g -jar $gatkJar \ -R$refFasta \
-T HaplotypeCaller \
-I $inBam \ -ploidy 1 \ -nct 1 \ -bamout${inBam%.bam}_adv_hapcallRealigned.bam \
-forceActive \
-disableOptimizations \
-dontTrimActiveRegions \
--genotyping_mode DISCOVERY \
--emitRefConfidence BP_RESOLUTION \
--intervals $outDir/tmp.intervals.bed \ --min_base_quality_score 5 \ --standard_min_confidence_threshold_for_calling 0 \ --standard_min_confidence_threshold_for_emitting 0 \ --minPruning 1 \ --minDanglingBranchLength 1 \ -A VariantType \ -A SampleList \ -A AlleleBalance \ -A BaseCounts \ -A AlleleBalanceBySample \ -o$outDir/vcfFiles/${fileBaseName}_hc_bp_adv_raw.g.vcf Call 4: I then tried adding both --minPruning 1 --minDanglingBranchLength 1 and -kmerSize 12 all at once, and I threw in a --min_mapping_quality_score 5. I maybe did slightly better... than in Calls 1-4. I did actually lose 1 SNP compared to Calls 1-4, but I got most of the additional SNPs I got from using Call 3, as well as some of the SNPs I got from using Call 2. java -Xmx4g -jar$gatkJar \
-R $refFasta \ -T HaplotypeCaller \ -I$inBam \
-ploidy 1 \
-nct 1 \
-bamout ${inBam%.bam}_hailMary_raw.bam \ -forceActive \ -disableOptimizations \ -dontTrimActiveRegions \ --genotyping_mode DISCOVERY \ --emitRefConfidence BP_RESOLUTION \ --interval_padding 500 \ --intervals$outDir/tmp.intervals.bed \
--min_base_quality_score 5 \
--min_mapping_quality_score 10 \
--standard_min_confidence_threshold_for_calling 0 \
--standard_min_confidence_threshold_for_emitting 0 \
--minPruning 1 \
--minDanglingBranchLength 1 \
-kmerSize 12 \
-A VariantType \
-A SampleList \
-A AlleleBalance \
-A BaseCounts \
-A AlleleBalanceBySample \
-o $outDir/vcfFiles/${fileBaseName}_hailMary_raw.g.vcf

Call 5: As I mentioned above, I've experience better performance (or at least I've done a better job executing) with UnifiedGenotyper. I actually get the most SNPs called at the known SNP sites, in individuals where manual examination confirms a SNP.

java -Xmx4g -jar $gatkJar \ -R$refFasta \
-T UnifiedGenotyper \
-I $inBam \ -ploidy 1 \ --output_mode EMIT_ALL_SITES \ -glm BOTH \ -dt NONE -dcov 0 \ -nt 4 \ -nct 1 \ --intervals$outDir/tmp.intervals.bed \
--min_base_quality_score 5 \
--standard_min_confidence_threshold_for_calling 0 \
--standard_min_confidence_threshold_for_emitting 0 \
-minIndelCnt 1 \
-A VariantType \
-A SampleList \
-A AlleleBalance \
-A BaseCounts \
-A AlleleBalanceBySample \
-o $outDir/vcfFiles/${fileBaseName}_ug_emitAll_raw.vcf

I hope you're still with me :)

None of the above commands are calling all of the SNPs that I (maybe naively) would expect them to. "Examples 1-3" in the first attached screenshot are three individuals with reads (two reads each) showing the alternate allele. The map quality scores for each read are 60, and the base quality scores at this position for individual #11 are 36 and 38, and for the other individuals, the base quality scores are between 48-61. The reads are very clean upstream of this position, the next upstream SNP is about ~80bp away, and the downstream SNP at the position marked for "Examples 4-6" is ~160 bp away. Commands 1 and 2 do not elicit a SNP call for Examples 1-6, Command 3 get the calls at both positions for individual 10, Command 4 for gets the both calls for individuals 10 and the upstream SNP for individual 11. Command 5 (UnifiedGenotyper) gets the alt allele called in all 3 individuals at the upstream position, and the alt allele called for individuals 10 and 12 at the downstream position. Note that in individual 11, there is only one read covering the downstream variant position, where UnifiedGenotyper missed the call.

Here is the vcf output for those two positions from each command. Note that there are more samples in the per-sample breakdown for the FORMAT tags. The last three groups of FORMAT tags correspond to the three individuals I've shown in the screenshots.

Command 1 output

Examples 1-3    649036  .   G   .   .   .   AN=11;DP=22;VariantType=NO_VARIATION;set=ReferenceInAll GT:AD:DP:RGQ    .   .   .   .   .   .   .   .   .   .   .:0:0:0 0:0:2:0 0:2:2:89    0:0:2:0 0:2:2:84    0:0:2:0 0:2:2:89    0:0:2:0 0:2:2:89    0:0:2:0 0:0:2:0 0:0:2:0
Examples 4-6    649160  .   C   .   .   .   AN=11;DP=21;VariantType=NO_VARIATION;set=ReferenceInAll GT:AD:DP:RGQ    .   .   .   .   .   .   .   .   .   .   .:0:0:0 0:0:2:0 0:2:2:89    0:0:2:0 0:2:2:0 0:0:2:0 0:2:2:71    0:0:2:0 0:2:2:44    0:0:2:0 0:0:1:0 0:0:2:0

Command 2 output

Examples 1-3    649036  .   G   A   26.02   .   ABHom=1.00;AC=6;AF=0.545;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-011;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-011_merged_sorted_hc_bp_kmer_raw.vcf  GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .:0:0:.:.:0 1:0,2:.:56:56,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:84    1:0,1:.:45:45,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:89    1:0,1:.:45:45,0 1:0,2:.:88:88,0 0:0:2:.:.:0
Examples 4-6    649160  .   C   A   13.22   .   AC=3;AF=0.273;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;OND=1.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hc_bp_kmer_raw.vcf   GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .   .   .       .   .   .   .   .   .   .   .:0:0:.:.:0 0:0:2:.:.:0 0:2:2:.:.:89    1:0,1:.:43:43,0 0:2:2:.:.:0 0:0:2:.:.:0 0:2:2:.:.:71    1:0,0,1:.:37:37,0   0:2:2:.:.:44    1:0,1:.:34:34,0 0:0:1:.:.:0 0:0:2:.:.:0

Command 3 output

Examples 1-3    649036  .   G   A   36.01   .   ABHom=1.00;AC=3;AF=0.273;AN=11;DP=20;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_hc_bp_adv_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_adv_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_hc_bp_adv_raw.vcf    GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .:0:0:.:.:0 1:0,2:.:66:66,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:84    1:0,1:.:45:45,0 0:2:2:.:.:89    0:0:2:.:.:0 0:2:2:.:.:89    0:0:2:.:.:0 0:0:2:.:.:0 0:0:2:.:.:0
Examples 4-6    649160  .   C   A   13.22   .   ABHom=1.00;AC=1;AF=0.091;AN=11;DP=20;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-004;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_adv_raw.vcf    GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .:0:0:.:.:0 0:0:2:.:.:0 0:2:2:.:.:89    1:0,1:.:43:43,0 0:2:2:.:.:0 0:0:2:.:.:0 0:2:2:.:.:71    0:0:2:.:.:0 0:2:2:.:.:44    0:0:2:.:.:0 0:0:1:.:.:0 0:0:2:.:.:0

Command 4 output

Examples 1-3    649036  .   G   A   26.02   .   ABHom=1.00;AC=6;AF=0.545;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-011;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-011_merged_sorted_hailMary_raw.vcf  GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .:0:0:.:.:0 1:0,2:.:56:56,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:84    1:0,1:.:45:45,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:89    1:0,1:.:45:45,0 1:0,2:.:88:88,0 0:0:2:.:.:0
Examples 4-6    649160  .   C   A   13.22   .   AC=3;AF=0.273;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;OND=1.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hailMary_raw.vcf GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .:0:0:.:.:0 0:0:2:.:.:0 0:2:2:.:.:89    1:0,1:.:43:43,0 0:2:2:.:.:0 0:0:2:.:.:0 0:2:2:.:.:71    1:0,0,1:.:37:37,0   0:2:2:.:.:44    1:0,1:.:34:34,0 0:0:1:.:.:0 0:0:2:.:.:0

Command 5 output

Examples 1-3    649036  .   G   A   26.02   .   ABHom=1.00;AC=7;AF=0.636;AN=11;DP=22;Dels=0.00;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQ0=0;SOR=2.303;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-011,qHZT-12c-s02_r2657_p4096_dJ-012;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-011_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-012_merged_sorted_ug_emitAll_raw.vcf  GT:AD:DP:GQ:PL  ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./../.  ./. ./. ./. ./. ./. ./. 1:0,2:2:56:56,0 0:.:2   1:0,2:2:99:117,0    0:.:2   1:0,2:2:99:122,0    0:.:2   1:0,2:2:67:67,0 0:.:2   1:0,2:2:99:110,0    1:0,2:2:84:84,0 1:0,2:2:99:127,0
Examples 4-6    649160  .   C   A   46  .   ABHom=1.00;AC=5;AF=0.455;AN=11;DP=21;Dels=0.00;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQ0=0;Samples=qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-012;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-012_merged_sorted_ug_emitAll_raw.vcf  GT:AD:DP:GQ:PL  ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./../.  ./. ./. ./. ./. ./. ./. 0:.:2   0:.:2   1:0,2:2:76:76,0 0:.:2   1:0,2:2:70:70,0 0:.:2   1:0,1:2:37:37,0 0:.:2   1:0,2:2:60:60,0 0:.:1   1:0,2:2:75:75,0

There are many more examples of missed SNP calls. When using the HaplotypeCaller, I'm missing ~23% of the SNP calls. So...what can I do to tweak my variant detection pipeline so that I don't miss so many SNP calls?

As I mentioned, I'm currently getting better results with the UnifiedGenotyper walker. I'm only missing about 2% of all Alt SNP calls. Also, about half of that 2% are improperly being genotyped as Ref by Command #5. It appears to me that most of the variant calls I'm missing using the UnifiedGenotyper are at positions where I only have a single Sanger read covering the base, and the base quality score starts to fall below 25 (such as in individual #11 in the first attached screen shot, base quality score was 20). Attached is a second IGV screenshot of a different locus where I've also missed SNP calls using Command 5 (Examples 7-9). I've also included the read details for those positions, as well as the VCF file output from Command 5. I have seen at least one instance where I had two Sanger reads reporting an alternate allele, however, UG did not call the variant. In that case though, the base quality scores in both reads were very low (8); mapping quality was 60 for both reads.

Does anyone have any suggestions as to how I might alter any of the parameters to reduce (hopefully eliminate) the missed SNP calls. I think I would accept false positives over false negatives in this case. Or does anyone have any other idea as to what my problem might be?

Thanks so much! Matt Maurer

Command 5 output for second screen shot file:

VCF output The samples shown in the second attached screen shot correspond to the 11th and 12th groupings in the per-sample breakdown of the FORMAT tags.

Examples 7-8    163422  .   G   C   173 .   ABHom=1.00;AC=2;AF=0.182;AN=11;DP=14;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQ0=0;SOR=1.609;Samples=qHZT-08a-s02_r2657_p4094_dJ-002,qHZT-08a-s02_r2657_p4094_dJ-008;VariantType=SNP;set=qHZT-08a-s02_r2657_p4094_dJ-002_merged_sorted_ug_emitAll_raw.vcf-qHZT-08a-s02_r2657_p4094_dJ-008_merged_sorted_ug_emitAll_raw.vcf GT:AD:DP:GQ:PL  0:.:1   1:0,4:4:99:203,0    0:.:1   0:.:1   0:.:1   0:.:1   0:.:1   1:0,1:1:54:54,0 0:.:1   ./. 0:.:1   0:.:1   ./. ./. ./. ./. ./. ./. ./. ./. ./../.  ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./.
Example 9   163476  .   A   G   173 .   ABHom=1.00;AC=2;AF=0.167;AN=12;DP=15;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=1;MLEAF=1.00;MQ0=0;SOR=1.609;Samples=qHZT-08a-s02_r2657_p4094_dJ-002,qHZT-08a-s02_r2657_p4094_dJ-008;VariantType=SNP;set=qHZT-08a-s02_r2657_p4094_dJ-002_merged_sorted_ug_emitAll_raw.vcf-qHZT-08a-s02_r2657_p4094_dJ-008_merged_sorted_ug_emitAll_raw.vcf  GT:AD:DP:GQ:PL  0:.:1   1:0,4:4:99:203,0    0:.:1   0:.:1   0:.:1   0:.:1   0:.:1   1:0,1:1:57:57,0 0:.:1   0:.:1   0:.:1   0:.:1   .   .   .   .   .   .   .   .   .   .   .

Also, why are the GT's sometimes "./." as they are for site163422, and sometimes "." as they are for site 163476?

Example#7

Read name = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11
Sample = qHZT-08a-s02_r2657_p4094_dJ-011
----------------------
Location = 163,422
Alignment start = 163,293 (+)
Cigar = 34S833M1D72M1I50M1I9M
Mapped = yes
Mapping quality = 60
Secondary = no
Supplementary = no
Duplicate = no
Failed QC = no
----------------------
Base = C
Base phred quality = 23
----------------------
RG = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11
NM = 20
AS = 858
XS = 0
-------------------

Example 8

Read name = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11
Sample = qHZT-08a-s02_r2657_p4094_dJ-011
----------------------
Location = 163,476
Alignment start = 163,293 (+)
Cigar = 34S833M1D72M1I50M1I9M
Mapped = yes
Mapping quality = 60
Secondary = no
Supplementary = no
Duplicate = no
Failed QC = no
----------------------
Base = G
Base phred quality = 15
----------------------
RG = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11
NM = 20
AS = 858
XS = 0
-------------------

Example #9

Read name = qHZT-08a-s02_L_r2657_p4094_dJ-012_pcrP1_oMM575_2014-11-27_A12
Sample = qHZT-08a-s02_r2657_p4094_dJ-012
----------------------
Location = 163,422
Alignment start = 163,329 (+)
Cigar = 67S16M1D181M1D634M1D9M1I8M1I62M1D17M4S
Mapped = yes
Mapping quality = 60
Secondary = no
Supplementary = no
Duplicate = no
Failed QC = no
----------------------
Base = C
Base phred quality = 18
----------------------
RG = qHZT-08a-s02_L_r2657_p4094_dJ-012_pcrP1_oMM575_2014-11-27_A12
NM = 87
AS = 480
XS = 0
-------------------

Created 2015-08-24 09:35:24 | Updated | Tags: unifiedgenotyper vqsr bqsr haplotypecaller

I'm a bit uncertain as to the optimal pipeline for calling variants. I've sequenced a population sample of ~200 at high coverage ~30X, with no prior information on nucleotide variation.

The most rigorous pipeline would seem to be:

1. Call variants with UG on 'raw' (realigned) bams.
2. Extract out high-confidence variants (high QUAL, high DP, not near indels or repeats, high MAF)
3. Perform BQSR using the high-confidence variants.
4. Call variants with HaplotypeCaller on recalibrated bams.
5. Perform VQSR using high-confidence variants.
6. Any other hard filters.

Is this excessive? Does using HaplotypeCaller negate the use of *QSR? Is it worthwhile performing VQSR if BQSR hasn't been done? Otherwise I'm just running HaplotyperCaller on un-recalibrated bams, and then hard-filtering.

Created 2015-08-18 11:55:19 | Updated 2015-08-18 11:57:08 | Tags: unifiedgenotyper vcf lowqual emit-all-confident-sites emit-all-sites

Hi,

I've run UnifiedGenotyper on a BAM file with both EMIT_ALL_SITES and EMIT_ALL_CONFIDENT_SITES. I've noticed some of the calls that have been omitted with the EMIT_ALL_SITES option seem to be of comparable quality to others that have been emitted. For example, sites like HE670865 368605 are removed as non-confident sites while the site 368591 has not been.

I understand why the positions flagged as "LowQual" are not present when using EMIT_ALL_CONFIDENT_SITES. But I can't see why other positions (such as HE670865 368605 and HE670865 368600) are not being emitted. Of particular importance is the fact that some of these seemingly 'good sites' that have been removed are SNPs that might be of biological importance.

These are the two commands used together with some relevant lines from the resulting vcfs:

java -Xmx4g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -nt 4 -out_mode EMIT_ALL_SITES -baq CALCULATE_AS_NECESSARY -hets 0.015 -R genome_ref.fasta -o output_all.vcf -I input_divergentscaffs.realign.bam &> divergentscaffs_out.GATKlog &

java -Xmx4g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -nt 4 -out_mode EMIT_ALL_CONFIDENT_SITES -baq CALCULATE_AS_NECESSARY -hets 0.015 -R genome_ref.fasta -o output_confident.vcf -I input_divergentscaffs.realign.bam &> divergentscaffs_confident.GATKlog &

These are a sample of lines from the outputted VCF file when using EMIT_ALL_SITES:

These are the comparable lines from the outputted VCF when using EMIT_ALL_CONFIDENT_SITES:

HE670865 368589 . A . 46.89 . AN=16;DP=441;MQ=43.26;MQ0=13 GT:DP 0/0:35 0/0:45 0/0:58 0/0:38 0/0:75 0/0:64 0/0:64 0/0:47 HE670865 368590 . A . 50.15 . AN=16;DP=447;MQ=43.41;MQ0=13 GT:DP 0/0:35 0/0:45 0/0:60 0/0:39 0/0:76 0/0:66 0/0:65 0/0:47 HE670865 368591 . A . 39.12 . AN=16;DP=448;MQ=43.32;MQ0=12 GT:DP 0/0:36 0/0:45 0/0:61 0/0:39 0/0:77 0/0:66 0/0:66 0/0:46 HE670865 368648 . G . 51.60 . AN=16;DP=366;MQ=40.87;MQ0=12 GT:DP 0/0:34 0/0:40 0/0:57 0/0:37 0/0:59 0/0:44 0/0:53 0/0:42 HE670865 368649 . T . 75.72 . AN=16;DP=368;MQ=40.89;MQ0=12 GT:DP 0/0:34 0/0:41 0/0:57 0/0:38 0/0:59 0/0:44 0/0:53 0/0:42 HE670865 368650 . T . 72.74 . AN=16;DP=369;MQ=40.87;MQ0=12 GT:DP 0/0:34 0/0:41 0/0:57 0/0:38 0/0:59 0/0:44 0/0:53 0/0:43 HE670865 368651 . T . 87.71 . AN=16;DP=369;MQ=40.88;MQ0=12 GT:DP 0/0:33 0/0:42 0/0:58 0/0:37 0/0:59 0/0:44 0/0:53 0/0:43 HE670865 368652 . T . 93.72 . AN=16;DP=373;MQ=40.89;MQ0=12 GT:DP 0/0:32 0/0:43 0/0:59 0/0:37 0/0:59 0/0:45 0/0:54 0/0:43

Any help or insight would be greatly appreciated.

Created 2015-07-23 16:05:27 | Updated | Tags: realignertargetcreator unifiedgenotyper

Hello,

I used bwa to map my samples to a mitochondrial genome of a non-model organism. Afterwards I careated a merged .bam file from multiple (288) sample .bams (used samtools merge and re-assigned RG tags), but when I run UnifiedGenotyper on that file it gets stuck at 32.1% and never moves forward from there. I also wanted to run RealignerTargetCreator, but I always get a truncated realigned.bam file. Any suggestions for how to troubleshoot this?

Thanks you.

Created 2015-07-22 12:08:20 | Updated 2015-07-22 12:09:10 | Tags: unifiedgenotyper haplotypecaller snp-calling input-prior

I'm using HaplotypeCaller (but it could be also possible to use this option with UnifiedGenotyper) for a very special experimental design in a no-human species, where we have an expectation for the prior probabilities of each genotype. I'm planning to call SNPs for single diploid individuals using HaplotypeCaller and afterwards for the whole dataset with GenotypeGVCFs.

Nevertheless, I'm confused about the structure of the prior probabilities command line. In the documentation, it says: "Values will be added and Pr(AC=0) will be 1-sum, so that they sum up to one. f) If user-defined values add to more than one, an error will be produced". So I'll require to provide two prior probabilities out of the 3 for each genotype (0/0, 0/1 and 1/1). My first guess is that the prior that I don't need to provide is for the reference homozygous (0/0) based on the Pr(AC=0) specified in the documentation. I would like to know if this idea is correct.

My second problem if is the two input_prior options are positional parameters. If so, and if my first guess for the Pr(AC=0) is correct, do they represent the probability of 0/1 and 1/1, that is, Pr(AC=1) and Pr(AC=2)?

More concretely, I'm going to provide an example where you don't expect any heterozygous call. In that case, is it correct that the argument will be _--input_prior 0.5 --inputprior 0?

Thank you very much.

Created 2015-07-21 07:49:36 | Updated | Tags: unifiedgenotyper pooled-calls

Hello,

I am using the UnifiedGenotyper for pooled samples and would like to set the minimum count (fraction) of reads supporting an alternate allele in order to distinguish low frequency alleles from sequencing errors. Is there a way to do it? I couldn't find a means to do that or the default threshold in the tool's documentation . Thank you for your support.

Created 2015-07-16 20:23:04 | Updated | Tags: unifiedgenotyper vcf

Hello!

I am writing to seek your help on a curious result I received from running the Unified Genotyper. I ran Unified Genotyper to obtain all confident sites (both invariant and variant) on a suite of different bam files. For certain regions of the genome that overlap certain bam file reads, it appears that it did not output invariant sites, only variant sites. I do not know if this is due to differences among bam files, my call to GATK, or is a glitch.

Here is a little more background.

• First, I am using Unified Genotyper to maintain continuity with some analyses I did over a year go.
• My data:
• I have 92 low coverage bam files from samples produced from illumina sequencing a RADtag type enzyme digestion library. The bams contain single-end reads, each 44-bp long. I will refer to these reads as ‘RADtags’.
• I also have 18 high coverage bams produced from illumina sequencing whole genome paired-end libraries, with read lengths varying from 36 to 100. I will refer to these as WGS samples.
• Previously I called SNPs using Unified Genotyper with mode set to EMIT_VARIANTS_ONLY, on all this data at once, to increase the chances of identifying quality SNPs in the low coverage RADtag data and to get genotypes from the high coverage WGS data at the same sites. The data seemed fine.

• Current issue:
• I ran the Unified Genotyper with the mode set to EMIT_ALL_CONFIDENT_SITES for all the same data. The vcf includes both invariant and variant calls. However, it seems I am not getting any invariant sites from regions that overlap the RADtag reads. It seems the program is not emitting invariant sites within RADtags, but is reporting invariant sites outside of RADtags and only for WGS samples.

I have attached a subset of my vcf file, and a snapshot image from viewing these samples in igv. In the igv view, the middle samples are the RADtag samples, the higher coverage samples at the top and bottom are the WGS samples. As you can see, there should be plenty of invariant sites in the regions overlapping the RADTAGs and there should be calls that include these data regions and samples.

Can you think of a reason that could be causing this? I have pasted my call to GATK below.

Thank you! Amanda

java -jar /usr/local/gatk/3.0-0/GenomeAnalysisTK.jar \ -R mg.new.ref.fa \ -T UnifiedGenotyper \ -I AHQT1G_q29.paired.nodup.realign.bam -I BOG10G_q29.paired.nodup.realign.bam -I CAC6G_q29.paired.nodup.realign.bam \ -I CAC9N_q29.paired.nodup.realign.bam -I DUNG_q29.paired.nodup.realign.bam -I IM62.JGI_q29.paired.nodup.realign.bam \ -I KOOTN_q29.paired.nodup.realign.bam -I LMC24G_q29.paired.nodup.realign.bam -I MAR3G_q29.paired.nodup.realign.bam \ -I MED84G_q29.paired.nodup.realign.bam -I MEN104_q29.paired.nodup.realign.bam -I NHN26_q29.paired.nodup.realign.bam \ -I PED5G_q29.paired.nodup.realign.bam -I REM8G_q29.paired.nodup.realign.bam -I SF5N_q29.paired.nodup.realign.bam \ -I SLP9G_q29.paired.nodup.realign.bam -I SWBG_q29.paired.nodup.realign.bam -I YJS6G_q29.paired.nodup.realign.bam \ -I CACid.277_index2.GCTCGG.sortedrealign.bam -I CACid.279b_index2.CTGATT.sortedrealign.bam -I CACid.247_index2.AATACT.sortedrealign.bam -I CACid.237_index2.CGCTGT.sortedrealign.bam \ -I CACid.240b_index2.GTCTCT.sortedrealign.bam -I CACid.272_index2.ATCTCC.sortedrealign.bam -I CACid.282b_index2.TCTGCT.sortedrealign.bam -I CACid.179_index2.TATAGT.sortedrealign.bam \ -I CACid.187_index2.CAGCAT.sortedrealign.bam -I CACid.189_index2.TAATCC.sortedrealign.bam -I CACid.192_index2.GCAGTT.sortedrealign.bam -I CACid.339_index2.CCTGAA.sortedrealign.bam \ -I CACid.235_index2.AAGCGA.sortedrealign.bam -I CACid.236_index2.AACTTA.sortedrealign.bam -I CACid.249_index2.GTTCCT.sortedrealign.bam -I CACid.370_index2.CTAATA.sortedrealign.bam \ -I CACid.372_index2.CCGAAT.sortedrealign.bam -I CACid.215_index2.CTTGTA.sortedrealign.bam -I CACid.218_index2.CCCTCG.sortedrealign.bam -I CACid.219_index2.ACTGAC.sortedrealign.bam \ -I CACid.221_index2.GTGACT.sortedrealign.bam -I CACid.225_index2.ACTAGC.sortedrealign.bam -I CACid.226_index2.CGACTA.sortedrealign.bam -I CACid.231_index2.GTGAGA.sortedrealign.bam \ -I CACid.232_index2.AATCTA.sortedrealign.bam -I CACid.233_index2.CAAGCT.sortedrealign.bam -I CACid.212_index2.CTCTCA.sortedrealign.bam -I CACid.211_index2.ACTCCT.sortedrealign.bam \ -I CACid.403_index2.GATCAA.sortedrealign.bam -I CACid.280_index2.GGCTTA.sortedrealign.bam -I CACid.283_index2.GTGGAA.sortedrealign.bam -I CACid.285_index2.AAATGA.sortedrealign.bam \ -I CACid.003_index2.GGGTAA.sortedrealign.bam -I CACid.007_index2.CGTCAA.sortedrealign.bam -I CACid.008_index2.GTTGCA.sortedrealign.bam -I CACid.055_index2.ATATAC.sortedrealign.bam \ -I CACid.056_index2.TTAGTA.sortedrealign.bam -I CACid.071_index2.TCGCTT.sortedrealign.bam -I CACid.313_index2.TCGTCT.sortedrealign.bam -I CACid.040_index2.CAATAT.sortedrealign.bam \ -I CACid.292_index2.TCATGG.sortedrealign.bam -I CACid.080_index2.CAGGAA.sortedrealign.bam -I CACid.088_index2.ATCGAC.sortedrealign.bam -I CACid.322_index2.TTGCGA.sortedrealign.bam \ -I CACid.114_index1.GCTCGG.sortedrealign.bam -I CACid.279a_index1.CTGATT.sortedrealign.bam -I CACid.181_index1.AATACT.sortedrealign.bam -I CACid.184_index1.CGCTGT.sortedrealign.bam \ -I CACid.185_index1.GTCTCT.sortedrealign.bam -I CACid.191_index1.ATCTCC.sortedrealign.bam -I CACid.194_index1.TCTGCT.sortedrealign.bam -I CACid.240a_index1.TATAGT.sortedrealign.bam \ -I CACid.238_index1.CAGCAT.sortedrealign.bam -I CACid.371_index1.TAATCC.sortedrealign.bam -I CACid.253_index1.GCAGTT.sortedrealign.bam -I CACid.260_index1.CCTGAA.sortedrealign.bam \ -I CACid.262_index1.AAGCGA.sortedrealign.bam -I CACid.257_index1.AACTTA.sortedrealign.bam -I CACid.258_index1.GTTCCT.sortedrealign.bam -I CACid.216_index1.CTAATA.sortedrealign.bam \ -I CACid.220_index1.CCGAAT.sortedrealign.bam -I CACid.223_index1.CTTGTA.sortedrealign.bam -I CACid.224_index1.CCCTCG.sortedrealign.bam -I CACid.228_index1.ACTGAC.sortedrealign.bam \ -I CACid.229_index1.GTGACT.sortedrealign.bam -I CACid.281_index1.ACTAGC.sortedrealign.bam -I CACid.282a_index1.CGACTA.sortedrealign.bam -I CACid.284_index1.GTGAGA.sortedrealign.bam \ -I CACid.286_index1.AATCTA.sortedrealign.bam -I CACid.287_index1.CAAGCT.sortedrealign.bam -I CACid.288_index1.CTCTCA.sortedrealign.bam -I CACid.072_index1.ACTCCT.sortedrealign.bam \ -I CACid.077_index1.GATCAA.sortedrealign.bam -I CACid.078_index1.GGCTTA.sortedrealign.bam -I CACid.084_index1.GTGGAA.sortedrealign.bam -I CACid.087a_index1.AAATGA.sortedrealign.bam \ -I CACid.087b_index1.GGGTAA.sortedrealign.bam -I CACid.091_index1.CGTCAA.sortedrealign.bam -I CACid.092_index1.GTTGCA.sortedrealign.bam -I CACid.103_index1.CTCACT.sortedrealign.bam \ -I CACid.095_index1.TCCATA.sortedrealign.bam -I CACid.100_index1.ACTCGA.sortedrealign.bam -I CACid.101_index1.GTTCGA.sortedrealign.bam -I CACid.106_index1.ATATAC.sortedrealign.bam \ -I CACid.108_index1.TTAGTA.sortedrealign.bam -I CACid.110_index1.TCGCTT.sortedrealign.bam -I CACid.112_index1.TCGTCT.sortedrealign.bam -I CACid.144_index1.CAATAT.sortedrealign.bam \ -I CACid.149_index1.TCATGG.sortedrealign.bam -I CACid.321_index1.CAGGAA.sortedrealign.bam -I CACid.323_index1.ATCGAC.sortedrealign.bam -I CACid.324_index1.TTGCGA.sortedrealign.bam \ -o CACid.MIM.sites.q40.6.14.15.vcf \ --metrics_file CACid.MIM.sites.q40.6.14.15.metrics.txt \ --genotype_likelihoods_model SNP \ --output_mode EMIT_ALL_CONFIDENT_SITES \ -stand_call_conf 40 \ -stand_emit_conf 40 \ -l INFO

Created 2015-05-11 13:57:37 | Updated | Tags: unifiedgenotyper depthofcoverage haplotypecaller mendelianviolations

Hi,

Two questions, which relate to Unified Genotyper or Haplocaller:

1. Detecting and determining the exact genotype of an individual seems to be inaccurate at read depths less than 10X. This is because there aren't enough reads to accurately say whether it could be heterozygous. Is there an option in either Unified Genotyper or Haplocaller, that excludes sites in individuals that have below 10X? Alternatively how can these sites be removed from the vcf - or set to missing ?

2. I'm sure I've read somewhere that pedigree information can be supplied to a variant caller (such as Unified Genotyper or Haplocaller), and used to improve the calling accuracy/speed/efficiency. I am calling variants on one parent and multiple offspring.

Apologies if these questions are answered in the user manual. I regularly look it but have not found much to answer my questions.

Cheers,

Blue

Created 2015-05-10 19:31:44 | Updated | Tags: unifiedgenotyper duplicatereadfilter

Hi All,

My understanding of DuplicateReadFilter is that is removes PCR duplicates (ie: identical reads that map to the same location in the genome). Is there a way to prevent UnifiedGenotyper from using this filter?

Many of my 'duplicate' reads are not really duplicates. I have 50bp single ended yeast RNA-seq data, and 10 samples/lane results in highly over-sampled data. Removing duplicates results in an undercounting of reads at the most highly expressed genes, and therefore an increased number of sub-clonal variants at these genes (because the reads from the major clonal population are discarded, but reads of sub-clonal populations are kept because they have a different sequence). At least I think that is what is happening.

-Lucas

Created 2015-05-06 14:48:11 | Updated | Tags: unifiedgenotyper strand-bias

Hi there, I have been struggling with the interpretation of the SOR annotation. I do get that a higher value is the sign of a strand bias and that this value is never negative.

I did read the SOR annotation documention but still cannot figure out how you calculate the SOR.

Here is one of my example where a SB is present for sure REF + strand = 2185 reads REF - strand = 5 reads ALT + strand = 7 reads ALT - strand = 2370 reads.

When I calculate "R" as indicated in the documentation, I obtain a very high value of 147 955. But for this variant in the VCF file, SOR = 11.382

Thanks Manon

Created 2015-05-06 09:14:36 | Updated | Tags: unifiedgenotyper low-coverage

Is it possible to call variants using unifiedgenotyper from data with less than 1 X coverage per sample? We have a total of 25 samples and each has a coverage of < 1 X. does changing dcov parameter to 50 or lower work?

Created 2015-04-30 12:15:29 | Updated | Tags: unifiedgenotyper multi-sample

Dear GATK,

I've got an issue using the UnifiedGenotyper, it does not output all possible variants sharing the same start coordinate.

My command line: GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ReferenceGenome.fasta -I list_of_Samples.list -ploidy 1 -glm BOTH --heterozygosity 0.001 --indel_heterozygosity 0.005 -gt_mode DISCOVERY -dcov 1000 -o output_raw_variants.vcf

Using "Batch 1" , about 50 Plasmodium falciparum samples, I identified the following insertion in Sample1, which I know is real :

chrom14 2261798 . T TTA 1576.91 1:2,42:59:99:1:1.00:1609,0

However when I run the same command line with Batch1 + Batch2, total of 100 samples, I get the following result for Sample1:

chrom14 2261798 . T TTATATA 23812.75 0:39,5:59:99:0:0.00:0,775

Some samples from Batch2 have a longer insertion starting at the same coordinate, and now the original insertion in Sample1 does not appear in the VCF anymore... Why UnifiedGenotyper did not output multiple lines in this example? (for some other positions it did output multiple lines sharing the same coordinate) I did not change the parameter --max_alternate_alleles 6.

Antoine

Created 2015-04-30 06:34:42 | Updated 2015-05-01 15:02:14 | Tags: unifiedgenotyper ploidy queue

Hi, I have several questions about the function "UnifiedGenotyper" and its corresponding Qscript.

1. I first test this function on a single chromosome of my chicken sample (bam file is about 800M) to call SNPs, and set the ploidy value to 30. It took me half a day to finish. Is it normal that it took so long for a single chromosome?

As a result of this, I tried to use the scatter-gather to parallel my job in order to get the full use of my multi-node server. I built my own script based on modifying the ExampleUnifiedGenotyper.scala. I defined "ploidy" as an argument in my Qscript, then I assign a value of 30 to ploidy in my bash command line.

java -Djava.io.tmpdir=$tmp_dir \ -jar /sw/apps/bioinfo/GATK-Queue/3.2.2/Queue.jar \ -S /path/to/ug.scala \ -R$ref_file \
-I /path/to/bamfile.bam \
-o /path/to/my.vcf  \
-ploidy 30 \
-run

2.The script ran successfully, but the output VCF file was based sample_ploidy=2. It seems that it does not respond to the argument "ploidy". Then if I add one command line in the def script() part, which is" genotyper.sample_ploidy = qscript.sample_ploidy ". GATK will give warning of errors "sample_ploidy is not a member of TestUnifiedGenotyper.this.UnifiedGenotyperArguments". Could you please help point out where I did wrong?

Also, I have some questions about how to assign values to argument such as " this.memoryLimit" and "genotyper.scatterCount".

1. Based on the explanation given in the example scala file, argument "this.memoryLimit" sets the memory limit for each command. So in my case, I have only one command in my script, so that I can assign as much memory as I have to it? I am running on a server that has multiple-node, each node with 16 cores, and each core has 8G RAM. So in my understanding, if I run this job on one node with 16 cores, can I assign "this.memoryLimit=128" (16*8)? Do I understand correctly?

2. about the scatterCount, can I assign it with the number of cores I have in the server? That is, based on the example I listed above, 16.

Thank you very much for your help and time! Attached is my Qscript in relating with my question.

My Qscript is as follows:

package org.broadinstitute.gatk.queue.qscripts.examples

class TestUnifiedGenotyper extends QScript {
qscript =>
@Input(doc="The reference file for the bam files.", shortName="R")
var referenceFile: File = _

@Input(doc="Bam file to genotype.", shortName="o")
var outputFile: File = _

@Input(doc="Bam file to genotype.", shortName="I")
var bamFile: File = _

@Input(doc="An optional file with dbSNP information.", shortName="D", required=false)
var dbsnp: File = _

@Argument(doc="Ploidy (number of chromosomes) per sample.", shortName="ploidy")
var sample_ploidy: List[String] = Nil

trait UnifiedGenotyperArguments extends CommandLineGATK {
this.reference_sequence = qscript.referenceFile
this.memoryLimit = 2
}

def script() {
val genotyper = new UnifiedGenotyper with UnifiedGenotyperArguments

genotyper.scatterCount = 3
genotyper.input_file :+= qscript.bamFile
// genotyper.sample_ploidy = qscript.sample_ploidy
genotyper.out = swapExt(outputFile, qscript.bamFile, "bam", "unfiltered.vcf")

}
}

Created 2015-04-28 08:17:21 | Updated | Tags: unifiedgenotyper haplotypecaller coverage bias

Hi,

I am doing joint variant calling for Illumina paired end data of 150 monkeys. Coverage varies from 3-30 X with most individuals having around 4X coverage.

I was doing all the variant detection and hard-filtering (GATK Best Practices) process with both UnifiedGenotyper and Haplotype caller.

My problem is that HaplotypeCaller shows a much stronger bias for calling the reference allele in low coverage individuals than UnifiedGenotyper does. Is this a known issue?

In particular, consider pairwise differences across individuals: The absolute values are lower for low coverage individuals than for high coverage, for both methods, since it is more difficult to make calls for them. However, for UnifiedGenotyper, I can correct for this by calculating the "accessible genome size" for each pair of individuals by substracting from the total reference length all the filtered sites and sites where one of the two individuals has no genotype call (./.). If I do this, there is no bias in pairwise differences for UnifiedGenotyper. Values are comparable for low and high coverage individuals (If both pairs consist of members of similar populations).

However, for HaplotypeCaller, this correction does not remove bias due to coverage. Hence, it seems that for UnifiedGenotyper low coverage individuals are more likely to have no call (./.) but if there is a call it is not biased towards reference or alternative allele (at least compared to high coverage individuals). For HaplotypeCaller, on the other hand, it seems that in cases of doubt the genotype is more likely to be set to reference. I can imagine that this is an effect of looking for similar haplotypes in the population.

Can you confirm this behaviour? For population genetic analysis this effect is highly problematic. I would trade in more false positive if this removed the bias. Note that when running HaplotypeCaller, I used a value of 3*10^(-3) for the expected heterozygosity (--heterozygosity) which is the average cross individuals diversity and thus already at the higher-end for within individual heterozygosity. I would expect the problem to be even worse if I chose lower values.

Can you give me any recommendation, should I go back using UnifiedGenotyper or is there any way to solve this problem?

Created 2015-04-20 10:57:05 | Updated 2015-04-20 11:03:28 | Tags: unifiedgenotyper combinevariants haplotypecaller genotype-given-alleles

Question: Is it possible to have CV merge like bcftools does it?

I get this warning, when running UG in GGA mode using an -alleles vcf generated with CV:

WARN  10:17:21,394 GenotypingGivenAllelesUtils - Multiple valid VCF records detected in the alleles input file at site 20:106089, only considering the first record 

I made this call with HC from 10 samples:

20  106089  .   CA  C

And this call with UG from 10 other samples:

20  106089  .   C   A

CV merges like this:

20  106089  .   C   A
20  106089  .   CA  C

bcftools merges like this:

20  106089  .   CA  AA,C

The UG recall from the CV generated -alleles vcf is incomplete:

20  106089  .   C   A

The UG recall from the bcftools generated -alleles vcf is complete:

20  106089  .   CA  AA,C

Is it possible to have CV merge like bcftools does it?

I'm really not sure. It's not a use case that UG was designed for (with UG we kept SNPs and indels separate until post-analysis), so I would recommend being cautious with it.

I checked the genotypes and UG seems to handle merged MNPs and indels just fine; see below. But I will do some additional testing. Or I might just take the safe path and do the recalling separately for SNPs and indels as suggested. The reason I have UG and HC calls in the first place is because I have low and high coverage data for different cohorts. I want to create a merged dataset.

Despite --interval_padding 100 helping to recall more sites with HC in GGA mode as per previous recommendation, some sites still fail to be called with HC in GGA mode. Hence I opted for UG.

UG calls on samples 1-10:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  535 545 546 550 554 564 567 574 575 578
20  106089  .   C   A   16.19   .   AC=2;AF=0.125;AN=16;BaseQRankSum=-0.854;DP=37;Dels=0.00;FS=0.000;HaplotypeScore=1.5282;MLEAC=2;MLEAF=0.125;MQ=58.74;MQ0=0;MQRankSum=-0.560;QD=2.70;ReadPosRankSum=-1.797;SOR=0.935;VariantType=SNP  GT:AD:DP:GQ:PL  0/0:3,0:3:6:0,6,76  0/0:4,2:6:9:0,9,115 0/1:3,1:4:24:24,0,80    0/0:6,0:6:12:0,12,130   0/1:1,1:2:29:30,0,29    ./. 0/0:7,0:7:15:0,15,188   0/0:3,1:4:6:0,6,74  ./. 0/0:5,0:5:12:0,12,142

HC calls on samples 11-20:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  585 590 622 625 628 640 655 668 687 693

20  106089  .   CA  C   47.95   .   AC=5;AF=0.250;AN=20;BaseQRankSum=0.925;DP=36;FS=1.850;InbreedingCoeff=0.0646;MLEAC=5;MLEAF=0.250;MQ=59.48;MQ0=0;MQRankSum=0.175;QD=3.00;ReadPosRankSum=-1.725;SOR=0.387 GT:AD:GQ:PL 0/0:2,0:6:0,6,49    0/0:2,0:6:0,6,49    0/0:3,0:12:0,12,130 0/0:5,0:15:0,15,122 0/0:2,0:6:0,6,46    0/1:2,1:14:14,0,39  0/1:2,1:15:15,0,38  0/0:4,0:12:0,12,93  0/1:3,1:12:12,0,46  1/1:0,3:9:67,9,0

UG GGA recalls on samples 1-20:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  535 545 546 550 554 564 567 574 575 578 585 590 622 625 628 640 655 668 687 693
20  106089  .   CA  AA,C    110.56  .   AC=0,8;AF=0.00,0.222;AN=36;DP=81;FS=0.000;InbreedingCoeff=0.5076;MLEAC=0,6;MLEAF=0.00,0.167;MQ=58.56;MQ0=0;QD=3.45;SOR=0.859;VariantType=MULTIALLELIC_MIXED GT:AD:DP:GQ:PL:SB   0/0:0,0,0:3:0:0,0,0,6,6,52:0,0,0,0  0/2:0,0,1:6:0:5,5,5,0,0,109:0,0,1,0 0/2:0,0,1:4:0:12,12,12,0,0,47:0,0,1,0   0/0:0,0,0:6:0:0,0,0,17,17,123:0,0,0,0   0/0:0,0,0:2:0:0,0,0,3,3,10:0,0,0,0  ./. 0/0:0,0,0:7:0:0,0,0,9,9,60:0,0,0,0  0/2:0,0,1:4:0:12,12,12,0,0,61:0,0,0,1   ./. 0/0:0,0,1:5:0:0,0,0,4,4,30:0,0,0,1  0/0:0,0,0:3:0:0,0,0,6,6,49:0,0,0,0  0/0:0,0,0:3:0:0,0,0,9,9,76:0,0,0,0  0/0:0,0,1:4:0:0,0,0,1,1,22:0,0,1,0  0/0:0,0,0:7:0:0,0,0,18,18,149:0,0,0,0   0/0:0,0,0:4:0:0,0,0,11,11,76:0,0,0,0    0/2:0,0,1:5:0:9,9,9,0,0,65:0,0,0,1  0/2:0,0,1:4:0:12,12,12,0,0,60:0,0,0,1   0/0:0,0,0:5:0:0,0,0,15,15,116:0,0,0,0   0/2:0,0,1:6:0:12,12,12,0,0,47:0,0,0,1   2/2:0,0,3:3:9:67,67,67,9,9,0:0,0,3,0

http://gatkforums.broadinstitute.org/discussion/5249/overcalling-deletion-in-unifiedgenotyper-genotype-given-alleles-mode
http://gatkforums.broadinstitute.org/discussion/4024/genotype-and-validate-or-haplotype-caller-gga-what-am-i-doing-wrong

P.S. I might gate crash your Cambridge party this week despite not being invited :D The course was already fully booked, when you announced it. I don't have a time machine!

Created 2015-03-17 10:47:42 | Updated | Tags: unifiedgenotyper non-human

Hi,

I've effectively got a sample of 200 individuals, that share one father but each have a different mother. I'm wondering how to best change the default parameters of Unified Genotyper to call variants without bias caused by assumptions of the algorithm.

Cheers,

Blue

Created 2015-03-12 16:15:59 | Updated | Tags: unifiedgenotyper snp error indel

Hello,

when I run the UnifiedGenotyper to call INDELs, I get the following error (detailed command line output below) HAPLOTYPE_MAX_LENGTH must be > 0 but got 0

When I call only SNPs instead, it doe not occur. I have searched to find an answer to why this is happening, but cannot figure out the reason. Could this be a bug? I get the error no matter if I run the Realigner before or not.

Did you observe this problem before?

Command line output: java -Xmx10g -jar ~/work/tools/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /media/rebecca/UUI/Work/BputSem/BputSem_gapfilled.final.fa -I realigned_A.bam -gt_mode DISCOVERY -glm INDEL -stand_call_conf 30 -stand_emit_conf 10 -o rawINDELS_q30_A.vcf -ploidy 10

INFO 15:50:47,987 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:50:47,990 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 INFO 15:50:47,990 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 15:50:47,990 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 15:50:47,996 HelpFormatter - Program Args: -T UnifiedGenotyper -R /media/rebecca/UUI/Work/BputSem/BputSem_gapfilled.final.fa -I realigned_A.bam -gt_mode DISCOVERY -glm INDEL -stand_call_conf 30 -stand_emit_conf 10 -o rawINDELS_q30_A.vcf -ploidy 10 INFO 15:50:48,002 HelpFormatter - Executing as rebecca@rebecca-ThinkPad-T440s on Linux 3.13.0-44-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_65-b32. INFO 15:50:48,002 HelpFormatter - Date/Time: 2015/03/12 15:50:47 INFO 15:50:48,002 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:50:48,002 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:50:48,132 GenomeAnalysisEngine - Strictness is SILENT INFO 15:50:48,402 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 15:50:48,409 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 15:50:48,447 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04 INFO 15:50:48,594 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 15:50:48,622 GenomeAnalysisEngine - Done preparing for traversal INFO 15:50:48,622 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 15:50:48,622 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 15:50:48,623 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 15:50:48,658 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values. INFO 15:50:48,658 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values. INFO 15:51:06,585 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR stack trace

java.lang.IllegalArgumentException: HAPLOTYPE_MAX_LENGTH must be > 0 but got 0 at org.broadinstitute.gatk.utils.pairhmm.PairHMM.initialize(PairHMM.java:97) at org.broadinstitute.gatk.utils.pairhmm.N2MemoryPairHMM.initialize(N2MemoryPairHMM.java:60) at org.broadinstitute.gatk.utils.pairhmm.LoglessPairHMM.initialize(LoglessPairHMM.java:66) at org.broadinstitute.gatk.utils.pairhmm.PairHMM.computeLikelihoods(PairHMM.java:194) at org.broadinstitute.gatk.tools.walkers.indels.PairHMMIndelErrorModel.computeGeneralReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:461) at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyIndelGenotypeLikelihoods.add(GeneralPloidyIndelGenotypeLikelihoods.java:201) at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyIndelGenotypeLikelihoods.add(GeneralPloidyIndelGenotypeLikelihoods.java:124) at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyGenotypeLikelihoodsCalculationModel.getLikelihoods(GeneralPloidyGenotypeLikelihoodsCalculationModel.java:270) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateLikelihoods(UnifiedGenotypingEngine.java:317) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotypingEngine.java:201) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:379) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:151) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

##### ERROR ------------------------------------------------------------------------------------------

Created 2015-03-04 17:41:22 | Updated | Tags: unifiedgenotyper commandlinegatk

I'm encountering a problem similar to what I experienced with a mismatch between reference and cosmic files. Specifically, I created .bam files using human_g1k_v3.fasta, which reference coordinates as 1...22,X,Y etc. When I try to run UnifiedGenotyper with this reference file, the .bam file I created, and the dbsnp_137.b37.vcf, I get an error on account of the fact that snp coordinates are listed as chr1...chr22,chrX,chrY etc. Other than writing a script to remove all occurrences of "chr" is there another way to get around this problem, i.e. a dbsnp reference file that has the desired coordinates without the "chr"?

I resolved the problem with reference/cosmic by finding a cosmic file with consistent notation, but can't find a similar fix for this one. I'd appreciate any suggestions.

Created 2015-03-03 22:52:42 | Updated 2015-03-03 22:54:44 | Tags: unifiedgenotyper

Hello GATK Team,

I've just run the following using GATK v3.1.

gatk -R ref.fa -T UnifiedGenotyper -I Q11_Final.sorted.rg.bam -I Q11D1.sorted.bam -I Q11D4.sorted.bam -I Q11D2.sorted.bam -I Q11D5.sorted.bam -o PX.raw.vcf -nt 5 -glm SNP -ploidy 2

I then trimmed SNPs within 10bp of indels (called using the same but -glm indel).

The samples therein represent a mother (Sample Q11) and her 4 haploid sons.

The problem I'm getting is that calls that should be heterozygotic in the mother are not being called as such (see attached image; .bam file of the mother is int he bottom and the resulting VCF file is loaded in the top panel. The green cmd screen shows the VCF file line for the site in question where the 14th element of the indicates the queen's results (1/1:68,86:154:99:1837,120,0)

I'm not sure what's going on here and was wondering if you could help out?

Created 2015-02-25 15:56:38 | Updated | Tags: unifiedgenotyper

I am observing "overcalling" in a deletion when running UnifiedGenotyper in GENOTYPE_GIVEN_ALLELES mode. I am supplying the following variants (along with some additional variants):

7:117199640TATC>T 7:117199644ATCT>A

I am expecting a homozygous reference call for the former, and a heterozygous call for the latter. However, I am getting heterozygous calls for both. The output I get from UnifiedGenotyper is below (and the command header). I attached a screenshot of the pileup.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample
7       117199635       .       G       T       0       LowQual AC=0;AF=0.00;AN=2;DP=250;Dels=0.00;FS=0.000;HaplotypeScore=117.5772;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0;SOR=0.677        GT:AD:DP:GQ:PL  0/0:250,0:250:99:0,722,9772
7       117199641       .       A       G       0       LowQual AC=0;AF=0.00;AN=2;DP=249;Dels=0.00;FS=0.000;HaplotypeScore=315.9359;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0;SOR=0.539        GT:AD:DP:GQ:PL  0/0:249,0:249:99:0,725,9808
7       117199644       .       ATCT    A,GTCT  5999.74 .       AC=1,0;AF=0.500,0.00;AN=2;DP=250;FS=0.000;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;QD=24.00;SOR=1.049 GT:AD:DP:GQ:PL  0/1:0,109,0:250:0:6034,0,7823,6034,0,6034
7       117199648       .       T       G       0       LowQual AC=0;AF=0.00;AN=2;DP=250;Dels=0.00;FS=0.000;HaplotypeScore=173.9715;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0;SOR=0.507        GT:AD:DP:GQ:PL  0/0:250,0:250:99:0,725,9763
__7       117199652       .       TG      T       0       LowQual AC=0;AF=0.00;AN=2;DP=249;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0;SOR=0.595   GT:AD:DP:GQ:PL 0/0:249,0:249:99:0,750,11299

INFO 10:42:22,173 HelpFormatter - --------------------------------------------------------------------------------- INFO 10:42:22,176 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-82-ge3b4844, Compiled 2015/02/24 07:15:40 INFO 10:42:22,176 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 10:42:22,177 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 10:42:22,183 HelpFormatter - Program Args: -T UnifiedGenotyper -I results/r7-48.20150210XC_AJ933.dedup.bam -L 7:117199622-117199662 -glm BOTH -alleles ../../../forcecall/forcecalling_variants.b37.vcf -out_mode EMIT_ALL_SITES -gt_mode GENOTYPE_GIVEN_ALLELES -et NO_ET -K /hpc/packages/minerva-common/gatk/eugene.fluder_mssm.edu.key -R /sc/orga/projects/ngs/resources/gatk/2.8/human_g1k_v37.fasta INFO 10:42:22,188 HelpFormatter - Executing as lindem03@interactive1 on Linux 2.6.32-358.6.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_03-b04. INFO 10:42:22,189 HelpFormatter - Date/Time: 2015/02/25 10:42:22 INFO 10:42:22,189 HelpFormatter - ---------------------------------------------------------------------------------

Created 2015-02-24 19:52:54 | Updated | Tags: unifiedgenotyper snp indels

I am working with plants, and so cannot follow most of your human/cancer specific workflows etc. Also my question is probably not GATK specifc, but I would be very grateful for any kind of answer. I have used BWA to align a set of reads and the contigs assembled from these reads (de novo). I have used samtools mpileup and GATK unifiedGenotyper on the aligned reads. This results for both in a list of ONLY SNPs, the same ones (aside of sensitivity). I have also used the samtools steps on the aligned contigs, resulting in a list of ONLY indels. I can clearly see both SNPs and indels in IGV, both looking at the reads or the contigs. generally I see that reads are being used for variant calling, so is it "wrong" to use contigs? and why do I not get ANY indels in either tool's output when using reads? any ideas?

Created 2015-02-19 22:47:05 | Updated | Tags: unifiedgenotyper ploidy vcf

Hi, I'm running on the Unified Genotyper (Version=3.3-0-g37228af) on a pooled sample. The ploidy is set to 32. I'm trying to get allele frequency information. I'm trying to filter sites based on depth of coverage but the DP flag is not consistent within a loci

The DP at the start says there is a depth of 37, but at the end it says its 13. Judging from neighbouring sites, the actual depth seems to be 37, but I'm confused why it seems to only be using 13 of them. At other sites the scores match.

Created 2015-02-10 05:49:09 | Updated | Tags: unifiedgenotyper vcf

Hi, How do I know based on the REF and ALT column of a VCF file the actual position where an indel event happened? I usually see an event that occur in the 2nd base. For example, REF ALT GGCGTGGCGT G,GCGCGTGGCGT --deletion; insertion of "C" ATTT A,ATT --deletions

But I saw a VCF format documentation(http://samtools.github.io/hts-specs/VCFv4.2.pdf) allowing a different case: GTC G,GTCT --deletion; insertion of "T" at the end

How is UnifiedGenotyper formatting the indels? Does it differ from the one used in the 1000Genome Project? Thank you!

Roven

Created 2015-02-04 04:17:31 | Updated 2015-02-04 04:18:06 | Tags: unifiedgenotyper vcf

Hi

I'm wondering why some positions in VCF overlap. Can GATK skip emitting positions that are already part of an indel/concatenated ref? We need to use EMIT_ALL_SITES but it's quite confusing if a position is represented multiple times especially if indel is present. I understand that there is always a duplicated position preceding an indel, but the positions after the indel should ideally be not overlapping the neighboring positions. Please see some examples below:

chr02 28792823 . C . 34.23 . AN=2;DP=12;MQ=57.31;MQ0=0 GT:DP 0/0:12 chr02 28792823 . CAA . 10000037.27 . AN=2;DP=12;MQ=57.31;MQ0=0 GT:AD:DP 0/0:12:12 chr02 28792824 . A . . . DP=12;MQ=57.31;MQ0=0 GT ./. chr02 28792825 . A . 31.22 . AN=2;DP=12;MQ=57.31;MQ0=0 GT:DP 0/0:7

chr02 314879 . T . 100.23 . AN=2;DP=27;MQ=55.82;MQ0=0 GT:DP 0/0:27 chr02 314879 . TAGAG T,TAG 647.19 . AC=1,1;AF=0.500,0.500;AN=2;DP=27;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=55.82;MQ0=0;QD=23.97;RPA=8,6,7;RU=AG;STR GT:AD:DP:GQ:PL 1/2:0,9,11:26:99:989,445,424,395,0,335 chr02 314880 . A . 43.23 . AN=2;DP=26;MQ=55.66;MQ0=0 GT:DP 0/0:5 chr02 314881 . G . 40.23 . AN=2;DP=25;MQ=55.47;MQ0=0 GT:DP 0/0:4 chr02 314882 . A . 73.23 . AN=2;DP=25;MQ=55.47;MQ0=0 GT:DP 0/0:16 chr02 314883 . G . 73.23 . AN=2;DP=25;MQ=55.47;MQ0=0 GT:DP 0/0:16

This next example even calls a SNP for position 10221400 even if deletion occurs in 10221399.

chr02 10221399 . C . 76.23 . AN=2;DP=17;MQ=58.74;MQ0=0 GT:DP 0/0:17 chr02 10221399 . CA CCTA,C 143.19 . AC=1,1;AF=0.500,0.500;AN=2;DP=17;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=58.74;MQ0=0;QD=8.42 GT:AD:DP:GQ:PL 1/2:0,8,6:17:99:527,131,112,185,0,143 chr02 10221400 . A C,T 287.29 . AC=1,1;AF=0.500,0.500;AN=2;DP=17;Dels=0.29;FS=0.000;HaplotypeScore=6.7569;MLEAC=1,1;MLEAF=0.500,0.500;MQ=58.74;MQ0=0;QD=16.90 GT:AD:DP:GQ:PL 1/2:0,4,8:12:88:407,294,285,112,0,88 chr02 10221401 . A C 163.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.854;DP=17;Dels=0.00;FS=1.913;HaplotypeScore=9.7562;MLEAC=1;MLEAF=0.500;MQ=58.74;MQ0=0;MQRankSum=1.558;QD=9.63;ReadPosRankSum=2.764 GT:AD:DP:GQ:PL 0/1:11,6:17:99:192,0,392 chr02 10221402 . A . 76.23 . AN=2;DP=17;MQ=58.74;MQ0=0 GT:DP 0/0:17

Created 2015-02-02 21:19:12 | Updated | Tags: unifiedgenotyper snp-calling input-prior

I'm a bit confused about the instructions for --input_prior in UnifiedGenotyper but they seem quite useful. Is there any chance you could clarify or simplify ?

https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_genotyper_UnifiedGenotyper.php#--input_prior --input_prior / -inputPrior Input prior for calls By default, the prior specified with the argument --heterozygosity/-hets is used for variant discovery at a particular locus, using an infinite sites model, see e.g. Waterson (1975) or Tajima (1996). This model asserts that the probability of having a population of k variant sites in N chromosomes is proportional to theta/k, for 1=1:N There are instances where using this prior might not be desireable, e.g. for population studies where prior might not be appropriate, as for example when the ancestral status of the reference allele is not known. By using this argument, user can manually specify priors to be used for calling as a vector for doubles, with the following restriciotns: a) User must specify 2N values, where N is the number of samples. b) Only diploid calls supported. c) Probability values are specified in double format, in linear space. d) No negative values allowed. e) Values will be added and Pr(AC=0) will be 1-sum, so that they sum up to one. f) If user-defined values add to more than one, an error will be produced. If user wants completely flat priors, then user should specify the same value (=1/(2N+1)) 2N times,e.g. -inputPrior 0.33 -inputPrior 0.33 for the single-sample diploid case.

List[Double]

Created 2015-01-13 01:21:15 | Updated | Tags: unifiedgenotyper indels snps genotype-given-alleles

I have a small UG question to ask and a minor bug to report, but I will give the whole context. I have cohorts at low and high coverage. I will probably call my low coverage data with UG and my high coverage data with HC. I will probably run VQSR per cohort. I will then create a union set of variants and recall all samples with GENOTYPE_GIVEN_ALLELES. I will probably use UG and HC for the low and high coverage recalling, respectively. Prior to generating the union set I will need to merge the UG outputted SNPs and InDels at the same position. I can merge SNPs and InDels with bcftools:

bcftools norm -m +any

I remember from another thread that UG treats SNPs and InDels separately:

http://gatkforums.broadinstitute.org/discussion/3375/combine-snp-and-indel-vcf

But I just checked that UG can do SNP and InDels calls at the same position, when an alleles file with merged SNPs and InDels is provided, when I use GENOTYPE_GIVEN_ALLELES. I expected it to fail, but it seems to work fine. SNPs and InDels are emitted at the same site/record/line. For one record and 3 samples (HOMREF, HET, HOMALT) the genotypes are correct. Can I trust the PLs? Can I also trust QUAL and other calculated values? Or is UG not meant to be used for calling SNPs and InDels at the same site with GGA?

I only had a minor issue despite tabix indexing my bgzipped known alleles file:

##### ERROR MESSAGE: An index is required, but none found., for input source: alleles.vcf.gz

Feel free to delete the original post of this question.

Created 2015-01-12 06:03:20 | Updated | Tags: unifiedgenotyper haplotypecaller downsample

Does UG and HC downsample before or after exclusion of marked duplicates? I guess code is king for the answer...

Created 2015-01-10 08:13:41 | Updated | Tags: unifiedgenotyper variantrecalibrator vqsr haplotypescore annotation

The documentation on the HaplotypeScore annotation reads:

HaplotypeCaller does not output this annotation because it already evaluates haplotype segregation internally. This annotation is only informative (and available) for variants called by Unified Genotyper.

The annotation used to be part of the best practices:

http://gatkforums.broadinstitute.org/discussion/15/best-practice-variant-detection-with-the-gatk-v1-x-retired

I will include it in the VQSR model for UG calls from low coverage data. Is this an unwise decision? I guess this is for myself to evaluate. I thought I would ask, in case I have missed something obvious.

Created 2014-12-11 00:55:58 | Updated 2014-12-11 01:47:11 | Tags: unifiedgenotyper ug

Hi

I've been trying to use the --allSitePLs option with Unified Genotyper with GaTK v3.2-2-gec30cee.

My command is:

java -Xmx${MEM} -jar${gatk_dir}/GenomeAnalysisTK.jar -R ${ref_dir}/${genome} -T UnifiedGenotyper \
--min_base_quality_score 30 \
-I ${in_dir}/"34E_"${REGION}"_ddkbt_RS74408_recalibrated_3.bam" \
-I ${in_dir}/"34E_"${REGION}"_ddber_RS86405_recalibrated_3.bam" \
--intervals $sites_dir/$file \
-o ${out_dir}/"38F_"${REGION}"_UG_allPL.vcf" \
--output_mode EMIT_ALL_SITES \
--allSitePLs 

I ran the same command with and without the --allSitePLs option and compared, and the output seems strange.

Specifically, with --allSitePLs - the FILTER column: 5053700 sites = lowqual, 19303 = . and 5059568 had QUAL < 30. By contrast WITHOUT the --allSitePLs 5067411 = lowqual, 5592 = . and 11460 had QUAL < 30. I don't understand why does adding this option changes the QUAL so much when I'm running UG on the exact same data but just requesting that I get a PL for all sites.

Lastly, how is the specific ALT allele selected? Is it random - because they all seem equally unlikely in the example below, where only one allele was seen in both my samples.

## 2 lines from the output
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ddber_RS86405   ddkbt_RS74408
chr01   1753201 .   C   A   0   LowQual AC=0;AF=0.00;AN=4;DP=43;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0 GT:AD:APL:DP:GQ:PL   0/0:17,0:655,51,0,655,51,655,655,51,655,655:17:51:0,51,655  0/0:26,0:1010,78,0,1010,78,1010,1010,78,1010,1010:26:78:0,78,1010
chr01   1753202 .   T   A   0   LowQual AC=0;AF=0.00;AN=4;DP=43;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=0;MLEAF=0.00;MQ=60.00;MQ0=0 GT:AD:APL:DP:GQ:PL   0/0:17,0:630,630,630,630,630,630,48,48,48,0:17:48:0,48,630  0/0:26,0:1027,1027,1027,1027,1027,1027,78,78,78,0:26:78:0,78,1027

Created 2014-12-10 08:57:09 | Updated | Tags: unifiedgenotyper combinevariants haplotypecaller combinegvcfs

Hi GATK team, i would like to seek opinion from your team to find the best workflow that best fit my data. Previously i've been exploring both variant calling algorithms UnifiedGenotyper and HaplotypeCaller, and i would love to go for UnifiedGenotyper considering of the sensitivity and the analysis runtime. Due to my experimental cohort samples grows from time to time, so i've opt for single sample calling follow by joint-analysis using combineVariants instead of doing multiple-samples variant calling. However by doing so, i've experience few drawbacks from it (this issue was discussed at few forums). For a particular SNP loci, we wouldn't know whether the "./." reported for some of the samples are due to no reads covering that particular loci, or it doesn't pass certain criteria during variant calling performed previously, or it is a homo-reference base (which i concern this most and can't cope to lost this information).

Then, i found this "gvcf", and it is potentially to solve my problem (Thanks GATK team for always understand our researcher's need)!! Again, i'm insist of opt for unifiedGenotyper instead of haplotypeCaller to generate the gvcf, and reading from the forum at https://www.broadinstitute.org/gatk/guide/tagged?tag=gvcf, i would assume that as VCFs produced by "UnifiedGenotyper with --output_mode EMIT_ALL_SITES" to be something alike with the gvcf file produced by HaplotyperCaller. However i couldn't joint them up using either "CombineVariants" or "CombineGVCFs", most probably i think "UnifiedGenotyper with --output_mode EMIT_ALL_SITES" doesn't generate gvcf format.

Can you please give me some advice to BEST fit my need and with minimum runtime (UnifiedGenotyper would be my best choice), is there any method to joint the ALL_SITES vcf file produced by UnifiedGenotyper which i might probably missed out from the GATK page?

Created 2014-11-25 16:32:43 | Updated | Tags: unifiedgenotyper error

When I run this command:

java -jar /storage/app/GATK_3.3/GenomeAnalysisTK.jar -R /storage/data/gabriel/ucsc/hg19/hg19.fasta -T UnifiedGenotyper -I SRR493939.valid.bam -o teste2.vcf

This error happens:

WARN 04:34:22,257 RestStorageService - Error Response: PUT '/DAXKr9W5oe2LWMJhFdSveQd50kAK6tex.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 393, Content-MD5: 33uF1RoVpa9w9uTWFUxP0g==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: df7b85d51a15a5af70f6e4d6154c4fd2, Date: Tue, 25 Nov 2014 06:34:20 GMT, Authorization: AWS AKIAI22FBBJ37D5X62OQ:agZvhjUckV8Jnth+aubQX9KViUo=, User-Agent: JetS3t/0.8.1 (Linux/3.13.0-30-generic; amd64; en; JVM 1.7.0_65), Host: broad.gsa.gatk.run.reports.s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: B6DD058B64A98870, x-amz-id-2: ADKbyes3Qb8ovjofT4tKpP3gV3yPuc9v5C157qzTkfII2q3U2ZwEpUgzy0eIOz+J, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Tue, 25 Nov 2014 14:39:19 GMT, Connection: close, Server: AmazonS3] WARN 04:34:22,846 RestStorageService - Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately 29096 seconds. Retrying connection.

Can someone help me?

Created 2014-11-18 14:19:42 | Updated | Tags: unifiedgenotyper genotype-given-alleles

I was running some samtools calls through UG with genotyping_mode GENOTYPE_GIVEN_ALLELES after reading this thread:

http://gatkforums.broadinstitute.org/discussion/1842/variant-annotator-not-annotating-mappingqualityranksumtest-and-readposranksumtest-for-indels

I got this error message:

##### ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Alignment 440969 | 36S26M1D6M1I2M1D1M3D27M1S

What did I do wrong? It works for most other fragments (chrom20:pos1-pos2). Is it possible to avoid this error message and just do ./. calls at coordinates not covered by reads? I'm not sure, why samtools called at a coordinate not covered by a read. I don't know at which coordinate this happened. I was at this coordinate, when the error happened:

INFO  10:21:06,036 ProgressMeter -       20:479014   2329564.0    55.0 m      23.6 m        0.6%     6.7 d       6.7 d 

I was using this command:

vcf=samtools.vcf.gz
$java -Djava.io.tmpdir=tmp -Xmx63900m \ -jar$jar \
--analysis_type UnifiedGenotyper \
--reference_sequence $ref \ --input_file bam.list \ --dbsnp$dbsnp138 \
--out $out \ --intervals$vcf \
-stand_call_conf 10 \
-stand_emit_conf 10 \
--genotype_likelihoods_model BOTH \
--output_mode EMIT_ALL_SITES \
--annotation Coverage -A FisherStrand -A HaplotypeScore -A MappingQualityRankSumTest -A QualByDepth -A RMSMappingQuality -A ReadPosRankSumTest \

Error:

##### ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: 96 at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.getCache(DiploidSNPGenotypeLikelihoods.java:298) at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.inCache(DiploidSNPGenotypeLikelihoods.java:262) at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.add(DiploidSNPGenotypeLikelihoods.java:229) at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.add(DiploidSNPGenotypeLikelihoods.java:190) at org.broadinstitute.sting.gatk.walkers.genotyper.DiploidSNPGenotypeLikelihoods.add(DiploidSNPGenotypeLikelihoods.java:176) at org.broadinstitute.sting.gatk.walkers.genotyper.SNPGenotypeLikelihoodsCalculationModel.getLikelihoods(SNPGenotypeLikelihoodsCalculationModel.java:114) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoods(UnifiedGenotyperEngine.java:331) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:232) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

##### ERROR MESSAGE: 96

I found this documented somewhat for Haplotype caller, but the solution didn't look applicable here. These are pooled samples so we must use unified genotyper.

Created 2014-09-29 19:32:51 | Updated | Tags: unifiedgenotyper

Hi Geraldine, I am confused when I read the paper and tutorial on GATK variant caller. First, the DePristo et al. 2011 paper says local re-aligner generates a list of candidate haplotypes using a) dbSNP info, b) presence of atleast one indel in a read and c) cluster of mismatches at a site. Then it finds the best alternative haplotype based on read haplotype likelihoods and uses log odds ratio to declare ref and alt haplotype pair. Second, the tutorial on GATK variant caller mentions Indel genotype likelihood calculation where UG estimates this via 1) first generate haplotypes from indels in the reads. 2) genotype likelihoods for all haplotype pairs 3) for each haplotype compute read haplotype likelihood using HMM

My question is do both the steps local realignment and UG do haplotype generation and then compute read haplotype likelihood. To me these two steps are redundant because if the best haplotype pairs are generated using local realignment then why is there a need to generate haplotypes for genotype likelihood estimation. I mean in theory, UG should just be able to call indels and assign them genotype by testing all the possible genotypes against the two haplotypes called by the local realigner. Please help me sort this out. I highly appreciate your concern.

Created 2014-09-23 22:46:46 | Updated | Tags: unifiedgenotyper ploidy

Now that GATK supports ploidy, how do people generate ChrX calls. My first guess is to have 2 separate genotyping runs, one for males with ploidy=1 and a second for females only with ploidy=2?

Created 2014-09-18 20:32:00 | Updated | Tags: unifiedgenotyper haplotypecaller

Hello everyone, I have recently sequenced 15 sample exomes using the Ion proton system, which directly generates VCFs after sequencing runs. However, since I will be using GATK for downstream analysis, I have decided to recall the variants using UG or HC. I read a document stating that HC should not be used in the case of pooled samples. Can someone clarify what exactly this means? For Ion proton, three samples are barcoded and pooled together prior to sequencing. Also, is 15 samples enough to obtain good discovery power? I was thinking of using external, publicly available bam files of exome sequencing data if 15 samples is not sufficient for joint variant calling.

Created 2014-09-15 21:28:59 | Updated | Tags: indelrealigner unifiedgenotyper indels gatk variant-calling

I used GATK 3.2-2 to do indel realignment and multisample calling with UnifiedGenotyper (I cannot use HaplotypeCaller because it is incompatible with the type of data I am analyzing).

Among the ~96 samples sequenced, I have one sample with two nearby variant calls, an indel and a missense, that we checked by Sanger. The missense is real but we found no trace of the indel in the Sanger traces. When I look at indel call in the multisample vcf, it has good allele depth and GQ, but suspiciously has the same AD as the missense call. Additionally, when I look at the bam in IGV, I see no evidence for the indel whatsoever and the variant is not called in any other samples in this project.

indel: chr13:101755523 CA>C GT:AD:DP:GQ:PL 0/1:55,56:111:99:1388,0,1538

missense: chr13:101755530 A>G GT:AD:DP:GQ:PL 0/1:55,56:111:99:2170,0,2125

I went back and recalled just this one sample (single sample calling)… which resulted in the correct variants, i.e. the indel was not called at all, but the SNP, which does validate, is called.

I understand that this is not an easy region to call because of the 7xA repeat, but it’s not obvious to me why this happens only in multisample mode and I'd like to have a better understanding of what is going on.

Created 2014-09-11 13:38:40 | Updated | Tags: unifiedgenotyper ploidy multiallelic pooled-calls

Hi,

Is GATK Unified genotyper able to call multi-allelic positions in a single pooled sample? Case is a pool of 13 samples, we use UG with ploidy set to 26. If I understand the supplementaries of the original publications correct, UG will never be able to call three alleles at a single position. in single sample calling. Or does this not hold for high ploidy analysis?

If needed, we can call multiple pools together, but this becomes computationally intensive.

In summary, we would like to call a 14xG,6xA,6xT call for example.

Also, how does UG take noise into account when genotyping (sequencing errors), when for example 3% of reads is aberrant at a position, this could correspond to ~ 1/26.

Thanks for any guidelines,

geert

Created 2014-09-10 18:50:45 | Updated | Tags: unifiedgenotyper joint-calling

Can someone tell me what version of the UnifiedGenotyper introduced joint sample calling? I'm specifically interested in whether 1.6 supported joint calling.

Thanks!

Created 2014-08-19 22:04:32 | Updated | Tags: unifiedgenotyper indel qd

Dear support,

According to the definition of QD, it equals to the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

However, when I ran the Unified Genotyper (v2.8.1) on a pair of samples for indel calls, the QD score for some indels are much lower than expected. For example,

1) chrX 51075762 . GCAGCTGCGT G 142.15 . AC=1;AF=0.250;AN=4;BaseCounts=0,0,204,0;BaseQRankSum=-1.435;DP=204;FS=0.000;GC=70.82;LowMQ=0.0049,0.0049,204;MLEAC=1;MLEAF=0.250;MQ=55.32;MQ0=0;MQRankSum=-2.698;QD=0.15;ReadPosRankSum=-3.784;Samples=tumor_DNA_2C;VariantType=DELETION.NumRepetitions_1.EventLength_9 GT:AD:DP:GQ:PL 0/0:94,0:95:99:0,283,11503 0/1:101,7:108:99:181,0,11532

in this case, the expected QD = 142.15 / 108 = 1.32, but the reported QD = 0.15

2) chr21 45649571 . C CCTGGACCCGCCCTGGACACCCCACGGGGG 526.15 . AC=1;AF=0.250;AN=4;BaseCounts=0,78,0,0;BaseQRankSum=0.787;DP=78;FS=0.000;GC=65.84;LowMQ=0.0000,0.0000,78;MLEAC=1;MLEAF=0.250;MQ=60.33;MQ0=0;MQRankSum=1.453;QD=0.44;ReadPosRankSum=-3.875;Samples=tumor_DNA_2C;VariantType=INSERTION.NOVEL_10orMore GT:AD:DP:GQ:PL 0/0:31,1:37:86:0,86,3672 0/1:30,6:41:99:565,0,4204

the expected QD = 526.15 / 41 = 12.83, the reported QD 0.44 is much smaller.

Do you know what's happening? Should I filter those indel calls (since QD < 2.0)?

Thanks,

Tuo

Created 2014-08-12 03:54:48 | Updated | Tags: unifiedgenotyper

Hello, Is there any possibility that I can separate the reads in AD (Allele Depths by sample) into 5' and 3' strands? e.g. the current expression of AD is REF:ALT1:ALT2 ...etc, Can I get the expression of AD like REF_5':REF_3':ALT1_5':ALT1_3':ALT2_5':ALT2_3'

Created 2014-08-11 09:50:56 | Updated | Tags: unifiedgenotyper vcf qual variant-calling

I used the EMIT_ALL_SITES option with Unified Genotyper. For polymorphic sites, the quality score (QUAL field) corresponds to the Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. But for monomorphic sites, (at this site, we have an homozygote for the reference allele), what is the definition of the quality score ? and how is it computed ? Many thanks for your explanation.

Created 2014-07-20 04:19:36 | Updated 2014-07-20 04:26:30 | Tags: unifiedgenotyper parallelism nct nt

Hello,

It seems that while non-parallel UnifiedGenotyper (SNP mode) always generates the same output vcf, parallel version (either -nt and/or -nct >1) of UnifiedGenotyper generates not exactly the same vcf files for each run. For example, below is the GenotypeConcordance output of two runs with same input file (chr 21) and same parameters (-nt4 -nct4):

Sample Eval_Genotype Comp_Genotype Count
ALL HET HET 46341 ALL HET HOM_REF 0 ALL HET HOM_VAR 8 ALL HET MIXED 0 ALL HET NO_CALL 0 ALL HOM_REF HET 300 ALL HOM_REF HOM_REF 34306900 ALL HOM_REF HOM_VAR 0 ALL HOM_REF MIXED 0 ALL HOM_REF NO_CALL 18 ALL HOM_VAR HET 12 ALL HOM_VAR HOM_REF 0 ALL HOM_VAR HOM_VAR 23490 ALL HOM_VAR MIXED 0 ALL HOM_VAR NO_CALL 18 ALL Mismatching_Alleles Mismatching_Alleles 303 ALL NO_CALL HET 0 ALL NO_CALL HOM_REF 11 ALL NO_CALL HOM_VAR 24 ALL NO_CALL MIXED 0 ALL NO_CALL NO_CALL 729217

So I am wondering if there is a way to get the same output vcf. Or do I miss something here? Thanks.

Created 2014-07-16 09:46:17 | Updated | Tags: unifiedgenotyper snp-calling

hi,

is there a way to output all possible variants (the alternate allele)? i have total reads of 2100 whereby 2045 reads are match reference and 55 reads are alternate allele. My case is very specific, i would like GATK UnifiedGenotyper to call out my variants although the proportion of alternate read is relative low (1-5%) . And regardless of the quality score of alternate reads. Is there a way to specify this settings during the SNP calling in GATK UnifiedGenotyper?

thank you very much Best J

Created 2014-07-10 20:18:01 | Updated | Tags: unifiedgenotyper indels

I performed Variant Calling using UnifiedGenotyper and HaplotypeCaller for Whole Exome Sequenced samples.

GATK version used GATK 3.0

I had only 4 African samples, since did not have any controls I took 70 African BAMs from 1000Genomes. Used Agilent Sure Select WES Version 5 target bed file as interval file to limit the calls to exomes.

I couldot find a single INDEL in UnifiedGenotyper called data and hence couldnot run VQSR for indels on it. But HaplotypeCaller called 20,456 INDELs.

UnifiedGenotyper Called Data

# INDELs-0

HaplotypeCaller Called Data

# INDELs-20456

Is it possible that for variants called from 74 BAMs, UnifiedGenotyper could not find a single INDEL ??

Thanks, Tinu

Created 2014-07-10 10:59:18 | Updated | Tags: unifiedgenotyper multi-sample

Hi all,

I am working with a set of samples at the moment which have the following problems:

1) The organism is aneuploid 2) Ploidy for individual chromosomes varies between samples (we have a script that estimates this from the coverage so we can use this information to improve variant calling) 3) Coverage is thin (sequencing was done on a student's budget)

I am using the UnifiedGenotyper for variant discovery as it can account for aneuploidy.

I initially tried calling variants for each sample, grouping chromosomes by ploidy (i.e, for sample 1 I call all the diploid chromosomes, then all the triploid chromosomes etc). I also tried doing multi-sample variant calling across all the samples, setting the ploidy to 2 (The modal chromosome number is 2). Comparison of these two analyses shows that each is missing good SNPs that are called by the other. Multi-sample analyses is missing or mis-calling SNPs on aneuploid chromosomes, and individual analysis is missing SNPs in individual samples due to thin coverage (or more accurately, they are being called, but failing QD or quality filters due to thin coverage - these SNPs pass these filters in the multi-sample analysis).

I am thinking of trying to combine these approaches. The idea is to analyse each chromosome separately. For each chromosome, I would do a multi-sample call on all the samples which are diploid and a multi-sample call on all the samples which are triploid etc etc. I am hoping that this will give me the best of the two approaches above.

So my questions are:

1) Does this strategy makes sense? Is there any reason not to do it this way?

2) How should I proceed downstream? I know I will have to hard-filter regardless. Can I merge all my vcf files before filtering and annotation, or is there a reason to keep them separate?

Any input from the GATK team or the wider community is appreciated!

Thanks

Kathryn

Created 2014-07-06 16:26:42 | Updated | Tags: unifiedgenotyper snp-calling

I'm seeing Mendelian errors in small but significant proportion of my unified genotyper calls. For many instances the SNP is called incorrectly as a homozygote even though there are some reads of the second allele. How can I set unified genotyper to be more sensitive ?

Created 2014-07-02 20:02:53 | Updated | Tags: unifiedgenotyper fastareference heapsize callvariants

Hi all,

Do you have a recommendation to estimate how much heap memory (-Xmx) is necessary to cal variants using the Unified Genotyper. I think that with my project I might be facing a situation where I will run out of memory until there is not more left to increase. To give you an idea, I have 185 samples (that together are 8Gb) and the fasta reference that I am using has too many scaffolds (3 Million). I don't have the opportunity to improve the reference I have at the moment. I have been using -Xmx52G and -nt 10 (in GATK 3.1) but it gives an error at the same point.

INFO 14:45:41,790 HelpFormatter - -------------------------------------------------------------------------------- INFO 14:45:42,773 GenomeAnalysisEngine - Strictness is SILENT INFO 14:59:01,106 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 14:59:01,171 SAMDataSourceSAMReaders - Initializing SAMRecords in serial ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java ##### ERROR ------------------------------------------------------------------------------------------ If you have a suggestion/advice of how to make the analysis work it would be very much appreciated. I know that increasing scaffolds length (reducing number of scaffolds) can improve the analysis so I am wondering if I am facing a situation where I can't do any analysis until the fasta reference is improved. Many thanks, Ximena Created 2014-06-16 15:43:27 | Updated | Tags: unifiedgenotyper Dear All, I am Calling SNPs using Unified Genotyper. I am using the Version GATK 2.5.2. Command Used: java -Djava.io.tmpdir=TMP -jar /GenomeAnalysisTK/2.5.2/GenomeAnalysisTK.jar -R Ref.fa -T UnifiedGenotyper -I input1.bam -I input2.bam -I input3.bam -o output.raw.vcf -stand_call_conf 50 -stand_emit_conf 20 -out_mode EMIT_ALL_CONFIDENT_SITES. Usually I get A summary of callable base counts: Like INFO 00:23:29,795 UnifiedGenotyper - Visited bases 247249719 INFO 00:23:29,796 UnifiedGenotyper - Callable bases 219998386 INFO 00:23:29,796 UnifiedGenotyper - Confidently called bases 219936125 INFO 00:23:29,796 UnifiedGenotyper - % callable bases of all loci 88.978 INFO 00:23:29,797 UnifiedGenotyper - % confidently called bases of all loci 88.953 INFO 00:23:29,797 UnifiedGenotyper - % confidently called bases of callable loci 88.953 INFO 00:23:29,797 UnifiedGenotyper - Actual calls made 303126 But I did not get it. Can you tell me how can I calculate this from my output VCF file now? Created 2014-06-16 07:02:28 | Updated | Tags: unifiedgenotyper exome target I am running some exome samples with UG. I tried this in 2 ways: 1. Run UG and then apply exome region filter. 2. Run UG with exome regions. Is there any difference in these approaches if I am only concerned about the variants in exome region? I have ran in both ways and on comparison I see that some values are different for the same locus. For eg. PL, MQRankSum, BaseQRankSum. However, the different is not huge but does UG perform slightly differently for making calls with target regions vs without? PS: I have ran this for UG with multiple samples Created 2014-05-28 22:56:35 | Updated | Tags: unifiedgenotyper heterozygosity priors If I specify both -inputPrior and -hets, which one will UnifiedGenotyper use to determine the prior? Is --heterozygosity used for anything other than prior specification? Created 2014-05-28 11:46:40 | Updated | Tags: unifiedgenotyper heterozygous snp-calling All of my whole-genome sequenced individuals are a cross of different outbred lines with one reference line - effectively identical to the reference genome used for read mapping and variant calling. Thus, effectively all SNPs are heterozygous relative to the reference genome. Should I alter the settings on the Unified Genotyper to maximise detection of these kind of variants ? Created 2014-05-23 15:26:18 | Updated 2014-05-23 15:32:12 | Tags: unifiedgenotyper haplotypecaller indels Hello, I used UnifiedGenotyper and HaplotypeCaller on my sample to get out INDELS and SNPS. There is a 147bp deletion in one of the genes(working with yeast genome)(see attached file). But this deletion is not reported by UnifiedGenotyper and HaplotypeCaller. Since I am working with yeast genome(which is comparatively very small compared to human genome), I am only interested in INDELS ranging from 150bp-200bp maximum. Is there a way by changing parameters that this can happen. I have read some of the previous questions about how haplotypecaller is unable to call large INDELS(in kb's), but I guess in my case INDELS max I am looking for is 200bp. Can this be achieved?? Also UnifiedGenotyper has an option to provide ploidy level. I could not locate that option in HaplotypeCaller. Any way to have that on HaplotypeCaller ? I ask you this because the yeast genome samples I work with are eseentially haploid. NOTE: My sample is paired end data with 100bp reads Hope to hear from you soon Regards Varun Created 2014-05-21 07:42:51 | Updated | Tags: unifiedgenotyper We've used GATK(v2.1-8) in a SGE environment for more than 1 years now. Without any changing in environment, we have had a few cases of UnifiedGenotyper getting hung up while trying to read our previously created .fai and .dict reference files. FSLockWithShared - WARNING: Unable to lock file xxx/ucsc.hg19.dict; ReferenceDataSource - Unable to create a lock on dictionary file. Then GATK was updated to 2.7-4 and 3.1-1. There was still an error. "ERROR MESSAGE: Timeout of 30000 milliseconds was reached while trying to acquire a lock on file xxx/dbsnp_135.hg19.vcf.idx. Since the GATK uses non-blocking lock acquisition calls that are not supposed to wait, this implies a problem with the file locking support in your operating system." Besides, there is no problem reported when we run the UnifiedGenotyper step on data store node. This problem happens if we run the UnifiedGenotyper step on compute nodes. We used NFS for shared data between nodes. Is it the way, one is access the data directly and the other one is by NFS shared, leading this problem? If so, how can I fix this problem? Hoping someone can answer this problem as soon as possible. Thank you! Created 2014-04-16 14:11:37 | Updated | Tags: unifiedgenotyper haplotypecaller bug gatk variant-calling HI, I'd like to report a weird result from HaplotypeCaller. We have a patient sequenced by targeted sequencing on HNF1B. This patient has been confirmed to have a whole gene deletion of HNF1B so we used this patient as a positive control. We expected to see no heterozygous variants called in HNF1B. However, HaplotypeCaller called two heterozygous variants: one deletion (it didn't pass the FS strand bias filter and the ReadPosRankSumTest filter) and one substitution (this one passed all the quality filters). Both these two variants were not called by UnifiedGenotyper (and the variants called by UnifiedGenotyper in HNF1B region were all homozygous as what we expected) Please see the VCF table: There are three things I want to highlight: 1, The deletion is only 10 bases upstream of the substitution, but the FS score is enormous for the deletion whereas very low for the substitution. If there's a strand bias, it must affect both variants if they are so close to each other. 2, The score of ReadPosRankSumTest of the deletion didn't pass the filter because it's always called near the end of the reads. The ReadPosRankSumTest score for the substitution is missing. 3, The genotype was called 0/1 for the substitution, but if we look at the AD, there are 206 reads supporting the ref allele and 0 read supporting the alt allele. Going by the AD, it's clearly a homozygous ref/ref genotype. Then I looked into the bam files. It turns out the all the alternative alleles of the substitution were from the ends of bad reads, and there are not too many of them after all. So the reads in the bam file also support a homozygous ref/ref genotype. Therefore I'm really confused why the substitution has 0/1 genotype called by the HaplotypeCaller and why it passed the filter. Many Thanks Betty Created 2013-12-17 17:07:40 | Updated 2013-12-17 17:08:47 | Tags: unifiedgenotyper vcf empty Hey there, I was trying to build an analysis pipeline for paired reads with BWA, Duplicate Removal Local Realignment and Base Quality Score Recalibration to finally use GATK's UnifiedGenotyper for SNP and Indel calling. However, for both SNPs and Indels, I receive no called variants no matter how low my used thresholds are. Quality values of the reads look ok, leaving out dbSNP does not change results. I have used the same reference throughout the whole pipeline. I use GATK 2.7, nevertheless a switch to GATK 1.6 did not change anything. This is my shell command for SNP calling on chromosome X (GATK delivers no results for all chromosomes): java -Xmx4g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R Homo_sapiens_assembly19.fasta -stand_call_conf 30.0 -stand_emit_conf 30.0 -glm SNP -mbq 17 -I test.bam -L X -o test.snps.vcf -D dbsnp_135.hg19.excluding_sites_after_129.vcf Entries in my BAM file look like this: SRR389458.1885965 113 X 10092397 37 76M = 10092397 1 CCTGTTTCCCCTGGGGCTGGGCTNGANACTGGGCCCAACCNGTGGCTCCCACCTGCACACACAGGGCTGGAGGGAC 98998999989:99:9:999888#88#79999:;:89998#99:;:88:989:;:91889888:;:9;:::::999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:23G2G13C35 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:3 SM:i:37 XM:i:3 XO:i:0 MQ:i:37 XT:A:U SRR389458.1885965 177 X 10092397 37 76M = 10092397 -1 CCTGTTTCCCCTGGGGCTGGGCTNGANACTGGGCCCAACCNGTGGCTCCCACCTGCACACACAGGGCTGGAGGGAC 98998999989:99:9:999888#88#79999:;:89998#99:;:88:989:;:91889888:;:9;:::::999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:23G2G13C35 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:3 SM:i:37 XM:i:3 XO:i:0 MQ:i:37 XT:A:U SRR389458.1888837 113 X 14748343 37 76M = 14748343 1 TCGTGAAAGTCGTTTTAATTTTAGCGGTTATGGGATGGGTCACTGCCTCCAAGTGAAAGATGGAACAGCTGTCAAG 889999:9988;98:9::9;9986::::99:8:::::999988989:8;;9::989:999:9;9:;:99:98:999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:76 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 MQ:i:37 XT:A:U And this is the output of the UnifiedGenotyper: INFO 17:57:00,575 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:57:00,578 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51 INFO 17:57:00,578 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:57:00,578 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:57:00,582 HelpFormatter - Program Args: -T UnifiedGenotyper -R /hana/exchange/reference_genomes/hg19/Homo_sapiens_assembly19.fasta -stand_call_conf 30.0 -stand_emit_conf 30.0 -glm SNP -mbq 17 -I test.bam -L X -o testX.snps.vcf -D dbsnp_135.hg19.excluding_sites_after_129.vcf INFO 17:57:00,583 HelpFormatter - Date/Time: 2013/12/17 17:57:00 INFO 17:57:00,583 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:57:00,583 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:57:00,943 ArgumentTypeDescriptor - Dynamically determined type of /hana/exchange/reference_genomes/hg19/dbsnp_135.hg19.excluding_sites_after_129.vcf to be VCF INFO 17:57:01,579 GenomeAnalysisEngine - Strictness is SILENT INFO 17:57:02,228 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 17:57:02,237 SAMDataSourceSAMReaders - Initializing SAMRecords in serial INFO 17:57:02,364 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.13 INFO 17:57:02,594 RMDTrackBuilder - Loading Tribble index from disk for file /hana/exchange/reference_genomes/hg19/dbsnp_135.hg19.excluding_sites_after_129.vcf INFO 17:57:02,867 IntervalUtils - Processing 155270560 bp from intervals INFO 17:57:02,943 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 17:57:03,166 GenomeAnalysisEngine - Done preparing for traversal INFO 17:57:03,167 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:57:03,167 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 17:57:33,171 ProgressMeter - X:11779845 1.01e+07 30.0 s 2.0 s 7.6% 6.6 m 6.1 m INFO 17:58:03,173 ProgressMeter - X:24739805 1.89e+07 60.0 s 3.0 s 15.9% 6.3 m 5.3 m INFO 17:58:33,175 ProgressMeter - X:37330641 3.25e+07 90.0 s 2.0 s 24.0% 6.2 m 4.7 m INFO 17:59:03,177 ProgressMeter - X:49404077 4.94e+07 120.0 s 2.0 s 31.8% 6.3 m 4.3 m INFO 17:59:33,178 ProgressMeter - X:64377965 5.36e+07 2.5 m 2.0 s 41.5% 6.0 m 3.5 m INFO 18:00:03,180 ProgressMeter - X:75606869 7.54e+07 3.0 m 2.0 s 48.7% 6.2 m 3.2 m INFO 18:00:33,189 ProgressMeter - X:88250233 7.74e+07 3.5 m 2.0 s 56.8% 6.2 m 2.7 m INFO 18:01:03,190 ProgressMeter - X:100393213 9.94e+07 4.0 m 2.0 s 64.7% 6.2 m 2.2 m INFO 18:01:33,192 ProgressMeter - X:110535705 1.09e+08 4.5 m 2.0 s 71.2% 6.3 m 109.0 s INFO 18:02:03,193 ProgressMeter - X:121257489 1.20e+08 5.0 m 2.0 s 78.1% 6.4 m 84.0 s INFO 18:02:33,195 ProgressMeter - X:132533757 1.32e+08 5.5 m 2.0 s 85.4% 6.4 m 56.0 s INFO 18:03:03,197 ProgressMeter - X:144498909 1.41e+08 6.0 m 2.0 s 93.1% 6.4 m 26.0 s INFO 18:03:30,079 ProgressMeter - done 1.55e+08 6.4 m 2.0 s 100.0% 6.4 m 0.0 s INFO 18:03:30,079 ProgressMeter - Total runtime 386.91 secs, 6.45 min, 0.11 hours INFO 18:03:30,080 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 150 total reads (0.00%) INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing BadMateFilter INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter INFO 18:03:32,167 GATKRunReport - Uploaded run statistics report to AWS S3 Do I miss anything here? Best, Cindy Created 2013-12-11 09:59:05 | Updated | Tags: unifiedgenotyper Dear all, all my GATK 2.8.1 UG jobs crash with the "Null alleles are not supported" error message. It works fine with 2.7.4 (more precisely the nightly release version as of November 28) but this error message kills all my jobs right now with the just released 2.8.1 version. Based on the traces, it looks like an indel calling issue. It's multisample calling, with reduced reads format. The issue may come from the fact that a handful of BAM files have been reduced with older versions of GATK (something I am currently fixing, but it takes time). The fact that the issue is brand new to 2.8.1 suggests something different though but I can't be sure. Full error message below. Any idea? V ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.IllegalArgumentException: Null alleles are not supported at org.broadinstitute.variant.variantcontext.Allele.(Allele.java:117) at org.broadinstitute.sting.utils.haplotype.Haplotype.(Haplotype.java:61) at org.broadinstitute.sting.gatk.walkers.indels.PairHMMIndelErrorModel.trimHaplotypes(PairHMMIndelErrorModel.java:235) at org.broadinstitute.sting.gatk.walkers.indels.PairHMMIndelErrorModel.computeGeneralReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:438) at org.broadinstitute.sting.gatk.walkers.indels.PairHMMIndelErrorModel.computeDiploidReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:251) at org.broadinstitute.sting.gatk.walkers.genotyper.IndelGenotypeLikelihoodsCalculationModel.getLikelihoods(IndelGenotypeLikelihoodsCalculationModel.java:149) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoods(UnifiedGenotyperEngine.java:331) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:232) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:367) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:143) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutorWorker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.8-1-g932cd3a): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Null alleles are not supported ##### ERROR ------------------------------------------------------------------------------------------ Created 2013-10-16 12:50:25 | Updated | Tags: unifiedgenotyper haplotypecaller dp Dear GATK Team, I've recently been exploring HaplotypeCaller and noticed that, for my data, it is reporting ~10x lower DP and AD values in comparison to reads visible in the igv browser and reported by the UnifiedGenotyper. I'm analyzing a human gene panel of amplicon data produced on a MiSeq, 150bp paired end. The coverage is ~5,000x. My pipeline is: Novoalign -> GATK (recalibrate quality) -> GATK (re-align) -> HaplotypeCaller/UnifiedGenotyper. Here are the minimum commands that reproduce the discrepancy: java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ --dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \ -R /gatk_bundle/ucsc.hg19.fasta \ -I sample1.rg.bam \ -o sample1.HC.vcf \ -L ROI.bed \ -dt NONE \ -nct 8 Example variant from sample1.HC.vcf: chr17 41245466 . G A 18004.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.411;ClippingRankSum=-1.211;DP=462;FS=2.564;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.250;QD=31.14;ReadPosRankSum=1.159 GT:AD:DP:GQ:PL 1/1:3,458:461:99:18033,1286,0 ... In comparison to using UnifiedGenotyper with exactly the same alignment file: java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ --dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \ -R /gatk_bundle/ucsc.hg19.fasta \ -I sample1.rg.bam \ -o sample1.UG.vcf \ -L ROI.bed \ -nct 4 \ -dt NONE \ -glm BOTH Example variant from sample1.UG.vcf: chr17 41245466 . G A 140732.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=5.488;DP=6382;Dels=0.00;FS=0.000;HaplotypeScore=568.8569;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.096;QD=22.05;ReadPosRankSum=0.104 GT:AD:DP:GQ:PL 1/1:56,6300:6378:99:140761,8716,0 I looked at the mapping quality and number of the alignments at the example region (200nt window) listed above and they look good: awk '{if (3=="chr17" && $4 > (41245466-100) &&$4 < (41245466+100))  print}' sample1.rg.sam | awk '{count[5]++} END {for(i in count) print count[i], i}' | sort -nr 8764 70 77 0 With other data generated in our lab, that has ~200x coverage and the same assay principle [just more amplicons], the DP reported by HaplotypeCaller corresponds perfectly to UnifiedGenotyper and igv. Is there an explanation as to why I should see a difference between HaplotypeCaller and UnifiedGenotyper, using these kinds of data? Many thanks in advance, Sam Created 2013-08-22 18:55:31 | Updated | Tags: unifiedgenotyper gatk Dear GATK Team, I am receiving the following error while running GATK 1.6. Unfortunately, for project consistency I cannot update to a more recent version of GATK and would at least wish to understand the source of the error. I ran ValidateSamFile on the input bam files and they appear to be OK. Any insight or advice would be greatly appreciated: ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Somehow the requested coordinate is not covered by the read. Too many deletions? at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:425) at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:374) at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:370) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinates(ReadClipper.java:445) at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinatesRightTail(ReadClipper.java:176) at org.broadinstitute.sting.gatk.walkers.indels.PairHMMIndelErrorModel.computeReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:196) at org.broadinstitute.sting.gatk.walkers.genotyper.IndelGenotypeLikelihoodsCalculationModel.getLikelihoods(IndelGenotypeLikelihoodsCalculationModel.java:212) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoods(UnifiedGenotyperEngine.java:235) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:164) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:302) at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:115) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:78) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:18) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:63) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:248) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 1.6-22-g3ec78bd): ##### ERROR ##### ERROR Please visit the wiki to see if this is a known problem ##### ERROR If not, please post the error, with stack trace, to the GATK forum ##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki ##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa ##### ERROR ##### ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Too many deletions? ##### ERROR ------------------------------------------------------------------------------------------ Abbreviated commandline used: GenomeAnalysisTK.jar -T UnifiedGenotyper -glm BOTH -et NO_ET \ -R "Saccharomyces_cerevisiae/UCSC/sacCer2/Sequence/WholeGenomeFasta/genome.fa" \ -dcov 5000 -I "someFile.bam" --output_mode EMIT_ALL_SITES -gvcf -l OFF \ -stand_call_conf 1 -L chrIV:1-1531919 Created 2013-06-12 09:22:28 | Updated | Tags: unifiedgenotyper Dear GATK team, I have used the UG following the best practice GATK workflow to call snps and Indels from exomeseq data of 24 human samples. First I called snps and Indels separately for each bam file, and I obtained separate vcfs. Then I decided to try to call the snps and Indels all in one go. I noticed that the output was quite different and the number of inser/delitions was higher when I called variants in contemporary (starting from separate bam files: -I sample1.bam -I sample2.bam...ETC). I also noticed that the called indels mostly were adjacent to tricky areas such as repetitive elements (ATATATATATATAT) or next to polyAAAAA. These snps and Indels weren't called by the UG when I called the variants separately. Is it more error prone to call variants in contemporary? Created 2013-04-11 21:30:55 | Updated 2013-04-11 21:32:22 | Tags: unifiedgenotyper haplotypecaller miseq Hey there, I'm currently working on a project that uses the MiSeq system to perform very deep sequencing (5000x coverage) on a small set of genes related to a particular cancer type. I have had some odd calls with HaplotypeCaller and UnifiedGenotyper, especially near the edges of the sequenced regions. It seems that often Haplotype calls very large deletions and Unified calls SNPs, e.g. (from the VCFs): Haplotype Caller: chr2 212576988 . TATTTTTAATTGTA T 13.95 LowQual AC=1;AF=0.500;AN=2;BaseQRankSum=1.805;ClippingRankSum=-0.916;DP=1000;FS=8.285;MLEAC=1;MLEAF=0.500;MQ=59.98;MQ0=0;MQRankSum=0.836;QD=0.00;ReadPosRankSum=0.528 GT:AD:GQ:PL 0/1:308,53:42:42,0,8418 Unified Genotyper: chr2 212576990 . T C 555.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.991;DP=1000;Dels=0.00;FS=0.000;HaplotypeScore=22.6027;MLEAC=1;MLEAF=0.500;MQ=59.98;MQ0=0;MQRankSum=0.073;QD=0.56;ReadPosRankSum=0.164 GT:AD:DP:GQ:PL 0/1:860,130:3743:99:584,0,32767 A screenshot from IGV showing some of the (few) BAM records in the area of these calls is attached. Many of these false variant calls can be removed through hard filtering, but I was a little surprised to see them called in the first place, because our raw results had been quite good with lower-coverage data. Is this to be expected with high coverage data? I am also wondering if there is a hard cap after which downsampling occurs. I set -dcov 10000 for all our samples, but the depth reported in the DP tag seemed max out at 1000. Is there another parameter I need to change? In some areas, due to the very small regions being sequenced, I have coverage in excess of 15,000 reads, so downsampling to 1000 might skew the results. Any guidance is appreciated! Cheers, Natascha Created 2013-04-10 21:11:11 | Updated 2013-04-10 22:56:24 | Tags: unifiedgenotyper Hi, I am running the unified genotyper in "EMIT_ALL_CONFIDENT_SITES" with the latest version of GATK and am getting some odd quality assignments of 10000030. It is clear that these sites are not very stellar, so I am wondering what is happening to these sites. Any thoughts? Below is a snippet from one such region, starting with some seemingly "normal" rows and ending with this seemingly odd assessment of quality. Any thoughts? As some background info, this sample is quite different than reference, so not abnormal to have lots of missing regions when aligned against the reference that I used. It is bacterial. Here is the command line that I ran: GenomeAnalysisTK.jar -T UnifiedGenotyper -dt NONE -I "i" -R ../../Ref.fasta -baq RECALCULATE -nt 4 -ploidy 1 -out_mode EMIT_ALL_CONFIDENT_SITES -o "$j".vcf -stand_call_conf 100 -stand_emit_conf 100 -mbq 20 Ref 930 . C . 102 . AN=1;DP=2;MQ=150.00;MQ0=0 GT:DP:MLPSAC:MLPSAF 0:2 Ref 931 . C . 107 . AN=1;DP=2;MQ=150.00;MQ0=0 GT:DP:MLPSAC:MLPSAF 0:2 Ref 932 . T . 105 . AN=1;DP=2;MQ=150.00;MQ0=0 GT:DP:MLPSAC:MLPSAF 0:2 Ref 942 . A . 10000030 . DP=1;MQ=150.00;MQ0=0 GT . Ref 1109 . G . 10000030 . DP=2;MQ=150.00;MQ0=0 GT . Ref 1110 . A . 10000030 . DP=2;MQ=150.00;MQ0=0 GT . Ref 1111 . T . 10000030 . DP=2;MQ=150.00;MQ0=0 GT . Ref 1112 . T . 10000030 . DP=2;MQ=150.00;MQ0=0 GT . Ref 1113 . T . 10000030 . DP=2;MQ=150.00;MQ0=0 GT . Ref 1114 . A . 10000030 . DP=2;MQ=150.00;MQ0=0 GT . Ref 1115 . C . 10000030 . DP=2;MQ=150.00;MQ0=0 GT . Created 2013-03-26 11:34:41 | Updated | Tags: unifiedgenotyper snp snps I was wondering, in the vcf output from UnifiedGenotyper what metrics go into the variant quality score, QUAL ? I'm assuming that depth, DP is one of them but I can't find further information. Apologies if I've missed it. Attached are some plots of the depth and quality distribution for variant calls, and also of the relationship between depth and quality. I'm slightly worried that the bimodal quality distribution indicates an error, otherwise I've just attached the graphs for general interest. Created 2013-02-22 16:49:52 | Updated 2013-02-22 16:51:24 | Tags: unifiedgenotyper argument Previously I have been running a command like this:  java -jar /path/GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R /path/human_g1k_v37.fasta \ -et NO_ET \ -K /path/key \ -out_mode EMIT_ALL_SITES \ --input_file /path/bam \ -L /path/intervals \ -gt_mode GENOTYPE_GIVEN_ALLELES \ --alleles /path/vcf \ --dbsnp /path/dbsnp_135.b37.vcf \ -o /path/my.vcf But I was reading the documentation again and I read this statement: GENOTYPE_GIVEN_ALLELES only the alleles passed in from a VCF rod bound to the -alleles argument will be used for genotyping Which lead me to believe that there wasn't a need to include the lines: --input_file /path/bam \ -L /path/intervals \ because it would be redundant. But when I try to run without those line I get back an error message: Walker requires reads but none were provided. Can you give an explaination as to why both of those lines AND GENOTYPE_GIVEN_ALLELES would be needed? Created 2012-12-11 23:07:30 | Updated | Tags: unifiedgenotyper vcf (There was another question about a similar symptom, but the answer does not appear to apply to what I'm seeing.) I get an empty VCF file that just contains the header lines. The input VCF file is 1.8Gb, and as far as I can tell the content is OK - it has MAPQ scores, the flags seem reasonable, etc. I've attached a copy of the console output, and the beginning of the input file in SAM format. Let me know if you have any suggestions. Thanks, Ravi Created 2012-11-29 17:51:33 | Updated | Tags: unifiedgenotyper depthperallelebysample Hello, I found some strange entries for indels in my VCF file created by the Unified Genotyper. For example: 4 184513470 . TC T 4009 PASS AC=4;AF=0.250;AN=16;BaseQRankSum=1.972;DP=1315;DS;FS=3.466;HaplotypeScore=537.6937;MLEAC=4;MLEAF=0.250;MQ=52.55;MQ0=0;MQRankSum=-10.581;QD=4.55;ReadPosRankSum=-10.128;SB=-3.500e+01;set=variant2 GT:AD:DP:GQ:PL 0/1:230,0:239:99:282,0,5011 0/0:92,0:95:99:0,133,2435 The first sample has genotype 0/1 with a good GQ value. However, according the allele depth field, there is no read supporting the deletion. When I look at the reads using the IGV, I find some reads supporting the deletion for the first sample (and even some for the second one). Moreover, when I looked at the AD values for SNPs, I noticed the the sum of all AD values is much less than the coverage shown in the IGV. I filtered duplicated reads in the IGV. Can someone please give an explanation? This link http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_DepthPerAlleleBySample.html explains the difference between AD and DP, but does not help in my case. Best greetings, Hans-Ulrich Created 2012-09-10 10:50:14 | Updated 2012-09-10 15:30:14 | Tags: unifiedgenotyper haplotypecaller Hi, how does the speed of the haplotypeCaller usually compare to that of the UnifiedGenotyper? When I try to use it, it seems to be about 90x slower. these are the command lines I use: java -jar -Xmx32g GenomeAnalysisTK.jar -R hg19.fasta -T UnifiedGenotyper -dcov 1000 -I file.bam -o file.vcf -L myTarget java -jar -Xmx32g GenomeAnalysisTK.jar -R hg19.fasta -T HaplotypeCaller -dcov 1000 -I file.bam -o file.vcf -L myTarget thanks! Created 2012-08-10 18:19:46 | Updated 2013-01-07 19:45:12 | Tags: unifiedgenotyper vcf indels Hello, I'm maybe missing something obvious but it seems a GATK vcf file does not tell a given variant is SNP, Insertion or deletion. Did I missed some command when I called variations? I can easily classify variations by eye or a script from a given vcf entry but the entry does not explicitly say variation type. Here are deletions: d02 264482 . TT T 389.93 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.224;DP=25;FS=0.000;HaplotypeScore=58.0966;MLEAC=2;MLEAF=1.00;MQ=47.51;MQ0=0;MQRankSum=1.714;QD=15.60;ReadPosRankSum=0.075;SB=-1.112e+02 GT:AD:DP:GQ:PL 1/1:15,9:20:9:431,9,0 Ld04 26597 . CCC C 2923.96 . AC=2;AF=1.00;AN=2;BaseQRankSum=-2.217;DP=98;FS=2.954;HaplotypeScore=145.2957;MLEAC=2;MLEAF=1.00;MQ=59.82;MQ0=0;MQRankSum=0.283;QD=29.84;ReadPosRankSum=1.150;SB=-1.306e+03 GT:AD:DP:GQ:PL 1/1:58,31:98:99:2966,195,0 Here are insertions: Ld04 23671 . G GAAAT 6952 . AC=2;AF=1.00;AN=2;DP=100;FS=0.000;HaplotypeScore=24.8695;MLEAC=2;MLEAF=1.00;MQ=59.54;MQ0=0;QD=69.52;SB=-3.355e+03 GT:AD:DP:GQ:PL 1/1:65,34:100:99:6952,301,0 Ld04 26880 . T TTATT 4973 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.552;DP=99;FS=2.935;HaplotypeScore=116.0082;MLEAC=2;MLEAF=1.00;MQ=59.46;MQ0=0;MQRankSum=0.192;QD=50.23;ReadPosRankSum=1.834;SB=-1.950e+03 GT:AD:DP:GQ:PL 1/1:49,38:98:99:4973,247,0 Here are SNPs Ld04 19890 . T C 3276.01 . AC=2;AF=1.00;AN=2;DP=85;Dels=0.00;FS=0.000;HaplotypeScore=0.7887;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=38.54;SB=-1.614e+03 GT:AD:DP:GQ:PL 1/1:0,85:85:99:3276,253,0 Ld04 19963 . A G 3571.01 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.895;DP=96;Dels=0.00;FS=0.000;HaplotypeScore=2.4678;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=1.769;QD=37.20;ReadPosRankSum=-0.316;SB=-1.523e+03 GT:AD:DP:GQ:PL 1/1:3,93:96:99:3571,205,0 I do not see any markers that tell me variation type. Nor the length of variations. Such information is given in a vcf of Samtools. PS I used this command line of V 2.0-38: java -jar ~/GenomeAnalysisTK-2.0-38/GenomeAnalysisTK.jar -T UnifiedGenotyper --genotype_likelihoods_model BOTH -R ref1 -I IN.bam -o OUT.GK2.vcf Thank you. Hideo Created 2012-08-01 16:50:30 | Updated 2012-08-01 16:55:22 | Tags: unifiedgenotyper fslockwithshared We have a SGE environment where we run our GATK jobs. Lately we have had a few cases of UnifiedGenotyper getting hung up while trying to read our previously created .fai and .dict reference files. On one occasion we had 6 jobs on different nodes all hung for about 5 days before finally giving the FSLockWithShared - WARNING messages and then continuing on just fine. We haven't been able to find out why this is happening. Do you know of anything that would cause UnifiedGenotyper to try and read those files before ultimately throwing the warning messages 5 days later? We looked into the possibility of other users/jobs having an exclusive lock on those reference files, but we had other UnifiedGenotyper jobs run fine during that same time period without these problems. We also haven't been able to find any hardware problems yet. The jobs with this problem were on multiple nodes, so I thought I would see if you have any ideas. Thanks for your help! Using strace we were able to find that all 6 of the UnifiedGenotyper processes were hung in the same spot: $ strace -p 22924
Process 22924 attached - interrupt to quit
futex(0x41c319d0, FUTEX_WAIT, 22925, NULL <unfinished ...>

strace -p 22925 Process 22925 attached - interrupt to quit write(1, "WARN 12:39:09,312 FSLockWithSha"..., 158) = 158 write(1, "INFO 12:39:09,364 ReferenceData"..., 116) = 116 write(1, "INFO 12:39:09,364 ReferenceData"..., 89) = 89 open("/reference/sequence/human/ncbi/37.1/allchr.fa.fai", O_RDONLY) = 16 fstat(16, {st_mode=S_IFREG|0770, st_size=2997, ...}) = 0 fstat(16, {st_mode=S_IFREG|0770, st_size=2997, ...}) = 0 fcntl(16, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=0, len=9223372036854775807} <unfinished ...> Here is the log file for one of the jobs: Thu Jul 19 09:42:53 CDT 2012 INFO 09:42:58,047 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:42:58,048 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.6-7-g2be5704, Compiled 2012/05/25 16:27:30 INFO 09:42:58,048 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 09:42:58,048 HelpFormatter - Please view our documentation at http://www.broadinstitute.org/gsa/wiki INFO 09:42:58,048 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa INFO 09:42:58,049 HelpFormatter - Program Args: -R /reference/sequence/human/ncbi/37.1/allchr.fa -et NO_ET -K /p rojects/bsi/bictools/apps/alignment/GenomeAnalysisTK/1.6-7-g2be5704//key -T UnifiedGenotyper --output _mode EMIT_ALL_SITES --min_base_quality_score 20 -nt 4 --max_alternate_alleles 5 -glm BOTH -L /data2/bsi/target.bed -I /data2/bsi/H-001.chr22-sorted.bam --out /data2/bsi/variants.chr22.raw.all.vcf INFO 09:42:58,050 HelpFormatter - Date/Time: 2012/07/19 09:42:58 INFO 09:42:58,050 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:42:58,050 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:42:58,166 GenomeAnalysisEngine - Strictness is SILENT WARN 12:39:09,312 FSLockWithShared - WARNING: Unable to lock file /reference/sequence/human/ncbi/37.1/allchr.dict: Protocol family not supported. INFO 12:39:09,364 ReferenceDataSource - Unable to create a lock on dictionary file: Protocol family not supported INFO 12:39:09,364 ReferenceDataSource - Treating existing dictionary file as complete. WARN 12:39:12,527 FSLockWithShared - WARNING: Unable to lock file /reference/sequence/human/ncbi/37.1/allchr.fa.fai: Protocol family not supported. INFO 12:39:12,528 ReferenceDataSource - Unable to create a lock on index file: Protocol family not supported INFO 12:39:12,528 ReferenceDataSource - Treating existing index file as complete. INFO 12:39:12,663 SAMDataSourceSAMReaders - Initializing SAMRecords in serial
INFO  12:39:12,954 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.29 INFO 12:39:13,108 MicroScheduler - Running the GATK in parallel mode with 4 concurrent threads WARN 12:39:13,864 UnifiedGenotyper - WARNING: note that the EMIT_ALL_SITES option is intended only for point mutations (SNPs) in DISCOVERY mode or generally when running in GENOTYPE_GIVEN_ALLELES mode; it will by no means produce a comprehensive set of indels in DISCOVERY mode INFO 12:39:14,361 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:39:14,568 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.21 INFO 12:39:14,569 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:39:14,627 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06 INFO 12:39:14,628 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
Tue Jul 24 12:47:32 CDT 2012