# Tagged with #vcf 6 documentation articles | 0 announcements | 89 forum discussions

Created 2012-08-11 06:46:51 | Updated 2016-01-04 22:37:34 | Tags: vcf bam script reordersam sorting contig-order

This is not as common as the "wrong reference build" problem, but it still pops up every now and then: a collaborator gives you a BAM or VCF file that's derived from the correct reference, but for whatever reason the contigs are not sorted in the same order. The GATK can be particular about the ordering BAM and VCF files so it will fail with an error in this case.

So what do you do?

### For BAM files

You run Picard's ReorderSam tool on your BAM file, using the reference genome dictionary as a template, like this:

java -jar picard.jar ReorderSam \
I=original.bam \
O=reordered.bam \
R=reference.fasta \
CREATE_INDEX=TRUE

Where reference.fasta is your genome reference, which must be accompanied by a valid *.dict dictionary file. The CREATE_INDEX argument is optional but useful if you plan to use the resulting file directly with GATK (otherwise you'll need to run another tool to create an index).

Be aware that this tool will drop reads that don't have equivalent contigs in the new reference (potentially bad or not, depending on what you want). If contigs have the same name in the BAM and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a liftover tool!

### For VCF files

You run Picard's SortVcf tool on your VCF file, using the reference genome dictionary as a template, like this:

java -jar picard.jar SortVcf \
I=original.vcf \
O=sorted.vcf \
SEQUENCE_DICTIONARY=reference.dict 

Where reference.dict is the sequence dictionary of your genome reference.

Note that you may need to delete the index file that gets created automatically for your new VCF by the Picard tool. GATK will automatically regenerate an index file for your VCF.

Created 2012-08-06 17:28:04 | Updated 2016-01-19 12:53:20 | Tags: vcf annotation

#### Contents

1. What is VCF?
2. Basic structure of a VCF file
3. Interpreting the VCF file header information
4. Structure of variant call records
5. How the genotype and other sample-level information is represented
6. How to extract information from a VCF in a sane, straightforward way

### 1. What is VCF?

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and expansion has been taken over by the Global Alliance for Genomics and Health Data Working group file format team. The full format spec can be found in the Samtools/Hts-specs repository along with other useful specs like SAM/BAM. We highly encourage you to take a look at those documents, as they contain a lot of useful information that we don't go over in this document.

VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from high-throughput sequencing data, such as the HaplotypeCaller, is especially complex. This document describes the key features and annotations that you need to know about in order to understand VCF files output by the GATK tools.

Note that VCF files are plain text files, so you can open them for viewing or editing in any text editor, with the following caveats:

• Some VCF files are very large, so your personal computer may struggle to load the whole file into memory. In such cases, you may need to use a different approach, such as using UNIX tools to access the part of the dataset that is relevant to you, or subsetting the data using tools like GATK's SelectVariants.

• NEVER EDIT A VCF IN A WORD PROCESSOR SUCH AS MICROSOFT WORD BECAUSE IT WILL SCREW UP THE FORMAT! You have been warned :)

• Don't write home-brewed VCF parsing scripts. It never ends well.

### 2. Basic structure of a VCF file

A valid VCF file is composed of two main parts: the header, and the variant call records.

The header contains information about the dataset and relevant reference sources (e.g. the organism, genome build version etc.), as well as definitions of all the annotations used to qualify and quantify the properties of the variant calls contained in the VCF file. The header of VCFs generated by GATK tools also include the command line that was used to generate them. Some other programs also record the command line in the VCF header, but not all do so as it is not required by the VCF specification. For more information about the header, see the next section.

The actual data lines will look something like this:

[HEADER LINES]
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878
1   873762  .       T   G   5231.78 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   3931.66 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   71.77   PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:1,3:4:26:103,0,26
1   974165  rs9442391   T   C   29.84   LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:14,4:14:61:61,0,255

After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that all the lines shown in the example above describe SNPs (also called SNVs), but other variation could be described, such as indels or CNVs. See the VCF specification for details on how the various types of variations are represented. Depending on how the callset was generated, there may only be records for sites where a variant was identified, or there may also be "invariant" records, ie records for sites where no variation was identified.

You will sometimes come across VCFs that have only 8 columns, and contain no FORMAT or sample-specific information. These are called "sites-only" VCFs, and represent variation that has been observed in a population. Generally, information about the population of origin should be included in the header.

### 3. Interpreting the VCF file header information

The following is a valid VCF header produced by HaplotypeCaller on an example data set (derived from our favorite test sample, NA12878). You can download similar test data from our resource bundle and try looking at it yourself!

##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.4-3-gd1ac142,Date="Mon May 18 17:36:4
.
.
.
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##contig=<ID=chr1,length=249250621,assembly=b37>
##reference=file:human_genome_b37.fasta

We're not showing all the lines here, but that's still a lot... so let's break it down into digestible bits. Note that the header lines are always listed in alphabetical order.

• VCF spec version

The first line:

##fileformat=VCFv4.1

tells you the version of the VCF specification to which the file conforms. This may seem uninteresting but it can have some important consequences for how to handle and interpret the file contents. As genomics is a fast moving field, the file formats are evolving fairly rapidly, so some of the encoding conventions change. If you run into unexpected issues while trying to parse a VCF file, be sure to check the version and the spec for any relevant format changes.

• FILTER lines

The FILTER lines tell you what filters have been applied to the data. In our test file, one filter has been applied:

##FILTER=<ID=LowQual,Description="Low quality">

Records that fail any of the filters listed here will contain the ID of the filter (here, LowQual) in its FILTER field (see how records are structured further below).

• FORMAT and INFO lines

These lines define the annotations contained in the FORMAT and INFO columns of the VCF file, which we explain further below. If you ever need to know what an annotation stands for, you can always check the VCF header for a brief explanation.

• GATKCommandLine

The GATKCommandLine lines contain all the parameters that went used by the tool that generated the file. Here, GATKCommandLine.HaplotypeCaller refers to a command line invoking HaplotypeCaller. These parameters include all the arguments that the tool accepts, not just the ones specified explicitly by the user in the command line.

• Contig lines and Reference

These contain the contig names, lengths, and which reference assembly was used with the input bam file. This can come in handy when someone gives you a callset but doesn't tell you which reference it was derived from -- remember that for most organisms, there are multiple reference assemblies, and you should always make sure to use the appropriate one!

[todo: FAQ on genome builds]

### 4. Structure of variant call records

For each site record, the information is structured into columns (also called fields) as follows:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

The first 8 columns of the VCF records (up to and including INFO) represent the properties observed at the level of the variant (or invariant) site. Keep in mind that when multiple samples are represented in a VCF file, some of the site-level annotations represent a summary or average of the values obtained for that site from the different samples.

Sample-specific information such as genotype and individual sample-level annotation values are contained in the FORMAT column (9th column) and in the sample-name columns (10th and beyond). In the example above, there is one sample called NA12878; if there were additional samples there would be additional columns to the right. Most programs order the sample columns alphabetically by sample name, but this is not always the case, so be aware that you can't depend on ordering rules for parsing VCF output!

#### Site-level properties and annotations

These first 7 fields are required by the VCF format and must be present, although they can be empty (in practice, there has to be a dot, ie . to serve as a placeholder).

• CHROM and POS : The contig and genomic coordinates on which the variant occurs. Note that for deletions the position given is actually the base preceding the event.

• ID: An optional identifier for the variant. Based on the contig and position of the call and whether a record exists at this site in a reference database such as dbSNP.

• REF and ALT: The reference allele and alternative allele(s) observed in a sample, set of samples, or a population in general (depending how the VCF was generated). Note that REF and ALT are always given on the forward strand. For insertions, the ALT allele includes the inserted sequence as well as the base preceding the insertion so you know where the insertion is compared to the reference sequence. For deletions, the ALT allele is the base before the deletion.

• QUAL: The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the FAQ article for a detailed explanation). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic. Not to be confused with the sample-level annotation GQ; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

• FILTER: This field contains the name(s) of any filter(s) that the variant fails to pass, or the value PASS if the variant passed all filters. If the FILTER value is ., then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.

This next field does not have to be present in the VCF.

• INFO: Various site-level annotations. The annotations contained in the INFO field are represented as tag-value pairs, where the tag and value are separated by an equal sign, ie =, and pairs are separated by colons, ie ; as in this example: MQ=99.00;MQ0=0;QD=17.94. They typically summarize context information from the samples, but can also include information from other sources (e.g. population frequencies from a database resource). Some are annotated by default by the GATK tools that produce the callset, and some can be added on request. They are always defined in the VCF header, so that's an easy way to check what an annotation means if you don't recognize it. You can also find additional information on how they are calculated and how they should be interpreted in the "Annotations" section of the Tool Documentation.

#### Sample-level annotations

At this point you've met all the fields up to INFO in this lineup:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

All the rest is going to be sample-level information. Sample-level annotations are tag-value pairs, like the INFO annotations, but the formatting is a bit different. The short names of the sample-level annotations are recorded in the FORMAT field. The annotation values are then recorded in corresponding order in each sample column (where the sample names are the SM tags identified in the read group data). Typically, you will at minimum have information about the genotype and confidence in the genotype for the sample at each site. See the next section on genotypes for more details.

### 5. How the genotype and other sample-level information is represented

The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.

Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

1   873762  .       T   G   [CLIPPED] GT:AD:DP:GQ:PL    0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   [CLIPPED] GT:AD:DP:GQ:PL    1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

Looking at that last column, here is what the tags mean:

• GT : The genotype of this sample at this site. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:

• 0/0 - the sample is homozygous reference
• 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
• 1/1 - the sample is homozygous alternate In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively. For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT; for polyploids there will be more, e.g. 4 values for a tetraploid organism.

• PL : Normalized Phred-scaled likelihoods of the possible genotypes. For the typical case of a monomorphic site (where there is only one ALT allele) in a diploid organism, the PL field will contain three numbers, corresponding to the three possible genotypes (0/0, 0/1, and 1/1). The PL values are normalized so that the PL of the most likely genotype (assigned in the GT field) is 0 in the Phred scale (meaning its P = 1.0 in regular scale). The other values are scaled relative to this most likely genotype. Keep in mind, if you're not familiar with the statistical lingo, that when we say PL is the "likelihood of the genotype", we mean it is "the probability that the genotype is not correct". That's why the smaller the value, the better it is. [todo: PL details doc]

• GQ : Quality of the assigned genotype. The Genotype Quality represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype PLs. Specifically, the GQ is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. As noted above, the values of the PLs are normalized so that the most likely PL is always 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of GQ is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a GQ of 99. Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another. See the FAQ article on the Phred scale to get a sense of what would be considered low. Not to be confused with the site-level annotation QUAL; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

With that out of the way, let's interpret the genotype information for NA12878 at 1:899282.

1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

At this site, the called genotype is GT = 0/1, which corresponds to the alleles C/T. The confidence indicated by GQ = 26 isn't very good, largely because there were only a total of 4 reads at this site (DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele, but the next PL is PL(1/1) = 26 (which corresponds to 10^(-2.6), or 0.0025). So although we're pretty sure there's a variant at this site, there's a chance that the genotype assignment is incorrect, and that the subject may in fact not be het (heterozygous) but be may instead be hom-var (homozygous with the variant allele). But either way, it's clear that the subject is definitely not hom-ref (homozygous with the reference allele) since PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number.

### 6. How to extract information from a VCF in a sane, (mostly) straightforward way

Use VariantsToTable.

No, really, don't write your own parser if you can avoid it. This is not a comment on how smart or how competent we think you are -- it's a comment on how annoyingly obtuse and convoluted the VCF format is.

Seriously. The VCF format lends itself really poorly to parsing methods like regular expressions, and we hear sob stories all the time from perfectly competent people whose home-brewed parser broke because it couldn't handle a more esoteric feature of the format. We know we broke a bunch of people's scripts when we introduced a new representation for spanning deletions in multisample callsets. OK, we ended up replacing it with a better representation a month later that was a lot less disruptive and more in line with the spirit of the specification -- but the point is, that first version was technically legal by the 4.2 spec, and that sort of thing can happen at any time. So yes, the VCF is a difficult format to work with, and one way to deal with that safely is to not home-brew parsers.

(Why are we sticking with it anyway? Because, as Winston Churchill famously put it, VCF is the worst variant call representation, except for all the others.)

Created 2012-07-23 17:50:07 | Updated 2015-05-15 23:55:20 | Tags: selectvariants jexl vcf callset

### This document describes why you might want to extract a subset of variants from a callset and how you would achieve this.

Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). The GATK tool that we use the most for subsetting calls in various ways is SelectVariants; it enables easy and convenient subsetting of VCF files according to many criteria.

Select Variants operates on VCF files (also sometimes referred to as ROD in our documentation, for Reference Ordered Data) provided at the command line using the GATK's built in --variant option. You can provide multiple VCF files for Select Variants, but at least one must be named 'variant' and this will be the file (or set of files) from which variants will be selected. Other files can be used to modify the selection based on concordance or discordance between the callsets (see --discordance / --concordance arguments in the tool documentation).

There are many options for setting the selection criteria, depending on what you want to achieve. For example, given a single VCF file, one or more samples can be extracted from the file, based either on a complete sample name, or on a pattern match. Variants can also be selected based on annotated properties, such as depth of coverage or allele frequency. This is done using JEXL expressions; make sure to read the linked document for details, especially the section on working with complex expressions.

Note that in the output VCF, some annotations such as AN (number of alleles), AC (allele count), AF (allele frequency), and DP (depth of coverage) are recalculated as appropriate to accurately reflect the composition of the subset callset. See further below for an explanation of how that works.

### Command-line arguments

For a complete, detailed argument reference, refer to the GATK document page here.

### Subsetting by sample and ALT alleles

SelectVariants now keeps (r5832) the alt allele, even if a record is AC=0 after subsetting the site down to selected samples. For example, when selecting down to just sample NA12878 from the OMNI VCF in 1000G (1525 samples), the resulting VCF will look like:

1       82154   rs4477212       A       G       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       T       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

Although NA12878 is 0/0 at the first sites, ALT allele is preserved in the VCF record. This is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. This is related to the tricky issue of isPolymorphic() vs. isVariant().

• isVariant => is there an ALT allele?

• isPolymorphic => is some sample non-ref in the samples?

For clarity, in previous versions of SelectVariants, the first two monomorphic sites lose the ALT allele, because NA12878 is hom-ref at this site, resulting in VCF that looks like:

1       82154   rs4477212       A       .       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       .       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting.

### How do the AC, AF, AN, and DP fields change?

Let's say you have a file with three samples. The numbers before the ":" will be the genotype (0/0 is hom-ref, 0/1 is het, and 1/1 is hom-var), and the number after will be the depth of coverage.

BOB        MARY        LINDA
1/0:20     0/0:30      1/1:50

In this case, the INFO field will say AN=6, AC=3, AF=0.5, and DP=100 (in practice, I think these numbers won't necessarily add up perfectly because of some read filters we apply when calling, but it's approximately right).

Now imagine I only want a file with the samples "BOB" and "MARY". The new file would look like:

BOB        MARY
1/0:20     0/0:30

The INFO field will now have to change to reflect the state of the new data. It will be AN=4, AC=1, AF=0.25, DP=50.

Let's pretend that MARY's genotype wasn't 0/0, but was instead "./." (no genotype could be ascertained). This would look like

BOB        MARY
1/0:20     ./.:.

with AN=2, AC=1, AF=0.5, and DP=20.

For information on how to construct regular expressions for use with this tool, see the method article on variant filtering with JEXL, or "Summary of regular-expression constructs" section here for more hardcore reading.

Created 2012-07-23 17:45:57 | Updated 2015-05-16 02:26:36 | Tags: combinevariants vcf callset

### Solutions for combining variant callsets depending on purpose

There are three main reasons why you might want to combine variants from different files into one, and the tool to use depends on what you are trying to achieve.

1. The most common case is when you have been parallelizing your variant calling analyses, e.g. running HaplotypeCaller per-chromosome, producing separate VCF files (or gVCF files) per-chromosome. For that case, you can use a tool called CatVariants to concatenate the files. There are a few important requirements (e.g. the files should contain all the same samples, and distinct intervals) which you can read about on the tool's documentation page.

2. The second case is when you have been using HaplotypeCaller in -ERC GVCF or -ERC BP_RESOLUTION to call variants on a large cohort, producing many gVCF files. We recommend combining the output gVCF in batches of e.g. 200 before putting them through joint genotyping with GenotypeGVCFs (for performance reasons), which you can do using CombineGVCFs, which is specific for handling gVCF files.

3. The third case is when you want to combine variant calls that were produced from the same samples but using different methods, for comparison. For example, if you're evaluating variant calls produced by different variant callers, different workflows, or the same but using different parameters. This produces separate callsets for the same samples, which are then easier to compare if you combine them into a single file. For that purpose, you can use CombineVariants, which is capable of merging VCF records intelligently, treating the same samples as separate or not as desired, combining annotations as appropriate. This is the case that requires the most preparation and forethought because there are many options that may be used to adapt the behavior of the tool.

There is also one reason you might want to combine variants from different files into one, that we do not recommend following. That is, if you have produced variant calls from various samples separately, and want to combine them for analysis. This is how people used to do variant analysis on large numbers of samples, but we don't recommend proceeding this way because that workflow suffers from serious methodological flaws. Instead, you should follow our recommendations as laid out in the Best Practices documentation.

### Merging records across VCFs with CombineVariants

Here we provide some more information and a worked out example to illustrate the third case because it is less straightforward than the other two.

A key point to understand is that CombineVariants will include a record at every site in all of your input VCF files, and annotate in which input callsets the record is present, pass, or filtered in in the set attribute in the INFO field (see below). In effect, CombineVariants always produces a union of the input VCFs. Any part of the Venn of the N merged VCFs can then be extracted specifically using JEXL expressions on the set attribute using SelectVariants. If you want to extract just the records in common between two VCFs, you would first CombineVariants the two files into a single VCF, and then run SelectVariants to extract the common records with -select 'set == "Intersection"', as worked out in the detailed example below.

#### Handling PASS/FAIL records at the same site in multiple input files

The -filteredRecordsMergeType argument determines how CombineVariants handles sites where a record is present in multiple VCFs, but it is filtered in some and unfiltered in others, as described in the tool documentation page linked above.

#### Understanding the set attribute

The set property of the INFO field indicates which call set the variant was found in. It can take on a variety of values indicating the exact nature of the overlap between the call sets. Note that the values are generalized for multi-way combinations, but here we describe only the values for 2 call sets being combined.

• set=Intersection : occurred in both call sets, not filtered out

• set=NAME : occurred in the call set NAME only

• set=NAME1-filteredInNAME : occurred in both call sets, but was not filtered in NAME1 but was filtered in NAME2

• set=filteredInAll : occurred in both call sets, but was filtered out of both

For three or more call sets combinations, you can see records like NAME1-NAME2 indicating a variant occurred in both NAME1 and NAME2 but not all sets.

You specify the NAME of a callset is by using the following syntax in your command line: -V:omni 1000G_omni2.5.b37.sites.vcf.

#### Emitting minimal VCF output

You can add the -minimalVCF argument to CombineVariants if you want to eliminate unnecessary information from the INFO field and genotypes. In that case, the only fields emitted will be GT:GQ for genotypes and the keySet for INFO.

An even more extreme output format is -sites_only (a general engine capability listed in the CommandLineGATK documentation) where the genotypes for all samples are completely stripped away from the output format. Enabling this option results in a significant performance speedup as well.

#### Requiring sites to be present in a minimum number of callsets

Sometimes you may want to combine several data sets but you only keep sites that are present in at least 2 of them. To do so, simply add the -minN (or --minimumN) command, followed by an integer if you want to only output records present in at least N input files. In our example, you would add -minN 2 to the command line.

#### Example: intersecting two VCFs

In the following example, we use CombineVariants and SelectVariants to obtain only the sites in common between the OMNI 2.5M and HapMap3 sites in the GSA bundle.

# combine the data
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R bundle/b37/human_g1k_v37.fasta -L 1:1-1,000,000 -V:omni bundle/b37/1000G_omni2.5.b37.sites.vcf -V:hm3 bundle/b37/hapmap_3.3.b37.sites.vcf -o union.vcf

# select the intersection
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T SelectVariants -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -L 1:1-1,000,000 -V:variant union.vcf -select 'set == "Intersection";' -o intersect.vcf

This results in two vcf files, which look like:

# contents of union.vcf
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       990882  SNP1-980745     C       T       .       PASS    CR=99.79873;GentrainScore=0.7403;HW=0.005225421;set=omni
1       990984  SNP1-980847     G       A       .       PASS    CR=99.76005;GentrainScore=0.8406;HW=0.26163524;set=omni
1       992265  SNP1-982128     C       T       .       PASS    CR=100.0;GentrainScore=0.7412;HW=0.0025895447;set=omni
1       992819  SNP1-982682     G       A       .       id50    CR=99.72961;GentrainScore=0.8505;HW=4.811053E-17;set=FilteredInAll
1       993987  SNP1-983850     T       C       .       PASS    CR=99.85935;GentrainScore=0.8336;HW=9.959717E-28;set=omni
1       994391  rs2488991       G       T       .       PASS    AC=1936;AF=0.69341;AN=2792;CR=99.89378;GentrainScore=0.7330;HW=1.1741E-41;set=filterInomni-hm3
1       996184  SNP1-986047     G       A       .       PASS    CR=99.932205;GentrainScore=0.8216;HW=3.8830226E-6;set=omni
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
1       999649  SNP1-989512     G       A       .       PASS    CR=99.93262;GentrainScore=0.7965;HW=4.9767335E-4;set=omni

# contents of intersect.vcf
1       950243  SNP1-940106     A       C       .       PASS    AC=826;AF=0.29993;AN=2754;CR=97.341675;GentrainScore=0.7311;HW=0.15148845;set=Intersection
1       957640  rs6657048       C       T       .       PASS    AC=127;AF=0.04552;AN=2790;CR=99.86667;GentrainScore=0.6806;HW=2.286109E-4;set=Intersection
1       959842  rs2710888       C       T       .       PASS    AC=654;AF=0.23559;AN=2776;CR=99.849;GentrainScore=0.8072;HW=0.17526293;set=Intersection
1       977780  rs2710875       C       T       .       PASS    AC=1989;AF=0.71341;AN=2788;CR=99.89077;GentrainScore=0.7875;HW=2.9912625E-32;set=Intersection
1       985900  SNP1-975763     C       T       .       PASS    AC=182;AF=0.06528;AN=2788;CR=99.79926;GentrainScore=0.8374;HW=0.017794203;set=Intersection
1       987200  SNP1-977063     C       T       .       PASS    AC=1956;AF=0.70007;AN=2794;CR=99.45917;GentrainScore=0.7914;HW=1.413E-42;set=Intersection
1       987670  SNP1-977533     T       G       .       PASS    AC=2485;AF=0.89196;AN=2786;CR=99.51427;GentrainScore=0.7005;HW=0.24214932;set=Intersection
1       990417  rs2465136       T       C       .       PASS    AC=1113;AF=0.40007;AN=2782;CR=99.7599;GentrainScore=0.8750;HW=8.595538E-5;set=Intersection
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection

Created 2012-07-23 17:36:42 | Updated 2013-06-10 18:24:01 | Tags: snpeff variantannotator vcf tooltips callset annotation

### IMPORTANT NOTE: This document is out of date and will be replaced soon. In the meantime, you can find accurate information on how to run SnpEff in a compatible way with GATK in the SnpEff documentation, and instructions on what steps are necessary in the presentation on Functional Annotation linked in the comments below.

Our testing has shown that not all combinations of snpEff/database versions produce high-quality results. Be sure to read this document completely to familiarize yourself with our recommended best practices BEFORE running snpEff.

### Introduction

Until recently we were using an in-house annotation tool for genomic annotation, but the burden of keeping the database current and our lack of ability to annotate indels has led us to employ the use of a third-party tool instead. After reviewing many external tools (including annoVar, VAT, and Oncotator), we decided that SnpEff best meets our needs as it accepts VCF files as input, can annotate a full exome callset (including indels) in seconds, and provides continually-updated transcript databases. We have implemented support in the GATK for parsing the output from the SnpEff tool and annotating VCFs with the information provided in it.

### SnpEff Setup and Usage

Download the SnpEff core program. If you want to be able to run VariantAnnotator on the SnpEff output, you'll need to download a version of SnpEff that VariantAnnotator supports from this page (currently supported versions are listed below). If you just want the most recent version of SnpEff and don't plan to run VariantAnnotator on its output, you can get it from here.

After unzipping the core program, open the file snpEff.config in a text editor, and change the "database_repository" line to the following:

database_repository = http://sourceforge.net/projects/snpeff/files/databases/

java -jar snpEff.jar download GRCh37.64

You can find a list of available databases here. The human genome databases have GRCh or hg in their names. You can also download the databases directly from the SnpEff website, if you prefer.

The download command by default puts the databases into a subdirectory called data within the directory containing the SnpEff jar file. If you want the databases in a different directory, you'll need to edit the data_dir entry in the file snpEff.config to point to the correct directory.

Run SnpEff on the file containing your variants, and redirect its output to a file. SnpEff supports many input file formats including VCF 4.1, BED, and SAM pileup. Full details and command-line options can be found on the SnpEff home page.

### Supported SnpEff Versions

If you want to take advantage of SnpEff integration in the GATK, you'll need to run SnpEff version *2.0.5. Note: newer versions are currently unsupported by the GATK, as we haven't yet had the reources to test it.

### Current Recommended Best Practices When Running SnpEff

These best practices are based on our analysis of various snpEff/database versions as described in detail in the Analysis of SnpEff Annotations Across Versions section below.

• We recommend using only the GRCh37.64 database with SnpEff 2.0.5. The more recent GRCh37.65 database produces many false-positive Missense annotations due to a regression in the ENSEMBL Release 65 GTF file used to build the database. This regression has been acknowledged by ENSEMBL and is supposedly fixed as of 1-30-2012; however as we have not yet tested the fixed version of the database we continue to recommend using only GRCh37.64 for now.

• We recommend always running with -onlyCoding true with human databases (eg., the GRCh37. databases). Setting -onlyCoding false causes snpEff to report all transcripts as if they were coding (even if they're not), which can lead to nonsensical results. The -onlyCoding false option should only* be used with databases that lack protein coding information.

• Do not trust annotations from versions of snpEff prior to 2.0.4. Older versions of snpEff (such as 2.0.2) produced many incorrect annotations due to the presence of a certain number of nonsensical transcripts in the underlying ENSEMBL databases. Newer versions of snpEff filter out such transcripts.

### Analyses of SnpEff Annotations Across Versions

See our analysis of the SNP annotations produced by snpEff across various snpEff/database versions here.

• Both snpEff 2.0.2 + GRCh37.63 and snpEff 2.0.5 + GRCh37.65 produce an abnormally high Missense:Silent ratio, with elevated levels of Missense mutations across the entire spectrum of allele counts. They also have a relatively low (~70%) level of concordance with the 1000G Gencode annotations when it comes to Silent mutations. This suggests that these combinations of snpEff/database versions incorrectly annotate many Silent mutations as Missense.

• snpEff 2.0.4 RC3 + GRCh37.64 and snpEff 2.0.5 + GRCh37.64 produce a Missense:Silent ratio in line with expectations, and have a very high (~97%-99%) level of concordance with the 1000G Gencode annotations across all categories.

See our comparison of SNP annotations produced using the GRCh37.64 and GRCh37.65 databases with snpEff 2.0.5 here

• The GRCh37.64 database gives good results on the condition that you run snpEff with the -onlyCoding true option. The -onlyCoding false option causes snpEff to mark all transcripts as coding, and so produces many false-positive Missense annotations.

• The GRCh37.65 database gives results that are as poor as those you get with the -onlyCoding false option on the GRCh37.64 database. This is due to a regression in the ENSEMBL release 65 GTF file used to build snpEff's GRCh37.65 database. The regression has been acknowledged by ENSEMBL and is due to be fixed shortly.

See our analysis of the INDEL annotations produced by snpEff across snpEff/database versions here

• snpEff's indel annotations are highly concordant with those of a high-quality set of genomic annotations from the 1000 Genomes project. This is true across all snpEff/database versions tested.

### Example SnpEff Usage with a VCF Input File

Below is an example of how to run SnpEff version 2.0.5 with a VCF input file and have it write its output in VCF format as well. Notice that you need to explicitly specify the database you want to use (in this case, GRCh37.64). This database must be present in a directory of the same name within the data_dir as defined in snpEff.config.

java -Xmx4G -jar snpEff.jar eff -v -onlyCoding true -i vcf -o vcf GRCh37.64 1000G.exomes.vcf > snpEff_output.vcf

In this mode, SnpEff aggregates all effects associated with each variant record together into a single INFO field annotation with the key EFF. The general format is:

EFF=Effect1(Information about Effect1),Effect2(Information about Effect2),etc.

And here is the precise layout with all the subfields:

EFF=Effect1(Effect_Impact|Effect_Functional_Class|Codon_Change|Amino_Acid_Change|Gene_Name|Gene_BioType|Coding|Transcript_ID|Exon_ID),Effect2(etc...

It's also possible to get SnpEff to output in a (non-VCF) text format with one Effect per line. See the SnpEff home page for full details.

### Adding SnpEff Annotations using VariantAnnotator

Once you have a SnpEff output VCF file, you can use the VariantAnnotator walker to add SnpEff annotations based on that output to the input file you ran SnpEff on.

There are two different options for doing this:

#### Option 1: Annotate with only the highest-impact effect for each variant

NOTE: This option works only with supported SnpEff versions as explained above. VariantAnnotator run as described below will refuse to parse SnpEff output files produced by other versions of the tool, or which lack a SnpEff version number in their header.

The default behavior when you run VariantAnnotator on a SnpEff output file is to parse the complete set of effects resulting from the current variant, select the most biologically-significant effect, and add annotations for just that effect to the INFO field of the VCF record for the current variant. This is the mode we plan to use in our Production Data-Processing Pipeline.

When selecting the most biologically-significant effect associated with the current variant, VariantAnnotator does the following:

• Prioritizes the effects according to the categories (in order of decreasing precedence) "High-Impact", "Moderate-Impact", "Low-Impact", and "Modifier", and always selects one of the effects from the highest-priority category. For example, if there are three moderate-impact effects and two high-impact effects resulting from the current variant, the annotator will choose one of the high-impact effects and add annotations based on it. See below for a full list of the effects arranged by category.

• Within each category, ties are broken using the functional class of each effect (in order of precedence: NONSENSE, MISSENSE, SILENT, or NONE). For example, if there is both a NON_SYNONYMOUS_CODING (MODERATE-impact, MISSENSE) and a CODON_CHANGE (MODERATE-impact, NONE) effect associated with the current variant, the annotator will select the NON_SYNONYMOUS_CODING effect. This is to allow for more accurate counts of the total number of sites with NONSENSE/MISSENSE/SILENT mutations. See below for a description of the functional classes SnpEff associates with the various effects.

• Effects that are within a non-coding region are always considered lower-impact than effects that are within a coding region.

Example Usage:

java -jar dist/GenomeAnalysisTK.jar \
-T VariantAnnotator \
-R /humgen/1kg/reference/human_g1k_v37.fasta \
-A SnpEff \
--variant 1000G.exomes.vcf \        (file to annotate)
--snpEffFile snpEff_output.vcf \    (SnpEff VCF output file generated by running SnpEff on the file to annotate)
-L 1000G.exomes.vcf \
-o out.vcf

VariantAnnotator adds some or all of the following INFO field annotations to each variant record:

• SNPEFF_EFFECT - The highest-impact effect resulting from the current variant (or one of the highest-impact effects, if there is a tie)
• SNPEFF_IMPACT - Impact of the highest-impact effect resulting from the current variant (HIGH, MODERATE, LOW, or MODIFIER)
• SNPEFF_FUNCTIONAL_CLASS - Functional class of the highest-impact effect resulting from the current variant (NONE, SILENT, MISSENSE, or NONSENSE)
• SNPEFF_CODON_CHANGE - Old/New codon for the highest-impact effect resulting from the current variant
• SNPEFF_AMINO_ACID_CHANGE - Old/New amino acid for the highest-impact effect resulting from the current variant
• SNPEFF_GENE_NAME - Gene name for the highest-impact effect resulting from the current variant
• SNPEFF_GENE_BIOTYPE - Gene biotype for the highest-impact effect resulting from the current variant
• SNPEFF_TRANSCRIPT_ID - Transcript ID for the highest-impact effect resulting from the current variant
• SNPEFF_EXON_ID - Exon ID for the highest-impact effect resulting from the current variant

Example VCF records annotated using SnpEff and VariantAnnotator:

1   874779  .   C   T   279.94  . AC=1;AF=0.0032;AN=310;BaseQRankSum=-1.800;DP=3371;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=1.4493;InbreedingCoeff=-0.0045;
SNPEFF_EFFECT=SYNONYMOUS_CODING;SNPEFF_EXON_ID=exon_1_874655_874840;SNPEFF_FUNCTIONAL_CLASS=SILENT;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=SAMD11;
SNPEFF_IMPACT=LOW;SNPEFF_TRANSCRIPT_ID=ENST00000342066

1   874816  .   C   CT  2527.52 .   AC=15;AF=0.0484;AN=310;BaseQRankSum=-11.876;DP=4718;FS=48.575;HRun=1;HaplotypeScore=91.9147;InbreedingCoeff=-0.0520;
SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=SAMD11;SNPEFF_IMPACT=HIGH;SNPEFF_TRANSCRIPT_ID=ENST00000342066

#### Option 2: Annotate with all effects for each variant

VariantAnnotator also has the ability to take the EFF field from the SnpEff VCF output file containing all the effects aggregated together and copy it verbatim into the VCF to annotate.

Here's an example of how to do this:

java -jar dist/GenomeAnalysisTK.jar \
-T VariantAnnotator \
-R /humgen/1kg/reference/human_g1k_v37.fasta \
-E resource.EFF \
--variant 1000G.exomes.vcf \      (file to annotate)
--resource snpEff_output.vcf \    (SnpEff VCF output file generated by running SnpEff on the file to annotate)
-L 1000G.exomes.vcf \
-o out.vcf

Of course, in this case you can also use the VCF output by SnpEff directly, but if you are using VariantAnnotator for other purposes anyway the above might be useful.

### List of Genomic Effects

Below are the possible genomic effects recognized by SnpEff, grouped by biological impact. Full descriptions of each effect are available on this page.

#### High-Impact Effects

• SPLICE_SITE_ACCEPTOR
• SPLICE_SITE_DONOR
• START_LOST
• EXON_DELETED
• FRAME_SHIFT
• STOP_GAINED
• STOP_LOST

#### Moderate-Impact Effects

• NON_SYNONYMOUS_CODING
• CODON_CHANGE (note: this effect is used by SnpEff only for MNPs, not SNPs)
• CODON_INSERTION
• CODON_CHANGE_PLUS_CODON_INSERTION
• CODON_DELETION
• CODON_CHANGE_PLUS_CODON_DELETION
• UTR_5_DELETED
• UTR_3_DELETED

#### Low-Impact Effects

• SYNONYMOUS_START
• NON_SYNONYMOUS_START
• START_GAINED
• SYNONYMOUS_CODING
• SYNONYMOUS_STOP
• NON_SYNONYMOUS_STOP

#### Modifiers

• NONE
• CHROMOSOME
• CUSTOM
• CDS
• GENE
• TRANSCRIPT
• EXON
• INTRON_CONSERVED
• UTR_5_PRIME
• UTR_3_PRIME
• DOWNSTREAM
• INTRAGENIC
• INTERGENIC
• INTERGENIC_CONSERVED
• UPSTREAM
• REGULATION
• INTRON

### Functional Classes

SnpEff assigns a functional class to certain effects, in addition to an impact:

• NONSENSE: assigned to point mutations that result in the creation of a new stop codon
• MISSENSE: assigned to point mutations that result in an amino acid change, but not a new stop codon
• SILENT: assigned to point mutations that result in a codon change, but not an amino acid change or new stop codon
• NONE: assigned to all effects that don't fall into any of the above categories (including all events larger than a point mutation)

The GATK prioritizes effects with functional classes over effects of equal impact that lack a functional class when selecting the most significant effect in VariantAnnotator. This is to enable accurate counts of NONSENSE/MISSENSE/SILENT sites.

Created 2012-07-23 16:49:34 | Updated 2016-01-08 13:55:52 | Tags: variantrecalibrator vqsr applyrecalibration vcf callset variantrecalibration

This document describes what Variant Quality Score Recalibration (VQSR) is designed to do, and outlines how it works under the hood. The first section is a high-level overview aimed at non-specialists. Additional technical details are provided below.

For command-line examples and recommendations on what specific resource datasets and arguments to use for VQSR, please see this FAQ article. See the VariantRecalibrator tool doc and the ApplyRecalibration tool doc for a complete description of available command line arguments.

As a complement to this document, we encourage you to watch the workshop videos available on our Events webpage.

## High-level overview

VQSR stands for “variant quality score recalibration”, which is a bad name because it’s not re-calibrating variant quality scores at all; it is calculating a new quality score that is supposedly super well calibrated (unlike the variant QUAL score which is a hot mess) called the VQSLOD (for variant quality score log-odds). I know this probably sounds like gibberish, stay with me. The purpose of this new score is to enable variant filtering in a way that allows analysts to balance sensitivity (trying to discover all the real variants) and specificity (trying to limit the false positives that creep in when filters get too lenient) as finely as possible.

The basic, traditional way of filtering variants is to look at various annotations (context statistics) that describe e.g. what the sequence context is like around the variant site, how many reads covered it, how many reads covered each allele, what proportion of reads were in forward vs reverse orientation; things like that -- then choose threshold values and throw out any variants that have annotation values above or below the set thresholds. The problem with this approach is that it is very limiting because it forces you to look at each annotation dimension individually, and you end up throwing out good variants just because one of their annotations looks bad, or keeping bad variants in order to keep those good variants.

The VQSR method, in a nutshell, uses machine learning algorithms to learn from each dataset what is the annotation profile of good variants vs. bad variants, and does so in a way that integrates information from multiple dimensions (like, 5 to 8, typically). The cool thing is that this allows us to pick out clusters of variants in a way that frees us from the traditional binary choice of “is this variant above or below the threshold for this annotation?”

Let’s do a quick mental visualization exercise (pending an actual figure to illustrate this), in two dimensions because our puny human brains work best at that level. Imagine a topographical map of a mountain range, with North-South and East-West axes standing in for two variant annotation scales. Your job is to define a subset of territory that contains mostly mountain peaks, and as few lowlands as possible. Traditional hard-filtering forces you to set a single longitude cutoff and a single latitude cutoff, resulting in one rectangular quadrant of the map being selected, and all the rest being greyed out. It’s about as subtle as a sledgehammer and forces you to make a lot of compromises. VQSR allows you to select contour lines around the peaks and decide how low or how high you want to go to include or exclude territory within your subset.

How this is achieved is another can of worms. The key point is that we use known, highly validated variant resources (omni, 100 Genomes, hapmap) to select a subset of variants within our callset that we’re really confident are probably true positives (that’s the training set). We look at the annotation profiles of those variants (in our own data!), and we from that we learn some rules about how to recognize good variants. We do something similar for bad variants as well. Then we apply the rules we learned to all of the sites, which (through some magical hand-waving) yields a single score for each variant that describes how likely it is based on all the examined dimensions. In our map analogy this is the equivalent of determining on which contour line the variant sits. Finally, we pick a threshold value indirectly by asking the question “what score do I need to choose so that e.g. 99% of the variants in my callset that are also in hapmap will be selected?”. This is called the target sensitivity. We can twist that dial in either direction depending on what is more important for our project, sensitivity or specificity.

## Technical overview

The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. This enables you to generate highly accurate call sets by filtering based on this single estimate for the accuracy of each call.

The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input (typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array, for humans). This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.

The variant recalibrator contrastively evaluates variants in a two step process, each performed by a distinct tool:

• VariantRecalibrator
Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants. This step produces a recalibration file.

• ApplyRecalibration
Apply the model parameters to each variant in input VCF files producing a recalibrated VCF file in which each variant is annotated with its VQSLOD value. In addition, this step will filter the calls based on this new lod score by adding lines to the FILTER column for variants that don't meet the specified lod threshold.

Please see the VQSR tutorial for step-by-step instructions on running these tools.

### How VariantRecalibrator works in a nutshell

The tool takes the overlap of the training/truth resource sets and of your callset. It models the distribution of these variants relative to the annotations you specified, and attempts to group them into clusters. Then it uses the clustering to assign VQSLOD scores to all variants. Variants that are closer to the heart of a cluster will get a higher score than variants that are outliers.

### How ApplyRecalibration works in a nutshell

During the first part of the recalibration process, variants in your callset were given a score called VQSLOD. At the same time, variants in your training sets were also ranked by VQSLOD. When you specify a tranche sensitivity threshold with ApplyRecalibration, expressed as a percentage (e.g. 99.9%), what happens is that the program looks at what is the VQSLOD value above which 99.9% of the variants in the training callset are included. It then takes that value of VQSLOD and uses it as a threshold to filter your variants. Variants that are above the threshold pass the filter, so the FILTER field will contain PASS. Variants that are below the threshold will be filtered out; they will be written to the output file, but in the FILTER field they will have the name of the tranche they belonged to. So VQSRTrancheSNP99.90to100.00 means that the variant was in the range of VQSLODs corresponding to the remaining 0.1% of the training set, which are basically considered false positives.

### Interpretation of the Gaussian mixture model plots

The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run (in the above command line the report will appear as path/to/output.plots.R.pdf). For every pair-wise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.

The figure shows one page of an example Gaussian mixture model report that is automatically generated by the VQSR from the example HiSeq call set. This page shows the 2D projection of mapping quality rank sum test versus Haplotype score by marginalizing over the other annotation dimensions in the model.

In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.

The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).

An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown (haplotype score and mapping quality bias). As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense as not only do higher values of HaplotypeScore indicate a lower chance of the data being explained by only two haplotypes but also higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out!

### Tranches and the tranche plot

The recalibrated variant quality score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The main purpose of the tranches is to establish thresholds within your data that correspond to certain levels of sensitivity relative to the truth sets. The idea is that with well calibrated variant quality scores, you can generate call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way you can choose to use some of the filtered records or only use the PASSing records.

The first tranche (90) which has the lowest value of truth sensitivity but the highest value of novel Ti/Tv, is exceedingly specific but less sensitive. Each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. Downstream applications can select in a principled way more specific or more sensitive call sets or incorporate directly the recalibrated quality scores to avoid entirely the need to analyze only a fixed subset of calls but rather weight individual variant calls by their probability of being real. An example tranche plot, automatically generated by the VariantRecalibrator walker, is shown below.

This is an example of a tranches plot generated for a HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion ratio and the overall truth sensitivity.

Note that the tranches plot is not applicable for indels and will not be generated when the tool is run in INDEL mode.

### Ti/Tv-free recalibration

We use a Ti/Tv-free approach to variant quality score recalibration. This approach requires an additional truth data set, and cuts the VQSLOD at given sensitivities to the truth set. It has several advantages over the Ti/Tv-targeted approach:

• The truth sensitivity (TS) approach gives you back the novel Ti/Tv as a QC metric
• The truth sensitivity (TS) approach is conceptual cleaner than deciding on a novel Ti/Tv target for your dataset
• The TS approach is easier to explain and defend, as saying "I took called variants until I found 99% of my known variable sites" is easier than "I took variants until I dropped my novel Ti/Tv ratio to 2.07"

We have used hapmap 3.3 sites as the truth set (genotypes_r27_nr.b37_fwd.vcf), but other sets of high-quality (~99% truly variable in the population) sets of sites should work just as well. In our experience, with HapMap, 99% is a good threshold, as the remaining 1% of sites often exhibit unusual features like being close to indels or are actually MNPs, and so receive a low VQSLOD score.
Note that the expected Ti/Tv is still an available argument but it is only used for display purposes.

### Finally, a couple of Frequently Asked Questions

#### - Can I use the variant quality score recalibrator with my small sequencing experiment?

This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties.

One piece of advice is to turn down the number of Gaussians used during training. This can be accomplished by adding --maxGaussians 4 to your command line.

maxGaussians is the maximum number of different "clusters" (=Gaussians) of variants the program is "allowed" to try to identify. Lowering this number forces the program to group variants into a smaller number of clusters, which means there will be more variants in each cluster -- hopefully enough to satisfy the statistical requirements. Of course, this decreases the level of discrimination that you can achieve between variant profiles/error modes. It's all about trade-offs; and unfortunately if you don't have a lot of variants you can't afford to be very demanding in terms of resolution.

#### - Why don't all the plots get generated for me?

The most common problem related to this is not having Rscript accessible in your environment path. Rscript is the command line version of R that gets installed right alongside. We also make use of the ggplot2 library so please be sure to install that package as well. See the Common Problems section of the Guide for more details.

No posts found with the requested search criteria.

Created 2016-02-10 10:31:35 | Updated | Tags: vcf qc

Hi,

Is there a tool in GATK - or in any other program - which counts the number of different genotype calls per individual in a vcf?

Specifically this would be the total number of hom.ref, het, hom.var and no.call genotype calls - for each individual.

This would be fairly straightforward to do on a small-ish and simple genotypes file, with perl, R etc but I'd rather do it on a vcf.

Sincerely,

William Gilks

Created 2016-02-03 20:40:44 | Updated | Tags: vcf variant-calling

From the GATK documentation I found that for PL field, the most likely genotype (assigned in the GT field) is 0. However, from the vcf file called by GATK, I found that this is not always the case. Below are a few examples:

GT:AD:DP:GQ:PL 0/1:0,232:444:99:0,382,4800 1/1:1,0:1:2:1,0,0 1/1:1,0:1:99:0,3,35 1/1:1,0:1:87:0,292,247

Also, GQ is equal to the second smallest PL, unless that PL is greater than 99. However, I also found cases like the following, where the GQ is 1.76:

I am wondering that whether they have specific meanings or it is due to calling error, or due to the low quality of calling. Thank you very much.

Created 2016-02-02 19:41:46 | Updated | Tags: vcf

Hi

In GATK version 3.5 I see the following in VCF headers:

##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

However, the number of values in the AD field should always be the number of alleles (including the reference), right? The 4.2 VCF spec has a value R to represent this. Therefore, could the header line be changed to the following in future GATK releases?

##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

Why is this important? Well, one reason is that bcftools norm uses this when splitting multiallelic variants into multiple biallelics. With the '.' this isn't done correctly, but with the 'R' it is. See the comment from freeseek at https://github.com/samtools/bcftools/issues/40 for further details.

Created 2016-01-11 14:47:05 | Updated | Tags: vcf r vcf-filtering

Hi,

I'm trying to finalise good hard-filtering parameters. Does anyone know why the quality-by-depth distribution having has two peaks (See attached graphs, last column"QD").

This happens even after very strict filtering by basic metrics. There seems to be a lot of variants contributing to the two peaks so I'm guessing it's not due to a particular genomic region. (The graph lines are Drosophila chromosomes. Chr4 in blue is clearly poor. The Inbreeding coefficient and allele frequency are expected to be weird-looking due to our breeding design.)

Filter parameters are (MQ > 61, MQ < 68, FS < 5, AN > 420, InbreedingCoeff > -1, DP < 10000, DP > 1000, ReadPosRankSum > -1, ReadPosRankSum < 1, ClippingRankSum > -.5, ClippingRankSum < .5, BaseQRankSum > -1, BaseQRankSum < 1, MQRankSum > -.5, MQRankSum < .5, EVENTLENGTH < 1, EVENTLENGTH > -1).

Cheers,

Will

Created 2015-12-15 23:23:57 | Updated | Tags: readbackedphasing vcf bam htsjdk

I got a Contig X does not have a length field error when I run the phasing tool. By digging in more, it seems a user exception from htsjdk code, here is the trace of the coding line, has anyone experienced this issue before? Is it a problem in contig reference sequence or bad VCF file, or bad bam file?

Thanks for any help,

--------------------------------error exception raised in following code line-------------------------------------------------------------------

public SAMSequenceRecord getSAMSequenceRecord() { final String lengthString = this.getGenericFieldValue("length"); if (lengthString == null) throw new TribbleException("Contig " + this.getID() + " does not have a length field.");

which is called by RMDTrackBuilder.java validateVariantAgainstSequenceDictionary(final String name, final String descriptorName, final AbstractFeatureReader reader, final SAMSequenceDictionary dict )

in vcfContigRecords.add(contig.getSAMSequenceRecord()); -- crashed here

Created 2015-11-25 21:38:23 | Updated 2015-11-25 21:40:23 | Tags: selectvariants vcf gatk

Hi,

I have multi-sample vcf file and an example variant is shown below: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 03-071 04-051 04-071 06-044 07-085 10-009

chr1 6526093 . T C 197.77 . AC1=1;AC=1;AF1=0.5 GT:GQ:DP:PL:AD 0/1 1/1 0/0 1/1 1/1 0/1

For each variant i would need to retrieve the sample names based on genotype. If the genotype is "0/1" could it be possible to select only the samples for which the genotype is "0/1" and similarly for the genotype "1/1"?

SelectVariants filters variants based on the different quality values provided. Here i would need to subset the vcf by sample for each variant based on the genotype. Are there any tools which can do this?

Created 2015-11-17 15:48:32 | Updated | Tags: vcf

Hi, I have been scouring the forums but have not found an answer that is clear to me yet. Perhaps someone here can clarify for me please? I am confused about the different DP values in my vcf (GATK HC v 3.2-2).

example: 17 7579472 rs1042522 G C 515.77 . AC=2;AF=1.00;AN=2;DB;DP=65;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=7.93 GT:AD:DP:GQ:PL 1/1:0,22:22:66:544,66,0

The first DP value is 65 my understanding that this values is across all samples. In this case the vcf is for 1 sample. That leads me to assume that the DP value of 22 in the last column reflects the good quality reads at this position used for variant calling. Can I interpret this as there are 22 good quality reads at this position and all those reads are the non reference C nucleotide? The other 43 reads were of poor quality or shifted in the recalibration and therefore not considered at the variant calling step? Thank you****

Created 2015-11-05 02:36:31 | Updated | Tags: vcf fastaalternatereferencemaker incompatible-contigs

Hello, I am using FastaAlterenateReferenceMaker to generate new reference genome using SNPs in .vcf file. I used following command: java -jar GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R genome.fa -o newgenome.fa -V snps.vcf

I got this error: ##### ERROR ##### ERROR MESSAGE: Input files variant and reference have incompatible contigs: The contig order in variant and referenceis not the same; to fix this please see (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files. ##### ERROR variant contigs = [10, 1, 2, 3, 4, 5, 6, 7, 8, 9, MT, Pt] ##### ERROR reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, MT, Pt]

vfc file contains SNPs for only one genome (ch6) and does not have any contig description line, and using the script in the link didn't work. I tried changing the order of the genome.fa to be same as variant contigs but this time, the chromosomes of the newgenome.fa were wrong (e.g. ch7 had the sequence of ch6, ch1 had the sequence of ch10, etc.)

How can I fix this? Thank you in advance.

Created 2015-10-12 19:01:16 | Updated | Tags: vcf bam

I have a problem processing a BAM file with GATK 3.4-46. This is my command:

java -Xmx3g -jar GenomeAnalysisTK.jar -l INFO -R resources/Homo_sapiens_assembly19.fasta -T UnifiedGenotyper -I /tmp/foo.bam -L Y -rf BadCigar -o /tmp/foo.vcf --output_mode EMIT_ALL_CONFIDENT_SITES

and I get an error:

ERROR MESSAGE: SAM/BAM/CRAM file htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter@32734443 appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 63; please see the GATK --help documentation for options related to this error I searched on the web and found that adding some more flags has helped some people to resolve the problem so I modified the command: java -Xmx3g -jar GenomeAnalysisTK.jar -l INFO -R resources/Homo_sapiens_assembly19.fasta -T UnifiedGenotyper -I /tmp/foo.bam -L Y -rf BadCigar -o /tmp/foo.vcf --output_mode EMIT_ALL_CONFIDENT_SITES --fix_misencoded_quality_scores -fixMisencodedQuals Now it fails in a different way: ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool Any idea how to fix this? Created 2015-10-10 09:58:15 | Updated | Tags: vcf Dear all, When I combined multi-sample variants with tumor-normal comparison vcf files, following errors emerged. It is normal for triallelic variants in tumor. Could you please tell me how to fix the error? Thanks a lot. ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.4-46-gbc02625): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: The provided VCF file is malformed at approximately line number 2457: unparsable vcf record with allele A/C ##### ERROR ------------------------------------------------------------------------------------------ Created 2015-10-06 20:13:14 | Updated | Tags: vcf indels gatk Have a long standing stable pipeline, which I need to tweak as can be explained and justified. What is the difference between the old file of known indels (from Mills paper(s)) - Homo_sapiens_assembly19.kown_indels.vcf And the current version of known indels with 1000G added - Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz I know the newer reference is smaller - just sites. It also has significantly fewer records/line counts. What else was removed or added from the old version until now? Thanks in advance. Created 2015-09-28 14:11:22 | Updated | Tags: vcf rna-seq Hello All, I am using I am using GATK RNA-seq variant pipeline for finding muttaion/vatiants on the list of gene given in teh follwoing command line java-1.7 -jar -Xincgc -Xmx1586M GenomeAnalysisTK-3.2-2.jar -T HaplotypeCaller --filter_reads_with_N_cigar -R human_genome37_gatk.fa -D dbsnp_137.hg19.vcf -I sample_split.bam -o sample.vcf -L mylist.intervals And the resulting VCF files has for variants AF either 100 % or 50 % . It would be great if anyone would explain me what does AF means in INFO column from VCF file. I AF means allele frequency or it has to be calculated separately from VCF file and if so how can I do it..??? Thank you Created 2015-09-17 01:02:25 | Updated 2015-09-17 01:03:59 | Tags: snp vcf variant-calling I am dealing with my data using vcftools, in the generated vcf file, I found that when DP=0, why it stills give a Genotype, like 0/1 0/0 or something? I attached a pic, a screenshot from a vcf file. there are 2 lines showed DP=0,and still have a GT value! The question bothered me for a long time, HELP!!! Created 2015-09-14 16:27:46 | Updated | Tags: haplotypecaller vcf bam rnaseq variant-calling Hello, I'm using GATK to call variants in my RNA-Seq data. I'm noticing something strange, perhaps someone can help? For a number of sites the VCF is reporting things I cannot replicate from BAMs. How can I recover the reads that contribute to a variant call? Here is an example for 1 site in 1 sample, but I've observed this at many sites/samples: $ grep 235068463 file.vcf
chr1    235068463   .   T   C   1795.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-3.530;ClippingRankSum=-0.535;DP=60;FS=7.844;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=0.401;QD=29.93;ReadPosRankSum=3.557   GT:AD:DP:GQ:PL  0/1:5,55:60:44:1824,0,44

60 reads, 5 T, 55 C.
But loading the bam in IGV, I do not see any T reads. Similarly:

samtools view -uh file.md.realn.bam chr1:235068463-235068463 |samtools mpileup - |grep 235068463 [mpileup] 1 samples in 1 input files <mpileup> Set max per-file depth to 8000 chr1 235068463 N 60 cCCccccCCCcccccCcccccccccCCCccCCCCCcCcccccCCCcCcCCccCCCCccCC >CA@B@>A>BA@BCABACCC:@@ACABBBCAACBBCABCB@CABBAB?>A?CBBAAAABA There are just 60 C's at that location. How do I decide what the genotype here is? C/C or C/T ? For methodology I'm using gatk/3.2.0. I tried using HC from gatk/3.3.1 and got the same result. The bam and vcf files come from the final two lines: -2 pass STAR -Mark Dups -SplitNCigarReads -RealignerTargetCreator -IndelRealigner -BaseRecalibrator -PrintReads -MergeSamFiles.jar -Mark Dups -RealignerTargetCreator -IndelRealigner -HaplotyeCaller Thanks, Kipp Created 2015-08-25 12:58:19 | Updated | Tags: vcf normalization Hi, I just read the article "Unified Representation of Genetic Variants" (A. Tan) They propose a way to normalized variants. As mention in this forum, GATK has a tools to normalized biallelic polymorphism, but does the latest version manage multiallelic polymorphism ? It is also mentioned that Mills files are note normalized correctly, are the files provided in the package normalized ? at least for biallelic polymorphism ? Created 2015-08-18 11:55:19 | Updated 2015-08-18 11:57:08 | Tags: unifiedgenotyper vcf lowqual emit-all-confident-sites emit-all-sites Hi, I've run UnifiedGenotyper on a BAM file with both EMIT_ALL_SITES and EMIT_ALL_CONFIDENT_SITES. I've noticed some of the calls that have been omitted with the EMIT_ALL_SITES option seem to be of comparable quality to others that have been emitted. For example, sites like HE670865 368605 are removed as non-confident sites while the site 368591 has not been. I understand why the positions flagged as "LowQual" are not present when using EMIT_ALL_CONFIDENT_SITES. But I can't see why other positions (such as HE670865 368605 and HE670865 368600) are not being emitted. Of particular importance is the fact that some of these seemingly 'good sites' that have been removed are SNPs that might be of biological importance. These are the two commands used together with some relevant lines from the resulting vcfs: java -Xmx4g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -nt 4 -out_mode EMIT_ALL_SITES -baq CALCULATE_AS_NECESSARY -hets 0.015 -R genome_ref.fasta -o output_all.vcf -I input_divergentscaffs.realign.bam &> divergentscaffs_out.GATKlog & java -Xmx4g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -nt 4 -out_mode EMIT_ALL_CONFIDENT_SITES -baq CALCULATE_AS_NECESSARY -hets 0.015 -R genome_ref.fasta -o output_confident.vcf -I input_divergentscaffs.realign.bam &> divergentscaffs_confident.GATKlog & These are a sample of lines from the outputted VCF file when using EMIT_ALL_SITES: HE670865 368589 . A . 46.89 . AN=16;DP=441;MQ=43.26;MQ0=13 GT:DP 0/0:35 0/0:45 0/0:58 0/0:38 0/0:75 0/0:64 0/0:64 0/0:47 HE670865 368590 . A . 50.15 . AN=16;DP=447;MQ=43.41;MQ0=13 GT:DP 0/0:35 0/0:45 0/0:60 0/0:39 0/0:76 0/0:66 0/0:65 0/0:47 HE670865 368591 . A . 39.12 . AN=16;DP=448;MQ=43.32;MQ0=12 GT:DP 0/0:36 0/0:45 0/0:61 0/0:39 0/0:77 0/0:66 0/0:66 0/0:46 HE670865 368592 . C T 5437.28 . AC=16;AF=1.00;AN=16;BaseQRankSum=1.036;DP=445;Dels=0.06;FS=0.000;HaplotypeScore=99.1839;MLEAC=16;MLEAF=1.00;MQ=43.48;MQ0=9;MQRankSum=4.214;QD=12.22;ReadPosRankSum=6.263 GT:AD:DP:GQ:PL 1/1:1,27:29:27:296,27,0 1/1:0,45:45:30:344,30,0 1/1:6,43:52:33:327,33,0 1/1:1,38:39:48:506,48,0 1/1:5,68:76:99:1100,102,0 1/1:4,61:65:68:1068,68,0 1/1:0,65:66:99:1035,99,0 1/1:0,46:46:69:774,69,0 HE670865 368593 . A . 50.76 . AN=16;DP=448;MQ=43.48;MQ0=9 GT:DP 0/0:29 0/0:45 0/0:53 0/0:41 0/0:76 0/0:65 0/0:66 0/0:47 HE670865 368594 . G . 69.75 . AN=16;DP=449;MQ=43.48;MQ0=9 GT:DP 0/0:28 0/0:46 0/0:51 0/0:41 0/0:76 0/0:66 0/0:66 0/0:47 HE670865 368595 . A . 66.75 . AN=16;DP=447;MQ=43.43;MQ0=9 GT:DP 0/0:30 0/0:46 0/0:50 0/0:41 0/0:75 0/0:66 0/0:66 0/0:47 HE670865 368596 . A . 72.75 . AN=16;DP=445;MQ=43.54;MQ0=9 GT:DP 0/0:31 0/0:46 0/0:48 0/0:41 0/0:76 0/0:66 0/0:67 0/0:47 HE670865 368597 . G . 66.75 . AN=16;DP=442;MQ=43.48;MQ0=9 GT:DP 0/0:25 0/0:46 0/0:41 0/0:40 0/0:74 0/0:67 0/0:67 0/0:47 HE670865 368598 . A . 75.75 . AN=16;DP=444;MQ=43.44;MQ0=9 GT:DP 0/0:25 0/0:46 0/0:41 0/0:39 0/0:75 0/0:69 0/0:67 0/0:47 HE670865 368599 . A . 72.75 . AN=16;DP=445;MQ=43.49;MQ0=9 GT:DP 0/0:24 0/0:45 0/0:39 0/0:39 0/0:76 0/0:69 0/0:68 0/0:48 HE670865 368600 . T A 4405.18 . AC=8;AF=0.500;AN=16;BaseQRankSum=2.278;DP=441;Dels=0.08;FS=6.510;HaplotypeScore=100.3222;MLEAC=8;MLEAF=0.500;MQ=43.26;MQ0=9;MQRankSum=0.498;QD=24.75;ReadPosRankSum=1.103 GT:AD:DP:GQ:PL 1/1:3,19:22:48:570,48,0 1/1:2,45:47:99:1386,114,0 1/1:4,33:37:96:1196,96,0 1/1:0,38:38:99:1307,105,0 0/0:76,2:78:99:0,178,2196 0/0:65,0:65:99:0,159,2002 0/0:70,0:70:99:0,168,2042 0/0:49,0:49:99:0,129,1549 HE670865 368601 . T . 63.75 . AN=16;DP=442;MQ=43.37;MQ0=9 GT:DP 0/0:19 0/0:47 0/0:37 0/0:38 0/0:77 0/0:66 0/0:72 0/0:48 HE670865 368602 . A . 66.75 . AN=16;DP=444;MQ=43.44;MQ0=9 GT:DP 0/0:19 0/0:47 0/0:38 0/0:38 0/0:78 0/0:65 0/0:74 0/0:48 HE670865 368603 . C . 66.75 . AN=16;DP=441;MQ=43.50;MQ0=11 GT:DP 0/0:20 0/0:47 0/0:39 0/0:37 0/0:76 0/0:64 0/0:73 0/0:48 HE670865 368604 . A . 63.75 . AN=16;DP=441;MQ=43.24;MQ0=11 GT:DP 0/0:21 0/0:47 0/0:39 0/0:38 0/0:74 0/0:64 0/0:74 0/0:48 HE670865 368605 . G . 60.75 . AN=16;DP=436;MQ=43.17;MQ0=12 GT:DP 0/0:20 0/0:47 0/0:40 0/0:38 0/0:73 0/0:63 0/0:73 0/0:47 HE670865 368606 . A . 60.75 . AN=16;DP=436;MQ=43.05;MQ0=12 GT:DP 0/0:22 0/0:45 0/0:40 0/0:39 0/0:73 0/0:63 0/0:75 0/0:46 HE670865 368607 . T C 3984.32 . AC=15;AF=0.938;AN=16;BaseQRankSum=0.225;DP=433;Dels=0.06;FS=11.987;HaplotypeScore=229.2107;MLEAC=15;MLEAF=0.938;MQ=42.94;MQ0=12;MQRankSum=1.624;QD=9.20;ReadPosRankSum=-0.463 GT:AD:DP:GQ:PL 1/1:7,18:26:12:109,12,0 1/1:2,39:44:57:475,57,0 1/1:6,36:44:51:426,51,0 1/1:1,36:39:39:328,39,0 1/1:6,67:73:44:603,44,0 1/1:0,63:63:99:1016,105,0 1/1:4,69:74:90:817,90,0 0/1:22,23:46:99:232,0,373 HE670865 368608 . A . 24.45 LowQual AN=16;DP=438;MQ=42.86;MQ0=12 GT:DP 0/0:34 0/0:45 0/0:55 0/0:39 0/0:74 0/0:63 0/0:74 0/0:47 HE670865 368609 . A . 29.97 LowQual AN=14;DP=434;MQ=42.67;MQ0=12 GT:DP ./. 0/0:44 0/0:53 0/0:39 0/0:67 0/0:61 0/0:74 0/0:47 HE670865 368610 . A . 20.62 LowQual AN=16;DP=432;MQ=42.47;MQ0=12 GT:DP 0/0:33 0/0:45 0/0:54 0/0:39 0/0:67 0/0:59 0/0:71 0/0:47 HE670865 368611 . C A 434.85 . AC=13;AF=0.813;AN=16;BaseQRankSum=0.922;DP=420;Dels=0.17;FS=0.883;HaplotypeScore=330.1362;MLEAC=14;MLEAF=0.875;MQ=42.34;MQ0=12;MQRankSum=0.871;QD=1.13;ReadPosRankSum=-2.717 GT:AD:DP:GQ:PL 0/0:17,11:28:3:0,3,23 1/1:27,10:38:3:32,3,0 1/1:23,23:46:6:64,6,0 1/1:12,16:29:3:32,3,0 1/1:22,30:53:9:73,9,0 1/1:18,35:53:6:56,6,0 1/1:26,34:60:18:188,18,0 0/1:23,17:40:16:16,0,61 HE670865 368612 . A . 6.62 LowQual AN=16;DP=409;MQ=42.14;MQ0=12 GT:DP 0/0:27 0/0:35 0/0:36 0/0:26 0/0:49 0/0:47 0/0:55 0/0:31 HE670865 368613 . G A 325 . AC=11;AF=0.917;AN=12;BaseQRankSum=3.200;DP=404;Dels=0.46;FS=1.094;HaplotypeScore=347.5139;MLEAC=11;MLEAF=0.917;MQ=41.95;MQ0=12;MQRankSum=1.300;QD=1.02;ReadPosRankSum=1.782 GT:AD:DP:GQ:PL 1/1:4,16:21:3:22,3,0 ./. 1/1:0,25:26:3:23,3,0 1/1:1,15:21:3:26,3,0 0/1:4,28:36:10:62,0,10 1/1:1,35:36:6:52,6,0 1/1:2,35:39:15:156,15,0 ./. HE670865 368614 . A . 17.30 LowQual AN=12;DP=416;MQ=41.75;MQ0=12 GT:DP ./. 0/0:25 0/0:36 ./. 0/0:47 0/0:41 0/0:50 0/0:25 HE670865 368615 . A . 16.42 LowQual AN=12;DP=411;MQ=41.66;MQ0=11 GT:DP ./. 0/0:25 0/0:35 ./. 0/0:45 0/0:40 0/0:49 0/0:22 HE670865 368616 . A . 17.42 LowQual AN=10;DP=406;MQ=41.71;MQ0=10 GT:DP ./. 0/0:24 ./. ./. 0/0:40 0/0:40 0/0:47 0/0:22 HE670865 368617 . T C 94.41 . AC=4;AF=0.667;AN=6;BaseQRankSum=0.262;DP=385;Dels=0.38;FS=7.489;HaplotypeScore=362.4384;MLEAC=5;MLEAF=0.833;MQ=41.72;MQ0=9;MQRankSum=1.775;QD=0.84;ReadPosRankSum=4.850 GT:AD:DP:GQ:PL ./. 0/0:15,1:20:3:0,3,32 ./. ./. ./. 1/1:5,31:36:9:85,9,0 1/1:14,28:42:3:26,3,0 ./. HE670865 368618 . C . 17.27 LowQual AN=6;DP=370;MQ=41.65;MQ0=8 GT:DP ./. 0/0:10 ./. ./. ./. 0/0:34 0/0:37 ./. HE670865 368619 . G A 35.14 . AC=4;AF=0.500;AN=8;BaseQRankSum=-3.320;DP=365;Dels=0.54;FS=0.000;HaplotypeScore=399.7009;MLEAC=4;MLEAF=0.500;MQ=41.71;MQ0=8;MQRankSum=-1.237;QD=0.35;ReadPosRankSum=-4.879 GT:AD:DP:GQ:PL ./. 1/1:0,4:8:3:28,3,0 ./. ./. 0/0:17,3:22:3:0,3,23 0/0:31,3:34:3:0,3,22 1/1:28,8:36:3:25,3,0 ./. HE670865 368620 . T . 15.82 LowQual AN=2;DP=358;MQ=41.72;MQ0=8 GT:DP ./. 0/0:8 ./. ./. ./. ./. ./. ./. HE670865 368621 . T . . . DP=358;MQ=41.63;MQ0=8 GT ./. ./. ./. ./. ./. ./. ./. ./. HE670865 368622 . T . 14.11 LowQual AN=6;DP=358;MQ=41.53;MQ0=8 GT:DP ./. ./. ./. ./. ./. 0/0:11 0/0:19 0/0:17 HE670865 368623 . T . 14.11 LowQual AN=6;DP=365;MQ=41.46;MQ0=7 GT:DP ./. 0/0:24 ./. 0/0:16 ./. 0/0:20 ./. ./. HE670865 368624 . A . 13.85 LowQual AN=8;DP=366;MQ=41.48;MQ0=7 GT:DP 0/0:22 0/0:26 ./. 0/0:23 ./. 0/0:20 ./. ./. HE670865 368625 . T . 14.12 LowQual AN=6;DP=353;MQ=41.39;MQ0=7 GT:DP 0/0:20 0/0:26 ./. ./. ./. 0/0:18 ./. ./. HE670865 368626 . T . . . DP=348;MQ=41.31;MQ0=7 GT ./. ./. ./. ./. ./. ./. ./. ./. HE670865 368627 . T A 15.18 LowQual AC=2;AF=1.00;AN=2;BaseQRankSum=-1.221;DP=348;Dels=0.55;FS=0.000;HaplotypeScore=403.2395;MLEAC=2;MLEAF=1.00;MQ=41.29;MQ0=7;MQRankSum=-0.971;QD=0.23;ReadPosRankSum=-0.344 GT:AD:DP:GQ:PL ./. ./. ./. ./. 1/1:30,4:34:3:26,3,0 ./. ./. ./. HE670865 368628 . T . 13.85 LowQual AN=8;DP=357;MQ=41.07;MQ0=8 GT:DP ./. 0/0:22 ./. 0/0:10 0/0:33 0/0:15 ./. ./. HE670865 368629 . A . 15.14 LowQual AN=10;DP=357;MQ=40.99;MQ0=8 GT:DP 0/0:14 0/0:22 ./. ./. 0/0:34 0/0:15 0/0:21 ./. HE670865 368630 . T . 18.75 LowQual AN=10;DP=358;MQ=40.89;MQ0=8 GT:DP 0/0:15 0/0:23 ./. 0/0:12 ./. 0/0:16 0/0:21 ./. HE670865 368631 . C . 18.33 LowQual AN=14;DP=359;MQ=40.70;MQ0=8 GT:DP 0/0:15 0/0:23 0/0:14 ./. 0/0:30 0/0:16 0/0:20 0/0:33 HE670865 368632 . G . 17.87 LowQual AN=16;DP=361;MQ=40.63;MQ0=9 GT:DP 0/0:15 0/0:23 0/0:15 0/0:12 0/0:29 0/0:16 0/0:21 0/0:33 HE670865 368633 . T . 18.90 LowQual AN=14;DP=364;MQ=40.51;MQ0=10 GT:DP 0/0:17 0/0:24 0/0:16 ./. 0/0:32 0/0:17 0/0:22 0/0:34 HE670865 368634 . A . 17 LowQual AN=14;DP=369;MQ=40.67;MQ0=10 GT:DP 0/0:18 0/0:24 0/0:18 ./. 0/0:36 0/0:19 0/0:22 0/0:34 HE670865 368635 . T . 17.22 LowQual AN=14;DP=364;MQ=40.66;MQ0=10 GT:DP 0/0:18 0/0:24 0/0:22 ./. 0/0:37 0/0:20 0/0:21 0/0:32 HE670865 368636 . C . 16.03 LowQual AN=14;DP=362;MQ=40.70;MQ0=10 GT:DP 0/0:16 ./. 0/0:15 0/0:14 0/0:36 0/0:20 0/0:22 0/0:29 HE670865 368637 . A . 0.02 LowQual AN=8;DP=356;MQ=40.91;MQ0=11 GT:DP 0/0:26 ./. 0/0:49 0/0:31 0/0:55 ./. ./. ./. HE670865 368638 . T G 24.10 LowQual AC=3;AF=0.250;AN=12;BaseQRankSum=-6.075;DP=355;Dels=0.14;FS=8.612;HaplotypeScore=349.9679;MLEAC=3;MLEAF=0.250;MQ=40.80;MQ0=12;MQRankSum=-2.113;QD=0.24;ReadPosRankSum=1.216 GT:AD:DP:GQ:PL ./. 0/0:22,3:25:6:0,6,44 0/0:46,3:49:6:0,6,44 0/0:29,1:30:6:0,6,49 0/0:46,8:54:3:0,3,22 1/1:40,2:42:3:29,3,0 0/1:47,5:52:17:18,0,17 ./. HE670865 368639 . A . 18.78 LowQual AN=16;DP=352;MQ=40.91;MQ0=13 GT:DP 0/0:25 0/0:26 0/0:48 0/0:30 0/0:52 0/0:42 0/0:50 0/0:26 HE670865 368640 . T . 20.94 LowQual AN=16;DP=351;MQ=40.93;MQ0=13 GT:DP 0/0:25 0/0:26 0/0:48 0/0:30 0/0:51 0/0:42 0/0:50 0/0:26 HE670865 368641 . C . 23.55 LowQual AN=16;DP=347;MQ=40.91;MQ0=13 GT:DP 0/0:24 0/0:25 0/0:49 0/0:26 0/0:45 0/0:39 0/0:47 0/0:18 HE670865 368642 . G T 24 LowQual AC=4;AF=0.250;AN=16;BaseQRankSum=-1.989;DP=349;Dels=0.21;FS=7.313;HaplotypeScore=259.1726;MLEAC=3;MLEAF=0.188;MQ=40.80;MQ0=13;MQRankSum=-2.006;QD=0.14;ReadPosRankSum=1.275 GT:AD:DP:GQ:PL 0/1:21,3:24:11:11,0,88 0/0:23,3:26:18:0,18,152 0/0:47,2:49:30:0,30,252 0/0:25,0:25:30:0,30,256 0/0:38,7:45:18:0,18,164 0/1:37,2:39:27:27,0,177 0/1:43,4:47:2:2,0,144 0/1:13,6:19:13:13,0,79 HE670865 368643 . T . 27.64 LowQual AN=16;DP=345;MQ=40.82;MQ0=13 GT:DP 0/0:24 0/0:26 0/0:49 0/0:26 0/0:44 0/0:38 0/0:46 0/0:20 HE670865 368644 . A . 26.97 LowQual AN=14;DP=343;MQ=40.74;MQ0=13 GT:DP ./. 0/0:22 0/0:47 0/0:26 0/0:36 0/0:37 0/0:42 0/0:17 HE670865 368645 . T . 20.86 LowQual AN=14;DP=340;MQ=40.85;MQ0=12 GT:DP ./. 0/0:21 0/0:47 0/0:26 0/0:36 0/0:36 0/0:41 0/0:17 HE670865 368646 . G C,T 237.20 . AC=5,7;AF=0.417,0.583;AN=12;BaseQRankSum=-1.775;DP=353;Dels=0.25;FS=0.000;HaplotypeScore=196.6140;MLEAC=5,6;MLEAF=0.417,0.500;MQ=40.93;MQ0=12;MQRankSum=-0.594;QD=0.89;ReadPosRankSum=1.392 GT:AD:DP:GQ:PL ./. 2/2:13,14,7:34:3:30,30,30,3,3,0 ./. 1/2:2,13,5:20:15:80,24,15,56,0,53 1/2:10,27,9:46:14:64,20,14,44,0,41 1/2:4,32,2:38:18:61,24,18,37,0,34 2/2:8,30,8:46:9:77,77,77,9,9,0 1/1:18,11,1:30:6:43,6,0,43,6,43 HE670865 368647 . T . 29.30 LowQual AN=16;DP=368;MQ=40.74;MQ0=12 GT:DP 0/0:31 0/0:37 0/0:54 0/0:36 0/0:52 0/0:42 0/0:50 0/0:34 HE670865 368648 . G . 51.60 . AN=16;DP=366;MQ=40.87;MQ0=12 GT:DP 0/0:34 0/0:40 0/0:57 0/0:37 0/0:59 0/0:44 0/0:53 0/0:42 These are the comparable lines from the outputted VCF when using EMIT_ALL_CONFIDENT_SITES: HE670865 368589 . A . 46.89 . AN=16;DP=441;MQ=43.26;MQ0=13 GT:DP 0/0:35 0/0:45 0/0:58 0/0:38 0/0:75 0/0:64 0/0:64 0/0:47 HE670865 368590 . A . 50.15 . AN=16;DP=447;MQ=43.41;MQ0=13 GT:DP 0/0:35 0/0:45 0/0:60 0/0:39 0/0:76 0/0:66 0/0:65 0/0:47 HE670865 368591 . A . 39.12 . AN=16;DP=448;MQ=43.32;MQ0=12 GT:DP 0/0:36 0/0:45 0/0:61 0/0:39 0/0:77 0/0:66 0/0:66 0/0:46 HE670865 368648 . G . 51.60 . AN=16;DP=366;MQ=40.87;MQ0=12 GT:DP 0/0:34 0/0:40 0/0:57 0/0:37 0/0:59 0/0:44 0/0:53 0/0:42 HE670865 368649 . T . 75.72 . AN=16;DP=368;MQ=40.89;MQ0=12 GT:DP 0/0:34 0/0:41 0/0:57 0/0:38 0/0:59 0/0:44 0/0:53 0/0:42 HE670865 368650 . T . 72.74 . AN=16;DP=369;MQ=40.87;MQ0=12 GT:DP 0/0:34 0/0:41 0/0:57 0/0:38 0/0:59 0/0:44 0/0:53 0/0:43 HE670865 368651 . T . 87.71 . AN=16;DP=369;MQ=40.88;MQ0=12 GT:DP 0/0:33 0/0:42 0/0:58 0/0:37 0/0:59 0/0:44 0/0:53 0/0:43 HE670865 368652 . T . 93.72 . AN=16;DP=373;MQ=40.89;MQ0=12 GT:DP 0/0:32 0/0:43 0/0:59 0/0:37 0/0:59 0/0:45 0/0:54 0/0:43 Any help or insight would be greatly appreciated. Created 2015-08-12 23:44:49 | Updated 2015-08-12 23:53:26 | Tags: liftovervariants vcf hapmap picard liftovervcf I am having a problem with picard's LiftoverVcf. I am trying to Liftover hapmap files (downloaded plink files from hapmap and converted to vcf using plink) from ncbi36 to hg38. I was able to do this with GATK LiftoverVariants. My problem came when I had to merge the hapmap.hg38 with some genotype files (that I liftover from hg19 to hg38 using GATK LiftoverVariants). I am merging them so that I can run population stratification using plink. I used vcf-merge but it complained that a SNP has different reference allele in both files: rs3094315, should be reference allele G (which was correct in the genotype.hg38 files but in the hapmap.hg38 files it was wrong). I also first tried to lift hapmap.ncbi36 to hg19 then to hg38 but the offending allele was still there. So I decided to try and lift the hapmap.ncbi36 using LiftoverVCF from picard. 1. I downloaded the newest picard build (20 hours old) picard-tools-1.138. 2. Used the command: java -jar -Xmx6000m ../../../tools/picard-tools-1.138/picard.jar LiftoverVcf I=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.vcf O=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.picard.hg38.vcf C=../../../tools/liftover/chain_files/hg18ToHg38.over.chain REJECT=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.picard.hg38.reject.vcf R=../../../data/assemblies/hg38/hg38.fa VERBOSITY=ERROR Here is the run: [Thu Aug 13 00:43:40 CEST 2015] picard.vcf.LiftoverVcf INPUT=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.vcf OUTPUT=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.picard.hg38.vcf CHAIN=......\tools\liftover\chain_files\hg18ToHg38.over.chain REJECT=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.picard.hg38.reject.vcf REFERENCE_SEQUENCE=......\data\assemblies\hg19\assemble\hg38.fa VERBOSITY=ERROR QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json Here is the error: Exception in thread "main" java.lang.IllegalStateException: Allele in genotype A not in the variant context [T, C] at htsjdk.variant.variantcontext.VariantContext.validateGenotypes(VariantContext.java:1357) at htsjdk.variant.variantcontext.VariantContext.validate(VariantContext.java:1295) at htsjdk.variant.variantcontext.VariantContext.(VariantContext.java:410) at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:496) at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:490) at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:200) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105) 1. I have no idea which SNP is the problem. 2. I do not know what T* means (does not seem to exist in the file). 3. I am new to picard so I thought VERBOSE=ERROR will give me something more but nothing more appeared. 4. Given that lifting hapmap.ncbi36 to hg19 then to hg38 produced the same erroneous reference allele I suppose lifting will not fix this and I will have to work with dnsnp to correct my file. Do you know how I can change reference allele in a vcf? Is there a tool for this? Is there a liftover tool for dbsnp? 5. As a side note I want to make picard work because I read that you will be deprecating the GATK liftover and will support the picard liftover (at some point in the future) so help with this tool will be appreciated. Created 2015-08-09 21:20:47 | Updated | Tags: fastaalternatereference vcf fastaalternatereferencemaker I am trying to use a VCF containing snps variants to change the mouse reference (GRCm38- c57BL/6J) with BALB/cJ snps. After running this command: java -jar ~/programs/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R ~/genome/mouse_GRCm38.p4/GRCm38.primary_assembly/GRCm38.primary_assembly.fa -o ~/BALBcJ.snp.primary.fa -V ~/BALB_cJ.snps.vcf The following ERROR shows up: ##### ERROR MESSAGE: Input files /home/tiagocastro/BALB_cJ.snps.vcf and reference have incompatible contigs: The contig order in /home/tiagocastro/BALB_cJ.snps.vcf and referenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files.. ##### ERROR /home/tiagocastro/BALB_cJ.snps.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, X, Y] ##### ERROR reference contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 3, 4, 5, 6, 7, 8, 9, MT, X, Y, JH584299.1, GL456233.1, JH584301.1, GL456211.1, GL456350.1, JH584293.1, GL456221.1, JH584297.1, JH584296.1, GL456354.1, JH584294.1, JH584298.1, JH584300.1, GL456219.1, GL456210.1, JH584303.1, JH584302.1, GL456212.1, JH584304.1, GL456379.1, GL456216.1, GL456393.1, GL456366.1, GL456367.1, GL456239.1, GL456213.1, GL456383.1, GL456385.1, GL456360.1, GL456378.1, GL456389.1, GL456372.1, GL456370.1, GL456381.1, GL456387.1, GL456390.1, GL456394.1, GL456392.1, GL456382.1, GL456359.1, GL456396.1, GL456368.1, JH584292.1, JH584295.1] So Trying to fix, I used the perl script in the link to sort properly within the reference. I did this: ./sortByRef.pl ~/BALB_cJ.snps.vcf /home/tiagocastro/genome/mouse_GRCm38.p4/GRCm38.primary_assembly/GRCm38.primary_assembly.fa.fai > ~/BALB_cJ.snps_sorted.vcf using the new vcf file, a new error is shown: ##### ERROR MESSAGE: Invalid command line: No tribble type was provided on the command line and the type of the file '/home/tiagocastro/BALB_cJ.snps_sorted.vcf' could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types: ##### ERROR Name FeatureType Documentation ##### ERROR BCF2 VariantContext (this is an external codec and is not documented within GATK) ##### ERROR VCF VariantContext (this is an external codec and is not documented within GATK) ##### ERROR VCF3 VariantContext (this is an external codec and is not documented within GATK) looking the head of each, sorted and basic vcf, I can see that is different. Can someone help me? Created 2015-08-04 23:35:08 | Updated 2015-08-04 23:36:54 | Tags: haplotypecaller vcf genotypegvcfs It appears that there are some cases in which an upstream site is heterozygous for a deletion, and a downstream site should be heterozygous for something like 1/2 (where 2 is the * allele), but GATK is still making erroneous 0/1 calls. For instance, in my dataset I have a deletion at position 773 (I'm leaving out some fields and GT info for additional samples for brevity). scaffold_1 773 . CAAAACATAGTTTCCCATGTCCGCCATCCCATGTTCTGCATCCGTGCA C ... 0/1:102,70:172:99:0|1:773_CAAAACATAGTTTCCCATGTCCGCCATCCCATGTTCTGCATCCGTGCA_C:283 9,0,4099 And just 2 bases downstream, there's a SNP for the same sample that has non- calls on both strands, which seems impossible. Shouldn't there be a allele here, e.g (ALT==G,* ; GT==1/2)? Here's the call: scaffold_1 775 . A G ... 0/1:60,134:194:99:4507,0,1642 Could this be a bug, such that when there's a heterozygous ALT call on the strand without the deletion, the half of the GT field for deleted strand is still getting a reference (0) call when it should be getting a * ? At another site overlapping with the same deletion, at which the strand without the deletion for this sample is a REF call , I get something that's more like what I'd expect. scaffold_1 805 . G T,* ... 0/2:235,0,33:268:99:0|1:773_CAAAACATAGTTTCCCATGTCCGCCATCCCATGTTCTGCATCCGTGCA_C:676,1384,11251,0,9868,9762 This was done with GenotypeGVCFs from GATK 3.4-46. Created 2015-07-28 12:20:30 | Updated | Tags: haplotypecaller vcf ngs variant-calling Hi! As part of a variant calling pipeline i'm interested in lowering the threshold for allele frequency tolerance in GATK's HaplotypeCaller variant caller to 0.01 (1%), if possible. If not, is there another variant calling tool that doesn't fliter out variants with low allele frequencies? Background: I know for a fact that there's at least one variant in my sample that doesn't appear in my VCF file if allele frequencies below 0.1 (10%) are filtered out during the variant calling as it's the default in some programs. I can see the variant when I inspect the corresponding bam file with samtools tview and another lab called that specific variant itself. Thanks in advance, Alon Created 2015-07-22 11:46:00 | Updated | Tags: combinevariants haplotypecaller best-practices vcf gatk genotypegvcfs combinegvcfs gvcf I was trying to do combine sets of vcf files for all my samples so that I have one single vcf output using this command option below java -d64 -Xmx48g -jar{GATK}/GenomeAnalysisTK.jar \ -R {REF} \ -T GenotypeGVCFs \ --variant A.g.vcf \ --variant B.g.vcf \ --variant C.g.vcf \ -stand_emit_conf 30 \ -stand_call_conf 30 \ -o genotype.vcf but I got this error message “The following invalid GT allele index was encountered in the file: END=21994810”. I have tried to locate where the problem could be coming from but I do not understand this. Could you please advise me. Created 2015-07-21 12:56:53 | Updated | Tags: snp vcf haplotype Hi, I would like to know if it is possible to get haplotype sequences for samples from a vcf SNP file and a reference sequence. The fact is that I took back from a paper a vcf file and a reference sequence, but I don't have bam files. The only solution I see is to ask for bam files... Thanks in advance, vschill Created 2015-07-20 09:00:51 | Updated | Tags: vcf Hello, I use HC to call variants, and there is a record like this: chr15 90172419 rs143394914 GGGGTGGGGGCTGTGGGCTGGGT G 88.77 . BaseQRankSum=0.406;DB;DP=6;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=61.78;MQ0=0;MQRankSum=-4.060e-01;QD=2.02;ReadPosRankSum=-4.060e-01;SOR=0.693 GT:AD:GQ:PL 0/1:3,3:99:117,0,117 There are equal reads for ref allele and alt allele, why the genotype is 0/1 instead of 1/1? Though I know GT is inferred from PL and GQ, I still can not understand why PL for 0/1 is much lower than that of 1/1? Thank you! Emma Created 2015-07-16 20:23:04 | Updated | Tags: unifiedgenotyper vcf Hello! I am writing to seek your help on a curious result I received from running the Unified Genotyper. I ran Unified Genotyper to obtain all confident sites (both invariant and variant) on a suite of different bam files. For certain regions of the genome that overlap certain bam file reads, it appears that it did not output invariant sites, only variant sites. I do not know if this is due to differences among bam files, my call to GATK, or is a glitch. Here is a little more background. • First, I am using Unified Genotyper to maintain continuity with some analyses I did over a year go. • My data: • I have 92 low coverage bam files from samples produced from illumina sequencing a RADtag type enzyme digestion library. The bams contain single-end reads, each 44-bp long. I will refer to these reads as ‘RADtags’. • I also have 18 high coverage bams produced from illumina sequencing whole genome paired-end libraries, with read lengths varying from 36 to 100. I will refer to these as WGS samples. • Previously I called SNPs using Unified Genotyper with mode set to EMIT_VARIANTS_ONLY, on all this data at once, to increase the chances of identifying quality SNPs in the low coverage RADtag data and to get genotypes from the high coverage WGS data at the same sites. The data seemed fine. • Current issue: • I ran the Unified Genotyper with the mode set to EMIT_ALL_CONFIDENT_SITES for all the same data. The vcf includes both invariant and variant calls. However, it seems I am not getting any invariant sites from regions that overlap the RADtag reads. It seems the program is not emitting invariant sites within RADtags, but is reporting invariant sites outside of RADtags and only for WGS samples. I have attached a subset of my vcf file, and a snapshot image from viewing these samples in igv. In the igv view, the middle samples are the RADtag samples, the higher coverage samples at the top and bottom are the WGS samples. As you can see, there should be plenty of invariant sites in the regions overlapping the RADTAGs and there should be calls that include these data regions and samples. Can you think of a reason that could be causing this? I have pasted my call to GATK below. Thank you! Amanda java -jar /usr/local/gatk/3.0-0/GenomeAnalysisTK.jar \ -R mg.new.ref.fa \ -T UnifiedGenotyper \ -I AHQT1G_q29.paired.nodup.realign.bam -I BOG10G_q29.paired.nodup.realign.bam -I CAC6G_q29.paired.nodup.realign.bam \ -I CAC9N_q29.paired.nodup.realign.bam -I DUNG_q29.paired.nodup.realign.bam -I IM62.JGI_q29.paired.nodup.realign.bam \ -I KOOTN_q29.paired.nodup.realign.bam -I LMC24G_q29.paired.nodup.realign.bam -I MAR3G_q29.paired.nodup.realign.bam \ -I MED84G_q29.paired.nodup.realign.bam -I MEN104_q29.paired.nodup.realign.bam -I NHN26_q29.paired.nodup.realign.bam \ -I PED5G_q29.paired.nodup.realign.bam -I REM8G_q29.paired.nodup.realign.bam -I SF5N_q29.paired.nodup.realign.bam \ -I SLP9G_q29.paired.nodup.realign.bam -I SWBG_q29.paired.nodup.realign.bam -I YJS6G_q29.paired.nodup.realign.bam \ -I CACid.277_index2.GCTCGG.sortedrealign.bam -I CACid.279b_index2.CTGATT.sortedrealign.bam -I CACid.247_index2.AATACT.sortedrealign.bam -I CACid.237_index2.CGCTGT.sortedrealign.bam \ -I CACid.240b_index2.GTCTCT.sortedrealign.bam -I CACid.272_index2.ATCTCC.sortedrealign.bam -I CACid.282b_index2.TCTGCT.sortedrealign.bam -I CACid.179_index2.TATAGT.sortedrealign.bam \ -I CACid.187_index2.CAGCAT.sortedrealign.bam -I CACid.189_index2.TAATCC.sortedrealign.bam -I CACid.192_index2.GCAGTT.sortedrealign.bam -I CACid.339_index2.CCTGAA.sortedrealign.bam \ -I CACid.235_index2.AAGCGA.sortedrealign.bam -I CACid.236_index2.AACTTA.sortedrealign.bam -I CACid.249_index2.GTTCCT.sortedrealign.bam -I CACid.370_index2.CTAATA.sortedrealign.bam \ -I CACid.372_index2.CCGAAT.sortedrealign.bam -I CACid.215_index2.CTTGTA.sortedrealign.bam -I CACid.218_index2.CCCTCG.sortedrealign.bam -I CACid.219_index2.ACTGAC.sortedrealign.bam \ -I CACid.221_index2.GTGACT.sortedrealign.bam -I CACid.225_index2.ACTAGC.sortedrealign.bam -I CACid.226_index2.CGACTA.sortedrealign.bam -I CACid.231_index2.GTGAGA.sortedrealign.bam \ -I CACid.232_index2.AATCTA.sortedrealign.bam -I CACid.233_index2.CAAGCT.sortedrealign.bam -I CACid.212_index2.CTCTCA.sortedrealign.bam -I CACid.211_index2.ACTCCT.sortedrealign.bam \ -I CACid.403_index2.GATCAA.sortedrealign.bam -I CACid.280_index2.GGCTTA.sortedrealign.bam -I CACid.283_index2.GTGGAA.sortedrealign.bam -I CACid.285_index2.AAATGA.sortedrealign.bam \ -I CACid.003_index2.GGGTAA.sortedrealign.bam -I CACid.007_index2.CGTCAA.sortedrealign.bam -I CACid.008_index2.GTTGCA.sortedrealign.bam -I CACid.055_index2.ATATAC.sortedrealign.bam \ -I CACid.056_index2.TTAGTA.sortedrealign.bam -I CACid.071_index2.TCGCTT.sortedrealign.bam -I CACid.313_index2.TCGTCT.sortedrealign.bam -I CACid.040_index2.CAATAT.sortedrealign.bam \ -I CACid.292_index2.TCATGG.sortedrealign.bam -I CACid.080_index2.CAGGAA.sortedrealign.bam -I CACid.088_index2.ATCGAC.sortedrealign.bam -I CACid.322_index2.TTGCGA.sortedrealign.bam \ -I CACid.114_index1.GCTCGG.sortedrealign.bam -I CACid.279a_index1.CTGATT.sortedrealign.bam -I CACid.181_index1.AATACT.sortedrealign.bam -I CACid.184_index1.CGCTGT.sortedrealign.bam \ -I CACid.185_index1.GTCTCT.sortedrealign.bam -I CACid.191_index1.ATCTCC.sortedrealign.bam -I CACid.194_index1.TCTGCT.sortedrealign.bam -I CACid.240a_index1.TATAGT.sortedrealign.bam \ -I CACid.238_index1.CAGCAT.sortedrealign.bam -I CACid.371_index1.TAATCC.sortedrealign.bam -I CACid.253_index1.GCAGTT.sortedrealign.bam -I CACid.260_index1.CCTGAA.sortedrealign.bam \ -I CACid.262_index1.AAGCGA.sortedrealign.bam -I CACid.257_index1.AACTTA.sortedrealign.bam -I CACid.258_index1.GTTCCT.sortedrealign.bam -I CACid.216_index1.CTAATA.sortedrealign.bam \ -I CACid.220_index1.CCGAAT.sortedrealign.bam -I CACid.223_index1.CTTGTA.sortedrealign.bam -I CACid.224_index1.CCCTCG.sortedrealign.bam -I CACid.228_index1.ACTGAC.sortedrealign.bam \ -I CACid.229_index1.GTGACT.sortedrealign.bam -I CACid.281_index1.ACTAGC.sortedrealign.bam -I CACid.282a_index1.CGACTA.sortedrealign.bam -I CACid.284_index1.GTGAGA.sortedrealign.bam \ -I CACid.286_index1.AATCTA.sortedrealign.bam -I CACid.287_index1.CAAGCT.sortedrealign.bam -I CACid.288_index1.CTCTCA.sortedrealign.bam -I CACid.072_index1.ACTCCT.sortedrealign.bam \ -I CACid.077_index1.GATCAA.sortedrealign.bam -I CACid.078_index1.GGCTTA.sortedrealign.bam -I CACid.084_index1.GTGGAA.sortedrealign.bam -I CACid.087a_index1.AAATGA.sortedrealign.bam \ -I CACid.087b_index1.GGGTAA.sortedrealign.bam -I CACid.091_index1.CGTCAA.sortedrealign.bam -I CACid.092_index1.GTTGCA.sortedrealign.bam -I CACid.103_index1.CTCACT.sortedrealign.bam \ -I CACid.095_index1.TCCATA.sortedrealign.bam -I CACid.100_index1.ACTCGA.sortedrealign.bam -I CACid.101_index1.GTTCGA.sortedrealign.bam -I CACid.106_index1.ATATAC.sortedrealign.bam \ -I CACid.108_index1.TTAGTA.sortedrealign.bam -I CACid.110_index1.TCGCTT.sortedrealign.bam -I CACid.112_index1.TCGTCT.sortedrealign.bam -I CACid.144_index1.CAATAT.sortedrealign.bam \ -I CACid.149_index1.TCATGG.sortedrealign.bam -I CACid.321_index1.CAGGAA.sortedrealign.bam -I CACid.323_index1.ATCGAC.sortedrealign.bam -I CACid.324_index1.TTGCGA.sortedrealign.bam \ -o CACid.MIM.sites.q40.6.14.15.vcf \ --metrics_file CACid.MIM.sites.q40.6.14.15.metrics.txt \ --genotype_likelihoods_model SNP \ --output_mode EMIT_ALL_CONFIDENT_SITES \ -stand_call_conf 40 \ -stand_emit_conf 40 \ -l INFO Created 2015-07-10 22:42:44 | Updated | Tags: vcf picard Hi I would like to know how can I return the entire contents of a VCF file without having to know the chromosome/start/end parameters. The VCFFileReader class has a query method that can return parts of the VCF like as: VCFFileReader.query(chromosome, startPos, endPos); What I would like is to be able to say: 'return everything the VCF has'. Is there a way to do this with picard? Martin Created 2015-07-07 07:35:22 | Updated | Tags: vqsr vcf Hi all! I've got a questions concerning the VQSR. The situation is as follows: • I've got more than 100 Single Sample VCFs • Unfortunately I wont be able to re-call the VCFs • Merging the Files into a single Multi-Sample VCF is, in my opinion, a bad idea due to the loss of the information stored in the INFO field • Creating Multi-Sample VCFs with the help of 1000G would require re-calling or merging, so this also no option. Therefore, more or less just to see what happens, I specified multiple inputs for the VariantRecalibrator Walker and was able to produce a recal and tranches file. However, its probably still a bad idea to use the recal file for Recalibration since now there are multiple entries for the same variant (this is most likely due to the same variant in multiple single-sample VCFs?) chr1 871334 . N . . END=871334;POSITIVE_TRAIN_SITE;VQSLOD=1.9214;culprit=MQRankSum chr1 871334 . N . . END=871334;POSITIVE_TRAIN_SITE;VQSLOD=2.0305;culprit=MQ I guess during the ApplyRecalibration, its not possible to decide which entry for a variant in Single Sample VCF X1 is the correct one. However this would be crucial since the entries show different VQSLOD values. So in my opinion, its probably not possible to use VQSR in my specific case. However, since I really would like to use it, I thought maybe one of you guys knows a possibility to use it despite all the problems. Thanks a lot! Created 2015-06-24 14:28:42 | Updated | Tags: variantannotator vcf Hi! So we use GATK a lot in our research, it works amazingly well most of the time, so first of all, thanks for creating it! We have this one problem that we were unable to solve on our own. Say we have a VCF file that contains called variants, and we want to annotate it using an external database, clinvar as one example. We used to use VariantAnnotator for this purpose until we found out (both by reading documentation and doing a quick experimental check) that it annotates variants based solely on position, ignoring the actual mutation that happened. Imagine for example that variant A → C was called at a specific position, but data for A → G is recorded at clinvar. In this case VariantAnnotator will still carry over INFO fields from clinvar into our VCF. We ideally do not want this to happen, because strictly speaking clinvar data was recorded for a completely different mutation and might not be relevant at all in our case. My question: is there an option for VariantAnnotator to make it check REF and ALT fields in the process of annotation? (Although I fear it wouldn't be possible because it uses RodWalker class to traverse the variants.) Or, alternatively, can this be achieved using combination of other GATK commands? Or will we have to write a custom walker to accomplish what we want? (The latter is obviously the worst case, but hopefully we can manage that.) All the best, Kirill Created 2015-06-23 16:19:39 | Updated | Tags: vcf developer bam bug error I hava a question wish to get help from the developers: I am using GATK with two modles: 1, I just use the UnifiedGenotyper to call the variants from a prepared bam file, then I get a vcf file. (Call it A.vcf) 2, I run the UnifiedGenotype by Chr, one by one, say, by using the "-L" arg, then I get a sets of small vcf files. But, the result of ChrY is confusing me a lot.... the ChrY part in the A.vcf is QUITE different from the small vcf file that generated by "-L chrY", the difference seems to be larger than 50%. That means, the result is DIFFERENT for chrY. However, I have also checked the other Chromosomes, the difference is slight. ONLY the ChrY has this problem. Our script pasted here: java -Xmx30g -jar /data/SG/Env/software_installed/GenomeAnalysisTK.jar \ -L xx \ # I add -L option here when I do step 2. when I generate A.vcf ,I didn't add -L here -R ucsc.hg19.fasta \ -T RealignerTargetCreator \ -I{sampleName}.bam \ -o {sampleName}.intervals \ -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \ -known 1000G_phase1.indels.hg19.sites.vcf java -Xmx30g -jar /data/SG/Env/software_installed/GenomeAnalysisTK.jar \ -L xx \ # I add -L option here when I do step 2. when I generate A.vcf ,I didn't add -L here -R ucsc.hg19.fasta \ -T IndelRealigner \ -targetIntervals{sampleName}.intervals \ -I ${sampleName}.bam \ -o${sampleName}.realigned.bam \ -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \ -known 1000G_phase1.indels.hg19.sites.vcf

samtools index {sampleName}.realigned.bam java -Xmx30g -jar /data/SG/Env/software_installed/GenomeAnalysisTK.jar \ -L xx \ # I add -L option here when I do step 2. when I generate A.vcf ,I didn't add -L here -R ucsc.hg19.fasta \ -T BaseRecalibrator \ -nct 8 \ -I{sampleName}.realigned.bam \ -knownSites dbsnp_138.hg19.vcf \ -knownSites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \ -knownSites 1000G_phase1.indels.hg19.sites.vcf \ -o ${sampleName}.recal_data.grp java -Xmx30g -jar /data/SG/Env/software_installed/GenomeAnalysisTK.jar \ -L xx \ # I add -L option here when I do step 2. when I generate A.vcf ,I didn't add -L here -R ucsc.hg19.fasta \ -T PrintReads \ -nct 8 \ -I${sampleName}.realigned.bam \ -BQSR ${sampleName}.recal_data.grp \ -o${sampleName}.realigned.recal.bam

java -Xmx1g -jar GenomeAnalysisTK.jar -R hs37d5.fa -T GenotypeGVCFs -V vcfs.chr${numchr}.new.list -o final_chr${numchr}.vcf -L ${numchr} # CatVariants (generate 1 vcf file with all inds and all chrs) java -Xmx1g -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants -R hs37d5.fa -V final.new.list -out final_allHutt.vcf -assumeSorted # VQSR java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input final_allHutt.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches # Used excludeFiltered here java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input final_allHutt.vcf -mode SNP --ts_filter_level 99.9 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches -o recalibrated_snps_raw_indels_allHutt_filteredout.vcf java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches # Used excludeFiltered here java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -mode INDEL --ts_filter_level 99.0 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches -o recalibrated_variants_allHutt_filteredout.vcf # Genotype Refinement Workflow java -Xmx3g -jar GenomeAnalysisTK.jar -T CalculateGenotypePosteriors -R hs37d5.fa --supporting ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf -ped Hutt.ped -V recalibrated_variants_allHutt_filteredout.vcf -o recalibrated_variants_allHutt.postCGP.f.vcf java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantFiltration -R hs37d5.fa -V recalibrated_variants_allHutt.postCGP.f.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibrated_variants_allHutt.postCGP.Gfiltered.f.vcf Again, the first genotype in this example (indel) passed VariantFiltration even though its coverage was zero (2/2:0,0,0:0:PASS) The entire example is below: 1 20199272 . T TCTTC,C 3520.08 PASS AC=8,22;AF=0.160,0.440;AN=50;BaseQRankSum=-1.356e+00;ClippingRankSum=-1.267e+00;DP=487;FS=4.843;GQ_MEAN=27.84;GQ_STDDEV=40.31;InbreedingCoeff=0.1002;MLEAC=1,12;MLEAF=0.020,0.240;MQ=51.74;MQ0=0;MQRankSum=0.421;NCC=2;PG=0,0,0,0,0,0;QD=32.53;ReadPosRankSum=1.27;SOR=0.699;VQSLOD=0.687;culprit=FS GT:AD:DP:FT:GQ:PGT:PID:PL:PP 2/2:0,0,0:0:PASS:22:.:.:410,207,355,32,22,0:410,207,355,32,22,0 2/2:0,0,1:1:lowGQ:5:.:.:240,51,36,18,5,0:240,51,36,18,5,0 0/2:4,0,4:8:PASS:90:.:.:140,153,256,0,103,90:140,153,256,0,103,90 0/0:22,0,0:22:lowGQ:0:.:.:0,0,390,0,390,390:0,0,390,0,390,390 0/0:2,0,0:2:lowGQ:3:.:.:0,3,45,3,45,45:0,3,45,3,45,45 2/2:0,0,3:3:lowGQ:11:.:.:287,135,124,21,11,0:287,135,124,21,11,0 ./.:7,0,0:7:PASS 2/2:0,0,3:4:lowGQ:11:.:.:282,126,115,22,11,0:282,126,115,22,11,0 0/2:10,0,0:10:lowGQ:5:.:.:27,5,494,0,411,405:27,5,494,0,411,405 0/2:7,0,0:7:lowGQ:13:.:.:13,15,502,0,303,288:13,15,502,0,303,288 0/1:8,6,0:14:PASS:99:.:.:194,0,255,218,273,491:194,0,255,218,273,491 0/0:18,0,0:18:PASS:52:.:.:0,52,755,52,755,755:0,52,755,52,755,755 2/2:0,0,0:0:PASS:23:.:.:305,168,416,23,30,0:305,168,416,23,30,0 0/2:0,0,4:4:lowGQ:14:.:.:40,14,634,0,185,699:40,14,634,0,185,699 0/0:19,0,0:19:PASS:58:.:.:0,58,824,58,824,824:0,58,824,58,824,824 0/0:1,0,0:1:lowGQ:6:0|1:20199257_CT_C:0,6,91,6,91,91:0,6,91,6,91,91 1/1:0,0,0:0:lowGQ:2:.:.:177,11,0,12,2,44:177,11,0,12,2,44 0/1:0,0,3:3:PASS:34:.:.:94,0,388,34,38,304:94,0,388,34,38,304 0/2:15,0,2:17:lowGQ:18:0|1:20199249_CT_C:18,64,695,0,632,624:18,64,695,0,632,624 1/1:0,0,0:0:lowGQ:8:.:.:133,8,0,101,17,265:133,8,0,101,17,265 0/2:3,0,0:3:PASS:25:.:.:129,25,484,0,121,94:129,25,484,0,121,94 0/2:2,0,0:2:PASS:38:.:.:185,38,644,0,88,42:185,38,644,0,88,42 0/2:2,0,0:2:lowGQ:14:.:.:256,14,293,0,57,41:256,14,293,0,57,41./.:11,0,0:11:PASS 1/2:0,0,1:1:lowGQ:14:.:.:115,24,14,36,0,359:115,24,14,36,0,359 1/2:0,0,1:1:PASS:28:.:.:188,39,28,35,0,206:188,39,28,35,0,206 2/2:0,0,3:3:lowGQ:8:1|1:20199249_CT_C:88,88,89,8,9,0:88,88,89,8,9,0 Why are some genotypes being passed when there is no support for their genotype? Why are the Mendelian error rates so high? Thank you very much in advance, Alva Created 2015-03-02 05:53:27 | Updated 2015-03-02 05:54:21 | Tags: variantstotable vcf multi-sample Hi- Will VariantsToTable work for multi-sample VCF files? I've tried the following command, but it only outputs the column headers: java -jar ~/tools/GenomeAnalysisTK.jar -R chr1.fa -T VariantsToTable -V cg_100.vcf.gz -F CHROM -F POS -F ID -F REF -F ALT -F F-101-009 -o test.table Thanks, Summer Created 2015-02-20 21:43:46 | Updated | Tags: vcf gatk calculategenotypeposteriors Hi, I used CalculateGenotypePosteriors with the supporting file called ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf, obtained from 1000 Genomes. It contains both indels and SNPs, and I was able to use the file for the first step of the Genotype Refinement Workflow. My question is: is it an issue that I didn't remove the indels from the supporting file? I would presume not since first of all, both the indels and the SNPs from Phase 3 of 1000G should be high confidence, and second, my recalibrated vcf file includes both indels and snps, so it should be in my interest to have as much information as possible, actually, so I should consider indels as well. Just want to check whether my reasoning is correct. Thanks, Alva Created 2015-02-19 22:47:05 | Updated | Tags: unifiedgenotyper ploidy vcf Hi, I'm running on the Unified Genotyper (Version=3.3-0-g37228af) on a pooled sample. The ploidy is set to 32. I'm trying to get allele frequency information. I'm trying to filter sites based on depth of coverage but the DP flag is not consistent within a loci groupIV 756 . C T 141.24 . AC=11;AF=0.344;AN=32;BaseQRankSum=-2.927;DP=37;Dels=0.65;FS=0.000;HaplotypeScore=166.3715;MLEAC=11;MLEAF=0.344;MQ=86.28;MQ0=0;MQRankSum=-0.696;QD=3.82;ReadPosRankSum=-2.927;SOR=1.022 GT:AD:DP:GQ:PL 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1/1/1/1/1/1/1/1/1/1/1:8,5:13:0:155,39,25,17,12,9,6,4,2,1,1,0,0,0,0,1,2,2,4,5,7,9,11,14,17,21,25,31,38,47,60,2147483647,2147483647 The DP at the start says there is a depth of 37, but at the end it says its 13. Judging from neighbouring sites, the actual depth seems to be 37, but I'm confused why it seems to only be using 13 of them. At other sites the scores match. Created 2015-02-18 15:22:27 | Updated | Tags: haplotypecaller vcf Dear all, I need to generate vcf file with GATK and I need to have TI (transcript information) in INFO field and VF and GQX in FORMAT field. Could you help me please with arguments. My agruments are to call variants: java -jar$gatk -T $variant_caller -R$reference -I in.bam -D dbsnp -L bed_file -o .raw.vcf

I still have no TI info. and in FORMAT i have GT:AD:DP:GQ:PL and i NEED GT:AD:DP:GQ:PL:VF:GQX.

Thank you for any help with command line arguments.

Paul.

Created 2015-02-10 05:49:09 | Updated | Tags: unifiedgenotyper vcf

Hi, How do I know based on the REF and ALT column of a VCF file the actual position where an indel event happened? I usually see an event that occur in the 2nd base. For example, REF ALT GGCGTGGCGT G,GCGCGTGGCGT --deletion; insertion of "C" ATTT A,ATT --deletions

But I saw a VCF format documentation(http://samtools.github.io/hts-specs/VCFv4.2.pdf) allowing a different case: GTC G,GTCT --deletion; insertion of "T" at the end

How is UnifiedGenotyper formatting the indels? Does it differ from the one used in the 1000Genome Project? Thank you!

Roven

Created 2015-02-04 05:14:44 | Updated | Tags: variantrecalibrator vqsr vcf gatk

Hi,

I have generated vcf files using GenotypeGVCFs; each file contains variants corresponding to a different chromosome. I would like to use VQSR to perform the recalibration on all these data combined (for maximum power), but it seems that VQSR only takes a single vcf file, so I would have to combine my vcf files using CombineVariants. Looking at the documentation for CombineVariants, it seems that this tool always produces a union of vcfs. Since each vcf file is chromosome-specific, there are no identical sites across files. My questions are: Is CombineVariants indeed the appropriate tool for me to merge chromosome-specific vcf files, and is there any additional information that I should specify in the command-line when doing this? Do I need to run VariantAnnotator afterwards (I would assume not, since these vcfs were generated using GenotypeGVCFs and the best practices workflow more generally)? I just want to be completely sure that I am proceeding correctly.

Thank you very much in advance, Alva

Created 2015-02-04 04:17:31 | Updated 2015-02-04 04:18:06 | Tags: unifiedgenotyper vcf

Hi

I'm wondering why some positions in VCF overlap. Can GATK skip emitting positions that are already part of an indel/concatenated ref? We need to use EMIT_ALL_SITES but it's quite confusing if a position is represented multiple times especially if indel is present. I understand that there is always a duplicated position preceding an indel, but the positions after the indel should ideally be not overlapping the neighboring positions. Please see some examples below:

chr02 28792823 . C . 34.23 . AN=2;DP=12;MQ=57.31;MQ0=0 GT:DP 0/0:12 chr02 28792823 . CAA . 10000037.27 . AN=2;DP=12;MQ=57.31;MQ0=0 GT:AD:DP 0/0:12:12 chr02 28792824 . A . . . DP=12;MQ=57.31;MQ0=0 GT ./. chr02 28792825 . A . 31.22 . AN=2;DP=12;MQ=57.31;MQ0=0 GT:DP 0/0:7

chr02 314879 . T . 100.23 . AN=2;DP=27;MQ=55.82;MQ0=0 GT:DP 0/0:27 chr02 314879 . TAGAG T,TAG 647.19 . AC=1,1;AF=0.500,0.500;AN=2;DP=27;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=55.82;MQ0=0;QD=23.97;RPA=8,6,7;RU=AG;STR GT:AD:DP:GQ:PL 1/2:0,9,11:26:99:989,445,424,395,0,335 chr02 314880 . A . 43.23 . AN=2;DP=26;MQ=55.66;MQ0=0 GT:DP 0/0:5 chr02 314881 . G . 40.23 . AN=2;DP=25;MQ=55.47;MQ0=0 GT:DP 0/0:4 chr02 314882 . A . 73.23 . AN=2;DP=25;MQ=55.47;MQ0=0 GT:DP 0/0:16 chr02 314883 . G . 73.23 . AN=2;DP=25;MQ=55.47;MQ0=0 GT:DP 0/0:16

This next example even calls a SNP for position 10221400 even if deletion occurs in 10221399.

chr02 10221399 . C . 76.23 . AN=2;DP=17;MQ=58.74;MQ0=0 GT:DP 0/0:17 chr02 10221399 . CA CCTA,C 143.19 . AC=1,1;AF=0.500,0.500;AN=2;DP=17;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=58.74;MQ0=0;QD=8.42 GT:AD:DP:GQ:PL 1/2:0,8,6:17:99:527,131,112,185,0,143 chr02 10221400 . A C,T 287.29 . AC=1,1;AF=0.500,0.500;AN=2;DP=17;Dels=0.29;FS=0.000;HaplotypeScore=6.7569;MLEAC=1,1;MLEAF=0.500,0.500;MQ=58.74;MQ0=0;QD=16.90 GT:AD:DP:GQ:PL 1/2:0,4,8:12:88:407,294,285,112,0,88 chr02 10221401 . A C 163.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.854;DP=17;Dels=0.00;FS=1.913;HaplotypeScore=9.7562;MLEAC=1;MLEAF=0.500;MQ=58.74;MQ0=0;MQRankSum=1.558;QD=9.63;ReadPosRankSum=2.764 GT:AD:DP:GQ:PL 0/1:11,6:17:99:192,0,392 chr02 10221402 . A . 76.23 . AN=2;DP=17;MQ=58.74;MQ0=0 GT:DP 0/0:17

Thanks in advance for your help. rfuentes

Created 2015-01-29 03:11:54 | Updated | Tags: vcf

I'm getting the output from GATK ChrSy 198904 . C . 44.99 LowQual AN=2;DP=8;MQ=39.35;MQ0=0 GT:DP 0/0:8 ChrSy 198904 . CGTCCGATATTTGCGAAATATCG . Infinity . DP=8;MQ=39.35;MQ0=0 GT ./.

ChrSy 464065 . A . 113.99 . AN=2;DP=37;MQ=25.38;MQ0=0 GT:DP 0/0:37 ChrSy 464066 . G . 113.99 . AN=2;DP=37;MQ=25.38;MQ0=0 GT:DP 0/0:37 ChrSy 464066 . GTT . 60.03 . AN=2;DP=37;MQ=25.38;MQ0=0 GT:DP 0/0:28 ChrSy 464069 . T . 113.99 . AN=2;DP=37;MQ=25.38;MQ0=0 GT:DP 0/0:37

Why do positions 198904 and 464066 have multiple-base REF but ALT="."? Why did GATK merge them in a single line unlike other positions(emit-all sites VCF) and still have a duplicated position preceding it? I thought only indel has an anchor base(position before the event)? Does ./. mean "same with the reference"?

Thank you!

Created 2015-01-27 20:25:42 | Updated | Tags: variantrecalibrator vqsr vcf gatk

Hi,

I ran VariantRecalibrator and ApplyRecalibration, and everything seems to have worked fine. I just have one question: if there are no reference alleles besides "N" in my recalibrate_SNP.recal and recalibrate_INDEL.recal files, and in the "alt" field simply displays , does that mean that none of my variants were recalibrated? Just wanted to be completely sure. My original file (after running GenotypeGVCFs) has the same number of variants as the recalibrated vcf's.

Thanks, Alva

Created 2015-01-21 19:53:26 | Updated | Tags: baserecalibrator haplotypecaller vcf bam merge rnaseq

Hi, I am working with RNA-Seq data from 6 different samples. Part of my research is to identify novel polymorphisms. I have generated a filtered vcf file for each sample. I would like to now combine these into a single vcf.

I am concerned about sites that were either not covered by the RNA-Seq analysis or were no different from the reference allele in some individuals but not others. These sites will be ‘missed’ when haplotypeCaller analyzes each sample individually and will not be represented in the downstream vcf files.

When the files are combined, what happens to these ‘missed’ sites? Are they automatically excluded? Are they treated as missing data? Is the absent data filled in from the reference genome?

Alternatively, can BaseRecallibrator and/or HaplotypeCaller simultaneously analyze multiple bam files?

Is it common practice to combine bam files for discovering sequence variants?

Created 2015-01-09 16:47:35 | Updated | Tags: bqsr haplotypecaller variantfiltration vcf bam workflows snps gatk3

Hi all, I'm in a bit of a daze going through all the documentation and I wanted to do a sanity check on my workflow with the experts. I have ~120 WGS of a ~24Mb fungal pathogen. The end-product of my GATK workflow would be a high quality call set of SNPs, restricted to the sites for which we have confidence in the call across all samples (so sites which are not covered by sufficient high quality reads in one or more samples will be eliminated).

Therefore my workflow (starting from a sorted indexed BAM file of reads from a single sample, mapped to reference with bwa mem) is this:

• 01- BAM: Local INDEL realignment (RealignerTargetCreator/IndelRealigner)
• 02- BAM: MarkDuplicates
• 03- BAM: Local INDEL realignment second pass (RealignerTargetCreator/IndelRealigner)
• 04- BAM: Calling variants using HaplotypeCaller
• 05- VCF: Hard filter variants for truth set for BQSR (there is no known variant site databases so we can use our best variants from each VCF file for this). The filter settings are: "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" and we also filter out heterozygous positions using "isHet == 1".
• 06- VCF: Calculate BQSR table using the high quality hard-filtered variants from step 05.
• 07- BAM: Apply BQSR recalibration from previous step to BAM file from step 04.
• 08- BAM: Calling variants on recalibrated BAM file from previous step using HaplotypeCaller, also emitting reference sites using --output_mode EMIT_ALL_SITES \ and --emitRefConfidence GVCF \

Does this sound like a reasonable thing to do? What options should I use in step 8 in order for HC to tell me how confident it is, site-by-site about it's calls, including those that are homozygous reference? I notice that when using --output_mode EMIT_ALL_CONFIDENT_SITES \ and --emitRefConfidence GVCF \ I am missing a lot of the annotation I get when just outputting variant sites (e.g. QD).

Created 2014-12-19 17:35:25 | Updated | Tags: combinevariants vcf error

Hi GATK team,

I am attempting to combine a HaplotypeCaller generated VCF with some indels called using pindel using the following arguments (GATK v3.3-0-g37228af):

-R /data/shared/ref/b37/human_g1k_v37.fasta -T CombineVariants --variant:GATK var.HiSeqDecember.raw.vcf --variant:pindel pindel_combined.vcf -o var.HiSeqDecember.pindel.raw.vcf -genotypeMergeOptions PRIORITIZE -priority GATK,pindel

However I get the following error:

##### ERROR MESSAGE: Badly formed variant context at location 1:157718231; getEnd() was 157718235 but this VariantContext contains an END key with value 157718231

The variants in question are (from GATK):

1 157718231 . CAAAT C,CAAATAAAT 2533.56 PASS AC=3,13;AF=0.125,0.542;AN=24;BaseQRankSum=1.762;ClippingRankSum=-0.327;DP=126;FS=0.000;HOMLEN=39;HOMSEQ=AAATAAATAAATAAATAAATAAATAAATAAATAAATAAA;InbreedingCoeff=-0.1260;MLEAC=3,13;MLEAF=0.125,0.542;MQ=70.00;MQ0=0;MQRankSum=0.920;QD=22.22;ReadPosRankSum=-0.893;SOR=0.976;SVLEN=4;SVTYPE=INS;set=Intersection GT:DP:GQ 0/0:10:30 0/2:9:18 2/2:6:18 2/2:5:15 0/1:10:99 0/2:14:99 2/2:8:24 2/2:6:18 2/2:7:21 0/2:17:99 0/1:5:75 0/1:6:27

and (from pindel):

1 157718231 . C CAAAT . PASS AC=2;AF=0.143;AN=14;END=157718231;HOMLEN=39;HOMSEQ=AAATAAATAAATAAATAAATAAATAAATAAATAAATAAA;SVLEN=4;SVTYPE=INS;set=variant3-variant4-variant6-variant7-variant8-variant9-variant10 GT:AD ./. ./. 0/0:0,7 0/0:0,6 ./. 0/0:0,9 0/0:0,8 0/0:0,7 0/0:0,8 1/1:0,12 ./. ./.

It is worth noting that the pindel VCF here was merged together from several pindel-generated VCFs using CombineVariants without any complaint from the GATK. It looks to me that the END key is correct for the pindel variant (a simple insertion), but the GATK is confused due to the mixed deletion/insertion variant generated by the HaplotypeCaller at the same position (without an END key).

I can rerun the command after stripping all END tags from the pindel VCF and the command completes successfully, so this is not a showstopper for me but I assume this is a bug(?) and if so, it would be great if there were a fix.

Cheers,

Dave

Created 2014-12-09 21:47:52 | Updated | Tags: vcf genotypegvcfs gvcf variant-calling

Hi,

I used GenotypeGVCFs with 3 input gvcf files (3 individuals) to create a vcf file, and this seems to work, but when I examine the sites in the final vcf file, there are sites that are missing. I am in the process of calculating exactly how many sites are missing, but taking an initial section of the vcf file and initial sections of my 3 gvcf files, the initial set of variant positions in the 3 gvcf files combined (called "test file") is:

16050036 16050612 16050822 16050933 16051556 16051968 16051994 16052080 16052167 16052239 16052250 16052357 etc.

whereas the initial set of sites in my final vcf file is:

16050822 16050933 16051347 16051497 16051556 16051968 16051994 16052080 16052167 16052169 16052239 16052357 etc.

First of all, there are sites in the final vcf file (in bold) that are fixed for the 3 individuals, but that are still included (the individuals are all 0/1 at these positions). I removed these positions when I created my test file, so they don't show up there, as you can see. Second, there are sites in the test file that are not in the final vcf file (in bold), even though these are most definitely variant sites (I checked them - f.ex., 16050036 is a SNP). I'm not sure why these discrepancies are occurring.

I also got this warning 3 times when running GenotypeGVCFs: WARN 20:04:45,102 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 22:21483632 has 7 alternate alleles so only the top alleles will be used How do I relax the requirements of "at most 6 alternate alleles" to allow more in the case of indels? I am using the newest version of GATK (3.3).

FYI, this is the command I used for GenotypeGVCFs: java -Xmx2g -jar GenomeAnalysisTK.jar -R hs37d5.fa -T GenotypeGVCFs --variant file1.vcf --variant file2.vcf --variant file3.vcf -o final.vcf -L 22 (only running this for chr22)

Thank you in advance, Alva

Created 2014-12-05 15:27:21 | Updated 2014-12-05 18:08:01 | Tags: haplotypecaller best-practices vcf gatk gvcf

Hi,

I used HaplotypeCaller in GVCF mode to generate a single sample GVCF, but when I checked my vcf file I see that the reference allele is not showing up:

22  1   .   N   <NON_REF>   .   .   END=16050022    GT:DP:GQ:MIN_DP:PL  0/0:0:0:0:0,0,0
22 16050023    .   C   <NON_REF>   .   .   END=16050023    GT:DP:GQ:MIN_DP:PL  0/0:1:3:1:0,3,37
22  16050024    .   A   <NON_REF>   .   .   END=16050026    GT:DP:GQ:MIN_DP:PL  0/0:2:6:2:0,6,73
22  16050027    .   A   <NON_REF>   .   .   END=16050035    GT:DP:GQ:MIN_DP:PL  0/0:3:9:3:0,9,110
22  16050036    .   A   C,<NON_REF> 26.80   .   BaseQRankSum=-0.736;ClippingRankSum=-0.736;DP=3;MLEAC=1,0;MLEAF=0.500,0.00;MQ=27.00;MQ0=0;MQRankSum=-0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB   0/1:1,2,0:3:23:55,0,23,58,29,86:1,0,2,0
22  16050037    .   G   <NON_REF>   .   .   END=16050037    GT:DP:GQ:MIN_DP:PL  0/0:3:9:3:0,9,109
22  16050038    .   A   <NON_REF>   .   .   END=16050039    GT:DP:GQ:MIN_DP:PL  0/0:4:12:4:0,12,153

I am not sure where to start troubleshooting for this, since all the steps prior to using HaplotypeCaller did not generate any obvious errors.

The basic command that I used was: java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hs37d5.fa -I recal_1.bam -o raw_1.vcf -L 22 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

Have you encountered this problem before? Where should I start troubleshooting?

Thanks very much in advance, Alva

Created 2014-12-04 19:17:24 | Updated | Tags: baserecalibrator haplotypecaller vcf convergence bootstrap

I am identifying new sequence variants/genotypes from RNA-Seq data. The species I am working with is not well studied, and there are no available datasets of reliable SNP and INDEL variants.

For BaseRecallibrator, it is recommended that when lacking a reliable set of sequence variants: "You can bootstrap a database of known SNPs. Here's how it works: First do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence."

Setting up a script to run HaplotypeCaller and BaseRecallibrator in a loop should be fairly strait forward. What is a good strategy for comparing VCF files and assessing convergence?

Created 2014-12-04 18:54:17 | Updated | Tags: haplotypecaller vcf rnaseq

Specifically, what does the 'start' component of this flag mean? Do the reads all have to start in exactly the same location? Alternatively, does the flag specify the total number of reads that must overlap a putative variant before that variant will be considered for calling?

Created 2014-12-01 06:55:25 | Updated 2014-12-01 14:03:45 | Tags: vcf genotype-likelihood

Hi,

We found this line in a VCF file and we're confused why GATK gave 1/1 when Ref has greater reads than Alt

chr10   5129483 .       A       C       49.28   .       AC=2;AF=1.00;AN=2;BaseQRankSum=-2.152;DP=29;    GT:AD:DP:GQ:PL  1/1:16,13:29:9:77,9,0

Is PL computed from filtered AD?
Thank you!

Created 2014-11-22 19:53:56 | Updated 2014-11-22 19:54:25 | Tags: haplotypecaller vcf

The SB field of HaplotypeCaller output is not described terribly well as far as I can find.
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">

What exactly happens when there are multiple alternate alleles? For example:
scaffold_1 2535 . T C,TTC,<NON_REF> 611.83 . DP=15;MLEAC=1,1,0;MLEAF=0.500,0.500,0.00;MQ=60.00;MQ0=0 GT:AD:DP:GQ:PL:SB 1/2:0,5,10,0:15:99:630,397,379,210,0,180,607,394,210,604:0,0,12,3
It doesn't seem to be particularly informative in this case (a case which is rather common for our data).

If it isn't already part of the possible annotations...
Perhaps the most sensible approach would be to output field with num-fwd, num-rev for each allele (rev, alt1, alt2, ...). SDP for "strand-depth" might be a reasonable name.

Created 2014-10-14 15:55:25 | Updated | Tags: diagnosetargets vcf

I have used a VCF file that was produced by GATK for the -L option of DiagnoseTargets, but I get the alternative allele from the original VCF as the reference allele on the output vcf from Diagnose targets: input VCF:

chr1 10425470 . G A 139.99 SNPBHF AC=1;AF=0.042;AN=24;BaseQRankSum=-5.862e+00;Clipp......

chr1 10425470 . A <DT> . PASS END=10425470;GC=1.00;IDP=805.00 FT:IDP:LL:ZL......

The GC content reflects the reference allele though. Is this normal behaviour or a bug?

Created 2014-09-30 19:53:52 | Updated | Tags: best-practices vcf

Hi, I am following the Best Practices for DNAseq analysis and have 2 quick questions:

1. I wanted to confirm if the VCF files produced after the VariantRecalibrator and ApplyRecalibration steps for SNP and for Indels are completely independent of each other. In other words, the VCF file produced after these two steps for SNPs (mode SNP) is just for SNPs and for Indels (mode INDEL) is just for Indels.
2. What is the source of the ALT alleles in the VCF file - is it the various annotation files that were used during the analysis? Thanks,
• Pankaj

Created 2014-09-17 16:45:28 | Updated | Tags: vcf genotypeconcordance

Hello, I just ran genotype concordance in order to determine how similar two samples were. However, in the output, everything is showing up as zero, the NRD determined is 1 and the overall genotype concordance is also 1. Has anyone encountered this before. The two samples I am using for --eval and --comp are actually the same sample that were sequenced using two different methodologies (HiSeq and Ion proton). So it is odd that I am getting this output. Any help is appreciated.

Thanks, Ricky

Created 2014-09-15 16:05:28 | Updated | Tags: combinevariants vcf

Hello everyone, I have used the ion proton system to sequence 15 sample exomes so far. In this system, only 3 samples can be run at a given time, so I essentially have five sets of samples. Is it still possible to use CombineVariants on the VCF files of samples that were from different sequencing runs? Thank you, Ricky

Created 2014-09-09 15:44:02 | Updated | Tags: vcf variant-recalibration

Hello, What annotations are recommended to be used for the variant recalibrator for an exome sequencing project? I am using the Ion proton system so my VCF files are automatically generated after the sequencing run. Also, would I still have to mark duplicates and realign indels in my BAM files before viewing them on a genome viewer like IGV? Since I already have my VCF files, I don't know if this will be necessary. Thank you, these forums are very helpful! Ricky

Created 2014-08-11 09:50:56 | Updated | Tags: unifiedgenotyper vcf qual variant-calling

I used the EMIT_ALL_SITES option with Unified Genotyper. For polymorphic sites, the quality score (QUAL field) corresponds to the Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. But for monomorphic sites, (at this site, we have an homozygote for the reference allele), what is the definition of the quality score ? and how is it computed ? Many thanks for your explanation.

Created 2014-06-30 08:33:39 | Updated 2014-06-30 08:34:34 | Tags: vcf genotypegvcfs

Hi

I've just noticed this in some records, in some bi-allelic sites the AD field for some samples can have far more than 2 columns and the values can be separated oddly. For example in the record at the bottom of the 0/1:21,11,0,0,0:32:99:398,0,677 the site is biallelic so I'd expect the AD to be 21,11 with a DP of 32, but instead we have 21,11,0,0,0.

Then at another position here we have 0/1:12,0,9,0,0:21:99:220,0,792 with the AD 12,0,9,0,0 DP 21 when I would have expected to see AD 12,9 DP 21. The 0 between the ref and alt alleles is particularly troubling.

 chr1    110326664       rs209200395     A       C       20726.45        .       AC=65;AF=0.192;AN=338;BaseQRankSum=1.05;ClippingRankSum=-6.610e-01;DB;DP=3547;FS=0.000;InbreedingCoeff=0.0055;MLEAC=66;MLEAF=0.195;MQ=60.00;MQ0=0;MQRankSum=-3.900e-02;QD=16.71;ReadPosRankSum=0.638    GT:AD:DP:GQ:PL  0/0:.:5:15:0,15,156     0/0:.:2:6:0,6,69        0/0:.:23:60:0,60,900    0/0:.:9:27:0,27,275     0/0:.:26:63:0,63,945    0/0:.:41:99:0,102,1433  0/0:.:23:60:0,60,774    0/0:.:23:60:0,60,900    0/0:.:6:9:0,9,135       0/0:.:3:9:0,9,86        0/0:.:6:12:0,12,180     0/0:.:9:24:0,24,288     0/0:.:3:6:0,6,90        0/1:21,11,0,0,0:32:99:398,0,677 0/0:.:29:61:0,61,939    0/1:9,11,0,0,0:20:99:350,0,241  0/0:.:26:43:0,43,832    0/0:.:22:63:0,63,945    0/0:.:30:68:0,68,916    0/0:.:21:57:0,57,855    0/0:.:23:62:0,62,719    0/1:10,14,0,0,0:24:99:509,0,270 1/1:0,23,0,0,0:23:85:1001,85,0  0/1:18,13,0,0,0:31:99:426,0,485 0/0:.:25:68:0,68,812    0/1:17,10,0,0,0:27:99:357,0,518 0/0:.:38:72:0,72,1223   0/0:.:25:64:0,64,823    0/0:.:22:60:0,60,900    0/0:.:21:60:0,60,707    0/1:3,6,0,0,0:9:86:184,0,86     0/0:.:25:60:0,60,900    0/0:.:26:72:0,72,1080   0/0:.:22:60:0,60,900    0/0:.:37:81:0,81,1257   0/0:.:24:65:0,65,838    0/0:.:32:65:0,65,1055   0/0:.:32:73:0,73,1069   0/1:8,0,12,0,0:20:99:467,0,494  0/0:.:22:60:0,60,900    0/1:3,3,0,0,0:6:81:197,0,81     0/1:8,12,0,0,0:20:99:369,0,228  0/0:.:19:26:0,26,615    0/0:.:21:60:0,60,761    1/1:0,24,0,0,0:24:60:984,60,0   0/0:.:24:60:0,60,900    0/0:.:21:60:0,60,747    0/0:.:34:87:0,87,1166   0/0:.:21:60:0,60,701    0/0:.:19:41:0,41,646    0/0:.:21:63:0,63,720    0/0:.:18:29:0,29,609    0/0:.:26:69:0,69,1035   0/1:8,3,0,0,0:11:99:104,0,229   0/0:.:23:60:0,60,900    0/0:.:20:60:0,60,681    0/1:7,0,8,0,0:15:99:327,0,517   0/0:.:2:6:0,6,68        ./.:.:0 0/0:.:31:66:0,66,1067   0/0:.:2:6:0,6,44        0/0:.:3:9:0,9,87        0/0:.:2:6:0,6,66        0/0:.:1:3:0,3,30        0/0:.:2:6:0,6,68        0/0:.:18:26:0,26,583    0/0:.:2:6:0,6,64        0/1:3,5,0,0,0:8:79:150,0,79     0/0:.:2:6:0,6,65        0/0:.:5:15:0,15,150     0/1:11,13,0,0,0:24:99:445,0,323 0/0:.:49:75:0,75,1662   0/1:7,11,0,0,0:18:99:328,0,172  0/0:.:22:63:0,63,753    0/1:11,11,0,0,0:22:99:400,0,313 0/0:.:35:37:0,37,1051   0/0:.:24:66:0,66,990    0/1:10,0,16,0,0:26:99:542,0,538 0/0:.:23:63:0,63,945    0/0:.:21:60:0,60,900    0/1:7,0,5,0,0:12:99:149,0,507   0/0:.:29:74:0,74,990    0/0:.:2:0:0,0,4 0/1:5,12,0,0,0:17:99:397,0,148  0/1:8,0,11,0,0:19:99:364,0,489  0/0:.:11:27:0,27,405    0/1:10,16,0,0,0:26:99:508,0,259 0/1:9,0,6,0,0:15:99:253,0,579   0/0:.:28:54:0,54,894    0/1:16,6,0,0,0:22:99:198,0,460  0/0:.:26:60:0,60,886    0/1:9,8,0,0,0:17:99:270,0,238   0/0:.:30:78:0,78,1170   0/1:9,14,0,0,0:23:99:442,0,246  0/0:.:17:21:0,21,531    0/0:.:25:64:0,64,832    0/0:.:30:65:0,65,1002   0/0:.:28:65:0,65,926    ./.:.:0 0/0:.:2:6:0,6,64        0/1:8,9,0,0,0:17:99:320,0,216   0/0:.:2:6:0,6,56        0/1:12,14,1,0,0:27:99:419,0,351 0/0:.:29:81:0,81,1215   0/1:12,0,12,0,0:24:99:307,0,782 0/0:.:24:69:0,69,809    0/0:.:23:60:0,60,900    0/0:.:28:62:0,62,990    0/1:9,0,8,0,0:17:99:357,0,776   1/1:0,10,0,0,0:10:32:355,32,0   1/1:0,12,0,0,0:12:35:435,35,0   0/1:15,13,0,0,0:28:99:418,0,442 0/1:10,0,9,0,0:19:99:326,0,845  0/0:.:29:61:0,61,970    0/0:.:27:60:0,60,900    0/0:.:33:61:0,61,1049   0/1:12,13,0,0,0:25:99:416,0,361 0/0:.:25:69:0,69,1035   0/1:10,13,0,0,0:23:99:451,0,269 0/1:21,14,0,0,0:35:99:441,0,599 0/0:.:12:22:0,22,376    0/1:10,14,0,0,0:24:99:457,0,273 0/1:10,0,14,0,0:24:99:525,0,646 0/1:15,12,0,0,0:27:99:319,0,416 0/1:5,8,0,0,0:13:99:297,0,123   0/0:.:14:7:0,7,411      0/0:.:18:19:0,19,515    0/0:.:21:63:0,63,745    0/0:.:27:72:0,72,1080   0/1:17,8,0,0,0:25:99:358,0,478  0/1:11,15,0,0,0:26:99:627,0,290 0/0:.:16:13:0,13,506    0/0:.:21:60:0,60,900    0/1:10,0,19,0,0:29:99:579,0,793 0/0:.:33:63:0,63,1077   0/0:.:13:39:0,39,432    0/0:.:8:18:0,18,270     1/1:0,15,0,0,0:15:54:630,54,0   0/0:.:9:27:0,27,297     0/0:.:31:62:0,62,1030   1/1:0,17,0,0,0:17:50:601,50,0   0/0:.:31:59:0,59,977    0/1:18,16,0,0,0:34:99:550,0,526 0/0:.:24:61:0,61,796    0/0:.:9:27:0,27,296     0/1:5,5,0,0,0:10:99:206,0,103   0/1:17,11,0,0,0:28:99:406,0,511 0/1:8,9,0,0,0:17:99:241,0,224   0/0:.:13:33:0,33,495    0/0:.:44:84:0,84,1358   0/0:.:13:20:0,20,495    0/0:.:15:21:0,21,387    0/1:17,6,0,0,0:23:99:117,0,498  0/1:18,13,0,0,0:31:99:451,0,519 1/1:0,34,0,0,0:34:99:1375,116,0 0/0:.:18:32:0,32,564    0/0:.:16:20:0,20,535    0/0:.:29:62:0,62,995    0/1:13,0,4,0,0:17:92:92,0,740   0/0:.:17:21:0,21,559    0/0:.:30:57:0,57,1006   0/0:.:23:60:0,60,900    0/1:12,0,9,0,0:21:99:220,0,792  0/0:.:30:56:0,56,967    0/0:.:23:63:0,63,945    0/0:.:29:69:0,69,999    0/1:8,12,0,0,0:20:99:412,0,237  0/0:.:26:60:0,60,980    0/0:.:27:62:0,62,872    0/1:19,9,0,0,0:28:99:285,0,587  0/1:15,14,0,0,0:29:99:422,0,430

Created 2014-06-11 08:23:18 | Updated 2014-06-11 11:59:33 | Tags: vcf

Dear GATK team,

I have got an error message while using RealignTargetCreator -- ERROR MESSAGE: Your input file has a malformed header: there are not enough columns present in the header line: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE However, after checked the standard VCF format, I found it seems that there is no problem with my VCF file.

The VCF file is generated by Dindel. That would be great if you could told me why I got this error message. Many thanks!

Here is the header and sample line:

##fileformat=VCFv4.0
##source=Dindel
##reference=ref.fa
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total number of reads in haplotype window">
##INFO=<ID=HP,Number=1,Type=Integer,Description="Reference homopolymer tract length">
##INFO=<ID=NF,Number=1,Type=Integer,Description="Number of reads covering non-ref variant on forward strand">
##INFO=<ID=NR,Number=1,Type=Integer,Description="Number of reads covering non-ref variant on reverse strand">
##INFO=<ID=NFS,Number=1,Type=Integer,Description="Number of reads covering non-ref variant site on forward strand">
##INFO=<ID=NRS,Number=1,Type=Integer,Description="Number of reads covering non-ref variant site on reverse strand">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype quality">
##ALT=<ID=DEL,Description="Deletion">
##FILTER=<ID=q20,Description="Quality below 20">
##FILTER=<ID=hp10,Description="Reference homopolymer length was longer than 10">
##FILTER=<ID=fr0,Description="Non-ref allele is not covered by at least one read on both strands">
##FILTER=<ID=wv,Description="Other indel in window had higher likelihood">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE

Created 2014-06-10 20:57:36 | Updated | Tags: variantannotator best-practices vcf developer performance walkers java gatk gatk-walkers

Hi,

I was wondering if there is a nice way to apply multiple processing steps to each variant (or a group of variants) as they are read so that the variant file is not read again and again. My understanding is that even if I use Queue, each script would read the vcf again. Is that correct?

Created 2014-06-03 11:58:36 | Updated 2014-06-03 12:10:00 | Tags: baserecalibrator vcf vcftools index-file

I am performing analysis on mm9 (mouse). Downloaded VCF file from UCSC. Removed chr*_random and chrM Used vcftool v4.0 prepared two files ONLY INDEL and ONLY SNPS vcf files (DBINDEL, DBSNP) Please kindly suggest me what needs to be done in this case? Command used java -Xmx30g -XX:+UseGCOverheadLimit -jar GenomeAnalysisTK.3.1-1-g07a4bf8.jar --log_to_file gatk.realigned.fixed.log --performanceLog gatk.realigned.fixed.perflog --keep_program_records -T BaseRecalibrator -R REFERENCE -I INPUTBAMNAME.realigned.fixed.bam -o OUTPUT_table --knownSites DBINDEL --knownSites DBSNP --deletions_default_quality 45 --insertions_default_quality 45 --low_quality_tail 2 --solid_nocall_strategy LEAVE_READ_UNRECALIBRATED --solid_recal_mode SET_Q_ZERO --lowMemoryMode --no_standard_covs --bqsrBAQGapOpenPenalty 30.0

some times I get this error: NFO 12:42:13,747 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:21 INFO 12:42:13,748 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 12:42:13,748 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 12:42:13,752 HelpFormatter - Program Args: --log_to_file /media/NAS1/PFlab/Mano/AllBAM/GATK/QualityScore/H3K4me1_run22_F3_5.sorted.marked.gatk.realigned.fixed.log --performanceLog /media/NAS1/PFlab/Mano/AllBAM/GATK/QualityScore/H3K4me1_run22_F3_5.sorted.marked.gatk.realigned.fixed.perflog --keep_program_records -T BaseRecalibrator -R /media/NAS1/PFlab/Mano/Genome/reference.fa -I /media/NAS1/PFlab/Mano/AllBAM/GATK/H3K4me1_run22_F3_5.sorted.marked.intervals.realigned.fixed.bam -o /media/NAS1/PFlab/Mano/AllBAM/GATK/QualityScore/H3K4me1_run22_F3_5.sorted.marked.intervals.realigned.after_recal_data.table --knownSites /media/NAS1/PFlab/Mano/Fwithoutchrsorted.vcf --knownSites /media/NAS1/PFlab/Mano/New.SNPs_only_withoutCHR_sort.vcf --deletions_default_quality 45 --insertions_default_quality 45 --low_quality_tail 2 --solid_nocall_strategy LEAVE_READ_UNRECALIBRATED --solid_recal_mode SET_Q_ZERO --lowMemoryMode --no_standard_covs --bqsrBAQGapOpenPenalty 30.0 INFO 12:42:13,756 HelpFormatter - Executing as mano@balrog01 on Linux 3.2.0-27-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13. INFO 12:42:13,756 HelpFormatter - Date/Time: 2014/06/03 12:42:13 INFO 12:42:13,756 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:42:13,757 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:42:14,556 GenomeAnalysisEngine - Strictness is SILENT INFO 12:42:14,634 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 12:42:14,643 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:42:14,687 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04 INFO 12:42:15,979 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR stack trace java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.broad.tribble.index.IndexFactory.loadIndex(IndexFactory.java:171) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.loadFromDisk(RMDTrackBuilder.java:324) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.attemptToLockAndLoadIndexFromDisk(RMDTrackBuilder.java:308) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:267) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:213) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:140) at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedQueryDataPool. (ReferenceOrderedDataSource.java:208) at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedDataSource. (ReferenceOrderedDataSource.java:88) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:973) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:767) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:284) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.broad.tribble.index.IndexFactory.loadIndex(IndexFactory.java:167) ... 14 more Caused by: java.io.EOFException at org.broad.tribble.util.LittleEndianInputStream.readFully(LittleEndianInputStream.java:134) at org.broad.tribble.util.LittleEndianInputStream.readLong(LittleEndianInputStream.java:76) at org.broad.tribble.index.linear.LinearIndexChrIndex.read(LinearIndex.java:268) at org.broad.tribble.index.AbstractIndex.read(AbstractIndex.java:359) at org.broad.tribble.index.linear.LinearIndex. (LinearIndex.java:98) ... 19 more ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: java.lang.reflect.InvocationTargetException ##### ERROR ------------------------------------------------------------------------------------------ Current error: INFO 13:49:09,583 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:21 INFO 13:49:09,583 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 13:49:09,583 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 13:49:09,588 HelpFormatter - Program Args: --log_to_file /home/mano/media/NAS1/PFlab/Mano/AllBAM/GATK/QualityScore/H3K4me1_run22_F3_5.sorted.marked.gatk.realigned.fixed.log --performanceLog /home/mano/media/NAS1/PFlab/Mano/AllBAM/GATK/QualityScore/H3K4me1_run22_F3_5.sorted.marked.gatk.realigned.fixed.perflog --keep_program_records -T BaseRecalibrator -R /home/mano/media/NAS1/PFlab/Mano/Genome/reference.fa -I /home/mano/media/NAS1/PFlab/Mano/AllBAM/GATK/H3K4me1_run22_F3_5.sorted.marked.intervals.realigned.fixed.bam -o /home/mano/media/NAS1/PFlab/Mano/AllBAM/GATK/QualityScore/H3K4me1_run22_F3_5.sorted.marked.intervals.realigned.after_recal_data.table --knownSites /home/mano/media/NAS1/PFlab/Mano/Fwithoutchrsorted.vcf --knownSites /home/mano/media/NAS1/PFlab/Mano/SNPs_only_withoutCHR_sort.vcf --deletions_default_quality 45 --insertions_default_quality 45 --low_quality_tail 2 --solid_nocall_strategy LEAVE_READ_UNRECALIBRATED --solid_recal_mode SET_Q_ZERO --lowMemoryMode --no_standard_covs --bqsrBAQGapOpenPenalty 30.0 INFO 13:49:09,591 HelpFormatter - Executing as mano@balrog04 on Linux 3.2.0-26-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13. INFO 13:49:09,592 HelpFormatter - Date/Time: 2014/06/02 13:49:09 INFO 13:49:09,592 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:49:09,592 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:49:10,461 GenomeAnalysisEngine - Strictness is SILENT INFO 13:49:10,538 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 13:49:10,547 SAMDataSourceSAMReaders - Initializing SAMRecords in serial INFO 13:49:10,574 SAMDataSourceSAMReaders - Done initializing BAM readers: total time 0.03 INFO 13:50:17,644 RMDTrackBuilder - Writing Tribble index to disk for file /home/mano/media/NAS1/PFlab/Mano/SNPs_only_withoutCHR_sort.vcf.idx INFO 13:50:29,885 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: I/O error loading or writing tribble index file for /home/mano/media/NAS1/PFlab/Mano/SNPs_only_withoutCHR_sort.vcf ##### ERROR ------------------------------------------------------------------------------------------ Please note that .idx file is formed but only in Kilobytes Created 2014-05-28 08:06:16 | Updated | Tags: haplotypecaller vcf variant-calling Hi GATK team, I ran HaplotypeCaller on a bam file that I pre-pprocessed according to your Best-Practices. When looking in a genome browser at this bam, I encountered a position that seems to have a variant, but the variant doesn't appear in the vcf file. This is the line from the vcf file (the variant I'm talking about is at position 75680995): 3 75680992 . C . . END=75680995 GT:DP:GQ:MIN_DP:PL 0/0:155:0:147:0,0,0 Attached is the picture from the genome browser - At the same position (75680995) there is a variant with the allele 'A', while the reference is 'C'. The coverage at this position is 195, and 171 of those reads are for the 'A' allele. Why is this happen? Thank you, Maya Created 2014-05-27 10:53:30 | Updated 2014-05-27 10:55:27 | Tags: vcf multi-sample vcf-format genotypegvcfs Hi, I'm doing a variant analysis of genomic DNA from 2 related samples. I followed the up-to-date Best practices using HaplotypeCaller in GVCF mode for both samples followed by GenotypeGVCF to compute a common vcf of variant loci. I'm looking at variants that would be sample2-specific (present in sample2 but not in sample1) Here is a line of this file: CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 chrIII 91124 . A AATAAGAGGAATTAGGCT 1132.42 . AC=2;AF=0.500;AN=4;DP=47;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=58.85;MQ0=0;QD=7.99 GT:AD:DP:GQ:PL 1/1:0,25:25:55:1167,55,0 0/0:.:22:33:0,33,495 In the Genotype Field, sample2.AD is a . (dot) meaning that no reads passed the Quality filters. However, sample2.DP=22 meaning that 22 reads covered this position. This line suggest that this variation is specific to sample1 (genotype HomVar 1/1) and is not present in sample2 (HomRef 0/0). But given the biological relationship between sample1 and 2 (the way they were generated), I doubt that this variation is true: it is very likely to be present in sample2 as well. It's a false I have 416 loci like this. For the vast majority of them, sample1 and 2 likely share the same variation. But since it is not impossible that a very few of them are really sample1=HomVar and sample2=HomRef, could you suggest me a way to detect those guys? What about comparing sample1.PL(1/1) and sample2.PL(0/0) ? For example could you suggest a rule of thumb to determine their ratio ? Created 2014-05-19 18:38:56 | Updated | Tags: vcf unparsable Hi! I try run GATK with SelectVariants. But the error message was generated. ##### ERROR MESSAGE: The provided VCF file is malformed at approximately line number 37: unparsable vcf record with allele X My vcf file have X. Do I need to change format? I made this with Bowtie2 -> Samtools -> Bcftools Here is my 50 lines of vcf file. ## fileformat=VCFv4.1 ## samtoolsVersion=0.1.19-44428cd ## reference=file://NODE_2_leng.fasta ## contig= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## INFO= ## FORMAT= ## FORMAT= ## FORMAT= ## FORMAT= ## FORMAT= ## FORMAT= ## FORMAT= # CHROM POS ID REF ALT QUAL FILTER INFO FORMAT pe_repeat_sort.bam NODE_2_length_24922_cov_5660.89_ID_3 1 . T X . DP=5656;I16=5430,120,0,0,186033,6369057,0,0,230246,9637234,0,0,117493,2780997,0,0;QS=1.000000,0.000000,0.000000,0.000000 PL 0,255,255 NODE_2_length_24922_cov_5660.89_ID_3 2 . G X . DP=5959;I16=5703,150,0,0,195867,6688005,0,0,243427,10211483,0,0,119037,2816345,0,0;QS=1.000000,0.000000,0.000000,0.000000 PL 0,255,255 NODE_2_length_24922_cov_5660.89_ID_3 3 . T X . DP=6026;I16=5772,149,0,0,198149,6767651,0,0,246413,10342791,0,0,120850,2853280,0,0;QS=1.000000,0.000000,0.000000,0.000000 PL 0,255,255 NODE_2_length_24922_cov_5660.89_ID_3 4 . G T,X . DP=6418;I16=6110,193,2,0,210264,7157322,53,1517,263032,11066766,65,2257,122440,2884596,50,1250;QS=0.999747,0.000253,0.000000,0.000000;VDB=7.840000e-02;RPB=2.061348e+00 PL 0,255,255,255,255,255 NODE_2_length_24922_cov_5660.89_ID_3 5 . T X . DP=6471;I16=6156,194,0,0,211750,7207204,0,0,265091,11157299,0,0,124426,2918388,0,0;QS=1.000000,0.000000,0.000000,0.000000 PL 0,255,255 NODE_2_length_24922_cov_5660.89_ID_3 6 . T X . DP=6590;I16=6231,229,0,0,214810,7294948,0,0,269745,11355423,0,0,126359,2952585,0,0;QS=1.000000,0.000000,0.000000,0.000000 PL 0,255,255 NODE_2_length_24922_cov_5660.89_ID_3 7 . A T,X . DP=6799;I16=6344,308,3,0,221029,7499167,78,2178,277797,11694735,132,5808,128237,2985471,17,97;QS=0.999645,0.000355,0.000000,0.000000;VDB=2.063840e-02;RPB=-1.796363e+00 PL 0,255,255,255,255,255 NODE_2_length_24922_cov_5660.89_ID_3 8 . A X . DP=6940;I16=6453,374,0,0,229907,7902269,0,0,285040,11996704,0,0,131453,3049447,0,0;QS=1.000000,0.000000,0.000000,0.000000 PL 0,255,255 NODE_2_length_24922_cov_5660.89_ID_3 9 . G T,A,X . DP=7067;I16=6506,397,8,0,232521,8000427,275,9515,288326,12138944,347,15061,132809,3068393,146,3280;QS=0.998813,0.000894,0.000294,0.000000;VDB=4.127108e-03;RPB=-7.938902e-01 PL 0,255,255,255,255,255,255,255,255,255 NODE_2_length_24922_cov_5660.89_ID_3 10 . C X . DP=7122;I16=6564,398,0,0,235379,8123921,0,0,290896,12250986,0,0,135314,3114576,0,0;QS=1.000000,0.000000,0.000000,0.000000 PL 0,255,255 NODE_2_length_24922_cov_5660.89_ID_3 11 . T X . DP=7156;I16=6590,398,0,0,237130,8213824,0,0,292077,12304019,0,0,137422,3153354,0,0;QS=1.000000,0.000000,0.000000,0.000000 PL 0,255,255 NODE_2_length_24922_cov_5660.89_ID_3 12 . G T,X . DP=7392;I16=6719,483,13,0,244199,8451047,409,13477,301084,12684774,529,21969,139336,3191190,254,5850;QS=0.998319,0.001681,0.000000,0.000000;VDB=2.994939e-02;RPB=1.872494e-02 PL 0,255,255,255,255,255 NODE_2_length_24922_cov_5660.89_ID_3 13 . A G,T,X . DP=7491;I16=6768,524,2,1,246463,8505451,102,3486,304841,12842749,116,4528,141728,3238198,53,1145;QS=0.999584,0.000265,0.000151,0.000000;VDB=5.361253e-02;RPB=2.817384e-01 PL 0,255,255,255,255,255,255,255,255,255 NODE_2_length_24922_cov_5660.89_ID_3 14 . T C,X . DP=7508;I16=6779,530,2,0,248025,8592981,47,1325,305546,12872218,84,3528,144261,3290147,50,1250;QS=0.999810,0.000190,0.000000,0.000000;VDB=6.720000e-02;RPB=1.155135e+00 PL 0,255,255,255,255,255 Thank you. Won Created 2014-05-16 21:22:48 | Updated | Tags: baserecalibrator vcf realignment Hello I am following the gatk best practices guide, and using the 2 stages BaseRecalibrator process. The first recalibration is going well, the second one, produces an error like described below, I find it strange since it is not mentioned in the documentation (neither the tutorials that BaseRecalibrator require a vcf input Here is my command java -Xmx2g -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -R GRCh37-lite.fa \ -I SA495-Tumor.sorted.realigned.bam \ -BQSR SA495-Tumor.sorted.realigned.grp \ -o SA495-Tumor.sorted.post_recal.grp2 here is the error message INFO 14:03:38,425 GATKRunReport - Uploaded run statistics report to AWS S3 ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8): ERROR ERROR This means that one or more arguments or inputs in your command are incorrect. ERROR The error message below tells you what is the problem. ERROR ERROR If the problem is an invalid argument, please check the online documentation guide ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ERROR ERROR Visit our website and forum for extensive documentation and answers to ERROR commonly asked questions http://www.broadinstitute.org/gatk ERROR ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ERROR ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation. Any idea ? Thanks Rad Created 2014-05-08 23:30:18 | Updated | Tags: combinevariants vcf genotypegvcfs I've noticed a small bug with GATK tools and VCF files. CombineVariants and GenotypeGVCF can generate files where some samples have fewer fields than the format column. For instance, this is part of a line from the VQSR-ed output VCF of GenotypeGVCF: 1 15820 rs200482301 G T 5909.59 VQSRTrancheSNP99.90to100.00 AC=21;AF=0.154;AN=136..... GT:AD:DP:GQ:PL 0/0:.:40:66:0,66,990 0/0:.:41:69:0,69,1035 1/1:0,20,1:21:78:985,78,0 0/0:.:35:60:0,60,900 ./.:.:1 0/0:.:7:21:0,21,233 .............. The second to last sample entry is ./.:.:1 (3 fields), while the format entry is GT:AD:DP:GQ:PL (5 fields). I think that GT=./., AD=., and DP=1, so the data is not getting messed up. This might even be within the rules of VCF, but one of the software that I use will not parse VCF files when 1 < # sample fields < # format fields. If sample entries were extended by ":." for every empty FORMAT field (unless only . or ./. was present in sample column), it would make the file parsable. It's not too hard of a manual fix, but it might be nice to add the functionality into the toolkit. I've seen it happen with CombineVariants as well, when the input VCF files have different numbers of FORMAT fields. Created 2014-05-06 23:07:46 | Updated | Tags: selectvariants vcf Is there a Walker to select a certain number of random variants from a VCF file? I have checked only the SelectVariants, but the only option similar is --select_random_fraction, that select a fraction of mutations instead of exact number. Created 2014-04-23 07:14:49 | Updated | Tags: realignertargetcreator vcf 1000g idx Hi, I'm trying to use the RealignerTargetCreator as a test with 1 know file; 1000G_phase1.indels.b37.vcf. At first the contigs didn't match with my .BAM file (chr1/chr2 vs 1/2), so I adjusted that. Now when running it for the first time; java - jar [path to genomeAnalysisTK.jar] -T RealignerTargetCreator -R [.fasta] -I [.bam] -o [.intervals] -known [path to 1000g_phase1_adjusted.indels.b37.vcf] it gives the following error: "I/O error loading or writing tribble index file for [path to 1000g]". When running it the second time, I get the following error; "Problem detecting index type", because the .idx file is not correctly created. What am I doing wrong? Created 2014-04-22 17:53:55 | Updated | Tags: snp vcf igv Hi, I start working with IGV, but I have some doubts in how to identify a good SPN in this program. First I download the new Soybean Genome on Phytozome (Gmax_275_v2.0.fa and Gmax_275_Wm82.a2.v1.gene.gff3 files), and then I upload my files (sample.vcf, sample.bam and sample.bam.bai) into the program. I indexed which files that program needed, so that's OK! But my doubt is which parameters should I consider for a good SNP? For example, what I need to see on Alleles, Genotypes and Variant Attributes? See the example below. Chr: Chr06 Position: 35170948 ID: . Reference: C* Alternate: T Qual: 160 Type: SNP Is Filtered Out: No Alleles: No Call: 0 Allele Num: 2 Allele Count: 4 Allele Frequency: 1 Minor Allele Fraction: 1 Genotypes: Non Variant: 0 • No Call: 0 • Hom Ref: 0 Variant: 1 • Het: 0 • Hom Var: 1 Variant Attributes AF1: 1 RPB: 5.557190e-01 VDB: 1.587578e-01 Depth: 18 FQ: -54 DP4: [1, 1, 6, 8] AC1: 2 Mapping Quality: 25 PV4: [1, 0.22, 1, 0.24] Created 2014-04-15 20:38:58 | Updated 2014-04-15 20:46:35 | Tags: vcf bug hapllotypecaller Hi I've just been having a go with the new Haplotype Caller method and I've getting some odd or malformed lines in the VCF file for example: For example the format line has declared we should have 4 fields for each Sample record but instead we have samples with 2 records. Two examples are shown here: See Block 1 GT:DP:GQ:PL 1/1:.:3:32,3,0 ./.:0 Or where the format block declares 5 fields and we get 3 instead: GT:AD:DP:GQ:PL 0/0:1,1,0,0,0,0,0,0,0:2:3:0,3,24,3,24,24 ./.:.:0 0/0:1,0,0,0,0,0,0,0,0:1:1:0,1,40,3,41,43 ./.:.:3 Full blocks, Block 1 chr1 3489714 . G A 9667.23 . AC=154;AF=1.00;AN=154;DP=0;FS=0.000;InbreedingCoeff=-0.0391;MLEAC=154;MLEAF=1.00;MQ=0.00;MQ0=0;EFF=INTRAGENIC(MODIFIER|||||TIAM1||CODING|||1),INTRON(MODIFIER||||649|TIAM1|protein_coding|CODING|ENSBTAT00000064124|1|1);CSQ=A|ENSBTAG00000017839|ENSBTAT00000064124|Transcript|intron_variant||||||||1/5||1|TIAM1|HGNC| GT:DP:GQ:PL 1/1:.:3:32,3,0 1/1:.:18:207,18,0 1/1:.:6:78,6,0 1/1:.:12:140,12,0 1/1:.:9:101,9,0 ./.:0 1/1:.:9:97,9,0 1/1:.:9:96,9,0 1/1:.:21:244,21,0 1/1:.:12:138,12,0 1/1:.:12:124,12,0 1/1:.:9:105,9,0 1/1:.:15:164,15,0 1/1:.:15:153,15,0 1/1:.:27:265,27,0 1/1:.:12:125,12,0 1/1:.:18:214,18,0 ./.:0 1/1:.:9:108,9,0 1/1:.:15:169,15,0 ./.:0 1/1:.:6:76,6,0 1/1:.:6:66,6,0 1/1:.:12:140,12,0 1/1:.:3:28,3,0 1/1:.:3:10,3,0 1/1:.:12:128,12,0 ./.:0 1/1:.:18:181,18,0 1/1:.:9:98,9,0 1/1:.:15:161,15,0 1/1:.:15:185,15,0 1/1:.:12:133,12,0 1/1:.:15:175,15,0 1/1:.:18:178,18,0 1/1:.:12:133,12,0 1/1:.:9:105,9,0 1/1:.:12:141,12,0 1/1:.:15:166,15,0 1/1:.:9:108,9,0 1/1:.:15:160,15,0 1/1:.:27:267,27,0 1/1:.:21:218,21,0 1/1:.:9:107,9,0 1/1:.:3:28,3,0 1/1:.:9:80,9,0 1/1:.:6:46,6,0 ./.:0 1/1:.:6:61,6,0 1/1:.:21:241,21,0 1/1:.:15:161,15,0 1/1:.:6:82,6,0 1/1:.:12:143,12,0 1/1:.:9:109,9,0 1/1:.:21:249,21,0 1/1:.:6:40,6,0 1/1:.:9:94,9,0 1/1:.:15:185,15,0 1/1:.:12:129,12,0 1/1:.:12:132,12,0 ./.:0 1/1:.:21:207,21,0 1/1:.:12:136,12,0 1/1:.:12:109,12,0 1/1:.:18:192,18,0 ./.:0 1/1:.:9:68,9,0 1/1:.:12:138,12,0 1/1:.:6:73,6,0 1/1:.:9:105,9,0 1/1:.:9:98,9,0 1/1:.:6:65,6,0 ./.:0 1/1:.:6:65,6,0 ./.:0 1/1:.:6:58,6,0 1/1:.:12:131,12,0 ./.:0 ./.:01/1:.:3:38,3,0 1/1:.:3:37,3,0 1/1:.:21:227,21,0 1/1:.:12:131,12,0 1/1:.:6:66,6,0 1/1:.:9:100,9,0 1/1:.:21:209,21,0 1/1:.:6:63,6,0 1/1:.:6:69,6,0 Block 2 chr1 55248 . ACCC A,CCCC 179.69 . AC=13,6;AF=0.100,0.046;AN=130;BaseQRankSum=0.736;ClippingRankSum=0.736;DP=347;FS=0.000;InbreedingCoeff=0.2231;MLEAC=10,4;MLEAF=0.077,0.031;MQ=53.55;MQ0=0;MQRankSum=0.736;QD=4.99;ReadPosRankSum=0.736;EFF=INTERGENIC(MODIFIER||||||||||1),INTERGENIC(MODIFIER||||||||||2);CSQ=-||||intergenic_variant||||||||||||| GT:AD:DP:GQ:PL 0/0:1,1,0,0,0,0,0,0,0:2:3:0,3,24,3,24,24 ./.:.:0 0/0:1,0,0,0,0,0,0,0,0:1:1:0,1,40,3,41,43 0/0:1,0,0,0,0,0,0,0,0:1:2:0,2,44,3,45,46 0/0:.:5:0:0,0,103,0,103,103 0/0:3,0,1,0,0,0,0,0,0:4:9:0,9,81,9,81,81 0/0:0,0,1,0,0,0,0,0,0:1:1:0,1,2,1,2,2 ./.:.:1 0/0:.:3:0:0,0,41,0,41,41 0/0:1,0,0,0,0,0,0,0,0:1:9:0,9,73,9,73,73 0/1:1,0,0,1,0,0,0,0,0:2:22:28,0,73,28,22,46 0/0:1,0,0,0,0,0,0,0,0:1:4:0,4,25,4,25,25 0/0:2,0,0,0,0,0,0,0,0:2:21:0,21,273,21,273,273 0/1:5,0,0,1,0,0,0,0,0:6:11:11,0,235,26,158,175 ./.:0,0,0,0,1,0,0,0,0:1 0/0:.:4:0:0,0,46,0,46,46 ./.:.:3 ./.:.:2 0/1:1,0,1,1,0,0,0,0,0:3:21:28,0,44,31,21,52 0/0:.:4:0:0,0,77,0,77,77 0/0:0,0,1,0,0,0,0,0,0:1:1:0,1,2,1,2,2 0/0:.:2:6:0,6,51,6,51,51 0/0:.:6:2:0,2,151,2,151,151 0/0:2,0,1,0,0,0,0,0,0:3:7:0,7,74,7,74,74 ./.:.:0 0/0:2,0,0,0,0,0,0,0,0:2:5:0,7,59,5,37,35 1/2:0,0,0,1,0,0,0,0,0:1:1:27,1,40,26,0,25 0/0:2,0,0,0,0,0,0,0,0:2:9:0,9,83,9,83,83 ./.:.:0 2/2:0,0,0,0,0,3,0,0,0:3:9:59,59,59,9,9,0 ./.:.:16 0/0:2,0,0,0,0,0,0,0,0:2:7:0,7,59,7,59,59 0/0:2,0,0,0,0,0,0,0,0:2:9:0,9,83,9,83,83 0/0:2,0,0,0,0,0,0,0,0:2:11:0,11,68,11,68,68 0/2:6,0,0,0,0,2,0,0,0:8:18:18,39,384,0,346,340 ./.:.:1 0/1:3,0,0,1,0,0,0,0,0:4:25:25,0,96,34,100,134 0/0:2,0,0,0,0,0,0,0,0:2:12:0,12,105,12,105,105 0/1:1,0,0,1,0,0,0,0,0:2:17:17,0,94,21,26,44 0/0:.:2:6:0,6,64,6,64,64 0/0:1,0,0,0,0,0,0,0,0:1:2:0,2,24,3,25,27 0/0:.:2:6:0,6,48,6,48,48 0/0:.:2:6:0,6,63,6,63,63 0/0:0,0,1,0,0,0,0,0,0:1:0:0,0,1,0,1,1 ./.:.:0 0/0:1,0,0,0,0,0,0,0,0:1:1:0,1,6,3,8,9 0/0:1,0,0,0,0,0,0,0,0:1:15:0,15,124,15,124,124 ./.:.:0 0/0:1,0,0,0,0,0,0,0,0:1:4:0,4,19,4,19,19 0/1:2,0,0,1,0,0,0,0,0:3:28:28,0,66,34,70,104 0/0:.:4:0:0,0,18,0,18,18 0/2:2,0,0,0,0,4,0,0,0:6:57:68,74,143,0,69,57 ./.:.:0 0/0:4,0,0,0,0,0,0,0,1:5:15:0,15,109,15,109,109 0/0:.:10:0:0,0,182,0,182,182 0/0:.:1:3:0,3,31,3,31,31 0/0:0,0,1,0,0,0,0,0,0:1:1:0,1,2,1,2,2 0/0:2,0,0,0,0,0,0,0,0:2:5:0,5,76,6,78,79 0/0:1,1,0,0,0,0,0,0,0:2:4:0,4,28,4,28,28 ./.:.:1 1/1:0,0,0,2,0,0,0,0,0:2:6:67,6,0,67,6,67 ./.:.:1 0/0:.:2:0:0,0,14,0,14,14 ./.:.:0 ./.:.:1 0/0:.:5:1:0,1,120,1,120,120 0/0:.:4:0:0,0,8,0,8,8 0/0:.:6:0:0,0,94,0,94,94 0/1:1,0,0,2,0,0,0,0,0:3:1:27,0,1,29,6,35 0/0:.:2:0:0,0,3,0,3,3 ./.:.:6 ./.:.:0 ./.:.:0 0/2:3,0,0,0,0,2,0,0,0:5:27:46,27,89,0,62,81 ./.:.:0 0/0:.:1:0:0,0,6,0,6,6 0/0:.:9:0:0,0,86,0,86,86 ./.:.:1 ./.:.:0 1/1:.:.:0:1,1,0,1,0,0 0/0:0,0,0,0,0,0,0,0,0:0:3:0,3,26,3,26,26 0/1:4,0,0,2,0,0,0,0,0:6:49:49,0,93,61,100,160 0/0:.:3:0:0,0,14,0,14,14 0/0:.:8:0:0,0,60,0,60,60 0/0:2,1,0,0,0,0,0,0,0:3:6:0,6,37,6,37,37 ./.:.:2 0/0:.:9:0:0,0,81,0,81,81 0/0:0,0,0,0,0,0,0,0,0:0:1:0,1,2,1,2,2 Any idea what the issue is? Created 2014-04-03 23:13:39 | Updated | Tags: queue vcf qscript scala Hi, Thanks to previous replies can run Queue and the relevant walker on a distributed computing server. The question was if I define my scala script to require an argument for the output file, using the -o parameter like so:  // Required arguments. All initialized to empty values. .... @Input(doc="Output file.", shortName="o") var outputFile: File = _ How do I direct the output to pipe the result to a specified directory? Currently I have the code: genotyper.out = swapExt(qscript.bamFile, "bam", outputFile, "unfiltered.vcf") Currently when I include the string -o /path/to/my/output/files/MyResearch.vcf The script creates a series of folders within the directory where I execute Queue from. In this case my results were sent to: /Queue-2-8-1-g932cd3a/MyResearch./path/to/my/output/files/MyResearch.unfiltered.vcf when all I wanted was the output to appear in the path: /path/to/my/output/files/MyResearch.unfiltered.vcf As always any help is much appreciated. Created 2014-03-11 08:10:55 | Updated | Tags: vcf bam rodwalker RodWalkers are really fast and great. Is there any way to connect a BAM file to it and get the pileup in the given positions? If not, what is the appropriate way to get pileups for a small set of selection positions? Created 2014-02-26 13:11:55 | Updated | Tags: unifiedgenotyper vcf genotyping Dear GATK team, Could you please help me how to explain genotyping in my vcf file. I have Illumina data and vcf caller was GATK. My variant frequency (Alt variant freq) is 99.7%. DP = 4622 (AD = 16, 4606) - so I would expect that this sample is alternate homozygous. But when I check PL value, which is - PL = 1655,0,323 - after calculating my likelihood - REF= G ALT= A P(D|GG) = 10 ^ -165.5 = small P(D|AG) = 10 ^ 0 = 1 P(D|AA) = 10 ^ 32.3 = small we can see it is heterozygous. Can anybody help me how to interpret my result? How it is possible that likelihoods show me heterozygous and coverage and VF show me homozygous? Here is part of my vcf file: chr13 32899193 . G A 1625.01 PASS AC=1;AF=0.5;AN=2;DP=4622;QD=0.35;TI=NM_000059;GI=BRCA2;FC=Silent GT:AD:DP:GQ:PL:VF:GQX 0/1:16,4606:5000:99:1655,0,323:0.997:99 Thank you for any explanation. Paul. Created 2014-02-20 08:40:28 | Updated | Tags: vcf fastaalternatereferencemaker Dear all I am writing to you to understand why FastaAlternateReferenceMaker is missing a deletion in the final consensus. I am using gatk v. 2.8-1 and this commandline: gatk -R ref.fa -T FastaAlternateReferenceMaker -o reads_vs_ref_gatk_consensus.fa --variant reads_vs_ref_var.vcf output from gatk was: INFO 08:57:46,123 HelpFormatter - -------------------------------------------------------------------------------- INFO 08:57:46,126 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.8-1-g932cd3a, Compiled 2013/12/06 16:47:15 INFO 08:57:46,126 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 08:57:46,126 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 08:57:46,130 HelpFormatter - Program Args: -R ref.fa -T FastaAlternateReferenceMaker -o reads_vs_ref_gatk_consensus.fa --variant reads_vs_ref_var.vcf INFO 08:57:46,130 HelpFormatter - Date/Time: 2014/02/20 08:57:46 INFO 08:57:46,131 HelpFormatter - -------------------------------------------------------------------------------- INFO 08:57:46,131 HelpFormatter - -------------------------------------------------------------------------------- INFO 08:57:46,137 ArgumentTypeDescriptor - Dynamically determined type of reads_vs_ref_var.vcf to be VCF INFO 08:57:46,856 GenomeAnalysisEngine - Strictness is SILENT INFO 08:57:46,978 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 08:57:47,010 RMDTrackBuilder - Creating Tribble index in memory for file reads_vs_ref_var.vcf INFO 08:57:47,079 RMDTrackBuilder - Writing Tribble index to disk for file /home/aprea/test/consensus/reads_vs_ref_var.vcf.idx INFO 08:57:47,223 GenomeAnalysisEngine - Preparing for traversal INFO 08:57:47,224 GenomeAnalysisEngine - Done preparing for traversal INFO 08:57:47,225 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 08:57:47,226 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 08:57:47,546 ProgressMeter - done 1.05e+04 0.0 s 30.0 s 99.0% 0.0 s 0.0 s INFO 08:57:47,547 ProgressMeter - Total runtime 0.32 secs, 0.01 min, 0.00 hours INFO 08:57:51,452 GATKRunReport - Uploaded run statistics report to AWS S3 There are 3 variants in the vcf file: the first is a SNP and is reported, the second is an insertion and is reported but the third, a deletion, is missing. Could you please tell me where's the problem? I attached both reference and vcf file. Many thanks. Created 2014-02-19 15:41:36 | Updated 2014-02-19 16:26:36 | Tags: combinevariants vcf developer I believe that I may have found an issue with the CombineVariants tool of GATK that manifests itself when there is a repeated ID in a given VCF. For us, the reason to have repeated IDs in a VCF file is to detect inconsistencies in our sample by calling variants on 2 different DNA samples and then checking the concordance. Our current process is: 1) Generate a VCF containing unique IDs (using GATK CallVariants) 2) Replace the VCF header with potentially non-unique IDs (using tabix -r) 3) Merge a single VCF to uniqify the IDs (using GATK CombineVariants) It seems that the genotypes in the merged VCF are off by one column. I've attached 3 files that demonstrate the issue: "combined" which is the result of step 1, "combined.renamed", which is the output of step 2, and "combined.renamed.merged", which is the output of step 3. The relevant lines are as follows: combined: HG00421@123910725 HG00422 HG00422@123910706 HG00423@123910701 NA12801 NA12802 0/0:300 0/0:127 0/0:292 0/0:290 0/0:127 0/0:127 0/0:299 0/0:127 0/0:299 0/0:293 0/0:127 0/0:127  combined.renamed: HG00421 HG00422 HG00422 HG00423 NA12801 NA12802 0/0:300 0/0:127 0/0:292 0/0:290 0/0:127 0/0:127 0/0:299 0/0:127 0/0:299 0/0:293 0/0:127 0/0:127  combined.renamed.merged: HG00421 HG00422 HG00423 NA12801 NA12802 0/0:300 0/0:127 0/0:292 0/0:290 0/0:127 0/0:299 0/0:127 0/0:299 0/0:293 0/0:127  Using the depth argument here, we can see that in the merged dataset, NA12801 has depths 290,293 whereas in the original and renamed datasets the depths were 127,127. The 290,293 depths correspond to HG00423, which is the column before. I have confirmed this behavior in both GATK 2.7-4 and 2.8-1. If there's any more information that you need, please let me know, and I would be happy to provide it. Also, if you might know where this issue arises, I would be happy to try to provide a patch. Thanks, John Wallace Created 2014-02-13 12:42:10 | Updated | Tags: selectvariants vcf Is there a way to include only variant sites and no-calls in your final vcf. I know during SNP calls you can only emit variants, or only confident sites or all. However is there a way to reduce your vcf in the end to only variant sites (vsqr passed) and places where no calls could be made. So the end vcfs have only variant sites and missing data - and everything that is not listed in the vcf file is reference. I need such a file for merging with other vcf files - so that every position that is not in the vcfs while merging can be called ref. So far i have called snps with emit-all and done vsqr - I now want to reduce vcfs in size by excluding NO_VARINATION sites (but want to keep information on "missing" sites) Created 2014-02-11 02:58:14 | Updated | Tags: unifiedgenotyper ploidy vcf I using the UnifiedGenotyper to call SNPs in a pooled sample of 30 diploid individuals (i.e., I am setting the ploidy to 60). Does this mean that if the coverage is < 60 at a given variant site, the vcf file will read "./." for all alleles at that site? In other words, does it require the coverage to be >= the ploidy or it won't produce a called variant at that site? I'm just trying to make sure I am interpreting the vcf file correctly, in that: if there is a called genotype at a given variant site that I can interpret that as the estimated allele frequency in that pool. Thanks in advance for any advice. Best, Jon Created 2014-02-10 19:57:41 | Updated 2014-02-10 20:24:46 | Tags: variantannotator variantstovcf variantfiltration vcf variant-calling minor-allele-frequency Hi, I have annotated my vcf file of 20 samples from Unified genotyper using the following steps. Unified genotyper->Variantrecalibration->Applyrecalibration->VariantAnnotator My question is how should I proceed if I have to select rare variants (MAF<1%) for the candidate genes that I have,for each of these 20 samples? Created 2014-02-06 08:15:40 | Updated | Tags: unifiedgenotyper vcf Hi, I'm trying to understand how do you usually calculate the GQ of a SNP. I understood the model to calculate the Likelihood of all the genotypes (AA,AC, AG,AT,CC,CG,CT,GG,GT,TT). Once all the likelihood have been calculated, correct me if I'm wrong, you normalize the likelihood of the best genotype to 1 and all the other likelihoods according to that scale. So the PL field in the VCF should be the Phred-scaled values of the normalized values. But is not clear to me how do you finally calculate the GQ value. What values do you use to calculate that quality (normalized or phred scaled)? And what's the right formula? I've tried to debug the code but it ends to be really tricky. I really hope that you would help me. I would be really thankfull for that. Created 2014-01-21 13:32:12 | Updated | Tags: vqsr selectvariants vcf I just wanted to select variants from a VCF with 42 samples. After 3 hours I got the following Error. How to fix this? please advise. Thanks I had the same problem when I used "VQSR". How can I fix this problem? INFO 20:28:17,247 HelpFormatter - -------------------------------------------------------------------------------- INFO 20:28:17,250 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51 INFO 20:28:17,250 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 20:28:17,251 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 20:28:17,255 HelpFormatter - Program Args: -T SelectVariants -rf BadCigar -R /groups/body/JDM_RNA_Seq-2012/GATK/bundle-2.3/ucsc.hg19/ucsc.hg19.fasta -V /hms/scratch1/mahyar/Danny/data/Overal-RGSM-42prebamfiles-allsites.vcf -L chr1 -L chr2 -L chr3 -selectType SNP -o /hms/scratch1/mahyar/Danny/data/Filter/extract_SNP_only3chr.vcf INFO 20:28:17,256 HelpFormatter - Date/Time: 2014/01/20 20:28:17 INFO 20:28:17,256 HelpFormatter - -------------------------------------------------------------------------------- INFO 20:28:17,256 HelpFormatter - -------------------------------------------------------------------------------- INFO 20:28:17,305 ArgumentTypeDescriptor - Dynamically determined type of /hms/scratch1/mahyar/Danny/data/Overal-RGSM-42prebamfiles-allsites.vcf to be VCF INFO 20:28:18,053 GenomeAnalysisEngine - Strictness is SILENT INFO 20:28:18,167 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 20:28:18,188 RMDTrackBuilder - Creating Tribble index in memory for file /hms/scratch1/mahyar/Danny/data/Overal-RGSM-42prebamfiles-allsites.vcf INFO 23:15:08,278 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.NegativeArraySizeException at org.broad.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:97) at org.broad.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:116) at org.broad.tribble.readers.AsciiLineReaderIteratorTupleIterator.advance(AsciiLineReaderIterator.java:84) at org.broad.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:73) at net.sf.samtools.util.AbstractIterator.next(AbstractIterator.java:57) at org.broad.tribble.readers.AsciiLineReaderIterator.next(AsciiLineReaderIterator.java:46) at org.broad.tribble.readers.AsciiLineReaderIterator.next(AsciiLineReaderIterator.java:24) at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:73) at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:35) at org.broad.tribble.AbstractFeatureCodec.decodeLoc(AbstractFeatureCodec.java:40) at org.broad.tribble.index.IndexFactory$FeatureIterator.readNextFeature(IndexFactory.java:428) at org.broad.tribble.index.IndexFactoryFeatureIterator.next(IndexFactory.java:390) at org.broad.tribble.index.IndexFactory.createIndex(IndexFactory.java:288) at org.broad.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:278) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createIndexInMemory(RMDTrackBuilder.java:388) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:274) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:211) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:140) at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208) at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:964) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:758) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:284) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.7-4-g6f46d11): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Code exception (see stack trace for error itself) ##### ERROR ------------------------------------------------------------------------------------------ Created 2014-01-17 20:34:26 | Updated | Tags: unifiedgenotyper vcf How I can remove SNPs with Low quality or out of Hardy-Weinberg in my VCF file. In column 7 of my VCF file I have only "LowQual" or ".". There is no "PASS" word in this column. Is there something wrong in my VCF or any problem with UnifiedGenotyper procedure? thanks Created 2013-12-17 17:07:40 | Updated 2013-12-17 17:08:47 | Tags: unifiedgenotyper vcf empty Hey there, I was trying to build an analysis pipeline for paired reads with BWA, Duplicate Removal Local Realignment and Base Quality Score Recalibration to finally use GATK's UnifiedGenotyper for SNP and Indel calling. However, for both SNPs and Indels, I receive no called variants no matter how low my used thresholds are. Quality values of the reads look ok, leaving out dbSNP does not change results. I have used the same reference throughout the whole pipeline. I use GATK 2.7, nevertheless a switch to GATK 1.6 did not change anything. This is my shell command for SNP calling on chromosome X (GATK delivers no results for all chromosomes): java -Xmx4g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R Homo_sapiens_assembly19.fasta -stand_call_conf 30.0 -stand_emit_conf 30.0 -glm SNP -mbq 17 -I test.bam -L X -o test.snps.vcf -D dbsnp_135.hg19.excluding_sites_after_129.vcf Entries in my BAM file look like this: SRR389458.1885965 113 X 10092397 37 76M = 10092397 1 CCTGTTTCCCCTGGGGCTGGGCTNGANACTGGGCCCAACCNGTGGCTCCCACCTGCACACACAGGGCTGGAGGGAC 98998999989:99:9:999888#88#79999:;:89998#99:;:88:989:;:91889888:;:9;:::::999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:23G2G13C35 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:3 SM:i:37 XM:i:3 XO:i:0 MQ:i:37 XT:A:U SRR389458.1885965 177 X 10092397 37 76M = 10092397 -1 CCTGTTTCCCCTGGGGCTGGGCTNGANACTGGGCCCAACCNGTGGCTCCCACCTGCACACACAGGGCTGGAGGGAC 98998999989:99:9:999888#88#79999:;:89998#99:;:88:989:;:91889888:;:9;:::::999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:23G2G13C35 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:3 SM:i:37 XM:i:3 XO:i:0 MQ:i:37 XT:A:U SRR389458.1888837 113 X 14748343 37 76M = 14748343 1 TCGTGAAAGTCGTTTTAATTTTAGCGGTTATGGGATGGGTCACTGCCTCCAAGTGAAAGATGGAACAGCTGTCAAG 889999:9988;98:9::9;9986::::99:8:::::999988989:8;;9::989:999:9;9:;:99:98:999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:76 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 MQ:i:37 XT:A:U And this is the output of the UnifiedGenotyper: INFO 17:57:00,575 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:57:00,578 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51 INFO 17:57:00,578 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:57:00,578 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:57:00,582 HelpFormatter - Program Args: -T UnifiedGenotyper -R /hana/exchange/reference_genomes/hg19/Homo_sapiens_assembly19.fasta -stand_call_conf 30.0 -stand_emit_conf 30.0 -glm SNP -mbq 17 -I test.bam -L X -o testX.snps.vcf -D dbsnp_135.hg19.excluding_sites_after_129.vcf INFO 17:57:00,583 HelpFormatter - Date/Time: 2013/12/17 17:57:00 INFO 17:57:00,583 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:57:00,583 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:57:00,943 ArgumentTypeDescriptor - Dynamically determined type of /hana/exchange/reference_genomes/hg19/dbsnp_135.hg19.excluding_sites_after_129.vcf to be VCF INFO 17:57:01,579 GenomeAnalysisEngine - Strictness is SILENT INFO 17:57:02,228 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 17:57:02,237 SAMDataSourceSAMReaders - Initializing SAMRecords in serial INFO 17:57:02,364 SAMDataSource\$SAMReaders - Done initializing BAM readers: total time 0.13 INFO 17:57:02,594 RMDTrackBuilder - Loading Tribble index from disk for file /hana/exchange/reference_genomes/hg19/dbsnp_135.hg19.excluding_sites_after_129.vcf INFO 17:57:02,867 IntervalUtils - Processing 155270560 bp from intervals INFO 17:57:02,943 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 17:57:03,166 GenomeAnalysisEngine - Done preparing for traversal INFO 17:57:03,167 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:57:03,167 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 17:57:33,171 ProgressMeter - X:11779845 1.01e+07 30.0 s 2.0 s 7.6% 6.6 m 6.1 m INFO 17:58:03,173 ProgressMeter - X:24739805 1.89e+07 60.0 s 3.0 s 15.9% 6.3 m 5.3 m INFO 17:58:33,175 ProgressMeter - X:37330641 3.25e+07 90.0 s 2.0 s 24.0% 6.2 m 4.7 m INFO 17:59:03,177 ProgressMeter - X:49404077 4.94e+07 120.0 s 2.0 s 31.8% 6.3 m 4.3 m INFO 17:59:33,178 ProgressMeter - X:64377965 5.36e+07 2.5 m 2.0 s 41.5% 6.0 m 3.5 m INFO 18:00:03,180 ProgressMeter - X:75606869 7.54e+07 3.0 m 2.0 s 48.7% 6.2 m 3.2 m INFO 18:00:33,189 ProgressMeter - X:88250233 7.74e+07 3.5 m 2.0 s 56.8% 6.2 m 2.7 m INFO 18:01:03,190 ProgressMeter - X:100393213 9.94e+07 4.0 m 2.0 s 64.7% 6.2 m 2.2 m INFO 18:01:33,192 ProgressMeter - X:110535705 1.09e+08 4.5 m 2.0 s 71.2% 6.3 m 109.0 s INFO 18:02:03,193 ProgressMeter - X:121257489 1.20e+08 5.0 m 2.0 s 78.1% 6.4 m 84.0 s INFO 18:02:33,195 ProgressMeter - X:132533757 1.32e+08 5.5 m 2.0 s 85.4% 6.4 m 56.0 s INFO 18:03:03,197 ProgressMeter - X:144498909 1.41e+08 6.0 m 2.0 s 93.1% 6.4 m 26.0 s INFO 18:03:30,079 ProgressMeter - done 1.55e+08 6.4 m 2.0 s 100.0% 6.4 m 0.0 s INFO 18:03:30,079 ProgressMeter - Total runtime 386.91 secs, 6.45 min, 0.11 hours INFO 18:03:30,080 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 150 total reads (0.00%) INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing BadMateFilter INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter INFO 18:03:32,167 GATKRunReport - Uploaded run statistics report to AWS S3

Do I miss anything here?

Best,

Cindy

Created 2013-08-13 21:38:39 | Updated | Tags: vcf gatk picard

I finally got the filtered VCF file from PWA + PiCard + GATK pipeline, and have 11 exome-seq data files which were processed as a list of input to GATK. In the process of getting VCF, I did not see an option of separating the 11 samples. Now, I've got two VCF files (one for SNPs and the other for indels) that each has 11 samples. My question is how to proceed from here?

Should I separate the 11 files before annotation? or annotation first then split them 11 samples to individual files? Big question here is how to split the samples from vcf files? thanks

Created 2013-03-27 19:31:47 | Updated 2013-03-27 19:38:27 | Tags: vcf

Hi,

quality score (phred score) is defined as below. (i.e. 1% error rate is equal to phred score of 20 (-10xlog 0.01))

QUAL phred-scaled quality score for the assertion made in ALT. i.e. -10log_10 prob(call in ALT is wrong). If ALT is ”.” (no variant) then this is -10log_10 p(variant), and if ALT is not ”.” this is -10log_10p(no variant). High QUAL scores indicate high confidence calls. Although traditionally people use integer phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired. If unknown, the missing value should be specified. (Numeric)

Using GATK to generate vcf files and looking through the quality column of those files, I found out that the max quality score is 441,453 which is extremely huge number.

I wonder if the quality score of GATK tool follows the phred score system; if not, how do you calculate the quality score and what do the numbers of quality score represent?

Look forward to hearing back from you soon and thank you very much.

Created 2012-12-19 02:20:20 | Updated 2013-01-07 19:19:38 | Tags: pedigree vcf plink

Before there is webpage for how to convert plink ped format to vcf format. But it seems that this link disappeared.

Thank you very much in advance.

Created 2012-12-11 23:07:30 | Updated | Tags: unifiedgenotyper vcf

(There was another question about a similar symptom, but the answer does not appear to apply to what I'm seeing.)

I get an empty VCF file that just contains the header lines. The input VCF file is 1.8Gb, and as far as I can tell the content is OK - it has MAPQ scores, the flags seem reasonable, etc. I've attached a copy of the console output, and the beginning of the input file in SAM format. Let me know if you have any suggestions. Thanks,

Ravi

Created 2012-12-11 02:33:32 | Updated | Tags: vcf gatk-lite

This is not a bug per se in that it does not cause incorrect output, but I think it would be accurately described as an "unintended consequence" of very poorly compressed VCF output files.

GATK allows for output VCF files to be written using Picard's BlockCompressedOutputStream when the the output file is specified with the extension .vcf.gz, which I consider to be very good behavior. However, I noticed after doing some minor external manipulation that the files produced this way are "suboptimally" compressed. By suboptimal, I mean that sometimes the files are even larger than the uncompressed VCF files.

Since the problem occurs in GATK-Lite, I was able to look through the source code to see what is going on. From what I can tell, the issue is that VCFWriter calls mWriter.flush() at the end of VCFWriter.add() for each variant. Per the documentation for BlockCompressedOutputStream.flush():

WARNING: flush() affects the output format, because it causes the current contents of uncompressedBuffer to be compressed and written, even if it isn't full.

As a result, instead of the default of blocks of about 64k, the bgzf-formatted .vcf.gz files produced by GATK have blocks for each line. That reduces the amount repetition for gzip to take advantage of. Not being sure what issues led to requiring a call to flush after every variant, I'm not sure how to best address this, but it may be necessary to wrap BlockCompressedOutputStream when used by VCFWriter to catch this flush in order to get effective compression.

Of course, it is possible to simply write the file and then compress it in a separate step, but this leads to disk IO that should be preventable.

Created 2012-10-12 09:44:57 | Updated 2012-10-18 00:59:49 | Tags: combinevariants vcf merge

Dear team, I am new to GATK and I am having a hard time simply trying to merge vcf files. I have tried to solve the problem by referring to the guide and to previous posts, but nothing woked. Actually I found several discussions about the very same error message I receive, but it seems that no clear answere was provided. Here is this message:

##### ERROR ------------------------------------------------------------------------------------------

I have tried three different MS Dos commands to do the task (see belbow), but the message didn't change:

1. java -jar GenomeAnalysisTK.jar -T CombineVariants -R E:\RessourcesGATK\ucsc.hg19.fasta -V:sample1 E:\TestGATK\sample1.vcf -V:sample2 E:\TestGATK\sample2.vcf -o combined.vcf

2. java -jar GenomeAnalysisTK.jar -R E:\RessourcesGATK\ucsc.hg19.fasta -T CombineVariants  --variant E:\TestGATK\sample1.vcf  --variant E:\TestGATK\sample2.vcf  -o output.vcf  -genotypeMergeOptions UNIQUIFY

3.java -jar GenomeAnalysisTK.jar -R E:\RessourcesGATK\ucsc.hg19.fasta  -T CombineVariants  --variant E:\TestGATK\sample1.vcf  --variant E:\TestGATK\sample2.vcf  -o output.vcf  -genotypeMergeOptions PRIORITIZE  -priority foo,bar

I have also tried to use the reference human_g1k_v37.fasta as a resource, but it was the same. I have suppressed the # before CHROM in the header line, tested vcf generated by Samtools or by GATK, but it did not change anything. Is this a problem of environment? I haven't read anything mentioning that GATK could not work with MS Dos.

Thank you very much for your help. S.

Created 2012-08-10 18:19:46 | Updated 2013-01-07 19:45:12 | Tags: unifiedgenotyper vcf indels

Hello, I'm maybe missing something obvious but it seems a GATK vcf file does not tell a given variant is SNP, Insertion or deletion. Did I missed some command when I called variations? I can easily classify variations by eye or a script from a given vcf entry but the entry does not explicitly say variation type.

Here are deletions:

d02 264482 . TT T 389.93 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.224;DP=25;FS=0.000;HaplotypeScore=58.0966;MLEAC=2;MLEAF=1.00;MQ=47.51;MQ0=0;MQRankSum=1.714;QD=15.60;ReadPosRankSum=0.075;SB=-1.112e+02 GT:AD:DP:GQ:PL 1/1:15,9:20:9:431,9,0

Ld04 26597 . CCC C 2923.96 . AC=2;AF=1.00;AN=2;BaseQRankSum=-2.217;DP=98;FS=2.954;HaplotypeScore=145.2957;MLEAC=2;MLEAF=1.00;MQ=59.82;MQ0=0;MQRankSum=0.283;QD=29.84;ReadPosRankSum=1.150;SB=-1.306e+03 GT:AD:DP:GQ:PL 1/1:58,31:98:99:2966,195,0

Here are insertions:

Ld04 23671 . G GAAAT 6952 . AC=2;AF=1.00;AN=2;DP=100;FS=0.000;HaplotypeScore=24.8695;MLEAC=2;MLEAF=1.00;MQ=59.54;MQ0=0;QD=69.52;SB=-3.355e+03 GT:AD:DP:GQ:PL 1/1:65,34:100:99:6952,301,0

Ld04 26880 . T TTATT 4973 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.552;DP=99;FS=2.935;HaplotypeScore=116.0082;MLEAC=2;MLEAF=1.00;MQ=59.46;MQ0=0;MQRankSum=0.192;QD=50.23;ReadPosRankSum=1.834;SB=-1.950e+03 GT:AD:DP:GQ:PL 1/1:49,38:98:99:4973,247,0

Here are SNPs

Ld04 19890 . T C 3276.01 . AC=2;AF=1.00;AN=2;DP=85;Dels=0.00;FS=0.000;HaplotypeScore=0.7887;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=38.54;SB=-1.614e+03 GT:AD:DP:GQ:PL 1/1:0,85:85:99:3276,253,0

Ld04 19963 . A G 3571.01 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.895;DP=96;Dels=0.00;FS=0.000;HaplotypeScore=2.4678;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=1.769;QD=37.20;ReadPosRankSum=-0.316;SB=-1.523e+03 GT:AD:DP:GQ:PL 1/1:3,93:96:99:3571,205,0

I do not see any markers that tell me variation type. Nor the length of variations. Such information is given in a vcf of Samtools.

PS I used this command line of V 2.0-38:

java -jar ~/GenomeAnalysisTK-2.0-38/GenomeAnalysisTK.jar -T UnifiedGenotyper --genotype_likelihoods_model BOTH -R ref1 -I IN.bam -o OUT.GK2.vcf

Thank you.

Hideo