SelectVariants is a GATK tool used to subset a VCF file by many arbitrary criteria listed in the command line options below. The output VCF wiil have the AN (number of alleles), AC (allele count), AF (allele frequency), and DP (depth of coverage) annotations updated as necessary to accurately reflect the file's new contents.
Select Variants operates on VCF files (ROD Tracks) provided in the command line using the GATK's built in
--variant option. You can provide multiple tracks for Select Variants but at least one must be named 'variant' and this will be the file all your analysis will be based of. Other tracks can be named as you please. Options requiring a reference to a ROD track name will use the track name provided in the -B option to refer to the correct VCF file (e.g. --discordance / --concordance ). All other analysis will be done in the 'variant' track.
Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). SelectVariants can be used for this purpose. Given a single VCF file, one or more samples can be extracted from the file (based on a complete sample name or a pattern match). Variants can be further selected by specifying criteria for inclusion, i.e. "DP > 1000" (depth of coverage greater than 1000x), "AF < 0.25" (sites with allele frequency less than 0.25). These JEXL expressions are documented here in the FAQ article on JEXL expressions; it is particularly important to note the section on working with complex expressions.
For a complete, detailed argument reference, refer to the GATK document page here.
Let's say you have a file with three samples. The numbers before the ":" will be the genotype (0/0 is hom-ref, 0/1 is het, and 1/1 is hom-var), and the number after will be the depth of coverage.
BOB MARY LINDA 1/0:20 0/0:30 1/1:50
In this case, the INFO field will say AN=6, AC=3, AF=0.5, and DP=100 (in practice, I think these numbers won't necessarily add up perfectly because of some read filters we apply when calling, but it's approximately right).
Now imagine I only want a file with the samples "BOB" and "MARY". The new file would look like:
BOB MARY 1/0:20 0/0:30
The INFO field will now have to change to reflect the state of the new data. It will be AN=4, AC=1, AF=0.25, DP=50.
Let's pretend that MARY's genotype wasn't 0/0, but was instead "./." (no genotype could be ascertained). This would look like
BOB MARY 1/0:20 ./.:.
with AN=2, AC=1, AF=0.5, and DP=20.
SelectVariants now keeps (r5832) the alt allele, even if a record is AC=0 after subsetting the site down to selected samples. For example, when selecting down to just sample NA12878 from the OMNI VCF in 1000G (1525 samples), the resulting VCF will look like:
1 82154 rs4477212 A G . PASS AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0 GT:GC 0/0:0.7205 1 534247 SNP1-524110 C T . PASS AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491 1 565286 SNP1-555149 C T . PASS AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0 GT:GC 1/1:0.3471 1 569624 SNP1-559487 T C . PASS AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0 GT:GC 1/1:0.3942
Although NA12878 is 0/0 at the first sites, ALT allele is preserved in the VCF record. This is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. This is related to the tricky issue of isPolymorphic() vs. isVariant().
isVariant => is there an ALT allele?
isPolymorphic => is some sample non-ref in the samples?
In part this is complicated as the semantics of sites-only VCFs, where ALT = . is used to mean not-polymorphic. Unfortunately, I just don't think there's a consistent convention right now, but it might be worth at some point to adopt a single approach to handling this.
For clarity, in previous versions of SelectVariants, the first two monomorphic sites lose the ALT allele, because NA12878 is hom-ref at this site, resulting in VCF that looks like:
1 82154 rs4477212 A . . PASS AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0 GT:GC 0/0:0.7205 1 534247 SNP1-524110 C . . PASS AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491 1 565286 SNP1-555149 C T . PASS AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0 GT:GC 1/1:0.3471 1 569624 SNP1-559487 T C . PASS AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0 GT:GC 1/1:0.3942
If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting.
Some VCFs may have repeated header entries with the same key name, for instance:
##fileformat=VCFv3.3 ##FILTER=ABFilter,"AB > 0.75" ##FILTER=HRunFilter,"HRun > 3.0" ##FILTER=QDFilter,"QD < 5.0" ##UG_bam_file_used=file1.bam ##UG_bam_file_used=file2.bam ##UG_bam_file_used=file3.bam ##UG_bam_file_used=file4.bam ##UG_bam_file_used=file5.bam ##source=UnifiedGenotyper ##source=VariantFiltration ##source=AnnotateVCFwithMAF ...
Here, the "UG_bam_file_used" and "source" header lines appear multiple times. When SelectVariants is run on such a file, the program will emit warnings that these repeated header lines are being discarded, resulting in only the first instance of such a line being written to the resulting VCF. This behavior is not ideal, but expected under the current architecture.
For information on how to construct regular expressions for use with this tool, see the "Summary of regular-expression constructs" section here.
JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.
In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.
JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:
"QUAL > 30.0"
QUALis a key: the name of the annotation we want to look at
30.0is a value: the threshold that we want to use to evaluate variant quality against
>is an operator: it determines which "side" of the threshold we want to select
The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:
"MY_STRING_KEY == 'foo'"
You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:
"QUAL / DP < 10.0"
You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):
"QUAL > 30.0 && DP == 10"
&& is the logical "AND".
Or if you want to select variants that have at least one of several conditions fulfilled:
"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"
|| is the logical "OR".
It is very important to note that the JEXL evaluation subprogram cannot correctly handle cases where the annotations requested by the JEXL expression are missing for some variants in a VCF record. It will throw an exception (i.e. fail with an error) when it encounters this scenario. The default behavior of the GATK is to handle this by having the entire expression evaluate to FALSE in such cases (although some tools provide options to change this behavior). This is extremely important especially when constructing complex expressions, because it affects how you should interpret the result.
For example, looking again at that last expression:
"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"
When run against a VCF record with INFO field
it will evaluate to
TRUE because the
FS value is greater than 200.0.
But when run against a VCF record with INFO field
it will evaluate to
FALSE because there is no
ReadPosRankSum value defined at all and JEXL fails to evaluate it.
This means that when you're trying to filter out records with VariantFiltration, for example, the previous record would be marked as PASSing, even though it contains a bad
For this reason, we highly recommend that complex expressions involving OR operations be split up into separate expressions whenever possible. For example, the previous example would have 3 distinct expressions:
"QD < 2.0",
"ReadPosRankSum < -20.0", and
"FS > 200.0". This way, although the
ReadPosRankSum expression evaluates to FALSE when the annotation is missing, the record can still get filtered (again using the example of VariantFiltration) when the
FS value is greater than 200.0.
Currently, VCF INFO field keys are case-sensitive. That means that if you have a
QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (
qual or whatever) in your JEXL expression.
The types (i.e. string, integer, non-integer or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (aka a Java exception).
Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.
If you are familiar with the VariantContext, Genotype and its associated classes and methods, you can directly access the full range of capabilities of the underlying objects from the command line. The underlying VariantContext object is available through the
For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").isHomRef()'
Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:
! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263
The classic way of evaluating a boolean goes like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'DB'
But you can also use the VariantContext object like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.hasAttribute("DB")'
Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").getAD().0 > 10'
GATK release 2.2 was released on October 31, 2012. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history
I have been trying get variants out of a VCF file where the Allele Frequency (AF) is greater than 4%. I have tried both VariantFiltration and SelectVariants but I get different errors with each. Here is my call for SelectVariants:
java -Xmx4g -jar ~/tools/bin/GenomeAnalysisTK.jar -R /home/genome/human_g1k_v37.truseq_mask.fasta -T SelectVariants -o S05-16209-1C_S4_L001_R1_001.30.10.sorted.3perc.vcf --variant S05-16209-1C_S4_L001_R1_001.30.10.sorted.vcf -select "AF > 0.04" -sn "S05-16209-1C_S4_L001_R1_001"
The error is:
MESSAGE: Invalid command line: Invalid JEXL expression detected for select-0 with message ![0,9]: 'AF > 0.04;' > error
For VariantFiltration the call is:
java -Xmx4g -jar ~/tools/bin/GenomeAnalysisTK.jar -R /home/genome/human_g1k_v37.truseq_mask.fasta -T VariantFiltration -o S05-16209-1C_S4_L001_R1_001.30.10.sorted.3perc.vcf --variant S05-16209-1C_S4_L001_R1_001.30.10.sorted.vcf --filterExpression 'AF > 0.040' --filterName "3perc"
The error is:
java.lang.ArithmeticException: Double coercion: java.util.ArrayList:([0.010, 0.010]) at org.apache.commons.jexl2.JexlArithmetic.toDouble(JexlArithmetic.java:1023) at org.apache.commons.jexl2.JexlArithmetic.compare(JexlArithmetic.java:699) at org.apache.commons.jexl2.JexlArithmetic.greaterThan(JexlArithmetic.java:790) at org.apache.commons.jexl2.Interpreter.visit(Interpreter.java:796) at org.apache.commons.jexl2.parser.ASTGTNode.jjtAccept(ASTGTNode.java:18) at org.apache.commons.jexl2.Interpreter.interpret(Interpreter.java:232) at org.apache.commons.jexl2.ExpressionImpl.evaluate(ExpressionImpl.java:65) at org.broadinstitute.sting.utils.variantcontext.JEXLMap.evaluateExpression(VariantJEXLContext.java:267) at org.broadinstitute.sting.utils.variantcontext.JEXLMap.get(VariantJEXLContext.java:233) at org.broadinstitute.sting.utils.variantcontext.JEXLMap.get(VariantJEXLContext.java:118) at org.broadinstitute.sting.utils.variantcontext.VariantContextUtils.match(VariantContextUtils.java:293) at org.broadinstitute.sting.gatk.walkers.filters.VariantFiltration.filter(VariantFiltration.java:331) at org.broadinstitute.sting.gatk.walkers.filters.VariantFiltration.map(VariantFiltration.java:270) at org.broadinstitute.sting.gatk.walkers.filters.VariantFiltration.map(VariantFiltration.java:80) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:65) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:18) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:62) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:265) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)
For both I have tried variations of double quotes and different sigfigs. Also, it works when I select on parameters other than AF.
Am I missing something?
Hello I would like to subset a VCF file to only save a few specific regions of the whole genome. I know some of your tools allow for an interval list to be used to subset the region analyzed. Do you have a tool or are you aware of a tool that would allow me to quickly do this from an interval list or something similar? I could make a little script myself, but I figure sub setting and printing out a specific genomic region of interest in a VCF file has to be a solved problem by GATK.
Thanks for your help! ~Sean
INFO 20:55:27,108 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.1-10-gdbc86ec, Compiled 2012/09/27 05:17:49 INFO 20:55:27,109 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 20:55:27,109 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 20:55:27,109 HelpFormatter - Program Args: -R ../generated_consensus/ASW_woSNPsindelsdelsdupsinvsEUR.fa -T SelectVariants --variant:VCF /media/Data/raw_seq/ASW_allsamples_ASW_coords.vcf -o /media/Data/raw_seq/ASW_NA19914_ASW_coords.vcf -sn NA19914 INFO 20:55:27,110 HelpFormatter - Date/Time: 2012/09/27 20:55:27 INFO 20:55:27,110 HelpFormatter - --------------------------------------------------------------------------------- INFO 20:55:27,110 HelpFormatter - --------------------------------------------------------------------------------- INFO 20:55:27,132 GenomeAnalysisEngine - Strictness is SILENT INFO 20:55:27,208 RMDTrackBuilder - Creating Tribble index in memory for file /media/Data/raw_seq/ASW_allsamples_ASW_coords.vcf INFO 20:55:37,632 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.IllegalArgumentException: Start must be greater than 0! at org.broad.tribble.index.interval.MutableInterval.setStart(IntervalIndexCreator.java:134) at org.broad.tribble.index.interval.IntervalIndexCreator.addFeature(IntervalIndexCreator.java:72) at org.broad.tribble.index.DynamicIndexCreator.addFeature(DynamicIndexCreator.java:155) at org.broad.tribble.index.IndexFactory.createIndex(IndexFactory.java:245) at org.broad.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:237) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createIndexInMemory(RMDTrackBuilder.java:362) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:259) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:203) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:132) at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedQueryDataPool.<init>(ReferenceOrderedDataSource.java:205) at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedDataSource.<init>(ReferenceOrderedDataSource.java:85) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:852) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:713) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:244) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.1-10-gdbc86ec): ##### ERROR ##### ERROR Please visit the wiki to see if this is a known problem ##### ERROR If not, please post the error, with stack trace, to the GATK forum ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Start must be greater than 0! ##### ERROR ------------------------------------------------------------------------------------------
I have used the UnifiedGenotyper to call variants on a set of ~2400 genes (TruSeq Illumina data) from 28 different samples mapped against a preliminary draft genome. I do not have a defined set of SNPs or INDELs to use in recalibration via VQSR.
While the raw VCF has plenty of QUAL scores that are very high, not a single call has a PASS associated with it in the Filter field- all are "." If I use SelectVaraints to filter the VCF based on high QUAL or DP values, or combination, the Filter field remains "." for the returned variants.
Am I doing something wrong, or is the raw file telling me that none of the variant calls are meaningful, in spite of their high QUAL values?
Is there a "best practices" way to go about filtering such a dataset when VQSR can't be employed? If so, I haven't found it.
My current workflow for analysing mouse exome-sequencing (based on v4 of Best Practices) can require me to use slightly different VCFs as
--known parameters in BQSR, indel realignment etc. Basically, I have a "master" VCF that I subset using
SelectVariants. The choice of subset largely depends on the strain of the mice being sequenced but also on other things such as
AF'. It'd be great to be able to do this on-the-fly in conjunction with--known' in tools that required knownSites rather than having to create project-specific (or even tool-specific) VCFs.
Is there a way to do this that I've overlooked? Is this a feature that might be added to GATK?
I'm looking to find all the entries that change between two calls to UG on the same data. I would like to find all the entries where the call in the variant tract are different from those in the comparison track. So in effect I want those entries that would not be result from -using -conc in SelectVariants. From the documentation is is unclear if the -disc option does this:
A site is considered discordant if there exists some sample in the variant track that has a non-reference genotype and either the site isn't present in this track, the sample isn't present in this track, or the sample is called reference in this track.
What if the comp is HOM_VAR and the variant track is HET? Or if they are both HET but disagree on the specific allele?
I'm trying to use JEXL to filter variants but something isn't working and I can't figure it out. I'm hoping someone can point me in the right direction. My VCF file contains an INFO field 1000g2012Apr_ALL. Some of the variants in my VCF have an entry for this field, some don't. I want to filter my VCF file for entries that are below a certain value or are NULL (empty).
Here's what my command looks like:
java -Xmx4G -jar GenomeAnalysisTK.jar -T SelectVariants -R hg19.fa -V my.vcf -o my.1kgfiltered.vcf -select 'vc.getAttribute("1000g2012Apr_ALL") < 0.01' -select '!vc.hasAttribute("1000g2012Apr_ALL")'
The problem is the 2nd select statement. I can't seem to get a JEXL select statement to give me the entries where 1000g2012Apr_ALL are empty. How do I accomplish this?
Just thought I should report a possible small bug with SelectVariant --maxIndelSize as I think it's only filtering insertions greater than the given value and not deletions. Of course using JEXL expressions I can still get the variants I'm after...
I am using the following command to generate SNPs and Indels from matching tumor normal pair bam files (GenomeAnalysisTK-2.4-9-g532efad)
java -jar GenomeAnalysisTK.jar -glm BOTH -R ucsc.hg19.fasta -T UnifiedGenotyper -I tumor.bam -I normal.bam -D dbsnp_135.hg19.vcf -o raw.snps.indels.vcf -metrics snps.metrics -stand_call_conf 50.0 -stand_emit_conf 10.0.
I would like to know the specific command (in SelectVariants) to separate SNPs unique to tumor but not to normal sample
Can you use SelectVariants with a combined vcf to produce a new vcf containing only variants present in a particular sample eg. you can select out de novo mutations from a combined family vcf?
I've a couple of essentially documentation-centric questions...
Firstly, the SelectVariants documentation describes selecting 1000 random variants from a vcf using '-number 1000', however when I try that (with the command "
java -jar GenomeAnalysisTK.jar -T SelectVariants -R human_g1k_v37.fasta --variant variants.vcf -o 1000.vcf -number 1000") it produces the error
'Argument with the name number isn't defined'. Trying --number instead doesn't make any difference, while the --help output does not list this argument (GATK 2.2.2). It this option no longer available?
Secondly, the 'gatkforums.broadinstitute.org/discussion/48/using-varianteval#Understanding_Genotype_Concordance_values_from_Variant_Eval' section of the 'Using VariantEval' page has a series of images explaining the concordance values, however the images are missing. Please could these be restored?
Many thanks, James
Hi, I wanted to double check my methods for some targeted capture data. I ran 96 samples through UG to produce a multisample VCF. I separated snps and indels into separate files using SelectVariants, and applied filters:
For snps "QD < 2.0", "MQ < 40.0", "FS > 60.0", "HaplotypeScore > 13.0", "MQRankSum < -12.5", "ReadPosRankSum < -8.0"
For indels "QD < 2.0", "ReadPosRankSum < -20.0", "InbreedingCoeff < -0.8", "FS > 200.0"
I then went back through with SelectVariants, pulling out each sample one at a time into their own filtered VCF.
My results are... lets say, wrong. I am wondering if it would be better practice to select each sample first and then apply the filters, or if it does not matter and my errors lie elsewhere. Thank you.
Greetings GATK team!
I hope I'm not making a duplicate question here, but I couldn't find anything regarding this in the forum.
Basically, what I want to do is to use SelectVariants to filter against another call set, but I do not want to be as strict as using -discordance (i.e. 100% discordance rate between the two call sets). I want to say for example: "filter call set A against variants that occur in >90% of call set B".
Is there a way to do this with JEXL expressions maybe?
My indel calling VCF has the following information:
##INFO=<ID=N_MQ,Number=2,Type=Float,Description="In NORMAL: average mapping quality of consensus indel-supporting reads/reference-supporting reads">
so one example of my indels is:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT exom_aladaz chr1 54690639 . G GCCC . . N_AC=0,0;N_DP=19;N_MM=0.0,0.36842105;N_MQ=0.0,33.894737;N_NQSBQ=0.0,31.80368;N_NQSMM=0.0,0.0;N_SC=0,0,19,0;SOMATIC;T_AC=6,6;T_DP=11;T_MM=0.5,0.2;T_MQ=44.5,29.0;T_NQSBQ=30.85,33.36;T_NQSMM=0.016666668,0.0;T_SC=6,0,5,0 GT:GQ 0/1:0
If I want to filter indel calls with N_MQ <30 for both consensus indel-supporting reads/reference-supporting reads, how should I write the SelectVariants for N_MQ=0.0,33.894737?
I wouldn't classify this as a bug, but thought I would just put this up here in anyways.
I'm creating a vcf file from exome targeted sequencing, using unified genotyper to just call SNVs on one sample, then using SelectVariants instead of VariantFiltration using hard-cutoffs. This is just a intermediate QC file (something that we can use to do a rough TiTv estimate, concordance to GWAS arrays, etc) in order to make the determination as to whether or not this sample was good enough as is to go into the pool of samples to run unified genotyper/haplotyper caller on together and then VQSR.
So when using the expression;
-select 'MQRankSum > -12.5'
any non-reference homozygote that has no reads containing an alternate allele logically won't have this annotation calculated and since this is missing then this record is removed. Nothing major. I plan on redoing it using the VariantFiltration walker instead and the just use the SelectVariants to pull out PASS records, but I though I would put this up in any case. I was trying to think of a way to use a more complex expression to say if GT was 0/1 then MQRankSum, but couldn't wrap around my head for the case were GT was 1/1 and MQRankSum was present and greater than -12.5.
I was using GenomeAnalysisTK-2.2-4-g4a174fb at the time and my command line is below.
java -jar $GATK/GenomeAnalysisTK.jar \ -T SelectVariants \ -R $REF_GENOME \ --variant $CORE_PATH/$OUTPATH/temp/$SM_TAG".QC.raw.OnBait.vcf" \ -select 'QD > 2.0' \ -select 'MQ > 30.0' \ -select 'FS < 40.0' \ -select 'HaplotypeScore < 13.0' \ -select 'MQRankSum > -12.5' \ -select 'ReadPosRankSum > -8.0' \ -select 'DP > 8.0' \ -o $CORE_PATH/$OUTPATH/SNV/QC/Aggregate/filtered_on_bait/$SM_TAG".QC.OnBait.vcf"
I used the UnifiedGenotyper (GATK 1.6) on a multi-sample set to call variants, and for some of the positions I get multiple mutated alleles. The genotype entries in the combined VCF file look like (GT:AD:DP:GQ:PL):
so it's three AD values per entry. Running SelectVariants yields the following line for the second example from above:
Although it changed the genotype from 0/2 to 0/1, it did not update the AD field. I checked the forums, but I could not really find anything discussing specifically the update of AD, except for the GATK 2.2 release notes where it says SelectVariants: "Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file."
I was wondering whether you could confirm if cases like the one above would benefit from the bugfix, or if the bug description applies to something else.
Thanks a lot for all your hard work, Markus
I have completed filtering my SNP data using VariantFiltration, and now I want to use SelectVariants to output all calls marked "PASS" in the FILTER field. I used the following script, but only the header information writes to the output file.
java -Xmx20g -jar GenomeAnalysisTK.jar -T SelectVariants -R HC.fa --variant HC.SNPs.filtered.vcf -select "FILTER == 'PASS'" -o HC.SNPs.passed.vcf
My input file contains many records that should evaluate as true. Any idea why this doesn't this work?
I am trying to filter variant calls which have "GQ>=20.0".
GATK SelectVariants, gives no error but gives only the header in the output file
java -Xmx2g -jar ~/GenomeAnalysisTKLite-2.1-8-gbb7f038/GenomeAnalysisTKLite.jar -R xxx -T SelectVariants --variant xxx.var.flt.vcf -o xxx.vcf -select "GQ >= 20.0"
So, I tried using VariantFiltration followed by SelectVariants. The variant filtration seems to work fine adding FT tag to the format field. And then I am trying to get records having FT tag using the following commands
java -Xmx2g -jar ~/GenomeAnalysisTKLite-2.1-8-gbb7f038/GenomeAnalysisTKLite.jar -R xxx -T VariantFiltration --variant xxx.var.flt.vcf -o xxx_filtered.vcf --genotypeFilterExpression "GQ >= 20.0" --genotypeFilterName "qual_1_filters" java -Xmx4g -jar ~/GenomeAnalysisTKLite-2.1-8-gbb7f038/GenomeAnalysisTKLite.jar -T SelectVariants -R xxx --variant xxx_filtered.vcf -select 'vc.hasAttribute("FT")' -o xxx_qual20.vcf
but I only get header in the output vcf file.
I am not sure if this is the right approach. Any help would be appreciated.
I ran in to the situation now a couple of times that I need to extract a set of private SNPs from a multisample VCF file. For example in a forward genetics knockout screen of a large set of samples.
It is possible with vcf-contrast from vcf-tools:
vcf-contrast +sample1 -sample2 -sample3 -n input.vcf > private sample1.vcf
vcf-contrast -sample1 +sample2 -sample3 -n input.vcf > private sample2.vcf
vcf-contrast -sample1 -sample2 +sample3 -n input.vcf > private sample3.vcf
After this I still would have to filter out the private 0/0 calls and doing this for a large multisample VCF means entering this command for all the combinations which is not really nice.
Surely this must be possible with GATK. Does anyone know how to do this with GATK.
Maybe it is somewhere in the SelectVariants? The --discordance option looked promissing but there is something about that the samples should be the same? Or is it possible to write another variant walker or a JEXL expression?
P.S. By accident I also posted this question in the XHMM, an admin could remove it there.