Selecting Variants by searching the FORMAT field array
Posted in Ask the team | Last updated on

Comments (11)

Hello all,

First post. Thank you for these amazing tools. I have spent two days pulling my hair out, trying all enumerations, searching the documentation and forums, and in the end I come to you for help. Please forgive me if these topics have been covered elsewhere.

I have several VCFs generated by SomaticSniper that I'd like to filter based on the SomaticScore (SSC in the FORMAT field). I was working with VariantFiltration and SelectVariants, and trying to use different options, as well as regular expressions, to select those calls with a SSC over 40. I have been unable to do so. I also looked into trying to figure out JEXL, and using the last command listed on the documentation page, about using the VariantContext feature to drill into an array. I cannot get it to recognize the SSC column of the FORMAT field and then filter for the TUMOR sample.

Using VariantFiltration I was using -select (but I understand now that this searches the INFO field only). I was then using the --genotypeFilterExpression, but it would not add the FT tag to the FORMAT field as it said it would, it would just apply PASS to everything.

java -Xmx4-jar GenomeAnalysisTK.jar -T VariantFiltration -R ~/Documents/reference/human_g1k_v37.fasta -V '/home/registry3/Desktop/merged/104024sniperRAWSNPS.vcf' --G_filter "SSC < 40.0" --G_filterName "myFilter" -o '/home/registry3/Desktop/merged/104024sniperFILTEREDSNPS.vcf'

Using SelectVariants, I was using -sn to select the TUMOR sample and then using -select_expressions, but I guess this also only works on the INFO field. I had been trying to use --sample_expression which gives the ability to use a regular expression, but then I never had good results; it wouldn't do any filtering, and output the entire input file. Does the regular expression only apply to the sample name, and not the content of each line? Trying to select SSC over 40 from a line like this

1   10177   0   A   C   0   0   AC=1;AF=0.500;AN=2;DP=62    GT:AMQ:BCOUNT:BQ:DP:DP4:GQ:IGT:MQ:SS:SSC:VAQ    0/1:16,15:40,22,0,0:28,25:62:31,9,10,12:37:0/1:16:2:19:37

I used a line such as this, to look at the second to last number in the FORMAT field based on : dividers

java -Xmx4-jar GenomeAnalysisTK.jar -R ~/Documents/reference/human_g1k_v37.fasta -T SelectVariants --variant '/home/registry3/Desktop/merged/104024sniperSNPs.vcf' -o '/home/registry3/Desktop/merged/104024sniperSSC40.vcf' -se ".*:[4,5,6,7,8,9][1,2,3,4,5,6,7,8,9]:[1234567890]{2,3}$"

I am not a coder, as you can probably see, but I'm trying to get this worked out. This output the entire file still, with SSC values above and below 40.

Looking into use the vc.getGenotype array access, I could not find much documentation about VariantContext; I was looking through the files on github, looking through the code and looking for samples, since the .getAD() from the documentation seems to work, but alas, there is no .getSCC() available. Is using vc. the best way to drill into an array (the FORMAT field) and search for what I want?

I didn't post all the code and output, trying to keep this as short as possible. I can post pastebin outputs if that would be helpful. Thank you, David

Return to top Comment on this article in the forum