Tagged with #vcf-format
0 documentation articles | 0 announcements | 16 forum discussions

No articles to display.

No articles to display.

Created 2016-04-22 22:22:46 | Updated | Tags: vcf dbsnp vcf-format

Comments (1)

Hi. So I want to take a dbSNP file and highlight SNPs that are found in some BAM files. Instead of filtering these I want to search for these SNPs only. I will then use these SNP positions as markers to trace recombination events that have occured between some strains I am studying. So far I havn't found a good strategy to accomplish this. I was hoping someone over here knows how to parse the dbSNP, and could guide me through it. Alternatively you could use this idea to generate a new tool! Please let me know if you have thoughts, or questions. Thanks! -Keller

Created 2016-04-22 10:39:43 | Updated 2016-04-22 10:45:25 | Tags: variantfiltration vcf-format

Comments (2)

Hi there,

I want to report this in GATK's VCF output format after VariantFiltration task.

I was trying to use vcftools to read the output VCF file from that program but an error came out when parsing the header. Specifically I found this is related with the string related to filters used. The following are the filters as reported in the VCF:

##FILTER=<ID=HARD_TO_VALIDATE,Description=""QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0""> ##FILTER= 30.0 && QUAL < 100.0"">

The double quotes are repeated, as you can see in all the three filters and this is causing Vcf.pm perl library to fail with:

_Could not parse header line: FILTER=<ID=HARD_TOVALIDATE,Description=""QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0""> Stopped at [QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0""].

When I go to remove the repeated double-quotes than I solved this. Infact with ##FILTER= 30.0 && QUAL < 100.0"> the error is not there.

Moreover, one more error is there because of VariantFiltration. Few lines later the Command line printed has all the parameters given to VariantFiltration and this filter again:

##GATKCommandLine.VariantFiltration=<ID=VariantFiltration,Version=3.4-46-gbc02625,Date="Sat Apr 16 12:01:58 CEST 2016",Epoch=1460800918456,CommandLineOptions=.. .. .. ..... filterExpression=["QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0", "QUAL < 30.0", "QUAL > 30.0 && QUAL < 100.0"] filterName=[HARD_TO_VALIDATE, VeryLowQual, LowQual] genotypeFilterExpression=[] genotypeFilterName=[] . .. .... > And Vcf.pm stops with:

Could not parse header line: GATKCommandLine.VariantFiltration=<ID=VariantFiltration,Version=3.4-46-gbc02625 .....

I solved in a similar way removing the quotes from the filter elements:

filterExpression=[QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0, QUAL < 30.0, QUAL > 30.0 && QUAL < 100.0]

I do not know if it is a problem in the Vcf.pm library or the VCF format is not respected in VariantFiltration.

Hope this will be useful


Francesco Musacchia

Created 2016-04-14 00:23:07 | Updated 2016-04-14 00:23:46 | Tags: vcf-format mutect2

Comments (2)

I'm using MuTect2 to call variants. The output vcf is in 4.2 format. I was wondering if there's a way to configure MuTect2 to use vcf 4.1 format instead?

Created 2016-04-07 16:31:11 | Updated | Tags: haplotypecaller dp vcf-format genotypegvcfs

Comments (3)


I've run into the (already reported http://gatkforums.broadinstitute.org/dsde/discussion/5598/missing-depth-dp-after-haplotypecaller ) bug of the missing DP format field in my callings.

I've run the following (relevant) commands:

Haplotype Caller -> Generate GVCF:

    java -Xmx${xmx} ${gct} -Djava.io.tmpdir=${NEWTMPDIR} -jar ${gatkpath}/GenomeAnalysisTK.jar \
       -T HaplotypeCaller \
       -R ${ref} \
       -I ${NEWTMPDIR}/${prefix}.realigned.fixed.recal.bam \
       -L ${reg} \
       -ERC GVCF \
       -nct ${nct} \
       --genotyping_mode DISCOVERY \
       -stand_emit_conf 10 \
       -stand_call_conf 30  \
       -o ${prefix}.raw_variants.annotated.g.vcf \
       -A QualByDepth -A RMSMappingQuality -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A StrandOddsRatio -A Coverage

That generates GVCF files that DO HAVE the DP field for all reference positions, but DO NOT HAVE the DP format field for any called variant (but still keep the DP in the INFO field):

18      11255   .       T       <NON_REF>       .       .       END=11256       GT:DP:GQ:MIN_DP:PL      0/0:18:48:18:0,48,720
18      11257   .       C       G,<NON_REF>     229.77  .       BaseQRankSum=1.999;DP=20;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=-1.377;ReadPosRankSum=0.489      GT:AD:GQ:PL:SB  0/1:10,8,0:99:258,0,308,288
18      11258   .       G       <NON_REF>       .       .       END=11260       GT:DP:GQ:MIN_DP:PL      0/0:17:48:16:0,48,530

Later, I ran Genotype GVCF joining all the samples with the following command:

java -Xmx${xmx} ${gct} -Djava.io.tmpdir=${NEWTMPDIR} -jar ${gatkpath}/GenomeAnalysisTK.jar \
   -T GenotypeGVCFs \
   -R ${ref} \
   -L ${pos} \
   -o ${prefix}.raw_variants.annotated.vcf \
   --variant ${variant} [...]

This generated vcf files where the DP field is present in the format description, it IS present in the Homozygous REF samples, but IS MISSING in any Heterozygous or HomoALT samples.

22  17280388    .   T   C   18459.8 PASS    AC=34;AF=0.340;AN=100;BaseQRankSum=-2.179e+00;DP=1593;FS=2.526;InbreedingCoeff=0.0196;MLEAC=34;MLEAF=0.340;MQ=60.00;MQRankSum=0.196;QD=19.76;ReadPosRankSum=-9.400e-02;SOR=0.523    GT:AD:DP:GQ:PL  0/0:29,0:29:81:0,81,1118    0/1:20,22:.:99:688,0,682    1/1:0,27:.:81:1018,81,0 0/0:22,0:22:60:0,60,869 0/1:20,10:.:99:286,0,664    0/1:11,17:.:99:532,0,330    0/1:14,14:.:99:431,0,458    0/0:28,0:28:81:0,81,1092    0/0:35,0:35:99:0,99,1326    0/1:14,20:.:99:631,0,453    0/1:13,16:.:99:511,0,423    0/1:38,29:.:99:845,0,1231   0/1:20,10:.:99:282,0,671    0/0:22,0:22:63:0,63,837 0/1:8,15:.:99:497,0,248 0/0:32,0:32:90:0,90,1350    0/1:12,12:.:99:378,0,391    0/1:14,26:.:99:865,0,433    0/0:37,0:37:99:0,105,1406   0/0:44,0:44:99:0,120,1800   0/0:24,0:24:72:0,72,877 0/0:30,0:30:84:0,84,1250    0/0:31,0:31:90:0,90,1350    0/1:15,25:.:99:827,0,462    0/0:35,0:35:99:0,99,1445    0/0:29,0:29:72:0,72,1089    1/1:0,32:.:96:1164,96,0 0/0:21,0:21:63:0,63,809 0/1:21,15:.:99:450,0,718    1/1:0,40:.:99:1539,120,0    0/0:20,0:20:60:0,60,765 0/1:11,9:.:99:293,0,381 1/1:0,35:.:99:1306,105,0    0/1:18,14:.:99:428,0,606    0/0:32,0:32:90:0,90,1158    0/1:24,22:.:99:652,0,816    0/0:20,0:20:60:0,60,740 1/1:0,30:.:90:1120,90,0 0/1:15,13:.:99:415,0,501    0/0:31,0:31:90:0,90,1350    0/1:15,18:.:99:570,0,480    0/1:22,13:.:99:384,0,742    0/1:19,11:.:99:318,0,632    0/0:28,0:28:75:0,75,1125    0/0:20,0:20:60:0,60,785 1/1:0,27:.:81:1030,81,0 0/0:30,0:30:90:0,90,1108    0/1:16,16:.:99:479,0,493    0/1:14,22:.:99:745,0,439    0/0:31,0:31:90:0,90,1252
22  17280822    .   G   A   5491.56 PASS    AC=8;AF=0.080;AN=100;BaseQRankSum=1.21;DP=1651;FS=0.000;InbreedingCoeff=-0.0870;MLEAC=8;MLEAF=0.080;MQ=60.00;MQRankSum=0.453;QD=17.89;ReadPosRankSum=-1.380e-01;SOR=0.695   GT:AD:DP:GQ:PL  0/0:27,0:27:72:0,72,1080    0/0:34,0:34:90:0,90,1350    0/1:15,16:.:99:528,0,491    0/0:27,0:27:60:0,60,900 0/1:15,22:.:99:699,0,453    0/0:32,0:32:90:0,90,1350    0/0:37,0:37:99:0,99,1485    0/0:31,0:31:87:0,87,1305    0/0:40,0:40:99:0,108,1620   0/1:20,9:.:99:258,0,652 0/0:26,0:26:72:0,72,954 0/1:16,29:.:99:943,0,476    0/0:27,0:27:69:0,69,1035    0/0:19,0:19:48:0,48,720 0/0:32,0:32:81:0,81,1215    0/0:36,0:36:99:0,99,1435    0/0:34,0:34:99:0,99,1299    0/0:35,0:35:99:0,102,1339   0/0:38,0:38:99:0,102,1520   0/0:36,0:36:99:0,99,1476    0/0:31,0:31:81:0,81,1215    0/0:31,0:31:75:0,75,1125    0/0:35,0:35:99:0,99,1485    0/0:37,0:37:99:0,99,1485    0/0:35,0:35:90:0,90,1350    0/0:20,0:20:28:0,28,708 0/1:16,22:.:99:733,0,474    0/0:32,0:32:90:0,90,1350    0/0:35,0:35:99:0,99,1467    0/1:27,36:.:99:1169,0,831   0/0:28,0:28:75:0,75,1125    0/0:36,0:36:81:0,81,1215    0/0:35,0:35:90:0,90,1350    0/0:28,0:28:72:0,72,1080    0/0:31,0:31:81:0,81,1215    0/0:37,0:37:99:0,99,1485    0/0:31,0:31:84:0,84,1260    0/0:39,0:39:99:0,101,1575   0/0:37,0:37:96:0,96,1440    0/0:34,0:34:99:0,99,1269    0/0:30,0:30:81:0,81,1215    0/0:36,0:36:99:0,99,1485    0/1:17,17:.:99:567,0,530    0/0:26,0:26:72:0,72,1008    0/0:18,0:18:45:0,45,675 0/0:33,0:33:84:0,84,1260    0/0:25,0:25:61:0,61,877 0/1:9,21:.:99:706,0,243 0/0:35,0:35:81:0,81,1215    0/0:35,0:35:99:0,99,1485

I've just discovered this issue, and I need to run an analysis trying on the differential depth of coverage in different regions, and if there is a DP bias between called/not-called samples.

I have thousands of files and I've spent almost 1 year generating all these callings, so redoing the callings is not an option.

What would be the best/fastest strategy to either fix my final vcfs with the DP data present in all intermediate gvcf files (preferably) or, at least, extracting this data for all snps and samples?

Thanks in advance,


PS: Recalling the individual samples from bamfiles is not an option. Fixing the individual gvcfs and redoing the joint GenotypeGVCFs could be.

Created 2016-03-28 00:54:06 | Updated | Tags: depthofcoverage vcf-format

Comments (1)


Although I have read through the related topics, I'm still quite confused about the significance of "Depth across all samples" (DP) in INFO in the vcf file. Does "across samples" mean it addition the read depth of all the samples together or is it a mean over all the samples? In my vcf file (after joint genotyping in gvcf mode), I obtained DP in INFO between 30 and 99 while the sample-DP are much less in general. I think the DP in INFO is a sum of depth, am I right?

Created 2016-03-27 14:51:57 | Updated | Tags: combinevariants variantcontext vcf-format genomestrip-2-0

Comments (2)


I've generated vcfs for deletions, genotyped with genomestrip, multiple indivuals/samples, for each of six chromosomes, and I'd like to combine them. The problem is that:

. /etc/profile.d/modules.sh
module load jre/1.7.0_25
module load gatk/3.4-0

GenomeAnalysisTK -R ${refseq} \
-T CombineVariants --genotypemergeoption UNSORTED \
-V ${vcf}.chr2L.vcf \
-V ${vcf}.chr2R.vcf \
-V ${vcf}.chr3L.vcf \
-V ${vcf}.chr3R.vcf \
-V ${vcf}.chrX.vcf \
-V ${vcf}.chr4.vcf \
    -o unfilt.${vcf}.vcf

returns ....

ERROR stack trace 
java.lang.IllegalStateException: Key CNF found in VariantContext field FORMAT at chr2L:16495 but this key isn't defined in the VCFHeader.
ERROR MESSAGE: Key CNF found in VariantContext field FORMAT at chr2L:16495 but this key isn't defined in the VCFHeader.  We require all VCFs to have complete VCF headers by default.

In the input vcfs, the CNF key in the VariantContext field FORMAT is shown here:

GT:CNF:FT:GL:GP:GQ      0/0:1.9552:PASS:-0.00,-7.56,-194.96:-0.00,-9.62,-198.31:96 

So I guess I have to edit the vcf header include some indication of the CNF field.. does that sound correct ?

Created 2016-03-24 12:31:15 | Updated 2016-03-24 12:32:14 | Tags: vcf-format gt

Comments (1)

Dear all,

After running Haplotyper caller, I get a VCF file. To understand the vcf file, I am following this document http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it.

According to GT section, in the output there should be one option among the following

0/0 - the sample is homozygous reference
0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
1/1 - the sample is homozygous alternate

However, in my output vcf file we also find values like GT 1/2, I am not sure how to interpret this output.

Could you please suggest what does this mean? I am using Haplotyper caller to call the variants

Created 2015-12-27 00:37:35 | Updated | Tags: readbackedphasingwalker readbackedphasing vcf-format

Comments (2)

I have been trying to use ReadBackedPhasing, but its is producing an output with no phasing tags (HP tag). It works with a vcf file that is been called on a single bam file, but does not produce any output from vcf files generated from vcf-subset. In short, I would like to know what INFO/FORMAT fields it looks for while assigning HP tag.

Created 2015-11-09 16:28:47 | Updated | Tags: catvariants vcf-format

Comments (4)

I tried to join chromosome datasets for my SNPs data separated in different chromosomes.

input command: java -Xmx160g -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants -R /nobackup/data7/jose/Medicago_truncatula_genome/JCVI.Medtr.v4.20130313.fasta -V /nobackup/data7/jose/GBSv1/Msativa_diploids/hapmap/mergedSNPs/myGBSGenos_mergedSNPs_chr1_ReadDepthFiltered.vcf -V /nobackup/data7/jose/GBSv1/Msativa_diploids/hapmap/mergedSNPs/myGBSGenos_mergedSNPs_chr2_ReadDepthFiltered.vcf -V /nobackup/data7/jose/GBSv1/Msativa_diploids/hapmap/mergedSNPs/myGBSGenos_mergedSNPs_chr3_ReadDepthFiltered.vcf -V /nobackup/data7/jose/GBSv1/Msativa_diploids/hapmap/mergedSNPs/myGBSGenos_mergedSNPs_chr4_ReadDepthFiltered.vcf -V /nobackup/data7/jose/GBSv1/Msativa_diploids/hapmap/mergedSNPs/myGBSGenos_mergedSNPs_chr5_ReadDepthFiltered.vcf -V /nobackup/data7/jose/GBSv1/Msativa_diploids/hapmap/mergedSNPs/myGBSGenos_mergedSNPs_chr6_ReadDepthFiltered.vcf -V /nobackup/data7/jose/GBSv1/Msativa_diploids/hapmap/mergedSNPs/myGBSGenos_mergedSNPs_chr7_ReadDepthFiltered.vcf -V /nobackup/data7/jose/GBSv1/Msativa_diploids/hapmap/mergedSNPs/myGBSGenos_mergedSNPs_chr8_ReadDepthFiltered.vcf -out /nobackup/data7/jose/GBSv1/Msativa_diploids/hapmap/mergedSNPs/myGBSGenos_mergedSNPs_chr1-8_ReadDepthFiltered.vcf -assumeSorted

error message:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.NullPointerException at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.setIndexSequenceDictionary(IndexingVariantContextWriter.java:178) at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.close(IndexingVariantContextWriter.java:133) at htsjdk.variant.variantcontext.writer.VCFWriter.close(VCFWriter.java:210) at org.broadinstitute.gatk.tools.CatVariants.execute(CatVariants.java:303) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.tools.CatVariants.main(CatVariants.java:311)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.4-46-gbc02625):
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

Any help would be appreciated.

Created 2015-11-03 13:15:55 | Updated | Tags: vcf-format new-tag

Comments (2)

Hello! Is it possible add a tag in a vcf when I calculate the ratio of the second filed of AD and DP? for example in this case: chr13 32912299 . T C 2372.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.631;ClippingRankSum=-0.154;DP=266;FS=1.546;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-0.144;QD=8.92;ReadPosRankSum=0.162;SOR=0.596 GT:AD:DP:GQ:PL 0/1:155,111:266:99:2401,0,4319

I would like calculate 111/226 Thank you very much in advance

Created 2015-08-10 18:17:31 | Updated | Tags: haplotypecaller vcf-format combinegvcfs

Comments (6)


I've a problem with a VCF when combiningGVCFs, gatk/v3.4. "headerKey END found in VariantContext field INFO at chr2L:64003 but this key isn't defined in the VCFHeader"

HaplotypeCaller -nct 30 --analysis_type HaplotypeCaller --reference_sequence ../../reference_sequences/dmel/v6.0/dm6.fa \ --input_file ../../read_mapping/reference_line/RGnb/RGnb.bam \ --variant_index_type LINEAR --variant_index_parameter 128000 -mbq 10 -L chr2L \ --out ../../read_mapping/reference_line/RGnb/RGnb.g.vcf

VCF CHROM POS ID REF ALT QUAL FILTER INFO FORMAT RGnb chr2L 43265 . A G 2147.77 . AC=2;AF=1.00;AN=2;DP=54;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.59;SOR=1.118 GT:AD:DP:GQ:PL 1/1:0,54:54:99:2176,162,0 chr2L 64003 . GT G 57.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.987;ClippingRankSum=-0.189;DP=45;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.85;MQRankSum=0.525;QD=1.28;ReadPosRankSum=1.149;SOR=0.622 GT:AD:DP:GQ:PL 0/1:25,8:33:95:95,0,521

Combine three gvcfs --analysis_type CombineGVCFs --reference_sequence ../../reference_sequences/dmel/v6.0/dm6.fa \ --variant rg_GVCF.list --out ./RG.g.vcf

Error message: java.lang.IllegalStateException: Key END found in VariantContext field INFO at chr2L:64003 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF

Previous issues for "key isn't defined in the VCFHeader" http://gatkforums.broadinstitute.org/discussion/3962/genotypeandvalidate-error-key-callstatus-found-in-variantcontext-field-info#latest http://gatkforums.broadinstitute.org/discussion/2331/select-variants-says-vcf-header-is-missing https://github.com/charite/jannovar/issues/70 https://groups.google.com/forum/#!topic/gawed-awesome-analyzers/RjXtIuYQ_V8

Could anyone advise me on how to fix this?



Created 2015-06-21 13:08:56 | Updated 2015-06-21 13:10:27 | Tags: picard vcftools vcf-format

Comments (2)

I am trying to open small uncompressed VCFs via picard-tools-1.119, via SplitVcfs.jar.

File vcfFile = new File("test2.vcf"); VCFFileReader vcfReader = new VCFFileReader(vcfFile,false);

Even though I've specified that an index is not required in the constructor, this generates the following error:

htsjdk.tribble.TribbleException: Index not found for: MYLOCATION\test2.vcf at htsjdk.tribble.TribbleIndexedFeatureReader.query(TribbleIndexedFeatureReader.java:253) at htsjdk.variant.vcf.VCFFileReader.query(VCFFileReader.java:124) (if I don't specify the index parameter, then it generates a different error:"An index is required, but none found" )

I've attached a test file so you can see it for yourself. ( NOTE: I've had to compress it into a .zip, otherwise the forum didn't allow me to attach the file itself, but if you want to replicate my results you will need to uncompress it and try loading the plain file (test2.vcf).

My question: is there any way to open uncompressed VCF files in picard via VCFFileReader or otherwise?

Thanks Martin

PS: these are small VCFs that would be exported out from larger datasets, so compressing/indexing them would seem like an unnecessary extra step.

Created 2015-04-06 15:40:41 | Updated | Tags: variantstovcf vcf-format

Comments (2)

Hi there can any body help me in knowing how we can calculate or find frequency of a heterozygous carrier from the Allele frequency in vcf file Regards

Created 2014-05-27 10:53:30 | Updated 2014-05-27 10:55:27 | Tags: vcf multi-sample vcf-format genotypegvcfs

Comments (3)


I'm doing a variant analysis of genomic DNA from 2 related samples. I followed the up-to-date Best practices using HaplotypeCaller in GVCF mode for both samples followed by GenotypeGVCF to compute a common vcf of variant loci. I'm looking at variants that would be sample2-specific (present in sample2 but not in sample1)

Here is a line of this file:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 chrIII 91124 . A AATAAGAGGAATTAGGCT 1132.42 . AC=2;AF=0.500;AN=4;DP=47;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=58.85;MQ0=0;QD=7.99 GT:AD:DP:GQ:PL 1/1:0,25:25:55:1167,55,0 0/0:.:22:33:0,33,495

In the Genotype Field, sample2.AD is a . (dot) meaning that no reads passed the Quality filters. However, sample2.DP=22 meaning that 22 reads covered this position. This line suggest that this variation is specific to sample1 (genotype HomVar 1/1) and is not present in sample2 (HomRef 0/0). But given the biological relationship between sample1 and 2 (the way they were generated), I doubt that this variation is true: it is very likely to be present in sample2 as well. It's a false

I have 416 loci like this. For the vast majority of them, sample1 and 2 likely share the same variation. But since it is not impossible that a very few of them are really sample1=HomVar and sample2=HomRef, could you suggest me a way to detect those guys? What about comparing sample1.PL(1/1) and sample2.PL(0/0) ? For example could you suggest a rule of thumb to determine their ratio ?

Created 2014-05-15 20:29:50 | Updated | Tags: haplotypecaller snp indels vcf-format

Comments (7)

Hi, GATK Team

I've run into a strange case where a SNP called by HaplotypeCaller has been represented as if it were an indel:

6 16327909 . ATGCTGATGCTGC CTGCTGATGCTGC 1390.70 PASS AC=1;AC_Orig=2;AF=0.500;AF_Orig=0.040;AN=2;AN_Orig=50;BaseQRankSum=0.788;DP=10;FS=6.154;InbreedingCoeff=0.1807;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358;VQSLOD=2.78;culprit=FS GT:DP:GQ:PL 0/1:10:70:284,0,214

This VCF entry (for a single individual) comes from a multi-sample VCF that has multiple alternate "alleles" at that position:

6   16327909    .   ATGCTGATGCTGC   ATGC,CTGCTGATGCTGC,A    1390.70 .   AC=12,3,1;AF=0.024,6.048e-03,2.016e-03;AN=496;BaseQRankSum=0.788;DP=2791;FS=6.154;InbreedingCoeff=0.1807;MLEAC=13,3,1;MLEAF=0.026,6.048e-03,2.016e-03;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358   GT:AD:DP:GQ:PL

However, this mode of representing a SNP is causing processing and analysis problems further downstream after I've split the multi-sample VCF into individual files. Is there a way to fix this problem such that variants are listed in the most parsimonious (and hopefully standard) way?



Created 2014-02-27 15:25:19 | Updated 2014-02-27 22:40:50 | Tags: unifiedgenotyper genotype vcf-format gt

Comments (10)

generated with gatk 2.8-1-g932cd3a

Although it is rare I see Genotype Fields that are inconsistent with the AD values (Read as table):

CHROM   POS ID  REF ALT FILTER  QUAL    ABHet   ABHom   AC  AF  AN  BaseCounts  BaseQRankSum    DP  Dels    FS  GC  HRun    HaplotypeScore  LowMQ   MLEAC   MLEAF   MQ  MQ0 MQRankSum   MeanDP  MinDP   OND PercentNBaseSolid   QD  ReadPosRankSum  Samples Somatic VariantType cosmic.ID   1.AB    1.AD    1.DP    1.F 1.GQ    1.GT    1.MQ0   1.PL    1.Z 2.AB    2.AD    2.DP    2.F 2.GQ    2.GT    2.MQ0   2.PL    2.Z 3.AB    3.AD    3.DP    3.F 3.GQ    3.GT    3.MQ0   3.PL    3.Z 4.AB    4.AD    4.DP    4.F 4.GQ    4.GT    4.MQ0   4.PL    4.Z 5.AB    5.AD    5.DP    5.F 5.GQ    5.GT    5.MQ0   5.PL    5.Z
11  92616485    0   A   C   PASS    63.71   0.333   0.698   1   0.1 10  89,54,0,0   -5.631  143 0   49.552  71.29   2   4.4154  0.0000,0.0000,143   1   0.1 50.27   0   -1.645  28.6    16  0.242   0   2.36    2.125   R5_A3_1 NA  SNP COSM467570  NA  24,9    33  0.2727272727    54  A/A 0   0,54,537    -1.3055824197   0.33    9,18    27  0.6666666667    96  A/C 0   96,0,178    0.8660254038    NA  21,11   32  0.34375 21  A/A 0   0,21,466    -0.8838834765   NA  12,4    16  0.25    27  A/A 0   0,27,272    -1  NA  23,12   35  0.3428571429    42  A/A 0   0,42,537    -0.9296696802

This shows that for example sample 5 has a AD value of '23,12' and a GT of 'A/A' aka homyzougous reference allele. I've included a screenshot wich shows low base quality and complete strand bias (Which I suspect to mis variants). So whats the prob? and how can i recalculate the GT's based on AD? because i cannot filter based on genotypes when they are buggy....