Tagged with #vcf-format
0 documentation articles | 0 announcements | 5 forum discussions

No posts found with the requested search criteria.
No posts found with the requested search criteria.

Created 2015-06-21 13:08:56 | Updated 2015-06-21 13:10:27 | Tags: picard vcftools vcf-format
Comments (2)

I am trying to open small uncompressed VCFs via picard-tools-1.119, via SplitVcfs.jar.

File vcfFile = new File("test2.vcf"); VCFFileReader vcfReader = new VCFFileReader(vcfFile,false);

Even though I've specified that an index is not required in the constructor, this generates the following error:

htsjdk.tribble.TribbleException: Index not found for: MYLOCATION\test2.vcf at htsjdk.tribble.TribbleIndexedFeatureReader.query(TribbleIndexedFeatureReader.java:253) at htsjdk.variant.vcf.VCFFileReader.query(VCFFileReader.java:124) (if I don't specify the index parameter, then it generates a different error:"An index is required, but none found" )

I've attached a test file so you can see it for yourself. ( NOTE: I've had to compress it into a .zip, otherwise the forum didn't allow me to attach the file itself, but if you want to replicate my results you will need to uncompress it and try loading the plain file (test2.vcf).

My question: is there any way to open uncompressed VCF files in picard via VCFFileReader or otherwise?

Thanks Martin

PS: these are small VCFs that would be exported out from larger datasets, so compressing/indexing them would seem like an unnecessary extra step.

Created 2015-04-06 15:40:41 | Updated | Tags: variantstovcf vcf-format
Comments (2)

Hi there can any body help me in knowing how we can calculate or find frequency of a heterozygous carrier from the Allele frequency in vcf file Regards

Created 2014-05-27 10:53:30 | Updated 2014-05-27 10:55:27 | Tags: vcf multi-sample vcf-format genotypegvcfs
Comments (3)


I'm doing a variant analysis of genomic DNA from 2 related samples. I followed the up-to-date Best practices using HaplotypeCaller in GVCF mode for both samples followed by GenotypeGVCF to compute a common vcf of variant loci. I'm looking at variants that would be sample2-specific (present in sample2 but not in sample1)

Here is a line of this file:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 chrIII 91124 . A AATAAGAGGAATTAGGCT 1132.42 . AC=2;AF=0.500;AN=4;DP=47;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=58.85;MQ0=0;QD=7.99 GT:AD:DP:GQ:PL 1/1:0,25:25:55:1167,55,0 0/0:.:22:33:0,33,495

In the Genotype Field, sample2.AD is a . (dot) meaning that no reads passed the Quality filters. However, sample2.DP=22 meaning that 22 reads covered this position. This line suggest that this variation is specific to sample1 (genotype HomVar 1/1) and is not present in sample2 (HomRef 0/0). But given the biological relationship between sample1 and 2 (the way they were generated), I doubt that this variation is true: it is very likely to be present in sample2 as well. It's a false

I have 416 loci like this. For the vast majority of them, sample1 and 2 likely share the same variation. But since it is not impossible that a very few of them are really sample1=HomVar and sample2=HomRef, could you suggest me a way to detect those guys? What about comparing sample1.PL(1/1) and sample2.PL(0/0) ? For example could you suggest a rule of thumb to determine their ratio ?

Created 2014-05-15 20:29:50 | Updated | Tags: haplotypecaller snp indels vcf-format
Comments (7)

Hi, GATK Team

I've run into a strange case where a SNP called by HaplotypeCaller has been represented as if it were an indel:

6 16327909 . ATGCTGATGCTGC CTGCTGATGCTGC 1390.70 PASS AC=1;AC_Orig=2;AF=0.500;AF_Orig=0.040;AN=2;AN_Orig=50;BaseQRankSum=0.788;DP=10;FS=6.154;InbreedingCoeff=0.1807;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358;VQSLOD=2.78;culprit=FS GT:DP:GQ:PL 0/1:10:70:284,0,214

This VCF entry (for a single individual) comes from a multi-sample VCF that has multiple alternate "alleles" at that position:

6   16327909    .   ATGCTGATGCTGC   ATGC,CTGCTGATGCTGC,A    1390.70 .   AC=12,3,1;AF=0.024,6.048e-03,2.016e-03;AN=496;BaseQRankSum=0.788;DP=2791;FS=6.154;InbreedingCoeff=0.1807;MLEAC=13,3,1;MLEAF=0.026,6.048e-03,2.016e-03;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358   GT:AD:DP:GQ:PL

However, this mode of representing a SNP is causing processing and analysis problems further downstream after I've split the multi-sample VCF into individual files. Is there a way to fix this problem such that variants are listed in the most parsimonious (and hopefully standard) way?



Created 2014-02-27 15:25:19 | Updated 2014-02-27 22:40:50 | Tags: unifiedgenotyper genotype vcf-format gt
Comments (10)

generated with gatk 2.8-1-g932cd3a

Although it is rare I see Genotype Fields that are inconsistent with the AD values (Read as table):

CHROM   POS ID  REF ALT FILTER  QUAL    ABHet   ABHom   AC  AF  AN  BaseCounts  BaseQRankSum    DP  Dels    FS  GC  HRun    HaplotypeScore  LowMQ   MLEAC   MLEAF   MQ  MQ0 MQRankSum   MeanDP  MinDP   OND PercentNBaseSolid   QD  ReadPosRankSum  Samples Somatic VariantType cosmic.ID   1.AB    1.AD    1.DP    1.F 1.GQ    1.GT    1.MQ0   1.PL    1.Z 2.AB    2.AD    2.DP    2.F 2.GQ    2.GT    2.MQ0   2.PL    2.Z 3.AB    3.AD    3.DP    3.F 3.GQ    3.GT    3.MQ0   3.PL    3.Z 4.AB    4.AD    4.DP    4.F 4.GQ    4.GT    4.MQ0   4.PL    4.Z 5.AB    5.AD    5.DP    5.F 5.GQ    5.GT    5.MQ0   5.PL    5.Z
11  92616485    0   A   C   PASS    63.71   0.333   0.698   1   0.1 10  89,54,0,0   -5.631  143 0   49.552  71.29   2   4.4154  0.0000,0.0000,143   1   0.1 50.27   0   -1.645  28.6    16  0.242   0   2.36    2.125   R5_A3_1 NA  SNP COSM467570  NA  24,9    33  0.2727272727    54  A/A 0   0,54,537    -1.3055824197   0.33    9,18    27  0.6666666667    96  A/C 0   96,0,178    0.8660254038    NA  21,11   32  0.34375 21  A/A 0   0,21,466    -0.8838834765   NA  12,4    16  0.25    27  A/A 0   0,27,272    -1  NA  23,12   35  0.3428571429    42  A/A 0   0,42,537    -0.9296696802

This shows that for example sample 5 has a AD value of '23,12' and a GT of 'A/A' aka homyzougous reference allele. I've included a screenshot wich shows low base quality and complete strand bias (Which I suspect to mis variants). So whats the prob? and how can i recalculate the GT's based on AD? because i cannot filter based on genotypes when they are buggy....