Tagged with #xml
0 documentation articles | 0 announcements | 1 forum discussion


No posts found with the requested search criteria.
No posts found with the requested search criteria.
Comments (2)

Hi the GATK team,

I hate the VCF format :-)

I want a structured output and I'd like to promote the use of the XML/JSON to store the variations. I think the best way to achieve this, is to integrate this new format in the GATK rather than creating another tool converting the VCF to XML/JSON. In the best world, I can insert the result of, say the ENSEMBL API ( e.g. http://beta.rest.ensembl.org/vep/human/9:22125503-22125502:1/C/consequences?content-type=text/xml ) in each 'variation' element.

I've forked the GATK and created a new class to handle the XML output:

https://github.com/lindenb/gatk/commit/dbffd2fa3e7a043a6951d8ac58dd619e68a6caa8

now in VariantContextWriterFactory, when the filename ends with ".xml", the factory creates a new XMLVariantContextWriter rather than a VCFWriter .

I'm currently writing XMLVariantContextWriter and I've only written the header and the chrom/pos for the variations. Here is a sample:

java -jar dist/GenomeAnalysisTK.jar  -T UnifiedGenotyper -o /home/lindenb/package/samtools-0.1.18/examples/ex1f.vcf.xml -R /home/lindenb/package/samtools-0.1.18/examples/ex1.fa -I /home/lindenb/package/samtools-0.1.18/examples/sorted.bam
INFO  17:12:28,358 HelpFormatter - ---------------------------------------------------------------------------------------------------------- 
INFO  17:12:28,361 HelpFormatter - The Genome Analysis Toolkit (GATK) vdbffd2fa3e7a043a6951d8ac58dd619e68a6caa8, Compiled 2012/10/15 16:53:32 
INFO  17:12:28,361 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  17:12:28,361 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  17:12:28,362 HelpFormatter - Program Args: -T UnifiedGenotyper -o /home/lindenb/package/samtools-0.1.18/examples/ex1f.vcf.xml -R /home/lindenb/package/samtools-0.1.18/examples/ex1.fa -I /home/lindenb/package/samtools-0.1.18/examples/sorted.bam 
INFO  17:12:28,363 HelpFormatter - Date/Time: 2012/10/15 17:12:28 
INFO  17:12:28,364 HelpFormatter - ---------------------------------------------------------------------------------------------------------- 
INFO  17:12:28,364 HelpFormatter - ---------------------------------------------------------------------------------------------------------- 
INFO  17:12:28,392 GenomeAnalysisEngine - Strictness is SILENT 
INFO  17:12:28,430 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  17:12:28,444 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  17:12:28,835 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] 
INFO  17:12:28,835 TraversalEngine -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
INFO  17:12:30,721 TraversalEngine - Total runtime 2.00 secs, 0.03 min, 0.00 hours 
INFO  17:12:30,723 TraversalEngine - 108 reads were filtered out during traversal out of 9921 total (1.09%) 
INFO  17:12:30,727 TraversalEngine -   -> 108 reads (1.09% of total) failing UnmappedReadFilter 

output:

<?xml version="1.0"?>
<vcf xmlns="http://xml.1000genomes.org/">
  <head>
    <metadata key="fileformat">VCFv4.1</metadata>
    <info-list>
      <info ID="FS" type="Float" count="1">Phred-scaled p-value using Fisher's exact test to detect strand bias</info>
      <info ID="AN" type="Integer" count="1">Total number of alleles in called genotypes</info>
      <info ID="BaseQRankSum" type="Float" count="1">Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities</info>
      <info ID="MQ" type="Float" count="1">RMS Mapping Quality</info>
      (....)
      <info ID="AF" type="Float">Allele Frequency, for each ALT allele, in the same order as listed</info>
    </info-list>
    <format-list>
      <format ID="DP" type="Integer" count="1">Approximate read depth (reads with MQ=255 or with bad mates are filtered)</format>
      <format ID="GT" type="String" count="1">Genotype</format>
      <format ID="PL" type="Integer">Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification</format>
      <format ID="GQ" type="Integer" count="1">Genotype Quality</format>
      <format ID="AD" type="Integer">Allelic depths for the ref and alt alleles in the order listed</format>
    </format-list>
    <filters-list>
      <filter ID="LowQual"/>
    </filters-list>
    <contigs-list>
      <contig ID="seq1" index="0"/>
      <contig ID="seq2" index="1"/>
    </contigs-list>
    <samples-list>
      <sample id="1">ex1</sample>
      <sample id="2">ex1b</sample>
    </samples-list>
  </head>
  <body>
    <variations>
      <variation chrom="seq1" pos="285">
        <id>.</id>
        <ref>T</ref>
        <alt>A</alt>
      </variation>
      <variation chrom="seq1" pos="287">
        <id>.</id>
        <ref>C</ref>
        <alt>A</alt>
      </variation>
     (....)
  </body>
</vcf>

would you accept a pull request for that project ?

(I'd like to create a JSON ouput too)

Pierre