Hi the GATK team,
I hate the VCF format :-)
I want a structured output and I'd like to promote the use of the XML/JSON to store the variations. I think the best way to achieve this, is to integrate this new format in the GATK rather than creating another tool converting the VCF to XML/JSON. In the best world, I can insert the result of, say the ENSEMBL API ( e.g. http://beta.rest.ensembl.org/vep/human/9:22125503-22125502:1/C/consequences?content-type=text/xml ) in each 'variation' element.
I've forked the GATK and created a new class to handle the XML output:
https://github.com/lindenb/gatk/commit/dbffd2fa3e7a043a6951d8ac58dd619e68a6caa8
now in VariantContextWriterFactory, when the filename ends with ".xml", the factory creates a new XMLVariantContextWriter rather than a VCFWriter .
I'm currently writing XMLVariantContextWriter and I've only written the header and the chrom/pos for the variations. Here is a sample:
java -jar dist/GenomeAnalysisTK.jar -T UnifiedGenotyper -o /home/lindenb/package/samtools-0.1.18/examples/ex1f.vcf.xml -R /home/lindenb/package/samtools-0.1.18/examples/ex1.fa -I /home/lindenb/package/samtools-0.1.18/examples/sorted.bam
INFO 17:12:28,358 HelpFormatter - ----------------------------------------------------------------------------------------------------------
INFO 17:12:28,361 HelpFormatter - The Genome Analysis Toolkit (GATK) vdbffd2fa3e7a043a6951d8ac58dd619e68a6caa8, Compiled 2012/10/15 16:53:32
INFO 17:12:28,361 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 17:12:28,361 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 17:12:28,362 HelpFormatter - Program Args: -T UnifiedGenotyper -o /home/lindenb/package/samtools-0.1.18/examples/ex1f.vcf.xml -R /home/lindenb/package/samtools-0.1.18/examples/ex1.fa -I /home/lindenb/package/samtools-0.1.18/examples/sorted.bam
INFO 17:12:28,363 HelpFormatter - Date/Time: 2012/10/15 17:12:28
INFO 17:12:28,364 HelpFormatter - ----------------------------------------------------------------------------------------------------------
INFO 17:12:28,364 HelpFormatter - ----------------------------------------------------------------------------------------------------------
INFO 17:12:28,392 GenomeAnalysisEngine - Strictness is SILENT
INFO 17:12:28,430 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 17:12:28,444 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO 17:12:28,835 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
INFO 17:12:28,835 TraversalEngine - Location processed.sites runtime per.1M.sites completed total.runtime remaining
INFO 17:12:30,721 TraversalEngine - Total runtime 2.00 secs, 0.03 min, 0.00 hours
INFO 17:12:30,723 TraversalEngine - 108 reads were filtered out during traversal out of 9921 total (1.09%)
INFO 17:12:30,727 TraversalEngine - -> 108 reads (1.09% of total) failing UnmappedReadFilter
output:
<?xml version="1.0"?>
<vcf xmlns="http://xml.1000genomes.org/">
<head>
<metadata key="fileformat">VCFv4.1</metadata>
<info-list>
<info ID="FS" type="Float" count="1">Phred-scaled p-value using Fisher's exact test to detect strand bias</info>
<info ID="AN" type="Integer" count="1">Total number of alleles in called genotypes</info>
<info ID="BaseQRankSum" type="Float" count="1">Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities</info>
<info ID="MQ" type="Float" count="1">RMS Mapping Quality</info>
(....)
<info ID="AF" type="Float">Allele Frequency, for each ALT allele, in the same order as listed</info>
</info-list>
<format-list>
<format ID="DP" type="Integer" count="1">Approximate read depth (reads with MQ=255 or with bad mates are filtered)</format>
<format ID="GT" type="String" count="1">Genotype</format>
<format ID="PL" type="Integer">Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification</format>
<format ID="GQ" type="Integer" count="1">Genotype Quality</format>
<format ID="AD" type="Integer">Allelic depths for the ref and alt alleles in the order listed</format>
</format-list>
<filters-list>
<filter ID="LowQual"/>
</filters-list>
<contigs-list>
<contig ID="seq1" index="0"/>
<contig ID="seq2" index="1"/>
</contigs-list>
<samples-list>
<sample id="1">ex1</sample>
<sample id="2">ex1b</sample>
</samples-list>
</head>
<body>
<variations>
<variation chrom="seq1" pos="285">
<id>.</id>
<ref>T</ref>
<alt>A</alt>
</variation>
<variation chrom="seq1" pos="287">
<id>.</id>
<ref>C</ref>
<alt>A</alt>
</variation>
(....)
</body>
</vcf>
would you accept a pull request for that project ?
(I'd like to create a JSON ouput too)
Pierre