Tagged with #malformedvcf
0 documentation articles | 0 announcements | 2 forum discussions

No posts found with the requested search criteria.
No posts found with the requested search criteria.
Comments (8)


I've tried my best to find the problem but without luck, so my first post...

First the background:

I have used HaplotypeCaller to make GVCFs per individual for 2 cohorts of 270 and 300 individuals each.
I then used CombineGVCFs to merged all the individuals in each cohort across 10Mb chunks across the genome.

These steps appear to have worked OK! Then I have been attempting to call genotypes from pairs GVCFs (one from each cohort) per region using GenotypeGVCFs with a command such as:

java -jar -Xmx6G /gpfs0/apps/well/gatk/3.2-2/GenomeAnalysisTK.jar \ -T GenotypeGVCFs \ -R ../hs37d5/hs37d5.fa \ --variant /well/hill/kate/exomes//haploc/cap270/cap270_1:230000000-240000000_merged_gvcf.vcf \ --variant /well/hill/kate/exomes//haploc/sepsis300/sepsis300_1:230000000-240000000_merged_gvcf.vcf \ -o /well/hill/kate/exomes//haploc/cap270-sepsis300/cap270-sepsis300_1:230000000-240000000_raw.vcf

This seems to start fine (and works perfectly to completion for some regions), but then throws a error for some of the files. For this example the error is:

ERROR MESSAGE: The provided VCF file is malformed at approximately line number 519534: ./.:29:75:25:0,63,945 is not a valid start position in the VCF format, for input source: /well/hill/kate/exomes/haploc/cap270/cap270_1:230000000-240000000_merged_gvcf.vcf

So to try and solve this I have grepped "./.:29:75:25:0,63,945 " from the VCF file (expecting to find a truncated line), but could find nothing wrong with the lines containing this expression. I looked carefully at the first line containing this after the last position that was written in the output - nothing out of the ordinary that I can see.

Any help/advice gratefully received! Kate

Comments (5)


I am having trouble calling variants using Haplotype Caller on simulated exome reads. I have been able to call reasonable-looking variants on the exome (simulated with dwgsim) with HaplotypeCaller before running it through the Best Practices Pre-Processing pipeline. The pre-processed data worked fine with UnifiedGenotyper but with HaplotypeCaller, though it runs without errors and seems to walk across the genome, only outputs a VCF header. I have tried calling variants with and without using -L to provide the exome regions (as recommended in this forum post: http://gatkforums.broadinstitute.org/discussion/1681/expected-file-size-haplotype-caller) but this hasn't made a difference - when we run the command with the pre-processed BAMs, we only get a VCF header. Everything has been tested with both 2.4-7 and 2.4-9.

Any help or guidance would be greatly appreciated!

Command Used for HaplotypeCaller:

java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -I exome.realigned.dedup.recal.bam -o exome.raw.vcf -D dbsnp_137.hg19.vcf -stand_emit_conf 10 -rf BadCigar -L Illumin_TruSeq.bed --logging_level DEBUG

Commands Used for pre-processing (run in sequence using a Perl script):

java -Xmx16g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -nt 8 -R ucsc.hg19.fasta -I exome.bam -o exome.intervals -known dbsnp_137.hg19.vcf

java -Xmx4g -jar GenomeAnalysisTK.jar -T IndelRealigner -R ucsc.hg19.fasta -I exome.bam -o exome.realigned.bam -targetIntervals intervals.bam -known dbsnp_137.hg19.vcf

java -Xmx16g -jar MarkDuplicates.jar I=exome.realigned.bam METRICS_FILE=exome.dups O=exome.realigned.dedup.bam

samtools index exome.realigned.dedup

java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -o exome.recal_data.grp -knownSites dbsnp_137.hg19.vcf -cov ReadGroupCovariate -cov ContextCovariate -cov CycleCovariate -cov QualityScoreCovariate

java -Xmx4g -jar GenomeAnalysisTK.jar -T PrintReads -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -BQSR exome.recal_data.grp -baq CALCULATE_AS_NECESSARY -o exome.realigned.dedup.recal.bam