Tagged with #malformedvcf
0 documentation articles | 0 announcements | 3 forum discussions

No articles to display.

No articles to display.

Created 2014-09-19 14:34:45 | Updated | Tags: haplotypecaller malformedvcf genotypegvcfs

Comments (9)


I've tried my best to find the problem but without luck, so my first post...

First the background:

I have used HaplotypeCaller to make GVCFs per individual for 2 cohorts of 270 and 300 individuals each.
I then used CombineGVCFs to merged all the individuals in each cohort across 10Mb chunks across the genome.

These steps appear to have worked OK! Then I have been attempting to call genotypes from pairs GVCFs (one from each cohort) per region using GenotypeGVCFs with a command such as:

java -jar -Xmx6G /gpfs0/apps/well/gatk/3.2-2/GenomeAnalysisTK.jar \ -T GenotypeGVCFs \ -R ../hs37d5/hs37d5.fa \ --variant /well/hill/kate/exomes//haploc/cap270/cap270_1:230000000-240000000_merged_gvcf.vcf \ --variant /well/hill/kate/exomes//haploc/sepsis300/sepsis300_1:230000000-240000000_merged_gvcf.vcf \ -o /well/hill/kate/exomes//haploc/cap270-sepsis300/cap270-sepsis300_1:230000000-240000000_raw.vcf

This seems to start fine (and works perfectly to completion for some regions), but then throws a error for some of the files. For this example the error is:

ERROR MESSAGE: The provided VCF file is malformed at approximately line number 519534: ./.:29:75:25:0,63,945 is not a valid start position in the VCF format, for input source: /well/hill/kate/exomes/haploc/cap270/cap270_1:230000000-240000000_merged_gvcf.vcf

So to try and solve this I have grepped "./.:29:75:25:0,63,945 " from the VCF file (expecting to find a truncated line), but could find nothing wrong with the lines containing this expression. I looked carefully at the first line containing this after the last position that was written in the output - nothing out of the ordinary that I can see.

Any help/advice gratefully received! Kate

Created 2013-04-11 22:33:51 | Updated | Tags: mutect java malformedvcf

Comments (6)

In addition to the standard mutect output, I'm interested in vcf output, and was happy to find
a previous related question showing how to output vcf. However, I seem to be having some
trouble with what I think is misformed output. Specifically, the genotype field is "0" for
normal and "0/1" for tumor on every line

#CHROM  POS       ID           REF  ALT  QUAL  FILTER  INFO               FORMAT             normal  tumor  
7       55230840  rs7781264    A    G    .     REJECT  DB                 GT:AD:BQ:DP:FA     0:0,1:.:1:1.00                0/1:0,27:29:27:1.00           
7       55233109  rs150899403  G    A    .     PASS    DB;SOMATIC;VT=SNP  GT:AD:BQ:DP:FA:SS  0:134,0:.:132:0.00:0          0/1:380,296:24:688:0.438:2    
7       55233265  .            A    C    .     REJECT  .                  GT:AD:BQ:DP:FA     0:6,0:.:6:0.00                0/1:251,24:12:275:0.087       

The corresponding lines in the mutect output file are

## muTector v1.0.47986
contig  position  context  ref_allele  alt_allele  tumor_name   normal_name   score  dbsnp_site    covered    power     tumor_power  normal_power  total_pairs  improper_pairs  map_Q0_reads  t_lod_fstar  tumor_f   contaminant_fraction  contaminant_lod  t_ref_count  t_alt_count  t_ref_sum  t_alt_sum  t_ref_max_mapq  t_alt_max_mapq  t_ins_count  t_del_count  normal_best_gt  init_n_lod   n_ref_count  n_alt_count  n_ref_sum  n_alt_sum  judgement
7       55230840  ACTxTGC  A           G           tumor        normal        0      DBSNP         UNCOVERED  0         0.612407     0             30           1               0             93.647278    1         0.02                  -0.236654        0            27           0          808        0               70              0            0            GG              -3.882263    0            1            0          30         REJECT 
7       55233109  TGTxCCA  G           A           tumor        normal        0      DBSNP+COSMIC  COVERED    1         1            1             1140         3               7             665.64967    0.43787   0.02                  28.941878        380          296          10901      7263       70              70              0            0            GG              40.298595    134          0            4097       0          KEEP   
7       55233265  CCCxCAG  A           C           tumor        normal        0      NOVEL         UNCOVERED  0         1            0             305          5               0             8.112745     0.087273  0.02                  2.430961         251          24           6289       302        70              70              0            0            AA              1.803681     6            0            154        0          REJECT 

If it matters, this was with openjdk 1.6:

$ /usr/lib/jvm/java-1.6.0-openjdk- -version
java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.5) (rhel-
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

Any idea what might be causing this, and is there anything you or I can do to fix it?

Thanks, Kevin

Comments (5)


I am having trouble calling variants using Haplotype Caller on simulated exome reads. I have been able to call reasonable-looking variants on the exome (simulated with dwgsim) with HaplotypeCaller before running it through the Best Practices Pre-Processing pipeline. The pre-processed data worked fine with UnifiedGenotyper but with HaplotypeCaller, though it runs without errors and seems to walk across the genome, only outputs a VCF header. I have tried calling variants with and without using -L to provide the exome regions (as recommended in this forum post: http://gatkforums.broadinstitute.org/discussion/1681/expected-file-size-haplotype-caller) but this hasn't made a difference - when we run the command with the pre-processed BAMs, we only get a VCF header. Everything has been tested with both 2.4-7 and 2.4-9.

Any help or guidance would be greatly appreciated!

Command Used for HaplotypeCaller:

java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -I exome.realigned.dedup.recal.bam -o exome.raw.vcf -D dbsnp_137.hg19.vcf -stand_emit_conf 10 -rf BadCigar -L Illumin_TruSeq.bed --logging_level DEBUG

Commands Used for pre-processing (run in sequence using a Perl script):

java -Xmx16g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -nt 8 -R ucsc.hg19.fasta -I exome.bam -o exome.intervals -known dbsnp_137.hg19.vcf

java -Xmx4g -jar GenomeAnalysisTK.jar -T IndelRealigner -R ucsc.hg19.fasta -I exome.bam -o exome.realigned.bam -targetIntervals intervals.bam -known dbsnp_137.hg19.vcf

java -Xmx16g -jar MarkDuplicates.jar I=exome.realigned.bam METRICS_FILE=exome.dups O=exome.realigned.dedup.bam

samtools index exome.realigned.dedup

java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -o exome.recal_data.grp -knownSites dbsnp_137.hg19.vcf -cov ReadGroupCovariate -cov ContextCovariate -cov CycleCovariate -cov QualityScoreCovariate

java -Xmx4g -jar GenomeAnalysisTK.jar -T PrintReads -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -BQSR exome.recal_data.grp -baq CALCULATE_AS_NECESSARY -o exome.realigned.dedup.recal.bam