This is a classic problem: you get some VCF files from collaborators, you try to use them with your own data, and GATK fails with a big fat error saying that the references don't match.
So what do you do? If you can, you find a version of the VCF file that is derived from the right reference. If you're working with human data and the VCF in question is just a common resource like dbsnp, you're in luck -- we provide versions of dbsnp and similar resources derived from the major human reference builds in our resource bundle (see FAQs for access details).
location: ftp.broadinstitute.org username: gsapubftp-anonymous
If that's not an option, then you'll have to "liftover" -- specifically, liftover the mismatching VCF to the reference you need to work with. The procedure for doing so is described below.
This procedure involves three steps:
We provide a script that performs those three steps for you, called
liftOverVCF.pl, which is available in our public source repository under the 'perl' directory. Instructions for pulling down our source are available here.
The example below shows how you would run the script:
./liftOverVCF.pl \ -vcf calls.b36.vcf \ # input vcf -chain b36ToHg19.broad.over.chain \ # chain file -out calls.hg19.vcf \ # output vcf -gatk gatk_source \ # path to source code -newRef Homo_sapiens_assembly19 \ # path to new reference base name (without extension) -oldRef human_b36_both \ # path to old reference prefix (without extension) -tmp /broad/shptmp [defaults to /tmp] # temp file location (defaults to /tmp)
We provide several chain files to liftover between the major human reference builds, also in our resource bundle (mentioned above) in the
Liftover_Chain_Files directory. If you are working with non-human organisms, we can't help you -- but others may have chain files, so ask around in your field.
Note that if you're at the Broad, you can access chain files to liftover from b36/hg18 to hg19 on the
The GATK expects specific information in the header of BAM files (as detailed in the input requirements FAQs), and will fail with an error if it does not find that information.
So what do you do? You use a Picard tool called AddOrReplaceReadGroups to add the missing information to your BAM file.
Here's an example:
# throws an error java -jar GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R reference.fasta \ -I reads_without_RG.bam \ -o output.vcf # fix the read groups java -jar picard.jar AddOrReplaceReadGroups \ I= reads_without_RG.bam \ O= reads_with_RG.bam \ SORT_ORDER=coordinate \ RGID=foo \ RGLB=bar \ RGPL=illumina \ RGSM=Sample1 \ CREATE_INDEX=True # runs without error java -jar GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R reference.fasta \ -I reads_with_RG.bam \ -o output.vcf
Note that if you don't know what information to put in the read groups, you should ask whoever performed the sequencing or provided the BAM to give you the metadata you need.
This tool is part of the Picard package.
This error occurs when for example, a collaborator gives you a BAM that's derived from what was originally the same reference as you are using, but for whatever reason the contigs are not sorted in the same order .The GATK can be particular about the ordering of a BAM file so it will fail with an error in this case.
So what do you do? You use a Picard tool called ReorderSam to, well, reorder your BAM file.
Here's an example usage where we reorder a BAM file that was sorted lexicographically so that the output will be another BAM, but this time sorted karyotypically :
java -jar picard.jar ReorderSam \ I= lexicographic.bam \ O= kayrotypic.bam \ REFERENCE= Homo_sapiens_assembly18.kayrotypic.fasta
This tool requires you have a correctly sorted version of the reference sequence you used to align your reads. Be aware that this tool will drop reads that don't have equivalent contigs in the new reference (potentially bad, but maybe not). If contigs have the same name in the bam and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a liftover tool!
This tool is part of the Picard package.
This document describes how to use the DepthOfCoverage tool, which is mostly intended for QC'ing coverage in whole-genome data. For exome data, consider using DiagnoseTargets instead.
For a complete, detailed argument reference, refer to the GATK document page here.
DepthOfCoverage is a coverage profiler for a (possibly multi-sample) bam file. It uses a granular histogram that can be user-specified to present useful aggregate coverage data. It reports the following metrics over the entire .bam file:
For humans, DepthOfCoverage can also be configured to output these statistics aggregated over genes, by providing it with a RefSeq gene list.
DepthOfCoverage also outputs, by default, the total coverage at every locus, and the coverage per sample and/or read group. This behavior can optionally be turned off, or switched to base count mode, where base counts will be output at each locus, rather than total depth.
To get a summary of coverage by each gene, you may supply a refseq (or alternative) gene list via the argument
The provided gene list must be of the following format:
585 NM_001005484 chr1 + 58953 59871 58953 59871 1 58953, 59871, 0 OR4F5 cmpl cmpl 0, 587 NM_001005224 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F3 cmpl cmpl 0, 587 NM_001005277 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F16 cmpl cmpl 0, 587 NM_001005221 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F29 cmpl cmpl 0, 589 NM_001005224 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F3 cmpl cmpl 0, 589 NM_001005277 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F16 cmpl cmpl 0, 589 NM_001005221 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F29 cmpl cmpl 0,
If you have access to the broad network, the properly-formatted file containing refseq genes and transcripts is located at
If you are not, you can generate your own as described here.
If you supply the -geneList argument, DepthOfCoverage will output an additional summary file that looks as follows:
Gene_Name Total_Cvg Avg_Cvg Sample_1_Total_Cvg Sample_1_Avg_Cvg Sample_1_Cvg_Q3 Sample_1_Cvg_Median Sample_1_Cvg_Q1 SORT1 594710 238.27 594710 238.27 165 245 330 NOTCH2 3011542 357.84 3011542 357.84 222 399 >500 LMNA 563183 186.73 563183 186.73 116 187 262 NOS1AP 513031 203.50 513031 203.50 91 191 290
Note that the gene coverage will be aggregated only over samples (not read groups, libraries, or other types). The
-geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls. Because by-gene aggregation looks for intervals that overlap genes,
-geneList is ignored if
-omitIntervals is thrown.