Tagged with #utilities
3 documentation articles | 0 announcements | 0 forum discussions

Created 2012-07-23 18:15:52 | Updated 2015-11-21 03:41:00 | Tags: fastareference vcf utilities callset
Comments (56)

These errors occur when the contig names, sizes or order doesn't match between input files. The most common case is when you try to use a VCF that does not match the reference.

VCF not matching the reference

This is a classic problem: you get some VCF files from collaborators, you try to use them with your own data, and GATK fails with a big fat error saying that the references don't match.


So what do you do? If you can, you find a version of the VCF file that is derived from the right reference. If you're working with human data and the VCF in question is just a common resource like dbsnp, you're in luck -- we provide versions of dbsnp and similar resources derived from the major human reference builds in our resource bundle (see FAQs for access details).

location: ftp.broadinstitute.org
username: gsapubftp-anonymous

If that's not an option, then you'll have to "liftover" -- specifically, liftover the mismatching VCF to the reference you need to work with. The best tool for liftover is Picard's LiftoverVCF.

Liftover procedure with older versions of GATK

This procedure involves three steps:

  1. Run GATK LiftoverVariants on your VCF file
  2. Run a script to sort the lifted-over file
  3. Filter out records whose REF field does not match the new reference

We provide a script that performs those three steps for you, called liftOverVCF.pl, which is available in our public source repository under the 'perl' directory. Instructions for pulling down our source are available here.

The example below shows how you would run the script:

./liftOverVCF.pl \
    -vcf calls.b36.vcf \                    # input vcf
    -chain b36ToHg19.broad.over.chain \     # chain file
    -out calls.hg19.vcf \                   # output vcf
    -gatk gatk_source \                     # path to source code
    -newRef Homo_sapiens_assembly19 \       # path to new reference base name (without extension)
    -oldRef human_b36_both \                # path to old reference prefix (without extension)
    -tmp /broad/shptmp [defaults to /tmp]   # temp file location (defaults to /tmp)

We provide several chain files to liftover between the major human reference builds, also in our resource bundle (mentioned above) in the Liftover_Chain_Files directory. If you are working with non-human organisms, we can't help you -- but others may have chain files, so ask around in your field.

Note that if you're at the Broad, you can access chain files to liftover from b36/hg18 to hg19 on the humgen server.


Created 2012-07-23 18:04:20 | Updated 2015-11-20 19:40:12 | Tags: bam readgroup utilities addorreplacereadgroups
Comments (0)

What are read groups?

See the Dictionary entry on read groups.

Errors about missing or undefined read groups

As detailed in the FAQs about input requirements, GATK expects all read groups appearing in the read data to be specified in the file header, and will fail with an error if it does not find that information (whether there is no read group information in the file, or a subset of reads do not have read groups).


You can use a Picard tool called AddOrReplaceReadGroups to add the missing information to your input file.

Here's an example:

# throws an error
java -jar GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R reference.fasta \
    -I reads_without_RG.bam \
    -o output.vcf

# fix the read groups
java -jar picard.jar AddOrReplaceReadGroups \
    I= reads_without_RG.bam \
    O=  reads_with_RG.bam \
    SORT_ORDER=coordinate \
    RGID=foo \
    RGLB=bar \
    RGPL=illumina \
    RGSM=Sample1 \

# runs without error
java -jar GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R reference.fasta \
    -I reads_with_RG.bam \
    -o output.vcf

Note that if you don't know what information to put in the read groups, you should ask whoever performed the sequencing or provided the BAM to give you the metadata you need.

Created 2012-07-23 18:02:24 | Updated 2015-05-15 09:13:41 | Tags: bam utilities picard reordersam sorting order
Comments (17)

This error occurs when for example, a collaborator gives you a BAM that's derived from what was originally the same reference as you are using, but for whatever reason the contigs are not sorted in the same order .The GATK can be particular about the ordering of a BAM file so it will fail with an error in this case.

So what do you do? You use a Picard tool called ReorderSam to, well, reorder your BAM file.

Here's an example usage where we reorder a BAM file that was sorted lexicographically so that the output will be another BAM, but this time sorted karyotypically :

java -jar picard.jar ReorderSam \
    I= lexicographic.bam \
    O= kayrotypic.bam \
    REFERENCE= Homo_sapiens_assembly18.kayrotypic.fasta

This tool requires you have a correctly sorted version of the reference sequence you used to align your reads. Be aware that this tool will drop reads that don't have equivalent contigs in the new reference (potentially bad, but maybe not). If contigs have the same name in the bam and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a liftover tool!

This tool is part of the Picard package.

No posts found with the requested search criteria.
No posts found with the requested search criteria.