GATK resource bundle

From GSA
Jump to: navigation, search

Contents

GATK resource bundle

A collection of standard files for working with human resequencing data with the GATK.

The standard reference sequence we use in the GATK is the the b37 edition from the Human Genome Reference Consortium. All of the key GATK data files are available against this reference sequence. Additionally, we used to use UCSC-style (chr1, not 1) for build hg18, and provide lifted-over files from b37 to hg18 for those still using those files.

Obtaining the bundle

Inside of the Broad, the latest bundle will always be available in:

/humgen/gsa-hpprojects/GATK/bundle/current

with a subdirectory containing for each reference sequence and associated data files.

External users can download these files (or corresponding .gz versions) from the GSA FTP Server in the directory bundle. Gzipped files should be unzipped before attempting to use them. Note that there is no current link on the FTP; users should download the highest numbered directory under current (this is the most recent data set).

b37 resources: the standard data set

  • Reference sequence (standard 1000 Genomes fasta) along with fai and dict files
  • dbSNP in VCF. This including two files:
    • The most recent dbSNP release
    • This file subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
  • HapMap genotypes and sites VCFs
  • OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF
  • The current best set of known indels to be used for local realignment (note that we don't use dbSNP for this anymore); use both files:
    • 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)
    • Mills_and_1000G_gold_standard.indels.b37.sites.vcf
  • A large-scale standard single sample BAM file for testing:
    • NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam containing ~64x reads of NA12878 on chromosome 20
    • The results of the latest UnifiedGenotyper with default arguments run on this data set (NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.vcf)

Additionally, these files all have supplementary indices, statistics, and other QC data available.

hg18 resources: lifted over from b37

Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over easily from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

b36 resources: lifted over from b37

Includes the 1000 Genomes pilot b36 formated reference sequence (human_b36_both.fasta) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over easily from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

hg19 resources: lifted over from b37

Includes the UCSC-style hg19 reference along with all lifted over VCF files.




Creating the GATK resource bundle (GSA members only)

Perhaps there's a better place for these docs?

The resource bundle should be created for each major GATK release, but not necessarily for each minor bug fix version of the release. The script operates in two phases. The first creates local gsa-hpprojects bundle files, and the second iterates over all of the files in that phase 1 directory and compresses/md5s them into the GSA FTP server root.

  • cd to the root of the GATK release source tree
  • Run phase1 of the Queue script
java -Djava.io.tmpdir=/broad/shptmp/depristo/tmp -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/GATKResourcesBundle.scala \
   -bsub -jobQueue gsa -ver $RELEASE_REVISION_NUMBER -run

where $RELEASE_REVISION_NUMBER is the revision control system (GIT) version of the GATK release. This command will create all of the bundle files in the local path:

/humgen/gsa-hpprojects/GATK/bundle/$RELEASE_REVISION_NUMBER

Wait until all of the jobs finish several hours from now.

  • update the bundle current symlink
rm -f /humgen/gsa-hpprojects/GATK/bundle/current
ln -s /humgen/gsa-hpprojects/GATK/bundle/$RELEASE_REVISION_NUMBER /humgen/gsa-hpprojects/GATK/bundle/current
  • clean up out of date bundles. There many be old bundle releases from long ago in these directories. You may decide to remove them to recover some space
  • Run phase 2 of the bundle script to push the files out to our FTP server
java -Djava.io.tmpdir=/broad/shptmp/depristo/tmp -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/GATKResourcesBundle.scala \
   -bsub -jobQueue gsa -ver $RELEASE_REVISION_NUMBER -run -phase2

to create:

/humgen/gsa-scr1/pub/bundle/$RELEASE_REVISION_NUMBER

There are no symlinks to update here. You are done!

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox