Tagged with #resources
1 documentation article | 0 announcements | 5 forum discussions

Created 2012-07-26 13:55:25 | Updated 2015-07-22 17:28:17 | Tags: bundle resources

Comments (53)

1. Obtaining the bundle

Inside of the Broad, the latest bundle will always be available in:


with a subdirectory containing for each reference sequence and associated data files.

External users can download these files (or corresponding .gz versions) from the GSA FTP Server in the directory bundle. Gzipped files should be unzipped before attempting to use them. Note that there is no "current link" on the FTP; users should download the highest numbered directory under current (this is the most recent data set).

2. b37 Resources: the Standard Data Set

  • Reference sequence (standard 1000 Genomes fasta) along with fai and dict files
  • dbSNP in VCF. This includes two files:
    • A recent dbSNP release (build 138)
    • This file subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
  • HapMap genotypes and sites VCFs
  • OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF
  • The current best set of known indels to be used for local realignment (note that we don't use dbSNP for this anymore); use both files:
    • 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)
    • Mills_and_1000G_gold_standard.indels.b37.sites.vcf
  • The latest set from 1000G phase 3 (v4) for genotype refinement: 1000G_phase3_v4_20130502.sites.vcf
  • A large-scale standard single sample BAM file for testing:
    • NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam containing ~64x reads of NA12878 on chromosome 20
    • A callset produced by running UnifiedGenotyper on the dataset above. Note that this resource is out of date and does not represent the results of our Best Practices. This will be updated in the near future.
  • The Broad's custom exome targets list: Broad.human.exome.b37.interval_list (note that you should always use the exome targets list that is appropriate for your data, which typically depends on the prep kit that was used, and should be available from the kit manufacturer's website)

Additionally, these files all have supplementary indices, statistics, and other QC data available.

3. hg18 Resources: lifted over from b37

Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

4. b36 Resources: lifted over from b37

Includes the 1000 Genomes pilot b36 formated reference sequence (human_b36_both.fasta) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

5. hg19 Resources: lifted over from b37

Includes the UCSC-style hg19 reference along with all lifted over VCF files.

No articles to display.

Created 2016-03-17 15:03:39 | Updated | Tags: variantrecalibrator bundle resources

Comments (1)

Hello all,

We have been working on a project to call variants in multiple samples, going through the steps of HaplotypeCaller, CombineGVCFs and GenotypeGVCFs. So far we have been using the reference Fasta and dbsnp files (for build hg19) from the Broad resource bundle version 2.5.

However, at this point we need to download other resource files (e.g. hapmap, mills, 1000Gphase1) for the VariantRecalibrator step, but we can only do that from the current bundle, which is version 2.8.

We are wondering what would be the effect of using resource files from different bundle versions for the next step. For instance, by checking the md5 for the reference fasta file of bundle 2.5 and 2.8, we can see that they are not the same, and we are not sure about the implications of this.

Thank you in advance!


Created 2015-02-25 02:12:39 | Updated | Tags: vqsr baserecalibrator haplotypecaller knownsites resources variant-recalibration

Comments (2)

Hi, I have a general question about the importance of known VCFs (for BQSR and HC) and resources file (for VQSR). I am working on rice for which the only known sites are the dbSNP VCF files which are built on a genomic version older than the reference genomic fasta file which I am using as basis. How does it affect the quality/accuracy of variants? How important is to have the exact same build of the genome as the one on which the known VCF is based? Is it better to leave out the known sites for some of the steps than to use the version which is built on a different version of the genome for the same species? In other words, which steps (BQSR, HC, VQSR etc) can be performed without the known sites/resource file? If the answers to the above questions are too detailed, can you please point me to any document, if available, which might address this issue?

Thanks, NB

Created 2014-06-11 15:10:28 | Updated | Tags: indelrealigner realignertargetcreator indels resources

Comments (3)

Hello there,

Would you please let us know how "IndelRealigner" makes use of "known" resources? I assume it already has the intervals of interest for realignment from the "RealignerTargetCreator". So it's not clear why it needs the resources again. Would it use them for making any sort of decision to reject or accept realigned indels?

Also, please let us know what happens (algorithmically) if we don't provide the same resources used in "RealignerTargetCreator" to the program.

Thank you Amin Zia

Created 2013-03-12 14:01:12 | Updated | Tags: test license development resources

Comments (3)

I have two licence questions. First of:

There are two licence notes in BaseTest.java. I'm assuming that the top one is the one that should be viewed as currently in use. Is this correct?

Secondly, under which licence are the test resources made available. Am I'm free to use and redistribute these? I'm asking since I'm in the process of refactoring my pipeline project to use GATK as a external dependency, but some of the tests I've written use the files provided with the GATK as a resource, and I'd like to keep them and distribute them with the project as I've done before in my GATK fork.

Created 2012-11-08 13:13:47 | Updated | Tags: queue resources

Comments (5)

Is there any where I can find the integration test file with the md5sum "45d97df6d291695b92668e8a55c54cd0", which is used in the DataProcessingPipelineTest class? Since my tests fail with another md5sum calculated I would be interested to know what the differences between the files are.