# Tagged with #bundle 5 documentation articles | 0 announcements | 11 forum discussions

Created 2014-09-16 00:04:00 | Updated 2015-04-03 14:13:28 | Tags: tutorials bundle workshop

So you're going to a GATK workshop, and you've been selected to participate in a hands-on session? Fantastic! We're looking forward to walking you through some exercises that will help you master the tools. However -- in order to make the best of the time we have together, we'd like to ask you to come prepared. Specifically, please complete the following steps:

#### - Download and install all necessary software as described in this tutorial.

We don't always get around to using RStudio, but all others are required. Note that if you are a Mac user, you may need to install Apple's XCode Tools, which are free but fairly large, so plan ahead because it can take a loooong time to download them if your connection is anything less than super-fast.

#### - Download one of the following tutorial bundles from our FTP server:

• The basic tutorial data bundle
Speaking of long downloads, this one is also pretty big (740M), so again, don't leave it until last minute. This mini-bundle contains chromosome 20 of the human genome reference, a BAM file snippet and accompanying dbsnp + known indels files.

• OR the advanced tutorial data bundle.
If you are attending an advanced hands-on session, you'll need some extra files that aren't in the basic tutorial bundle. This add-on bundle is also quite large (870M) because it contains the complete human genome and a complete whole-genome callset. Note that this will take around 4G of space on your hard drive once it's uncompressed, so make sure you have plenty of space available on your machine.

At the start of the session, we'll give you handouts with a walkthrough of the session so you can follow along and take notes (highly recommended!).

With that, you should be all set. See you soon!

Created 2012-08-09 14:28:32 | Updated 2013-03-25 21:51:48 | Tags: official bundle analyst developer paper intermediate

## New WGS and WEx CEU trio BAM files

We have sequenced at the Broad Institute and released to the 1000 Genomes Project the following datasets for the three members of the CEU trio (NA12878, NA12891 and NA12892):

• WEx (150x) sequence
• WGS (>60x) sequence

This is better data to work with than the original DePristo et al. BAMs files, so we recommend you download and analyze these files if you are looking for complete, large-scale data sets to evaluate the GATK or other tools.

Here's the rough library properties of the BAMs:

## NA12878 Datasets from DePristo et al. (2011) Nature Genetics

Here are the datasets we used in the GATK paper cited below.

DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D and Daly, M (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics. 43:491-498.

Some of the BAM and VCF files are currently hosted by the NCBI: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20101201_cg_NA12878/

• NA12878.hiseq.wgs.bwa.recal.bam -- BAM file for NA12878 HiSeq whole genome
• NA12878.hiseq.wgs.bwa.raw.bam Raw reads (in BAM format, see below)
• NA12878.ga2.exome.maq.recal.bam -- BAM file for NA12878 GenomeAnalyzer II whole exome (hg18)
• NA12878.ga2.exome.maq.raw.bam Raw reads (in BAM format, see below)
• NA12878.hiseq.wgs.vcf.gz -- SNP calls for NA12878 HiSeq whole genome (hg18)
• NA12878.ga2.exome.vcf.gz -- SNP calls for NA12878 GenomeAnalyzer II whole exome (hg18)
• BAM files for CEU + NA12878 whole genome (b36). These are the standard BAM files for the 1000 Genomes pilot CEU samples plus a 4x downsampled version of NA12878 from the pilot 2 data set, available in the DePristoNatGenet2011 directory of the GSA FTP Server
• SNP calls for CEU + NA12878 whole genome (b36) are available in the DePristoNatGenet2011 directory of the GSA FTP Server
• Crossbow comparison SNP calls are available in the DePristoNatGenet2011 directory of the GSA FTP Server as crossbow.filtered.vcf. The raw calls can be viewed by ignoring the FILTER field status
• whole_exome_agilent_designed_120.Homo_sapiens_assembly18.targets.interval_list -- targets used in the analysis of the exome capture data

Please note that we have not collected the indel calls for the paper, as these are only used for filtering SNPs near indels. If you want to call accurate indels, please use the new GATK indel caller in the Unified Genotyper.

### Warnings

Both the GATK and the sequencing technologies have improved significantly since the analyses performed in this paper.

• If you are conducting a review today, we would recommend that the newest version of the GATK, which performs much better than the version described in the paper. Moreover, we would also recommend one use the newest version of Crossbow as well, in case they have improved things. The GATK calls for NA12878 from the paper (above) will give one a good idea what a good call set looks like whole-genome or whole-exome.

• The data sets used in the paper are no longer state-of-the-art. The WEx BAM is GAII data aligned with MAQ on hg18, but a state-of-the-art data set would use HiSeq and BWA on hg19. Even the 64x HiSeq WG data set is already more than one year old. For a better assessment, we would recommend you use a newer data set for these samples, if you have the capacity to generate it. This applies less to the WG NA12878 data, which is pretty good, but the NA12878 WEx from the paper is nearly 2 years old now and notably worse than our most recent data sets.

Obviously, this was an annoyance for us as well, as it would have been nice to use a state-of-the-art data set for the WEx. But we decided to freeze the data used for analysis to actually finish this paper.

### How do I get the raw FASTQ file from a BAM?

If you want the raw, machine output for the data analyzed in the GATK framework paper, obtain the raw BAM files above and convert them from SAM to FASTQ using the Picard tool SamToFastq.

Created 2012-08-02 14:05:29 | Updated 2014-12-17 17:05:58 | Tags: variantrecalibrator bundle vqsr applyrecalibration faq

This document describes the resource datasets and arguments that we recommend for use in the two steps of VQSR (i.e. the successive application of VariantRecalibrator and ApplyRecalibration), based on our work with human genomes, to comply with the GATK Best Practices. The recommendations detailed in this document take precedence over any others you may see elsewhere in our documentation (e.g. in Tutorial articles, which are only meant to illustrate usage, or in past presentations, which may be out of date).

The document covers:

• Explanation of resource datasets
• Important notes about exome experiments
• Argument recommendations for VariantRecalibrator
• Argument recommendations for ApplyRecalibration

These recommendations are valid for use with calls generated by both the UnifiedGenotyper and HaplotypeCaller. In the past we made a distinction in how we processed the calls from these two callers, but now we treat them the same way. These recommendations will probably not work properly on calls generated by other (non-GATK) callers.

Note that VQSR must be run twice in succession in order to build a separate error model for SNPs and INDELs (see the VQSR documentation for more details).

### Explanation of resource datasets

The human genome training, truth and known resource datasets mentioned in this document are all available from our resource bundle.

If you are working with non-human genomes, you will need to find or generate at least truth and training resource datasets with properties corresponding to those described below. To generate your own resource set, one idea is to first do an initial round of SNP calling and only use those SNPs which have the highest quality scores. These sites which have the most confidence are probably real and could be used as truth data to help disambiguate the rest of the variants in the call set. Another idea is to try using several SNP callers in addition to the UnifiedGenotyper or HaplotypeCaller, and use those sites which are concordant between the different methods as truth data. In either case, you'll need to assign your set a prior likelihood that reflects your confidence in how reliable it is as a truth set. We recommend Q10 as a starting value, which you can then experiment with to find the most appropriate value empirically. There are many possible avenues of research here. Hopefully the model reporting plots that are generated by the recalibration tools will help facilitate this experimentation.

#### Resources for SNPs

• True sites training resource: HapMap
This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).

• True sites training resource: Omni
This resource is a set of polymorphic SNP sites produced by the Omni geno- typing array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

• Non-true sites training resource: 1000G
This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this re- source may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (%). 17

• Known sites resource, not used in training: dbSNP
This resource is a call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

#### Resources for Indels

• Known and true sites training resource: Mills
This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

Some of the annotations included in the recommendations given below might not be the best for your particular dataset. In particular, the following caveats apply:

• Depth of coverage (the DP annotation invoked by Coverage) should not be used when working with exome datasets since there is extreme variation in the depth to which targets are captured! In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.

• You may have seen HaplotypeScore mentioned in older documents. That is a statistic produced by UnifiedGenotyper that should only be used if you called your variants with UG. This statistic isn't produced by the HaplotypeCaller because that mathematics is already built into the likelihood function itself when calling full haplotypes with HC.

• The InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be computed. For projects with fewer samples, or that includes many closely related samples (such as a family) please omit this annotation from the command line.

### Important notes for exome capture experiments

In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP and/or indel callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:

• Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline). Be aware that you cannot simply add VCFs from the 1000 Genomes Project. You must either call variants from the original BAMs jointly with your own samples, or (better) use the reference model workflow to generate GVCFs from the original BAMs, and perform joint genotyping on those GVCFs along with your own samples' GVCFs with GenotypeGVCFs.

• You can also try using the VQSR with the smaller variant callset, but experiment with argument settings (try adding --maxGaussians 4 to your command line, for example). You should only do this if you are working with a non-model organism for which there are no available genomes or exomes that you can use to supplement your own cohort.

### Argument recommendations for VariantRecalibrator

The variant quality score recalibrator builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant is a true genetic variant or a machine artifact. One major improvement from previous recommended protocols is that hand filters do not need to be applied at any point in the process now. All filtering criteria are learned from the data itself.

#### Common, base command line

This is the first part of the VariantRecalibrator command line, to which you need to add either the SNP-specific recommendations or the indel-specific recommendations given further below.

java -Xmx4g -jar GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R path/to/reference/human_g1k_v37.fasta \
-input raw.input.vcf \
-recalFile path/to/output.recal \
-tranchesFile path/to/output.tranches \
-nt 4 \
[SPECIFY TRUTH AND TRAINING SETS] \
[SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \
[SPECIFY WHICH CLASS OF VARIATION TO MODEL] \


#### SNP specific recommendations

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. In addition we take the highest confidence SNPs from the project's callset. These datasets are available in the GATK resource bundle.

   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
-resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -an InbreedingCoeff \
-mode SNP \


Please note that these recommendations are formulated for whole-genome datasets. For exomes, we do not recommend using DP for variant recalibration (see below for details of why).

Note also that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, DP, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.

Also, using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.

You may notice that these recommendations no longer include the --numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.

#### Indel specific recommendations

When modeling indels with the VQSR we use a training dataset that was created at the Broad by strictly curating the (Mills, Devine, Genome Research, 2011) dataset as as well as adding in very high confidence indels from the 1000 Genomes Project. This dataset is available in the GATK resource bundle.

   --maxGaussians 4 \
-resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
-an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -an InbreedingCoeff \
-mode INDEL \


Note that indels use a different set of annotations than SNPs. Most annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

You may notice that these recommendations no longer include the --numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.

### Argument recommendations for ApplyRecalibration

The power of the VQSR is that it assigns a calibrated probability to every putative mutation in the callset. The user is then able to decide at what point on the theoretical ROC curve their project wants to live. Some projects, for example, are interested in finding every possible mutation and can tolerate a higher false positive rate. On the other hand, some projects want to generate a ranked list of mutations that they are very certain are real and well supported by the underlying data. The VQSR provides the necessary statistical machinery to effectively apply this sensitivity/specificity tradeoff.

#### Common, base command line

This is the first part of the ApplyRecalibration command line, to which you need to add either the SNP-specific recommendations or the indel-specific recommendations given further below.


java -Xmx3g -jar GenomeAnalysisTK.jar \
-T ApplyRecalibration \
-R reference/human_g1k_v37.fasta \
-input raw.input.vcf \
-tranchesFile path/to/input.tranches \
-recalFile path/to/input.recal \
-o path/to/output.recalibrated.filtered.vcf \
[SPECIFY THE DESIRED LEVEL OF SENSITIVITY TO TRUTH SITES] \
[SPECIFY WHICH CLASS OF VARIATION WAS MODELED] \


#### SNP specific recommendations

For SNPs we used HapMap 3.3 and the Omni 2.5M chip as our truth set. We typically seek to achieve 99.5% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.5 \
-mode SNP \


#### Indel specific recommendations

For indels we use the Mills / 1000 Genomes indel truth set described above. We typically seek to achieve 99.0% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.0 \
-mode INDEL \


Created 2012-07-26 15:24:14 | Updated 2014-09-02 17:24:53 | Tags: official bundle basic analyst ftp developer

We make various files available for public download from the GSA FTP server, such as the GATK resource bundle and presentation slides. We also maintain a public upload feature for processing bug reports from users.

location: ftp.broadinstitute.org


location: ftp.broadinstitute.org


### Using a browser as FTP client

Created 2012-07-26 13:55:25 | Updated 2015-07-22 17:28:17 | Tags: bundle resources

### 1. Obtaining the bundle

Inside of the Broad, the latest bundle will always be available in:

/humgen/gsa-hpprojects/GATK/bundle/current


with a subdirectory containing for each reference sequence and associated data files.

External users can download these files (or corresponding .gz versions) from the GSA FTP Server in the directory bundle. Gzipped files should be unzipped before attempting to use them. Note that there is no "current link" on the FTP; users should download the highest numbered directory under current (this is the most recent data set).

### 2. b37 Resources: the Standard Data Set

• Reference sequence (standard 1000 Genomes fasta) along with fai and dict files
• dbSNP in VCF. This includes two files:
• A recent dbSNP release (build 138)
• This file subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
• HapMap genotypes and sites VCFs
• OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF
• The current best set of known indels to be used for local realignment (note that we don't use dbSNP for this anymore); use both files:
• 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)
• Mills_and_1000G_gold_standard.indels.b37.sites.vcf
• The latest set from 1000G phase 3 (v4) for genotype refinement: 1000G_phase3_v4_20130502.sites.vcf
• A large-scale standard single sample BAM file for testing:
• NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam containing ~64x reads of NA12878 on chromosome 20
• A callset produced by running UnifiedGenotyper on the dataset above. Note that this resource is out of date and does not represent the results of our Best Practices. This will be updated in the near future.
• The Broad's custom exome targets list: Broad.human.exome.b37.interval_list (note that you should always use the exome targets list that is appropriate for your data, which typically depends on the prep kit that was used, and should be available from the kit manufacturer's website)

Additionally, these files all have supplementary indices, statistics, and other QC data available.

### 3. hg18 Resources: lifted over from b37

Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

### 4. b36 Resources: lifted over from b37

Includes the 1000 Genomes pilot b36 formated reference sequence (human_b36_both.fasta) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

### 5. hg19 Resources: lifted over from b37

Includes the UCSC-style hg19 reference along with all lifted over VCF files.

No posts found with the requested search criteria.

Created 2014-08-09 05:44:49 | Updated 2014-08-09 06:04:21 | Tags: bundle resource-bundle ucsc-genome

Hi, It is becoming more and more difficult for us who are analyzing data in a transition period. After hg38 public release all the databases are updated with recent assembly. Now some people like us who are using GATK best practices facing a real problem to match our data with the databases. My basic question is when the supprotive files for hg38 are going to be available in your ftp site as a bundle.

I may have missed, but can you please give me information about the release version of GRCh37 (p13 or what) available in 2.5 and 2.8 bundle. Thanks in advance.

Created 2014-07-21 07:39:50 | Updated | Tags: bundle baserecalibrator dbsnp

Hi, I have a question. Is there any specific reason why GATK Bundle only has dbSNP138 since dbSNP141 has released for a while ? Are you planning to release new bundle about this ? Do you recommend to use the latest version dbSNP141 for base recalibration ? Thanks.

Created 2013-07-07 12:46:50 | Updated 2013-07-07 12:51:16 | Tags: bundle vcf indels

Hi all,

I have been looking for a documentation for the INFO column in the VCF of the Mills indels in the GATK resource bundle (Mills_and_1000G_gold_standard.indels.b37.sites.vcf.gz), but to no avail. I did a unique list of all the possible entries for the INFO, but I have no idea what they mean. Does anybody know what they mean? Thank you!!

set=Intersect1000GAll

set=Intersect1000GAll-MillsDoubleCenter

set=Intersect1000GAll-MillsDoubleCenter-MillsTracesUnknown

set=Intersect1000GAll-MillsTracesUnknown

set=Intersect1000GMinusBI

set=Intersect1000GMinusDI

set=Intersect1000GMinusDI-MillsDoubleCenter

set=Intersect1000GMinusOX

set=Intersect1000GMinusOX-MillsCenterUnknown-MillsTracesUnknown

set=Intersect1000GMinusOX-MillsDoubleCenter

set=Intersect1000GMinusOX-MillsDoubleCenter-MillsTracesUnknown

set=Intersect1000GMinusOX-MillsTracesUnknown

set=Intersect1000GMinusSI

set=Intersect1000GMinusSI-MillsDoubleCenter

set=Intersect1000GMinusSI-MillsDoubleCenter-MillsTracesUnknown

set=Intersect1000GMinusSI-MillsTracesUnknown

set=MillsAlleleMatch1000G

set=MillsAlleleMatch1000G-MillsDoubleCenter

set=MillsAlleleMatch1000G-MillsDoubleCenter-MillsTracesUnknown

set=MillsAlleleMatch1000G-MillsTracesUnknown

set=MillsCenterUnknown

set=MillsCenterUnknown-Intersect1000GAll-MillsTracesUnknown

set=MillsCenterUnknown-Intersect1000GMinusBI-MillsTracesUnknown

set=MillsCenterUnknown-Intersect1000GMinusDI-MillsTracesUnknown

set=MillsCenterUnknown-Intersect1000GMinusSI

set=MillsCenterUnknown-Intersect1000GMinusSI-MillsTracesUnknown

set=MillsCenterUnknown-MillsAlleleMatch1000G

set=MillsCenterUnknown-MillsAlleleMatch1000G-MillsTracesUnknown

set=MillsCenterUnknown-MillsTraces3Plus

set=MillsCenterUnknown-MillsTraces3Plus-Intersect1000GMinusSI

set=MillsCenterUnknown-MillsTraces3Plus-MillsAlleleMatch1000G

set=MillsCenterUnknown-MillsTracesUnknown

set=MillsDoubleCenter

set=MillsDoubleCenter-MillsTracesUnknown

set=MillsTraces3Plus

set=MillsTraces3Plus-Intersect1000GMinusSI

set=MillsTraces3Plus-MillsAlleleMatch1000G

set=MillsTracesUnknown

Created 2013-07-02 14:32:42 | Updated | Tags: bundle

I downloaded the bundle b37 and I used the dbSNP137.b37 database in the variant call with UnifiedGenotyper. I obtained different variants with the own "rs" but I cannot find the frequency (MAF) usually reported in the database. I know that by UCSC it is possible to download a db with common SNP (MAF > 5%) or all SNP (MAF >1%). do you know which is the db in the bundle or where I can find this information?

Thanks

Created 2013-02-19 17:59:36 | Updated 2013-02-19 19:19:52 | Tags: variantrecalibrator bundle

Hello, I am having a hard time finding the resource vcf files, needed for VariantRecalibration.

hapmap_3.3.b37.sites.vcf
1000G_omni2.5.b37.sites.vcf
dbsnp_135.b37.vcf

I don't seem to find them in the GATK bundle, as suggested:

Any help appreciated Thanks, G.

Created 2013-02-05 13:56:55 | Updated | Tags: bundle

I am trying to download the latest bundle following direction at http://gatkforums.broadinstitute.org/discussion/1215/how-can-i-access-the-gsa-public-ftp-server but I keep getting a "421 Service not available, remote server timed out. Connection closed." error. I also tried filezilla but I am also getting "Could not connect to server" error.

Is there another alternative for getting the resource bundle?

Created 2012-11-16 14:05:36 | Updated | Tags: bundle na12878

Dear all, I am really a newbie in NGS, so I apologize if I say something completely wrong. In order to start to manage this kind of data I am trying to follow the workflow described in http://www.broadinstitute.org/gatk/guide/topic?name=intro. It'suggested to download the "raw and realigned, recalibrated NA12878 test data from the GATK resource bundle". When i connect to the ftp through Filezilla I cannot find the bam file of NA12878 (/bundle/hg19) but only the vcf files. Is it correct? Should I download the vcf files?

Sorry again and thank you very much for your availability.

Francesco

Created 2012-11-14 13:21:40 | Updated | Tags: bundle tabix

Hi I'm getting an error "An index is required, but none found" for files from the GATK resource bundle. In this case hapmap_3.3.b37.vcf.gz from your bundle has hapmap_3.3.b37.vcf.idx.gz in the same directory. I suspect this is because my input VCF file is tabix indexed something in GATK has decided that everything else must be. Is there any way to override this? I'm guessing it might be something to do with tribbles? But I can't seem to find the relevant bit of documentation that explains this. Thanks, martin

Created 2012-11-13 20:12:34 | Updated | Tags: bundle

Hi, all. I have questions on resource bundles. Are the 'hg19' bundle files just liftover from 'b37' bundles in UCSC-style? If so, why are there some variants in only one version and not the other? For example, the variant 'rs34872315 (on chr1)' is in b37 version of dbsnp137.excluding_sites_after_129.vcf, but not in hg19 version. At first, I thought it's because of the differences in reference genome (vcf files in the bundle are fit for the accompanying reference sequences). But the reference chromosome 1 was the same in both bundles. Can you help me to understand the difference between b37 and hg19 resource bundles?

Created 2012-11-09 17:09:54 | Updated 2013-01-07 20:07:55 | Tags: bundle

I have bam files generated by Illumina's CASAVA pipeline, and the reference genome version in the accompanying document is 'NCBI37_UCSC'. The chromosomes are all named after UCSC hg19 convention, but 'NCBI37' part in the name confuses me. I'm not sure which bundle files should I use with these bam files, b37 or hg19.

Created 2012-10-18 23:10:15 | Updated 2012-10-19 17:56:13 | Tags: bundle liftovervariants hapmap

Hi,

We are sequencing some of the HapMap samples (NA19240 for instance) and we compared the calls we get for the HapMap samples with the calls that are reported for these samples in the file "hapmap_3.3.b37.vcf" that was part of the GATK bundle 1.5.

We are surprised to find a lot of discordant calls, but when we verified the calls, we could see no evidence in the HapMap sequence data for many of them.

We see the discordance at many sites and the vast majority of those show multiple alleles for the positions (like this one:

13 32914977 rs11571660 A C,T . PASS AC=1,2785;AF=0.00036,0.99964;AN=2786;set=Intersection GT 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 ... )

We call many of these sites as Homozygous REF...

##reference=Homo_sapiens_assembly18.fasta


But the file seem to suggest it was done against b37 which implies hg19... Was this file created with liftover?

Also...Could it be possible that HG19 was updated to reflect the most common allele in the population (as you can see of the MAF of 0.00036 for this example) when it went from HG18 to HG19, but that the VCF file was not updated for the HapMap 3.3 samples to reflect this?

So the bottom line is, that the HapMap3.3 file could not be used to rely on the actual calls for the HapMap samples, since about 10% of the sites show variant calls, while the HG19 reference shows no variance...

I know that HapMap is used for different purpose in GATK (Variant Score Calibration), but we may want to warn the public that you cannot use the file as a source of variation for the HapMap samples it reports on...

Thanks

Thon