In general most GATK tools don't care about ploidy. The major exception is, of course, at the variant calling step: the variant callers need to know what ploidy is assumed for a given sample in order to perform the appropriate calculations.
Since version 2.0, the UnifiedGenotyper has been able to deal with ploidies other than two. Three use cases are currently supported:
In order to enable this feature, you need to set the
-ploidy argument to desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e.
(Ploidy per individual) * (Individuals in pool).
Note that all other UnifiedGenotyper arguments work in the same way.
A full minimal command line would look for example like
java -jar GenomeAnalysisTK.jar \ -R reference.fasta \ -I myReads.bam \ -T UnifiedGenotyper \ -ploidy 4
glm argument works in the same way as in the diploid case - set to
[INDEL|SNP|BOTH] to specify which variants to discover and/or genotype.
Many of these limitations will be gradually removed over time, but for now please keep these in mind.
Fragment-aware calling like the one provided by default for diploid organisms is not present for the non-diploid case.
Some annotations do not work in non-diploid cases. In particular, InbreedingCoeff will not be annotated on non-diploid calls. Annotations that do work and are supported in non-diploid use cases are the following:
AF, and Genotype annotations such as
The HaplotypeCaller and ReduceReads currently do not support non-diploid data.
In theory you can use VQSR to filter non-diploid calls, but we currently have no experience with this and therefore cannot offer any support nor best practices on how to do this.
For indels, only a maximum of 4 alleles can be genotyped. This is not relevant for the SNP case, but discovering or genotyping more than this number of indel alleles will not work and an arbitrary set of 4 alleles will be chosen at a site.
You should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.
Short answer: NO.
Medium answer: no, at least not if you want to run a low-risk pipeline.
Long answer: see below for details.
There are several reasons why you might want to do this: you're using the latest version of GATK and one of the tools has a show-stopping bug, so you'd like to use an older, pre-bug version of that tool, but still use the latest version of all the other tools; or maybe you've been using an older version of GATK and you'd like to use a new tool, but keep using the rest in the version that you've been using to process hundreds of samples already.
The problem: compatibility is not guaranteed
In many cases, when we modify one tool in the GATK, we need to make adjustments to other tools that interact either directly or indirectly with the data consumed or produced by the upgraded tool. If you mix and match tools from different versions of GATK, you risk running into compatibility issues. For example, HaplotypeCaller expects a BAM compressed by Reduce Reads to have its data annotated in a certain way. If the information is formatted differently than what the HC expects (because that's how the corresponding RR from the same version does it), it can blow up -- or worse, do the wrong thing but not tell you there's a problem.
But what if the tools/tasks are in unrelated workflows?
Would it really be so bad to use CountReads from GATK version 2.7 for a quick QC check that's not actually part of my pipeline, which uses version 2.5? Well, maaaaybe not, but we still think it's a source of error, and we do our damnedest to eliminate those.
You shouldn't use tools from different versions within the same workflow, that's for sure. We don't think it's worth the risks. If there's a show-stopping bug, let us know and we promise to fix it as soon as (humanly) possible. For the rest, either accept that you're stuck with the version you started your study with (we may be able to help with workarounds for known issues), or upgrade your entire workflow and start your analysis from scratch. Depending on how far along you are one of those options will be less painful to you; go with that.
The plea bargain, and a warning
If despite our dire warnings you're still going to mix and match tool versions, fine, we can't stop you. But be really careful, and check every version release notes document ever. And keep in mind that when things go wrong, we will deny you support if we think you've been reckless.
The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample name, etc.). No other file formats are supported.
The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at Picard and Samtools for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See BWA for one such aligner.
All BAM files must satisfy the following requirements:
See the BAM specification for more information.
It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is as follows:
Human genome reference consortium standard ordering and names (b3x): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT...
UCSC convention (hg1x): chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY...
The easiest way to do it is to download Samtools and run the following command to examine the header of your file:
$ samtools view -H /path/to/my.bam @HD VN:1.0 GO:none SO:coordinate @SQ SN:1 LN:247249719 @SQ SN:2 LN:242951149 @SQ SN:3 LN:199501827 @SQ SN:4 LN:191273063 @SQ SN:5 LN:180857866 @SQ SN:6 LN:170899992 @SQ SN:7 LN:158821424 @SQ SN:8 LN:146274826 @SQ SN:9 LN:140273252 @SQ SN:10 LN:135374737 @SQ SN:11 LN:134452384 @SQ SN:12 LN:132349534 @SQ SN:13 LN:114142980 @SQ SN:14 LN:106368585 @SQ SN:15 LN:100338915 @SQ SN:16 LN:88827254 @SQ SN:17 LN:78774742 @SQ SN:18 LN:76117153 @SQ SN:19 LN:63811651 @SQ SN:20 LN:62435964 @SQ SN:21 LN:46944323 @SQ SN:22 LN:49691432 @SQ SN:X LN:154913754 @SQ SN:Y LN:57772954 @SQ SN:MT LN:16571 @SQ SN:NT_113887 LN:3994 ...
If the order of the contigs here matches the contig ordering specified above, and the
SO:coordinate flag appears in your header, then your contig and read ordering satisfies the GATK requirements.
Picard offers a tool called SortSam that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the ReorderSam command to make a BAM file SQ order match another reference sequence.
A quick Unix command using Samtools will do the trick:
$ samtools view -H /path/to/my.bam | grep '^@RG' @RG ID:0 PL:solid PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414 CN:bcm @RG ID:1 PL:solid PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414 CN:bcm @RG ID:2 PL:LS454 PU:R_2008_10_02_06_06_12_FLX01080312_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC @RG ID:3 PL:LS454 PU:R_2008_10_02_06_07_08_rig19_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC @RG ID:4 PL:LS454 PU:R_2008_10_02_17_50_32_FLX03080339_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC ...
The presence of the
@RG tags indicate the presence of read groups. Each read group has a
SM tag, indicating the sample from which the reads belonging to that read group originate.
In addition to the presence of a read group in the header, each read must belong to one and only one read group. Given the following example reads,
$ samtools view /path/to/my.bam | grep '^@RG' EAS139_44:2:61:681:18781 35 1 1 0 51M = 9 59 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31 EAS139_44:7:84:1300:7601 35 1 1 0 51M = 12 62 TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3 MF:i:18 Aq:i:0 NM:i:1 UQ:i:5 H0:i:0 H1:i:85 EAS139_44:8:59:118:13881 35 1 1 0 51M = 2 52 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31 EAS139_46:3:75:1326:2391 35 1 1 0 51M = 12 62 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31 ...
membership in a read group is specified by the
RG:Z:* tag. For instance, the first read belongs to read group 4 (sample NA11881), while the last read shown here belongs to read group 0 (sample NA12414).
Yes! Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information.
For technical details, see the SAM specification on the Samtools website.
|Tag||Importance||SAM spec definition||Meaning|
||Required||Read group identifier. Each
||Ideally, this should be a globally unique identify across all sequencing data in the world, such as the Illumina flowcell + lane name and number. Will be referenced by each read with the
||Sample. Use pool name where a pool is being sequenced.||Required. As important as
||The name of the sample sequenced in this read group. GATK tools treat all read groups with the same
||Platform/technology used to produce the read. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.||Important. Not currently used in the GATK, but was in the past, and may return. The only way to known the sequencing technology used to generate the sequencing data .||It's a good idea to use this field.|
||DNA preparation library identify||Essential for MarkDuplicates||MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.|
We do not require value for the
A concrete example may be instructive. Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following
@RG fields in the header:
Dad's data: @RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400 @RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400 Mom's data: @RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200 @RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200 @RG ID:FLOWCELL1.LANE7 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400 @RG ID:FLOWCELL1.LANE8 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400 Kid's data: @RG ID:FLOWCELL2.LANE1 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE2 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE3 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400 @RG ID:FLOWCELL2.LANE4 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400
Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).
Use Picard's AddOrReplaceReadGroups tool to add read group information.
You can use the GATK to do the following:
GATK -I full.bam -T PrintReads -L chr1:10-20 -o subset.bam
and you'll get a BAM file containing only reads overlapping those points. This operation retains the complete BAM header from the full file (this was the reference aligned to, after all) so that the BAM remains easy to work with. We routinely use these features for testing and high-performance analysis with the GATK.
We support the Variant Call Format (VCF) for variant callsets. No other file formats are supported.
No, we like VCF and we think it's important to have a good standard format. Multiplying formats just makes life hard for everyone, both developers and analysts.
We support three types of interval lists, as mentioned here. Interval lists should preferentially be formatted as Picard-style interval lists, with an explicit sequence dictionary, as this prevents accidental misuse (e.g. hg18 intervals on an hg19 file). Note that this file is 1-based, not 0-based (first position in the genome is position 1).
One relatively easy way to combine your intervals is to use the online tool Galaxy, using the
Get Data -> Upload command to upload your intervals, and the
Operate on Genomic Intervals command to compute the intersection or union of your intervals (depending on your needs).
We make various files available for public download from the GSA FTP server, such as the GATK resource bundle and presentation slides. We also maintain a public upload feature for processing bug reports from users.
There are two logins to choose from depending on whether you want to upload or download something:
location: ftp.broadinstitute.org username: gsapubftp-anonymous password: <blank>
location: ftp.broadinstitute.org username: gsapubftp password: 5WvQWSfi
If you are sure that you cannot use VQSR / recalibrate variants (typically because your dataset is too small, or because there are no truth/training resources available for your organism), then you will need to use the VariantFiltration tool to manually filter your variants. To do this, you will need to compose filter expressions as explained here, here and here based on the recommendations detailed further below.
Let's be painfully clear about this: there is no magic formula that will give you perfect results. Filtering variants manually, using thresholds on annotation values, is subject to all sorts of caveats. The appropriateness of both the annotations and the threshold values is very highly dependent on the specific callset, how it was called, what the data was like, etc.
HOWEVER, because we want to help and people always say that something is better than nothing (not necessarily true, but let's go with that for now), we have formulated some generic recommendations that should at least provide a starting point for people to experiment with their data.
In case you didn't catch that bit in bold there, we're saying that you absolutely SHOULD NOT expect to run these commands and be done with your analysis. You absolutely SHOULD expect to have to evaluate your results critically and TRY AGAIN with some parameter adjustments until you find the settings that are right for your data.
In addition, please note that these recommendations are mainly designed for dealing with very small data sets (in terms of both number of samples or size of targeted regions). If you are not using VQSR because you do not have training/truth resources available for your organism, then you should expect to have to do even more tweaking on the filtering parameters.
So, here are some recommended arguments to use with VariantFiltration when ALL other options are unavailable to you:
QD < 2.0
MQ < 40.0
FS > 60.0
HaplotypeScore > 13.0
MQRankSum < -12.5
ReadPosRankSum < -8.0
QD < 2.0
ReadPosRankSum < -20.0
InbreedingCoeff < -0.8
FS > 200.0
The InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement.
For shallow-coverage (<10x), it is virtually impossible to use manual filtering to reliably separate true positives from false positives. You really, really, really should use the protocol involving variant quality score recalibration. If you can't do that, maybe you need to take a long hard look at your experimental design. In any case you're probably in for a world of pain.
The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data.
Some bits of this article may seem harsh, or depressing. Sorry. We believe in giving you the cold hard truth.
HOWEVER, we do understand that this is one of the major points of pain that GATK users encounter -- along with understanding how VQSR works, so really, whichever option you go with, you're going to suffer.
And we do genuinely want to help. So although we can't look at every single person's callset and give an opinion on how it looks (no, seriously, don't ask us to do that), we do want to hear from you about how we can best help you help yourself. What information do you feel would help you make informed decisions about how to set parameters? Are the meanings of the annotations not clear? Would knowing more about how they are computed help you understand how you can use them? Do you want more math? Less math, more concrete examples?
Tell us what you'd like to see here, and we'll do our best to make it happen. (no unicorns though, we're out of stock)
We also welcome testimonials from you. We are one small team; you are a legion of analysts all trying different things. Please feel free to come forward and share your findings on what works particularly well in your hands.
Most GATK tools apply several read filters by default. You can look up exactly what are the defaults for each tool in their respective Technical Documentation pages.
But sometimes you want to specify additional filters yourself (and before you ask, no, you cannot disable the default read filters used by a given tool). This is how you do it:
--read-filter argument (or
-rf for short) allows you to apply whatever read filters you'd like. For example, to add the
MaxReadLengthFilter filter above to
PrintReads, you just add this to your command line:
The read filter will be applied with its default value (which you can also look up in the Tech Docs for that filter). Now, if you want to specify a different value from the default, you pass the relevant argument by adding this right after the read filter:
--read_filter MaxReadLength -maxReadLength 76
It's important that you pass the argument right after the filter itself, otherwise the command line parser won't know that they're supposed to go together.
And of course, you can add as many filters as you like by using multiple copies of the
--read_filter MaxReadLength --maxReadLength 76 --read_filter ZeroMappingQualityRead
This article describes the steps necessary to prepare your reference file (if it's not one that you got from us). As a complement to this article, see the relevant tutorial.
The GATK uses two files to access and safety check access to the reference files: a
.dict dictionary of the contig names and sizes and a
.fai fasta index file to allow efficient random access to the reference bases. You have to generate these files in order to be able to use a Fasta file as reference.
NOTE: Picard and samtools treat spaces in contig names differently. We recommend that you avoid using spaces in contig names.
We use CreateSequenceDictionary.jar from Picard to create a .dict file from a fasta file.
> java -jar CreateSequenceDictionary.jar R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict [Fri Jun 19 14:09:11 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict [Fri Jun 19 14:09:58 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary done. Runtime.totalMemory()=2112487424 44.922u 2.308s 0:47.09 100.2% 0+0k 0+0io 2pf+0w
This produces a SAM-style header file describing the contents of our fasta file.
> cat Homo_sapiens_assembly18.dict @HD VN:1.0 SO:unsorted @SQ SN:chrM LN:16571 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d2ed829b8a1628d16cbeee88e88e39eb @SQ SN:chr1 LN:247249719 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:9ebc6df9496613f373e73396d5b3b6b6 @SQ SN:chr2 LN:242951149 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:b12c7373e3882120332983be99aeb18d @SQ SN:chr3 LN:199501827 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:0e48ed7f305877f66e6fd4addbae2b9a @SQ SN:chr4 LN:191273063 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:cf37020337904229dca8401907b626c2 @SQ SN:chr5 LN:180857866 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:031c851664e31b2c17337fd6f9004858 @SQ SN:chr6 LN:170899992 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:bfe8005c536131276d448ead33f1b583 @SQ SN:chr7 LN:158821424 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:74239c5ceee3b28f0038123d958114cb @SQ SN:chr8 LN:146274826 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:1eb00fe1ce26ce6701d2cd75c35b5ccb @SQ SN:chr9 LN:140273252 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:ea244473e525dde0393d353ef94f974b @SQ SN:chr10 LN:135374737 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:4ca41bf2d7d33578d2cd7ee9411e1533 @SQ SN:chr11 LN:134452384 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:425ba5eb6c95b60bafbf2874493a56c3 @SQ SN:chr12 LN:132349534 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d17d70060c56b4578fa570117bf19716 @SQ SN:chr13 LN:114142980 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:c4f3084a20380a373bbbdb9ae30da587 @SQ SN:chr14 LN:106368585 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:c1ff5d44683831e9c7c1db23f93fbb45 @SQ SN:chr15 LN:100338915 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:5cd9622c459fe0a276b27f6ac06116d8 @SQ SN:chr16 LN:88827254 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:3e81884229e8dc6b7f258169ec8da246 @SQ SN:chr17 LN:78774742 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:2a5c95ed99c5298bb107f313c7044588 @SQ SN:chr18 LN:76117153 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:3d11df432bcdc1407835d5ef2ce62634 @SQ SN:chr19 LN:63811651 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:2f1a59077cfad51df907ac25723bff28 @SQ SN:chr20 LN:62435964 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f126cdf8a6e0c7f379d618ff66beb2da @SQ SN:chr21 LN:46944323 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f1b74b7f9f4cdbaeb6832ee86cb426c6 @SQ SN:chr22 LN:49691432 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:2041e6a0c914b48dd537922cca63acb8 @SQ SN:chrX LN:154913754 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d7e626c80ad172a4d7c95aadb94d9040 @SQ SN:chrY LN:57772954 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:62f69d0e82a12af74bad85e2e4a8bd91 @SQ SN:chr1_random LN:1663265 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:cc05cb1554258add2eb62e88c0746394 @SQ SN:chr2_random LN:185571 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:18ceab9e4667a25c8a1f67869a4356ea @SQ SN:chr3_random LN:749256 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:9cc571e918ac18afa0b2053262cadab6 @SQ SN:chr4_random LN:842648 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:9cab2949ccf26ee0f69a875412c93740 @SQ SN:chr5_random LN:143687 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:05926bdbff978d4a0906862eb3f773d0 @SQ SN:chr6_random LN:1875562 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d62eb2919ba7b9c1d382c011c5218094 @SQ SN:chr7_random LN:549659 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:28ebfb89c858edbc4d71ff3f83d52231 @SQ SN:chr8_random LN:943810 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:0ed5b088d843d6f6e6b181465b9e82ed @SQ SN:chr9_random LN:1146434 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:1e3d2d2f141f0550fa28a8d0ed3fd1cf @SQ SN:chr10_random LN:113275 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:50be2d2c6720dabeff497ffb53189daa @SQ SN:chr11_random LN:215294 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:bfc93adc30c621d5c83eee3f0d841624 @SQ SN:chr13_random LN:186858 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:563531689f3dbd691331fd6c5730a88b @SQ SN:chr15_random LN:784346 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:bf885e99940d2d439d83eba791804a48 @SQ SN:chr16_random LN:105485 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:dd06ea813a80b59d9c626b31faf6ae7f @SQ SN:chr17_random LN:2617613 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:34d5e2005dffdfaaced1d34f60ed8fc2 @SQ SN:chr18_random LN:4262 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f3814841f1939d3ca19072d9e89f3fd7 @SQ SN:chr19_random LN:301858 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:420ce95da035386cc8c63094288c49e2 @SQ SN:chr21_random LN:1679693 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:a7252115bfe5bb5525f34d039eecd096 @SQ SN:chr22_random LN:257318 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:4f2d259b82f7647d3b668063cf18378b @SQ SN:chrX_random LN:1719168 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f4d71e0758986c15e5455bf3e14e5d6f
We use the faidx command in samtools to prepare the fasta index file. This file describes byte offsets in the fasta file for each contig, allowing us to compute exactly where a particular reference base at contig:pos is in the fasta file.
> samtools faidx Homo_sapiens_assembly18.fasta 108.446u 3.384s 2:44.61 67.9% 0+0k 0+0io 0pf+0w
This produces a text file with one record per line for each of the fasta contigs. Each record is of the: contig, size, location, basesPerLine, bytesPerLine. The index file produced above looks like:
> cat Homo_sapiens_assembly18.fasta.fai chrM 16571 6 50 51 chr1 247249719 16915 50 51 chr2 242951149 252211635 50 51 chr3 199501827 500021813 50 51 chr4 191273063 703513683 50 51 chr5 180857866 898612214 50 51 chr6 170899992 1083087244 50 51 chr7 158821424 1257405242 50 51 chr8 146274826 1419403101 50 51 chr9 140273252 1568603430 50 51 chr10 135374737 1711682155 50 51 chr11 134452384 1849764394 50 51 chr12 132349534 1986905833 50 51 chr13 114142980 2121902365 50 51 chr14 106368585 2238328212 50 51 chr15 100338915 2346824176 50 51 chr16 88827254 2449169877 50 51 chr17 78774742 2539773684 50 51 chr18 76117153 2620123928 50 51 chr19 63811651 2697763432 50 51 chr20 62435964 2762851324 50 51 chr21 46944323 2826536015 50 51 chr22 49691432 2874419232 50 51 chrX 154913754 2925104499 50 51 chrY 57772954 3083116535 50 51 chr1_random 1663265 3142044962 50 51 chr2_random 185571 3143741506 50 51 chr3_random 749256 3143930802 50 51 chr4_random 842648 3144695057 50 51 chr5_random 143687 3145554571 50 51 chr6_random 1875562 3145701145 50 51 chr7_random 549659 3147614232 50 51 chr8_random 943810 3148174898 50 51 chr9_random 1146434 3149137598 50 51 chr10_random 113275 3150306975 50 51 chr11_random 215294 3150422530 50 51 chr13_random 186858 3150642144 50 51 chr15_random 784346 3150832754 50 51 chr16_random 105485 3151632801 50 51 chr17_random 2617613 3151740410 50 51 chr18_random 4262 3154410390 50 51 chr19_random 301858 3154414752 50 51 chr21_random 1679693 3154722662 50 51 chr22_random 257318 3156435963 50 51 chrX_random 1719168 3156698441 50 51
The GATK is an open source project that has greatly benefited from the contributions of outside users. The GATK team welcomes contributions from anyone who produces useful functionality in line with the goals of the toolkit. You are welcome to branch the GATK main repository and develop your own tools. Sometimes these tools may be useful to the GATK user community and you may want to make it part of the main GATK distribution. If so we ask you to follow our guidelines for submission of patches.
There are a few good GIT practices that you should follow to simplify the ultimate goal, which is, adding your changes to the main GATK repository.
You should always start your code by creating a new branch from the most recent version of the main repository with :
git checkout master (make sure you are in the master branch) git fetch && git rebase origin/master (you can substitute this line for "git pull" if you have no changes in the master branch) git checkout -b newtool (create a new branch for your new tool)
Note: If you have submitted a patch to the group, do not continue development on the same branch as we cannot guarantee that your changes will make it to the main repository unchanged.
Every time before you rebase, you have to update your copy of the main repository. To do this use:
If you are just trying to keep up with the changes in the main repository after a fetch, you can rebase your branch at anytime using (and this should be all you need to do):
git rebase origin/master
In case there are conflicts, resolve them as you would and do:
git rebase --continue
If you don't know how to resolve the conflicts, you can always safely abort the whole process and go back to your branch before you started rebasing:
git rebase --abort
If you are done and want to generate your patches conforming to the latest repository changes, to edit, squash and reorder your commits use :
git rebase -i origin/master
At the prompt, you can follow the instructions to squash, edit and reorder accordingly. You can also do this step from IntelliJ with a visual editor that allows you to select what to edit/squash/reorder. You can also take a look at this nice tutorial on how to use interactive rebase.
It is okay to have a list of commits (numbered) somewhat like this in your local tree:
Before you can send your tools to us, you have to organize these commits so they tell a meaningful history and are self contained. To achieve this you will need to rebase so you can squash, edit and reorder your commits. This tree makes a lot of sense for your development process, but it makes no sense in the main repository history as it becomes hard to pick/revert commits and understand the history at a glance. After rebasing, you should edit your commits to look like this:
Use your commit messages wisely to help quick processing of your patches. Make sure the first line of your commit messages have less than 50 characters (title). Add a blank line and write a paragraph or more explaining what this commit represents (now that it is a package of multiple commits. It is important to have the 50 char title because this is all we see when we look at an extended history to find bugs and it is also our quick access to remember what the commit does to the repository.
A patch should be self contained. Meaning if we decide to adopt feature X and Z but not Y, we should be able to do so by only applying patches 1 and 2. If your patches are co-dependent, you should say so in the commits and justify why you didn't squash the commits together into one tool.
To generate patches, use :
git format-patch since
The since parameter is the last commit you want to generate patches from, for example: HEAD^3 will generate patches for HEAD^2, HEAD^1 and HEAD. You can also specify the commit by its id or by using the head of a branch. This is where using branches will make your life easier. If master is always up to date with the main repo with no changes, you can do:
git format-patch master (provided your master is up to date)
This will generate a patch for each commit you've created and you can simply e-mail them as an attachment to us.
By default, the forum does not send notification messages about new comments or discussions. If you want to turn on notifications or customize the type of notifications you want to receive (email, popup message etc), you need to do the following:
To specifically get new GATK announcements, scroll down to "Category Notifications" and tick off the "Announcements" category for email notification for discussions (and comments if you really want to know everything).
This document provides technical details and recommendations on how the parallelism options offered by the GATK can be used to yield optimal performance results.
As explained in the primer on parallelism for the GATK, there are two main kinds of parallelism that can be applied to the GATK: multi-threading and scatter-gather (using Queue).
There are two options for multi-threading with the GATK, controlled by the arguments
-nct, respectively, which can be combined:
-nt / --num_threadscontrols the number of data threads sent to the processor
-nct / --num_cpu_threads_per_data_threadcontrols the number of CPU threads allocated to each data thread
For more information on how these multi-threading options work, please read the primer on parallelism for the GATK.
Each data thread needs to be given the full amount of memory you’d normally give a single run. So if you’re running a tool that normally requires 2 Gb of memory to run, if you use
-nt 4, the multithreaded run will use 8 Gb of memory. In contrast, CPU threads will share the memory allocated to their “mother” data thread, so you don’t need to worry about allocating memory based on the number of CPU threads you use.
-nctwith versions 2.2 and 2.3
Because of the way the
-nct option was originally implemented, in versions 2.2 and 2.3, there is one CPU thread that is reserved by the system to “manage” the rest. So if you use
-nct, you’ll only really start seeing a speedup with
-nct 3 (which yields two effective "working" threads) and above. This limitation has been resolved in the implementation that will be available in versions 2.4 and up.
For more details on scatter-gather, see the primer on parallelism for the GATK and the Queue documentation.
Please note that not all tools support all parallelization modes. The parallelization modes that are available for each tool depend partly on the type of traversal that the tool uses to walk through the data, and partly on the nature of the analyses it performs.
|Tool||Full name||Type of traversal||NT||NCT||SG|
The table below summarizes configurations that we typically use for our own projects (one per tool, except we give three alternate possibilities for the UnifiedGenotyper). The different values allocated for each tool reflect not only the technical capabilities of these tools (which options are supported), but also our empirical observations of what provides the best tradeoffs between performance gains and commitment of resources. Please note however that this is meant only as a guide, and that we cannot give you any guarantee that these configurations are the best for your own setup. You will probably have to experiment with the settings to find the configuration that is right for you.
|Cluster nodes||1||4||4||1||4||4 / 4 / 4|
|CPU threads (
||1||1||8||4-8||1||3 / 6 / 24|
|Data threads (
||24||1||1||1||1||8 / 4 / 1|
|Memory (Gb)||48||4||4||4||4||32 / 16 / 4|
Where NT is data multithreading, NCT is CPU multithreading and SG is scatter-gather using Queue. For more details on scatter-gather, see the primer on parallelism for the GATK and the Queue documentation.
Note: only do this if you have been explicitly asked to do so.
You posted a question about a problem you had with GATK tools, we answered that we think it's a bug, and we asked you to submit a detailed bug report.
A snippet file is a slice of the original BAM file which contains the problematic region and is sufficient to reproduce the error. We need it in order to reproduce the problem on our end, which is the first necessary step to finding and fixing the bug. We ask you to provide this as a snippet rather than the full file so that you don't have to upload (and we don't have to process) huge giga-scale files.
-Largument and progressively narrowing down the interval
-Lto write the problematic region (with 500 bp padding on either side) to a new file -- this is your snippet file.
Imagine a simple question like, "What's the depth of coverage at position A of the genome?"
First, you are given billions of reads that are aligned to the genome but not ordered in any particular way (except perhaps in the order they were emitted by the sequencer). This simple question is then very difficult to answer efficiently, because the algorithm is forced to examine every single read in succession, since any one of them might span position A. The algorithm must now take several hours in order to compute this value.
Instead, imagine the billions of reads are now sorted in reference order (that is to say, on each chromosome, the reads are stored on disk in the same order they appear on the chromosome). Now, answering the question above is trivial, as the algorithm can jump to the desired location, examine only the reads that span the position, and return immediately after those reads (and only those reads) are inspected. The total number of reads that need to be interrogated is only a handful, rather than several billion, and the processing time is seconds, not hours.
This reference-ordered sorting enables the GATK to process terabytes of data quickly and without tremendous memory overhead. Most GATK tools run very quickly and with less than 2 gigabytes of RAM. Without this sorting, the GATK cannot operate correctly. Thus, it is a fundamental rule of working with the GATK, which is the reason for the Central Dogma of the GATK:
VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. See this page for detailed specifications.
VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.
That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from NGS data, such as the UnifiedGenotyper and the HaplotypeCaller, is especially complex. This document describes some specific features and annotations used in the VCF files output by the GATK tools.
The following text is a valid VCF file describing the first few SNPs found by the UG in a deep whole genome data set from our favorite test sample, NA12878:
##fileformat=VCFv4.0 ##FILTER=<ID=LowQual,Description="QUAL < 50.0"> ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=PL,Number=3,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic"> ##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?"> ##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions"> ##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"> ##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes"> ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality"> ##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads"> ##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth"> ##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias"> ##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="log10-scaled probability of variant being true under the trained gaussian mixture model"> ##UnifiedGenotyperV2="analysis_type=UnifiedGenotyperV2 input_file=[TEXT CLIPPED FOR CLARITY]" #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 chr1 873762 . T G 5231.78 PASS AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255 chr1 877664 rs3828047 A G 3931.66 PASS AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185 GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0 chr1 899282 rs28548431 C T 71.77 PASS AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148 GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26 chr1 974165 rs9442391 T C 29.84 LowQual AC=1;AF=0.50;AN=2;DB;DP=18;Dels=0.00;HRun=1;HaplotypeScore=0.16;MQ=95.26;MQ0=0;QD=1.66;SB=-0.98 GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255
It seems a bit complex, but the structure of the file is actually quite simple:
[HEADER LINES] #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 chr1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255 chr1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0 chr1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26 chr1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255
After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that here everything is a SNP, but some could be indels or CNVs.
The first 6 columns of the VCF, which represent the observed variation, are easy to understand because they have a single, well-defined meaning.
CHROM and POS : The CHROM and POS gives the contig on which the variant occurs. For indels this is actually the base preceding the event, due to how indels are represented in a VCF.
ID: The dbSNP
rs identifier of the SNP, based on the contig and position of the call and whether a record exists at this site in dbSNP.
REF and ALT: The reference base and alternative base that vary in the samples, or in the population in general. Note that REF and ALT are always given on the forward strand. For indels the REF and ALT bases always include at least one base each (the base before the event).
QUAL: The Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance. These values can grow very large when a large amount of NGS data is used for variant calling.
FILTER: In a perfect world, the QUAL field would be based on a complete model for all error modes present in the data used to call. Unfortunately, we are still far from this ideal, and we have to use orthogonal approaches to determine which called sites, independent of QUAL, are machine errors and which are real SNPs. Whatever approach is used to filter the SNPs, the VCFs produced by the GATK carry both the PASSing filter records (the ones that are good have PASS in their FILTER field) as well as those that fail (the filter field is anything but PASS or a dot). If the FILTER field is a ".", then no filtering has been applied to the records, meaning that all of the records will be used for analysis but without explicitly saying that any PASS. You should avoid such a situation by always filtering raw variant calls before analysis.
For more details about these fields, please see this page.
In the excerpt shown above, here is how we interpret the line corresponding to each variant:
The genotype fields of the VCF look more complicated but they're actually not that hard to interpret once you understand that they're just sets of tags and values. Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:
chr1 873762 . T G [CLIPPED] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255 chr1 877664 rs3828047 A G [CLIPPED] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0 chr1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26
Looking at that last column, here is what the tags mean:
GT : The genotype of this sample. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:
GQ: The Genotype Quality, or Phred-scaled confidence that the true genotype is the one provided in GT. In the diploid case, if GT is 0/1, then GQ is really L(0/1) / (L(0/0) + L(0/1) + L(1/1)), where L is the likelihood that the sample is 0/0, 0/1/, or 1/1 under the model built for the NGS dataset.
AD and DP: These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site. See the Technical Documentation for details on AD (DepthPerAlleleBySample) and DP (Coverage).
PL: This field provides the likelihoods of the given genotypes (here, 0/0, 0/1, and 1/1). These are normalized, Phred-scaled likelihoods for each of the 0/0, 0/1, and 1/1, without priors. To be concrete, for the heterozygous case, this is L(data given that the true genotype is 0/1). The most likely genotype (given in the GT field) is scaled so that it's P = 1.0 (0 when Phred-scaled), and the other likelihoods reflect their Phred-scaled likelihoods relative to this most likely genotype.
With that out of the way, let's interpret the genotypes for NA12878 at chr1:899282.
chr1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26
At this site, the called genotype is
GT = 0/1, which is C/T. The confidence indicated by
GQ = 25.92 isn't so good, largely because there were only a total of 4 reads at this site (
DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by
AD=1,3). The lack of certainty is evident in the PL field, where
PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0). There's a chance that the subject is "hom-var" (=homozygous with the variant allele) since
PL(1/1) = 26, which corresponds to 10^(-2.6), or 0.0025, but either way, it's clear that the subject is definitely not "hom-ref" (=homozygous with the reference allele) since
PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number.
Finally, variants in a VCF can be annotated with a variety of additional tags, either by the built-in tools or with others that you add yourself. The way they're formatted is similar to what we saw in the Genotype fields, except instead of being in two separate fields (tags and values, respectively) the annotation tags and values are grouped together, so tag-value pairs are written one after another.
chr1 873762 [CLIPPED] AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 chr1 877664 [CLIPPED] AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185 chr1 899282 [CLIPPED] AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148
Here are some commonly used built-in annotations and what they mean:
|Annotation tag in VCF||Meaning|
|AC,AF,AN||See the Technical Documentation for Chromosome Counts.|
|DB||If present, then the variant is in dbSNP.|
|DP||See the Technical Documentation for Coverage.|
|DS||Were any of the samples downsampled because of too much coverage?|
|Dels||See the Technical Documentation for SpanningDeletions.|
|MQ and MQ0||See the Technical Documentation for RMS Mapping Quality and Mapping Quality Zero.|
|BaseQualityRankSumTest||See the Technical Documentation for Base Quality Rank Sum Test.|
|MappingQualityRankSumTest||See the Technical Documentation for Mapping Quality Rank Sum Test.|
|ReadPosRankSumTest||See the Technical Documentation for Read Position Rank Sum Test.|
|HRun||See the Technical Documentation for Homopolymer Run.|
|HaplotypeScore||See the Technical Documentation for Haplotype Score.|
|QD||See the Technical Documentation for Qual By Depth.|
|VQSLOD||Only present when using Variant quality score recalibration. Log odds ratio of being a true variant versus being false under the trained gaussian mixture model.|
|FS||See the Technical Documentation for Fisher Strand|
|SB||How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls).|
There are four major organizational units for next-generation DNA sequencing processes that used throughout the GATK documentation:
Lane: The basic machine unit for sequencing. The lane reflects the basic independent run of an NGS machine. For Illumina machines, this is the physical sequencing lane.
Library: A unit of DNA preparation that at some point is physically pooled together. Multiple lanes can be run from aliquots from the same library. The DNA library and its preparation is the natural unit that is being sequenced. For example, if the library has limited complexity, then many sequences are duplicated and will result in a high duplication rate across lanes.
Sample: A single individual, such as human CEPH NA12878. Multiple libraries with different properties can be constructed from the original sample DNA source. Throughout our documentation, we treat samples as independent individuals whose genome sequence we are attempting to determine. Note that from this perspective, tumor / normal samples are different despite coming from the same individual.
Cohort: A collection of samples being analyzed together. This organizational unit is the most subjective and depends very specifically on the design goals of the sequencing project. For population discovery projects like the 1000 Genomes, the analysis cohort is the ~100 individual in each population. For exome projects with many deeply sequenced samples (e.g., ESP with 800 EOMI samples) we divide up the complete set of samples into cohorts of ~50 individuals for multi-sample analyses.
Note that many GATK commands can be run at the lane level, but will give better results seeing all of the data for a single sample, or even all of the data for all samples. Unfortunately, there's a trade-off in computational cost, since running these commands across all of your data simultaneously requires much more computing power. Please see the documentation for each step to understand what is the best way to group or partition your data for that particular process.
The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper. Its ability to call SNPs is equivalent to that of the UnifiedGenotyper, and its ability to call indels is far superior. We recommend using HaplotypeCaller in all cases, with only a few exceptions:
In those cases, we recommend using UnifiedGenotyper instead of HaplotypeCaller.
This document describes the resource datasets and arguments to use in the two steps of VQSR (i.e. the successive application of VariantRecalibrator and ApplyRecalibration), based on our work with human genomes.
Note that VQSR must be run twice in succession in order to build a separate error model for SNPs and INDELs (see the VQSR documentation for more details).
These recommendations are valid for use with calls generated by both the UnifiedGenotyper and HaplotypeCaller. In the past we made a distinction in how we processed the calls from these two callers, but now we treat them the same way. These recommendations will probably not work properly on calls generated by other (non-GATK) callers.
The human genome training, truth and known resource datasets mentioned in this document are all available from our resource bundle.
If you are working with non-human genomes, you will need to find or generate at least truth and training resource datasets with properties corresponding to those described below. To generate your own resource set, one idea is to first do an initial round of SNP calling and only use those SNPs which have the highest quality scores. These sites which have the most confidence are probably real and could be used as truth data to help disambiguate the rest of the variants in the call set. Another idea is to try using several SNP callers in addition to the UnifiedGenotyper or HaplotypeCaller, and use those sites which are concordant between the different methods as truth data. In either case, you'll need to assign your set a prior likelihood that reflects your confidence in how reliable it is as a truth set. We recommend Q10 as a starting value, which you can then experiment with to find the most appropriate value empirically. There are many possible avenues of research here. Hopefully the model reporting plots that are generated by the recalibration tools will help facilitate this experimentation.
True sites training resource: HapMap
This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).
True sites training resource: Omni
This resource is a set of polymorphic SNP sites produced by the Omni geno- typing array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).
Non-true sites training resource: 1000G
This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this re- source may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (%). 17
Known sites resource, not used in training: dbSNP
This resource is a call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).
The variant quality score recalibrator builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant is a true genetic variant or a machine artifact. One major improvement from previous recommended protocols is that hand filters do not need to be applied at any point in the process now. All filtering criteria are learned from the data itself.
java -Xmx4g -jar GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R path/to/reference/human_g1k_v37.fasta \ -input raw.input.vcf \ -recalFile path/to/output.recal \ -tranchesFile path/to/output.tranches \ -nt 4 \ [SPECIFY TRUTH AND TRAINING SETS] \ [SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \ [SPECIFY WHICH CLASS OF VARIATION TO MODEL] \
For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. In addition we take the highest confidence SNPs from the project's callset. These datasets are available in the GATK resource bundle.
Arguments for VariantRecalibrator command:
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \ -an QD -an MQRankSum -an ReadPosRankSum -an FS -an DP \ -mode SNP \
Note that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, DP, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.
Also, using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.
You may notice that these recommendations no longer include the
--numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.
Some of these annotations might not be the best for your particular dataset.
Depth of coverage (the DP annotation invoked by Coverage) should not be used when working with hybrid capture datasets since there is extreme variation in the depth to which targets are captured! In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.
Additionally, the UnifiedGenotyper produces a statistic called the HaplotypeScore which should be used for SNPs. This statistic isn't necessary for the HaplotypeCaller because that mathematics is already built into the likelihood function itself when calling full haplotypes.
In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP and/or indel callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:
When modeling indels with the VQSR we use a training dataset that was created at the Broad by strictly curating the (Mills, Devine, Genome Research, 2011) dataset as as well as adding in very high confidence indels from the 1000 Genomes Project. This dataset is available in the GATK resource bundle.
Arguments for VariantRecalibrator:
--maxGaussians 4 \ -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \ -an DP -an FS -an ReadPosRankSum -an MQRankSum \ -mode INDEL \
Note that indels use a different set of annotations than SNPs. Most annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.
You may notice that these recommendations no longer include the
--numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.
The power of the VQSR is that it assigns a calibrated probability to every putative mutation in the callset. The user is then able to decide at what point on the theoretical ROC curve their project wants to live. Some projects, for example, are interested in finding every possible mutation and can tolerate a higher false positive rate. On the other hand, some projects want to generate a ranked list of mutations that they are very certain are real and well supported by the underlying data. The VQSR provides the necessary statistical machinery to effectively apply this sensitivity/specificity tradeoff.
java -Xmx3g -jar GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R reference/human_g1k_v37.fasta \ -input raw.input.vcf \ -tranchesFile path/to/input.tranches \ -recalFile path/to/input.recal \ -o path/to/output.recalibrated.filtered.vcf \ [SPECIFY THE DESIRED LEVEL OF SENSITIVITY TO TRUTH SITES] \ [SPECIFY WHICH CLASS OF VARIATION WAS MODELED] \
For SNPs we used HapMap 3.3 and the Omni 2.5M chip as our truth set. We typically seek to achieve 99.9% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.
--ts_filter_level 99.9 \ -mode SNP \
For indels we use the Mills / 1000 Genomes indel truth set described above. We typically seek to achieve 99.9% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.
--ts_filter_level 99.9 \ -mode INDEL \
JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.
In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.
JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:
"QUAL > 30.0"
QUALis a key: the name of the annotation we want to look at
30.0is a value: the threshold that we want to use to evaluate variant quality against
>is an operator: it determines which "side" of the threshold we want to select
The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:
"MY_STRING_KEY == 'foo'"
You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:
"QUAL / DP < 10.0"
You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):
"QUAL > 30.0 && DP == 10"
&& is the logical "AND".
Or if you want to select variants that have at least one of several conditions fulfilled:
"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"
|| is the logical "OR".
Currently, VCF INFO field keys are case-sensitive. That means that if you have a
QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (
qual or whatever) in your JEXL expression.
The types (i.e. string, integer, non-integer or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (aka a Java exception).
We highly recommend that complex expressions involving multiple AND/OR operations be split up into separate expressions whenever possible to avoid confusion. If you are using complex expressions, make sure to test them on a panel of different sites with several combinations of yes/no criteria.
Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.
If you are familiar with the VariantContext, Genotype and its associated classes and methods, you can directly access the full range of capabilities of the underlying objects from the command line. The underlying VariantContext object is available through the
For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").isHomRef()'
Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:
! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263
The classic way of evaluating a boolean goes like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'DB'
But you can also use the VariantContext object like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.hasAttribute("DB")'
Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").getAD().0 > 10'
The GATK runs natively on most if not all flavors of UNIX, which includes MacOSX, Linux and BSD. It is possible to get it running on Windows using Cygwin, but we don't provide any support nor instructions for that.
The GATK is a Java-based program, so you'll need to have Java installed on your machine. The Java version should be at 1.6 (at this time we don't support 1.7). You can check what version you have by typing
java -version at the command line. This article has some more details about what to do if you don't have the right version. Note that at this time we only support the Sun/Oracle Java JDK; OpenJDK is not supported.
The GATK does not have a Graphical User Interface (GUI). You don't open it by clicking on the
.jar file; you have to use the Console (or Terminal) to input commands. If this is all new to you, we recommend you first learn about that and follow some online tutorials before trying to use the GATK. It's not difficult but you'll need to learn some jargon and get used to living without a mouse...
Some of the GATK tools produce plots using R, so if you want to get the plots you'll need to have R installed, as well as several R libraries. Full details can be found in the Tutorial on prerequisites.
VariantEval accepts two types of modules: stratification and evaluation modules.
CpG is a three-state stratification:
A CpG site is defined as a site where the reference base at a locus is a C and the adjacent reference base in the 3' direction is a G.
EvalRod is an N-state stratification, where N is the number of eval rods bound to VariantEval.
Sample is an N-state stratification, where N is the number of samples in the eval files.
Filter is a three-state stratification:
FunctionalClass is a four-state stratification:
CompRod is an N-state stratification, where N is the number of comp tracks bound to VariantEval.
Degeneracy is a six-state stratification:
See the [http://en.wikipedia.org/wiki/Genetic_code#Degeneracy Wikipedia page on degeneracy] for more information.
JexlExpression is an N-state stratification, where N is the number of JEXL expressions supplied to VariantEval. See [[Using JEXL expressions]]
Novelty is a three-state stratification:
CountVariants is an evaluation module that computes the following metrics:
|nProcessedLoci||Number of processed loci|
|nCalledLoci||Number of called loci|
|nRefLoci||Number of reference loci|
|nVariantLoci||Number of variant loci|
|variantRate||Variants per loci rate|
|variantRatePerBp||Number of variants per base|
|nSNPs||Number of snp loci|
|nInsertions||Number of insertion|
|nDeletions||Number of deletions|
|nComplex||Number of complex loci|
|nNoCalls||Number of no calls loci|
|nHets||Number of het loci|
|nHomRef||Number of hom ref loci|
|nHomVar||Number of hom var loci|
|nSingletons||Number of singletons|
|heterozygosity||heterozygosity per locus rate|
|heterozygosityPerBp||heterozygosity per base pair|
|hetHomRatio||heterozygosity to homozygosity ratio|
|indelRate||indel rate (insertion count + deletion count)|
|indelRatePerBp||indel rate per base pair|
|deletionInsertionRatio||deletion to insertion ratio|
CompOverlap is an evaluation module that computes the following metrics:
|nEvalSNPs||number of eval SNP sites|
|nCompSNPs||number of comp SNP sites|
|novelSites||number of eval sites outside of comp sites|
|nVariantsAtComp||number of eval sites at comp sites (that is, sharing the same locus as a variant in the comp track, regardless of whether the alternate allele is the same)|
|compRate||percentage of eval sites at comp sites|
|nConcordant||number of concordant sites (that is, for the sites that share the same locus as a variant in the comp track, those that have the same alternate allele)|
|concordantRate||the concordance rate|
A SNP in the detection set is said to be 'concordant' if the position exactly matches an entry in dbSNP and the allele is the same. To understand this and other output of CompOverlap, we shall examine a detailed example. First, consider a fake dbSNP file (headers are suppressed so that one can see the important things):
$ grep -v '##' dbsnp.vcf #CHROM POS ID REF ALT QUAL FILTER INFO 1 10327 rs112750067 T C . . ASP;R5;VC=SNP;VP=050000020005000000000100;WGT=1;dbSNPBuildID=132
Now, a detection set file with a single sample, where the variant allele is the same as listed in dbSNP:
$ grep -v '##' eval_correct_allele.vcf #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001-6 1 10327 . T C 5168.52 PASS ... GT:AD:DP:GQ:PL 0/1:357,238:373:99:3959,0,4059
Finally, a detection set file with a single sample, but the alternate allele differs from that in dbSNP:
$ grep -v '##' eval_incorrect_allele.vcf #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001-6 1 10327 . T A 5168.52 PASS ... GT:AD:DP:GQ:PL 0/1:357,238:373:99:3959,0,4059
Running VariantEval with just the CompOverlap module:
$ java -jar $STING_DIR/dist/GenomeAnalysisTK.jar -T VariantEval \ -R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta \ -L 1:10327 \ -B:dbsnp,VCF dbsnp.vcf \ -B:eval_correct_allele,VCF eval_correct_allele.vcf \ -B:eval_incorrect_allele,VCF eval_incorrect_allele.vcf \ -noEV \ -EV CompOverlap \ -o eval.table
We find that the eval.table file contains the following:
$ grep -v '##' eval.table | column -t CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants nCompVariants novelSites nVariantsAtComp compRate nConcordant concordantRate CompOverlap dbsnp eval_correct_allele none all 1 1 0 1 100.00000000 1 100.00000000 CompOverlap dbsnp eval_correct_allele none known 1 1 0 1 100.00000000 1 100.00000000 CompOverlap dbsnp eval_correct_allele none novel 0 0 0 0 0.00000000 0 0.00000000 CompOverlap dbsnp eval_incorrect_allele none all 1 1 0 1 100.00000000 0 0.00000000 CompOverlap dbsnp eval_incorrect_allele none known 1 1 0 1 100.00000000 0 0.00000000 CompOverlap dbsnp eval_incorrect_allele none novel 0 0 0 0 0.00000000 0 0.00000000
As you can see, the detection set variant was listed under nVariantsAtComp (meaning the variant was seen at a position listed in dbSNP), but only the eval_correct_allele dataset is shown to be concordant at that site, because the allele listed in this dataset and dbSNP match.
TiTvVariantEvaluator is an evaluation module that computes the following metrics:
|nTi||number of transition loci|
|nTv||number of transversion loci|
|tiTvRatio||the transition to transversion ratio|
|nTiInComp||number of comp transition sites|
|nTvInComp||number of comp transversion sites|
|TiTvRatioStandard||the transition to transversion ratio for comp sites|
All analyses done with the GATK typically involve several (though not necessarily all) of the following inputs:
This article describes the corresponding file formats that are acceptable for use with the GATK.
The GATK requires the reference sequence in a single reference sequence in FASTA format, with all contigs in the same file. The GATK requires strict adherence to the FASTA standard. All the standard IUPAC bases are accepted, but keep in mind that non-standard bases (i.e. other than ACGT, such as W for example) will be ignored (i.e. those positions in the genome will be skipped).
Some users have reported having issues with reference files that have been stored or modified on Windows filesystems. The issues manifest as "10" characters (corresponding to encoded newlines) inserted in the sequence, which cause the GATK to quit with an error. If you encounter this issue, you will need to re-download a valid master copy of the reference file, or clean it up yourself.
Gzipped fasta files will not work with the GATK, so please make sure to unzip them first. Please see this article for more information on preparing FASTA reference sequences for use with the GATK.
If you are using human data, your reads must be aligned to one of the official b3x (e.g. b36, b37) or hg1x (e.g. hg18, hg19) references. The contig ordering in the reference you used must exactly match that of one of the official references canonical orderings. These are defined by historical karotyping of largest to smallest chromosomes, followed by the X, Y, and MT for the b3x references; the order is thus 1, 2, 3, ..., 10, 11, 12, ... 20, 21, 22, X, Y, MT. The hg1x references differ in that the chromosome names are prefixed with "chr" and chrM appears first instead of last. The GATK will detect misordered contigs (for example, lexicographically sorted) and throw an error. This draconian approach, though unnecessary technically, ensures that all supplementary data provided with the GATK works correctly. You can use ReorderSam to fix a BAM file aligned to a missorted reference sequence.
Our Best Practice recommendation is that you use a standard GATK reference from the GATK resource bundle.
The only input format for sequence reads that the GATK itself supports is the [Sequence Alignment/Map (SAM)] format. See [SAM/BAM] for more details on the SAM/BAM format as well as Samtools and Picard, two complementary sets of utilities for working with SAM/BAM files.
If you don't find the information you need in this section, please see our FAQs on BAM files.
If you are starting out your pipeline with raw reads (typically in FASTQ format) you'll need to make sure that when you map those reads to the reference and produce a BAM file, the resulting BAM file is fully compliant with the GATK requirements. See the Best Practices documentation for detailed instructions on how to do this.
In addition to being in SAM format, we require the following additional constraints in order to use your file with the GATK:
Below is an example well-formed SAM field header and fields (with @SQ dictionary truncated to show only the first two chromosomes for brevity):
@HD VN:1.0 GO:none SO:coordinate @SQ SN:1 LN:249250621 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 @SQ SN:2 LN:243199373 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:a0d9851da00400dec1098a9255ac712e @RG ID:ERR000162 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC @RG ID:ERR000252 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC @RG ID:ERR001684 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC @RG ID:ERR001685 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC @PG ID:GATK TableRecalibration VN:v2.2.16 CL:Covariates=[ReadGroupCovariate, QualityScoreCovariate, DinucCovariate, CycleCovariate], use_original_quals=true, defau t_read_group=DefaultReadGroup, default_platform=Illumina, force_read_group=null, force_platform=null, solid_recal_mode=SET_Q_ZERO, window_size_nqs=5, homopolymer_nback=7, except on_if_no_tile=false, pQ=5, maxQ=40, smoothing=137 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:b4eb71ee878d3706246b7c1dbef69299 @PG ID:bwa VN:0.5.5 ERR001685.4315085 16 1 9997 25 35M * 0 0 CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT ?8:C7ACAABBCBAAB?CCAABBEBA@ACEBBB@? XT:A:U XN:i:4 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 RG:Z:ERR001685 NM:i:6 MD:Z:0N0N0N0N1A0A28 OQ:Z:>>:>2>>>>>>>>>>>>>>>>>>?>>>>??>???> ERR001689.1165834 117 1 9997 0 * = 9997 0 CCGATCTAGGGTTAGGGTTAGGGTTAGGGTTAGGG >7AA<@@C?@?B?B??>9?B??>A?B???BAB??@ RG:Z:ERR001689 OQ:Z:>:<<8<<<><<><><<>7<>>>?>>??>??????? ERR001689.1165834 185 1 9997 25 35M = 9997 0 CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT 758A:?>>8?=@@>>?;4<>=??@@==??@?==?8 XT:A:U XN:i:4 SM:i:25 AM:i:0 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 RG:Z:ERR001689 NM:i:6 MD:Z:0N0N0N0N1A0A28 OQ:Z:;74>7><><><>>>>><:<>>>>>>>>>>>>>>>> ERR001688.2681347 117 1 9998 0 * = 9998 0 CGATCTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG 5@BA@A6B???A?B??>B@B??>B@B??>BAB??? RG:Z:ERR001688 OQ:Z:=>>>><4><<?><??????????????????????
The GATK requires that the BAM file be sorted in the same order as the reference. Unfortunately, many BAM files have headers that are sorted in some other order -- lexicographical order is a common alternative. To resort the BAM file please use ReorderSam.
If you don't find the information you need in this section, please see our FAQs on interval lists.
The GATK accept interval files for processing subsets of the genome in Picard-style interval lists. These files typically have an extension such as
'.list' or more explicitly,.interval_list`, and look like this:
@HD VN:1.0 SO:coordinate @SQ SN:1 LN:249250621 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 SP:Homo Sapiens @SQ SN:2 LN:243199373 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:a0d9851da00400dec1098a9255ac712e SP:Homo Sapiens 1 30366 30503 + target_1 1 69089 70010 + target_2 1 367657 368599 + target_3 1 621094 622036 + target_4 1 861320 861395 + target_5 1 865533 865718 + target_6 ...
consisting of aSAM-file-like sequence dictionary (the header), and targets in the form of
<chr> <start> <stop> + <target_name>. These interval lists are tab-delimited. They are also 1-based (first position in the genome is position 1, not position 0). The easiest way to create such a file is to combine your reference file's sequence dictionary (the file stored alongside the reference fasta file with the
.dict extension) and your intervals into one file.
You can also specify a list of intervals formatted as
<chr>:<start>-<stop> (one interval per line). No sequence dictionary is necessary. This file format also uses 1-based coordinates.
Finally, we also accept BED style interval lists. Warning: this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats should be offset by 1.
The GATK can associate arbitrary reference ordered data (ROD) files with named tracks for all tools. Some tools require specific ROD data files for processing, and developers are free to write tools that access arbitrary data sets using the ROD interface. The general ROD system has the following syntax:
name is the name in the GATK tool (like "eval" in VariantEval),
type is the type of the file, such as VCF or dbSNP, and
file is the path to the file containing the ROD data.
The GATK supports several common file formats for reading ROD data:
Note that we no longer support the PED format. See here for converting .ped files to VCF.
Since September, 2010, the GATK has had a "phone-home" feature that sends us information about each GATK run via the Broad filesystem (within the Broad) and Amazon's S3 cloud storage service (outside the Broad). This feature is enabled by default.
The information provided by the phone-home feature is critical in driving improvements to the GATK
Below are two example GATK Run Reports showing exactly what information is sent to us each time the GATK phones home.
<GATK-run-report> <id>D7D31ULwTSxlAwnEOSmW6Z4PawXwMxEz</id> <start-time>2012/03/10 20.21.19</start-time> <end-time>2012/03/10 20.21.19</end-time> <run-time>0</run-time> <walker-name>CountReads</walker-name> <svn-version>1.4-483-g63ecdb2</svn-version> <total-memory>85000192</total-memory> <max-memory>129957888</max-memory> <user-name>depristo</user-name> <host-name>10.0.1.10</host-name> <java>Apple Inc.-1.6.0_26</java> <machine>Mac OS X-x86_64</machine> <iterations>105</iterations> </GATK-run-report>
<GATK-run-report> <id>yX3AnltsqIlXH9kAQqTWHQUd8CQ5bikz</id> <exception> <message>Failed to parse Genome Location string: 20:10,000,000-10,000,001x</message> <stacktrace class="java.util.ArrayList"> <string>org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:377)</string> <string>org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:82)</string> <string>org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:106)</string> <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.loadIntervals(GenomeAnalysisEngine.java:618)</string> <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeIntervals(GenomeAnalysisEngine.java:585)</string> <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:231)</string> <string>org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:128)</string> <string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)</string> <string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)</string> <string>org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)</string> </stacktrace> <cause> <message>Position: '10,000,001x' contains invalid chars.</message> <stacktrace class="java.util.ArrayList"> <string>org.broadinstitute.sting.utils.GenomeLocParser.parsePosition(GenomeLocParser.java:411)</string> <string>org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:374)</string> <string>org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:82)</string> <string>org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:106)</string> <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.loadIntervals(GenomeAnalysisEngine.java:618)</string> <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeIntervals(GenomeAnalysisEngine.java:585)</string> <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:231)</string> <string>org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:128)</string> <string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)</string> <string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)</string> <string>org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)</string> </stacktrace> <is-user-exception>false</is-user-exception> </cause> <is-user-exception>true</is-user-exception> </exception> <start-time>2012/03/10 20.19.52</start-time> <end-time>2012/03/10 20.19.52</end-time> <run-time>0</run-time> <walker-name>CountReads</walker-name> <svn-version>1.4-483-g63ecdb2</svn-version> <total-memory>85000192</total-memory> <max-memory>129957888</max-memory> <user-name>depristo</user-name> <host-name>10.0.1.10</host-name> <java>Apple Inc.-1.6.0_26</java> <machine>Mac OS X-x86_64</machine> <iterations>0</iterations> </GATK-run-report>
Note that as of GATK 1.5 we no longer collect information about the command-line executed, the working directory, or tmp directory.
The GATK is currently in the process of evolving to require interaction with Amazon S3 as a normal part of each run. For this reason, and because the information contained in the GATK run reports is so critical in driving improvements to the GATK, we strongly discourage our users from disabling the phone-home feature.
At the same time, we recognize that some of our users do have legitimate reasons for needing to run the GATK with phone-home disabled, and we don't wish to make it impossible for these users to run the GATK.
Technical reasons: Your local network might have restrictions in place that don't allow the GATK to access external resources, or you might need to run the GATK in a network-less environment.
Organizational reasons: Your organization's policies might forbid the dissemination of one or more pieces of information contained in the GATK run report.
For such users we have provided an
-et NO_ET option in the GATK to disable the phone-home feature. To use this option in GATK 1.5 and later, you need to contact us to request a key. Instructions for doing so are below.
To obtain a GATK key, please fill out the request form.
Running the GATK with a key is simple: you just need to append a
-K your.key argument to your customary command line, where
your.key is the path to the key file you obtained from us:
java -jar dist/GenomeAnalysisTK.jar \ -T PrintReads \ -I public/testdata/exampleBAM.bam \ -R public/testdata/exampleFASTA.fasta \ -et NO_ET \ -K your.key
-K argument is only necessary when running the GATK with the
If you get an error message from the GATK saying that your key is corrupt, unreadable, or has been revoked, please email '''firstname.lastname@example.org''' to ask for a replacement key.
If you get an error message stating that the GATK public key could not be located or read, then something is likely wrong with your build of the GATK. If you're running the binary release, try downloading it again. If you're compiling from source, try doing an
ant clean and re-compiling. If all else fails, please ask for help on our community forum.
We use the phone home data for three main purposes. First, we monitor the input logs for errors that occur in the GATK, and proactively fix them in the codebase. Second, we monitor the usage rates of the GATK in general and specific versions of the GATK to explain how widely used the GATK is to funding agencies and other potential supporters. Finally, we monitor adoption rates of specific GATK tools to understand how quickly new tools reach our users. Many of these analyses require us to aggregate the data by unique user, which is why we still collect the username of the individual who ran the GATK (as you can see in the plots). Examples of all three uses are shown in the Tableau graphs below, which update each night and are sent to the GATK members each morning for review.
Please note that GATK-Lite was retired in February 2013 when version 2.4 was released. See the announcement here.
You probably know by now that GATK-Lite is a free-for-everyone and completely open-source version of the GATK (licensed under the original MIT license).
But what's in the box? What can GATK-Lite do -- or rather, what can it not do that the full version (let's call it GATK-Full) can? And what does that mean exactly, in terms of functionality, reliability and power?
To really understand the differences between GATK-Lite and GATK-Full, you need some more information on how the GATK works, and how we work to develop and improve it.
As explained here, the engine handles all the common work that's related to data access, conversion and traversal, as well as high-performance computing features. The engine is supported by an infrastructure of software libraries. If the GATK was a car, that would be the engine and chassis. What we call the **tools* are attached on top of that, and they provide the various analytical and processing functionalities like variant calling and base or variant recalibration. On your car, that would be headlights, airbags and so on.
We do all our development work on a single codebase. This means that everything --the engine and all tools-- is on one common workbench. There are not different versions that we work on in parallel -- that would be crazy to manage! That's why the version numbers of GATK-Lite and GATK-Full always match: if the latest GATK-Full version is numbered 2.1-13, then the latest GATK-Lite is also numbered 2.1-13.
The most important consequence of this setup is that when we make improvements to the infrastructure and engine, the same improvements will end up in GATK Lite and in GATK Full. So for the purposes of power, speed and robustness of the GATK that is determined by the engine, there is no difference between them.
For the tools, it's a little more complicated -- but not much. When we "build" the GATK binaries (the
.jar files), we put everything from the workbench into the Full build, but we only put a subset into the Lite build. Note that this Lite subset is pretty big -- it contains all the tools that were previously available in GATK 1.x versions, and always will. We also reserve the right to add previews or not-fully-featured versions of the new tools that are in Full, at our discretion, to the Lite build.
We have a new tool that performs a brand new function (which wasn't available in GATK 1.x), and we only include it in the Full build.
We have a tool that has some new add-on capabilities (which weren't possible in GATK 1.x); we put the tool in both the Lite and the Full build, but the add-ons are only available in the Full build.
Reprising the car analogy, GATK-Lite and GATK-Full are like two versions of the same car -- the basic version and the fully-equipped one. They both have the exact same engine, and most of the equipment (tools) is the same -- for example, they both have the same airbag system, and they both have headlights. But there are a few important differences:
The GATK-Full car comes with a GPS (sat-nav for our UK friends), for which the Lite car has no equivalent. You could buy a portable GPS unit from a third-party store for your Lite car, but it might not be as good, and certainly not as convenient, as the Full car's built-in one.
Both cars have windows of course, but the Full car has power windows, while the Lite car doesn't. The Lite windows can open and close, but you have to operate them by hand, which is much slower.
The underlying engine is exactly the same in both GATK-Lite and GATK-Full. Most functionalities are available in both builds, performed by the same tools. Some functionalities are available in both builds, but they are performed by different tools, and the tool in the Full build is better. New, cutting-edge functionalities are only available in the Full build, and there is no equivalent in the Lite build.
We hope this clears up some of the confusion surrounding GATK-Lite. If not, please leave a comment and we'll do our best to clarify further!
One of the key challenges of working with next-gen sequence data is that input files are usually very large. We can’t just make the program open the files, load all the data into memory and perform whatever analysis is needed on all of it in one go. It’s just too much work, even for supercomputers.
Instead, we make the program cut the job into smaller tasks that the computer can easily process separately. Then we have it combine the results of each step into the final result.
Map/Reduce is the technique we use to achieve this. It consists of three steps formally called
reduce. Let’s apply it to an example case where we want to find out what is the average depth of coverage in our dataset for a certain region of the genome.
filter determines what subset of the data needs to be processed in each task. In our example, the program lists all the reference positions in our region of interest.
map applies the function, i.e. performs the analysis on each subset of data. In our example, for each position in the list, the program looks into the BAM file, pulls out the pileup of bases and outputs the depth of coverage at that position.
reduce combines the elements in the list of results output by the
map function. In our example, the program takes the coverage numbers that were calculated separately for all the reference positions and calculates their average, which is the final result we want.
This may seem trivial for such a simple example, but it is a very powerful method with many advantages. Among other things, it makes it relatively easy to parallelize operations, which makes the tools run much faster on large datasets.
All the tools in the GATK are built from the ground up to take advantage of this method. That’s why we call them walkers: because they “walk” across the genome, getting things done.
Note that even though it’s not included in the Map/Reduce technique’s name, the
filter step is very important. It determines what data get presented to the tool for analysis, selecting only the appropriate data for each task and discarding anything that’s not relevant. This is a key part of the Map/Reduce technique, because that’s what makes each task “bite-sized” enough for the computer to handle easily.
Each tool has filters that are tailored specifically for the type of analysis it performs. The filters rely on traversal engines, which are little programs that are designed to “traverse” the data (i.e. walk through the data) in specific ways.
There are three major types of traversal: Locus Traversal, Read Traversal and Active Region Traversal. In our interval coverage example, the tool’s filter uses the Locus Traversal engine, which walks through the data by locus, i.e. by position along the reference genome. Because of that, the tool is classified as a Locus Walker. Similarly, the Read Traversal engine is used, you’ve guessed it, by Read Walkers.
The GATK engine comes packed with many other ways to walk through the genome and get the job done seamlessly, but those are the ones you’ll encounter most often.
A GATKReport is simply a text document that contains well-formatted, easy to read representation of some tabular data. Many GATK tools output their results as GATKReports, so it's important to understand how they are formatted and how you can use them in further analyses.
Here's a simple example:
#:GATKReport.v1.0:2 #:GATKTable:true:2:9:%.18E:%.15f:; #:GATKTable:ErrorRatePerCycle:The error rate per sequenced position in the reads cycle errorrate.61PA8.7 qualavg.61PA8.7 0 7.451835696110506E-3 25.474613284804366 1 2.362777171937477E-3 29.844949954504095 2 9.087604507451836E-4 32.875909752547310 3 5.452562704471102E-4 34.498999090081895 4 9.087604507451836E-4 35.148316651501370 5 5.452562704471102E-4 36.072234352256190 6 5.452562704471102E-4 36.121724890829700 7 5.452562704471102E-4 36.191048034934500 8 5.452562704471102E-4 36.003457059679770 #:GATKTable:false:2:3:%s:%c:; #:GATKTable:TableName:Description key column 1:1000 T 1:1001 A 1:1002 C
This report contains two individual GATK report tables. Every table begins with a header for its metadata and then a header for its name and description. The next row contains the column names followed by the data.
We provide an R library called
gsalib that allows you to load GATKReport files into R for further analysis. Here are four simple steps to getting
gsalib, installing it and loading a report.
$ R R version 2.11.0 (2010-04-22) Copyright (C) 2010 The R Foundation for Statistical Computing ISBN 3-900051-07-0 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.
gsaliblibrary from CRAN
gsalib library is available on the Comprehensive R Archive Network, so you can just do:
From within R (we use RStudio for convenience).
In some cases you need to explicitly tell R where to find the library; you can do this as follows:
$ cat .Rprofile .libPaths("/path/to/Sting/R/")
> d = gsa.read.gatkreport("/path/to/my.gatkreport") > summary(d) Length Class Mode CountVariants 27 data.frame list CompOverlap 13 data.frame list
Each tool uses known sites differently, but what is common to all is that they use them to help distinguish true variants from false positives, which is very important to how these tools work. If you don't provide known sites, the statistical analysis of the data will be skewed, which can dramatically affect the sensitivity and reliability of the results.
In the variant calling pipeline, the only tools that do not strictly require known sites are UnifiedGenotyper and HaplotypeCaller.
If you're working on human genomes, you're in luck. We provide sets of known sites in the human genome as part of our resource bundle, and we can give you specific Best Practices recommendations on which sets to use for each tool in the variant calling pipeline. See the next section for details.
If you're working on genomes of other organisms, things may be a little harder -- but don't panic, we'll try to help as much as we can. We've started a community discussion in the forum on What are the standard resources for non-human genomes? in which we hope people with non-human genomics experience will share their knowledge.
And if it turns out that there is as yet no suitable set of known sites for your organisms, here's how to make your own for the purposes of BaseRecalibration: First, do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence. Good luck!
Some experimentation will be required to figure out the best way to find the highest confidence SNPs for use here. Perhaps one could call variants with several different calling algorithms and take the set intersection. Or perhaps one could do a very strict round of filtering and take only those variants which pass the test.
|Tool||dbSNP 129 -||- dbSNP >132 -||- Mills indels -||- 1KG indels -||- HapMap -||- Omni|
These tools require known indels passed with the
-known argument to function properly. We use both the following files:
This tool requires known SNPs and indels passed with the
-knownSites argument to function properly. We use all the following files:
These tools do NOT require known sites, but if SNPs are provided with the
-dbsnp argument they will use them for variant annotation. We use this file:
For VariantRecalibrator, please see the FAQ article on VQSR training sets and arguments.
This tool requires known SNPs passed with the
-dbsnp argument to function properly. We use the following file:
The answer depends on what tool we're talking about, and whether we're considering variant discovery or variant manipulation.
GATK variant manipulation tools are able to recognize the following types of alleles:
Of our two variant callers, UnifiedGenotyper is the more limited, as it only calls SNPs and indels, and does so separately (even if you run in calling mode BOTH, the program performs separate calling operations internally). The HaplotypeCaller is more sophisticated and calls different types of variants at the same time. So in addition to SNPs and indels, it is capable of emitting mixed records by default. It is also capable of emitting MNPs and symbolic alleles, but the modes to do so are not enabled by default and they are not part of our recommended best practices for the tool.
The GATK currently does not handle SVs (structural variations) or CNVs (copy number variations), but there are some third-party software packages built on top of GATK that provide this functionality. See GenomeSTRiP for SVs and XHMM for CNVs.
Inside of the Broad, the latest bundle will always be available in:
with a subdirectory containing for each reference sequence and associated data files.
External users can download these files (or corresponding .gz versions) from the GSA FTP Server in the directory bundle. Gzipped files should be unzipped before attempting to use them. Note that there is no "current link" on the FTP; users should download the highest numbered directory under current (this is the most recent data set).
Additionally, these files all have supplementary indices, statistics, and other QC data available.
Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.
Also includes a chain file to lift over to b37.
Includes the 1000 Genomes pilot b36 formated reference sequence (human_b36_both.fasta) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.
Also includes a chain file to lift over to b37.
Includes the UCSC-style hg19 reference along with all lifted over VCF files.
We know this field can be confusing or even overwhelming to newcomers, and getting to grips with a large and varied toolkit like the GATK can be a big challenge. We have produce a presentation that we hope will help you review all the background information that you need to know in order to use the GATK:
In addition, the following links feature a lot of useful educational material about concepts and terminology related to next-generation sequencing:
A basic review of the sequencing process.
An excellent, detailed overview of the myriad next-gen sequencing methdologies.
A nice piece explaining the problems inherent in trying to analyze terabytes of data. The GATK addresses this issue by requiring all datasets be in reference order, so only small chunks of the genome need to be in memory at once, as explained here.
Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.
Let's say we have this example data:
At the basic level, all pre-processing steps are meant to be performed per-lane. Assuming that you received one FASTQ file per lane of sequence data, just run each file through each pre-processing step individually: map & dedup -> realign -> recal.
The example data becomes:
Once you have pre-processed each lane individually, you merge lanes belonging to the same sample into a single BAM file.
The example data becomes:
You can increase the quality of your results by performing an extra round of dedupping and realignment, this time at the sample level. It is not absolutely required and will increase your computational costs, so it's up to you to decide whether you want to do it on your data, but that's how we do it internally at Broad.
The example data becomes:
This gets you two big wins:
People often ask also if it's worth the trouble to try realigning across all samples in a cohort. The answer is almost always no, unless you have very shallow coverage. The problem is that while it would be lovely to ensure consistent alignments around indels across all samples, the computational cost gets too ridiculous too fast. That being said, for contrastive calling projects -- such as cancer tumor/normals -- we do recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.
Finally, why not do base recalibration across lanes or across samples? Well, by definition there is no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine. That said, don't worry if you find yourself needing to recalibrate a BAM file with the lanes already merged -- the GATK's BaseRecalibrator is read group-aware, which means that it will identify separate lanes as such even if they are in the same BAM file, and it will always process them separately.
We have sequenced at the Broad Institute and released to the 1000 Genomes Project the following datasets for the three members of the CEU trio (NA12878, NA12891 and NA12892):
This is better data to work with than the original DePristo et al. BAMs files, so we recommend you download and analyze these files if you are looking for complete, large-scale data sets to evaluate the GATK or other tools.
Here's the rough library properties of the BAMs:
These data files can be downloaded from the 1000 Genomes DCC
Here are the datasets we used in the GATK paper cited below.
DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D and Daly, M (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics. 43:491-498.
Some of the BAM and VCF files are currently hosted by the NCBI: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20101201_cg_NA12878/
-- targetsused in the analysis of the exome capture data
Please note that we have not collected the indel calls for the paper, as these are only used for filtering SNPs near indels. If you want to call accurate indels, please use the new GATK indel caller in the Unified Genotyper.
Both the GATK and the sequencing technologies have improved significantly since the analyses performed in this paper.
If you are conducting a review today, we would recommend that the newest version of the GATK, which performs much better than the version described in the paper. Moreover, we would also recommend one use the newest version of Crossbow as well, in case they have improved things. The GATK calls for NA12878 from the paper (above) will give one a good idea what a good call set looks like whole-genome or whole-exome.
The data sets used in the paper are no longer state-of-the-art. The WEx BAM is GAII data aligned with MAQ on hg18, but a state-of-the-art data set would use HiSeq and BWA on hg19. Even the 64x HiSeq WG data set is already more than one year old. For a better assessment, we would recommend you use a newer data set for these samples, if you have the capacity to generate it. This applies less to the WG NA12878 data, which is pretty good, but the NA12878 WEx from the paper is nearly 2 years old now and notably worse than our most recent data sets.
Obviously, this was an annoyance for us as well, as it would have been nice to use a state-of-the-art data set for the WEx. But we decided to freeze the data used for analysis to actually finish this paper.
If you want the raw, machine output for the data analyzed in the GATK framework paper, obtain the raw BAM files above and convert them from SAM to FASTQ using the Picard tool SamToFastq.
As featured in this forum question.
Two main things account for these kinds of differences, both linked to default behaviors of the tools:
The tools downsample to different depths of coverage
The tools apply different read filters
In both cases, you can end up looking at different sets or numbers of reads, which causes some of the annotation values to be different. It's usually not a cause for alarm. Remember that many of these annotations should be interpreted relatively, not absolutely.
Just because something looks like a SNP in IGV doesn't mean that it is of high quality. We are extremely confident in the genotype likelihoods calculations in the Unified Genotyper (especially for SNPs) and in the HaplotypeCaller (for all variants including indels). So, before you post this issue in our support forum, please do a little bit of investigation on your own, as follows.
To diagnose what is happening, you should take a look at the pileup of bases at the position in question. It is very important for you to look at the underlying data here.
Here is a checklist of questions you should ask yourself:
The genotyper ignores sites if there are too many overlapping deletions. This value can be set using the
--max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument), but be aware that increasing it could affect the reliability of your results.
Remember that there is a minimum base quality threshold and that low base qualities mean that the sequencer assigned a low confidence to that base. If your would-be SNP is only supported by low-confidence bases, it is probably a false positive.
Keep in mind that the depth reported in the VCF is the unfiltered depth. You may think you have good coverage at that site, but the Unified Genotyper ignores bases if they don't look good, so actual coverage seen by the UG may be lower than you think.
A base's quality is capped by the mapping quality of its read. The reason for this is that low mapping qualities mean that the aligner had little confidence that the read is mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- you may be looking at the sequence of some other locus in the genome!
Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.
By default the UG will only consider a certain number of alternate alleles. This value can be set using the
--max_alternate_alleles argument (see the UG's documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles is both CPU and memory intensive and it scales exponentially based on the number of alternate alleles. Unless there is a good reason to change the default value, we highly recommend that you not play around with this parameter.
SOLiD alignments tend to have reference bias and it can be severe in some cases. Do the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site? If so, you are probably seeing false positives.
There are three potential strategies for calling genetic variants on multiple samples:
We recommend joint calling because it can dramatically improve consistency across batches and cause fewer artefacts due to three key advantages:
Batch-calling does not output a genotype call at sites where no member in the batch has evidence for a variant; it is thus impossible to distinguish such sites from locations missing data. In contrast, joint calling emits genotype calls at every site where any individual in the call set has evidence for variation.
By sharing information across all samples, joint calling makes it possible to “rescue” genotype calls at sites where a carrier has low coverage but other samples within the call set have a confident variant at that location.
The current approaches to variant filtering (such as VQSR) use statistical models that work better with large amounts of data. Of the three calling strategies, only joint calling provides enough data for accurate error modeling and ensures that filtering is applied uniformly across all samples.
Figure 1: Power of joint calling in finding mutations at low coverage sites. The variant allele is present in only two of the N samples, in both cases with such low coverage that the variant is not callable when processed separately. Joint calling allows evidence to be accumulated over all samples and renders the variant callable. (right) Importance of joint calling to square off the genotype matrix, using an example of two disease-relevant variants. Neither sample will have records in a variants-only output file, for different reasons: the first sample is homozygous reference while the second sample has no data. However, merging the results from single sample calling will incorrectly treat both of these samples identically as being non-informative.
We recently participated in a large-scale project in which we applied joint calling approaches to raw sequencing data from approximately 57,000 human exomes representing a wide range of human population diversity. This was done in collaboration with other groups studying the genetic basis of complex and Mendelian diseases.
We performed two pilot studies as part of this project, one focused on sample QC and one focused specifically on evaluating the joint calling approach. In that pilot, we performed complete joint calling across chromosomes 11, 20, 21 and 22, which represents approximately 11.4% of the human exome. In a nutshell, we found that large-scale joint calling results in greater sensitivity to low-frequency variants, an increased ability to remove systematic false positives such as mapping errors, and greater consistency of variant calls across projects (Figure 2).
So, should you call your samples jointly? Yes! But there are a few issues you should be aware of before you start.
Most of the problems we experienced in our joint calling experiments were scaling problems -- we managed to do joint analysis on 50K+ exomes, but that was already pushing the bounds of what our fairly heavyweight infrastructure can support. Anyone with less hardware is going to struggle to reach those numbers. Not to mention the logistical headache of managing access to the data, if it originates from multiple separate projects. But this only really applies to people who are dealing with seriously large projects, involving tens of thousands of samples.
This one is probably more widely applicable. When you’re getting a large-ish number of samples sequenced (especially clinical samples), you typically get them in small batches over an extended period of time, and you analyze each batch as it comes in (whether it’s because the analysis is time-sensitive or your PI is breathing down your back). But that’s not joint calling, that’s batch calling, and it doesn’t give you the same significant gains that joint calling can give you. Unfortunately the current joint calling approaches don’t allow for incremental analysis -- every time you get even one new sample sequence, you have to re-call all samples from scratch.
The good news is that we’re working on a new joint calling procedure that will address these issues (Figure 3) , so stay tuned for updates!