#### View by tag

In general most GATK tools don't care about ploidy. The major exception is, of course, at the variant calling step: the variant callers need to know what ploidy is assumed for a given sample in order to perform the appropriate calculations.

### Ploidy-related functionalities

As of version 3.3, the HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the -ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the -ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the -ploidy argument.

For earlier versions (all the way to 2.0) the fallback option is UnifiedGenotyper, which also accepts the -ploidy argument.

### Cases where ploidy needs to be specified

1. Native variant calling in haploid or polyploid organisms.
2. Pooled calling where many pooled organisms share a single barcode and hence are treated as a single "sample".
3. Pooled validation/genotyping at known sites.

For normal organism ploidy, you just set the -ploidy argument to the desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e. (Ploidy per individual) * (Individuals in pool).

## Important limitations

Several variant annotations are not appropriate for use with non-diploid cases. In particular, InbreedingCoeff will not be annotated on non-diploid calls. Annotations that do work and are supported in non-diploid use cases are the following: QUAL, QD, SB, FS, AC, AF, and Genotype annotations such as PL, AD, GT, etc.

You should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.

Medium answer: no, at least not if you want to run a low-risk pipeline.

Long answer: see below for details.

The rationale

There are several reasons why you might want to do this: you're using the latest version of GATK and one of the tools has a show-stopping bug, so you'd like to use an older, pre-bug version of that tool, but still use the latest version of all the other tools; or maybe you've been using an older version of GATK and you'd like to use a new tool, but keep using the rest in the version that you've been using to process hundreds of samples already.

The problem: compatibility is not guaranteed

In many cases, when we modify one tool in the GATK, we need to make adjustments to other tools that interact either directly or indirectly with the data consumed or produced by the upgraded tool. If you mix and match tools from different versions of GATK, you risk running into compatibility issues. For example, HaplotypeCaller expects a BAM compressed by Reduce Reads to have its data annotated in a certain way. If the information is formatted differently than what the HC expects (because that's how the corresponding RR from the same version does it), it can blow up -- or worse, do the wrong thing but not tell you there's a problem.

But what if the tools/tasks are in unrelated workflows?

Would it really be so bad to use CountReads from GATK version 2.7 for a quick QC check that's not actually part of my pipeline, which uses version 2.5? Well, maaaaybe not, but we still think it's a source of error, and we do our damnedest to eliminate those.

The conclusion

You shouldn't use tools from different versions within the same workflow, that's for sure. We don't think it's worth the risks. If there's a show-stopping bug, let us know and we promise to fix it as soon as (humanly) possible. For the rest, either accept that you're stuck with the version you started your study with (we may be able to help with workarounds for known issues), or upgrade your entire workflow and start your analysis from scratch. Depending on how far along you are one of those options will be less painful to you; go with that.

The plea bargain, and a warning

If despite our dire warnings you're still going to mix and match tool versions, fine, we can't stop you. But be really careful, and check every version release notes document ever. And keep in mind that when things go wrong, we will deny you support if we think you've been reckless.

### 1. What file formats do you support for sequencer output?

The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample name, etc.). No other file formats are supported.

### 2. How do I get my data into BAM format?

The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at Picard and Samtools for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See BWA for one such aligner.

### 3. What are the formatting requirements for my BAM file(s)?

All BAM files must satisfy the following requirements:

• It must be aligned to one of the references described here.
• It must be sorted in coordinate order (not by queryname and not "unsorted").
• It must list the read groups with sample names in the header.
• The BAM file must pass Picard validation.

### 4. What is the canonical ordering of human reference contigs in a BAM file?

It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is as follows:

Human genome reference consortium standard ordering and names (b3x): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT...

UCSC convention (hg1x): chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY...

$samtools view -H /path/to/my.bam @HD VN:1.0 GO:none SO:coordinate @SQ SN:1 LN:247249719 @SQ SN:2 LN:242951149 @SQ SN:3 LN:199501827 @SQ SN:4 LN:191273063 @SQ SN:5 LN:180857866 @SQ SN:6 LN:170899992 @SQ SN:7 LN:158821424 @SQ SN:8 LN:146274826 @SQ SN:9 LN:140273252 @SQ SN:10 LN:135374737 @SQ SN:11 LN:134452384 @SQ SN:12 LN:132349534 @SQ SN:13 LN:114142980 @SQ SN:14 LN:106368585 @SQ SN:15 LN:100338915 @SQ SN:16 LN:88827254 @SQ SN:17 LN:78774742 @SQ SN:18 LN:76117153 @SQ SN:19 LN:63811651 @SQ SN:20 LN:62435964 @SQ SN:21 LN:46944323 @SQ SN:22 LN:49691432 @SQ SN:X LN:154913754 @SQ SN:Y LN:57772954 @SQ SN:MT LN:16571 @SQ SN:NT_113887 LN:3994 ...  If the order of the contigs here matches the contig ordering specified above, and the SO:coordinate flag appears in your header, then your contig and read ordering satisfies the GATK requirements. ### 6. My BAM file isn't sorted that way. How can I fix it? Picard offers a tool called SortSam that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the ReorderSam command to make a BAM file SQ order match another reference sequence. ### 7. How can I tell if my BAM file has read group and sample information? A quick Unix command using Samtools will do the trick: $ samtools view -H /path/to/my.bam | grep '^@RG'
@RG ID:0    PL:solid    PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP   LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:1    PL:solid    PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP    LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:2    PL:LS454    PU:R_2008_10_02_06_06_12_FLX01080312_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:3    PL:LS454    PU:R_2008_10_02_06_07_08_rig19_retry    LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:4    PL:LS454    PU:R_2008_10_02_17_50_32_FLX03080339_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
...


The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate.

samtools view /path/to/my.bam | grep '^@RG' EAS139_44:2:61:681:18781 35 1 1 0 51M = 9 59 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31 EAS139_44:7:84:1300:7601 35 1 1 0 51M = 12 62 TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3 MF:i:18 Aq:i:0 NM:i:1 UQ:i:5 H0:i:0 H1:i:85 EAS139_44:8:59:118:13881 35 1 1 0 51M = 2 52 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31 EAS139_46:3:75:1326:2391 35 1 1 0 51M = 12 62 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31 ...  membership in a read group is specified by the RG:Z:* tag. For instance, the first read belongs to read group 4 (sample NA11881), while the last read shown here belongs to read group 0 (sample NA12414). ### 8. My BAM file doesn't have read group and sample information. Do I really need it? Yes! Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information. ### 9. What's the meaning of the standard read group fields? For technical details, see the SAM specification on the Samtools website. Tag Importance SAM spec definition Meaning ID Required Read group identifier. Each @RG line must have a unique ID. The value of ID is used in the RG tags of alignment records. Must be unique among all read groups in header section. Read groupIDs may be modified when merging SAM files in order to handle collisions. Ideally, this should be a globally unique identify across all sequencing data in the world, such as the Illumina flowcell + lane name and number. Will be referenced by each read with the RG:Z field, allowing tools to determine the read group information associated with each read, including the sample from which the read came. Also, a read group is effectively treated as a separate run of the NGS instrument in tools like base quality score recalibration -- all reads within a read group are assumed to come from the same instrument run and to therefore share the same error model. SM Sample. Use pool name where a pool is being sequenced. Required. As important as ID. The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample. Therefore it's critical that the SM field be correctly specified, especially when using multi-sample tools like the Unified Genotyper. PL Platform/technology used to produce the read. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO. Important. Not currently used in the GATK, but was in the past, and may return. The only way to known the sequencing technology used to generate the sequencing data . It's a good idea to use this field. LB DNA preparation library identify Essential for MarkDuplicates MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes. We do not require value for the CN, DS, DT, PG, PI, or PU fields. A concrete example may be instructive. Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header: Dad's data: @RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400 @RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400 Mom's data: @RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200 @RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200 @RG ID:FLOWCELL1.LANE7 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400 @RG ID:FLOWCELL1.LANE8 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400 Kid's data: @RG ID:FLOWCELL2.LANE1 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE2 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE3 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400 @RG ID:FLOWCELL2.LANE4 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400  Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library). ### 9. My BAM file doesn't have read group and sample information. How do I add it? Use Picard's AddOrReplaceReadGroups tool to add read group information. ### 10. How do I know if my BAM file is valid? Picard contains a tool called ValidateSamFile that can be used for this. BAMs passing STRICT validation stringency work best with the GATK. ### 11. What's the best way to create a subset of my BAM file containing only reads over a small interval? You can use the GATK to do the following: GATK -I full.bam -T PrintReads -L chr1:10-20 -o subset.bam  and you'll get a BAM file containing only reads overlapping those points. This operation retains the complete BAM header from the full file (this was the reference aligned to, after all) so that the BAM remains easy to work with. We routinely use these features for testing and high-performance analysis with the GATK. ### 1. What file formats do you support for variant callsets? We support the Variant Call Format (VCF) for variant callsets. No other file formats are supported. ### 2. How can I know if my VCF file is valid? VCFTools contains a validation tool that will allow you to verify it. ### 3. Are you planning to include any converters from different formats or allow different input formats than VCF? No, we like VCF and we think it's important to have a good standard format. Multiplying formats just makes life hard for everyone, both developers and analysts. ### 1. What file formats do you support for interval lists? We support three types of interval lists, as mentioned here. Interval lists should preferentially be formatted as Picard-style interval lists, with an explicit sequence dictionary, as this prevents accidental misuse (e.g. hg18 intervals on an hg19 file). Note that this file is 1-based, not 0-based (first position in the genome is position 1). ### 2. I have two (or more) sequencing experiments with different target intervals. How can I combine them? One relatively easy way to combine your intervals is to use the online tool Galaxy, using the Get Data -> Upload command to upload your intervals, and the Operate on Genomic Intervals command to compute the intersection or union of your intervals (depending on your needs). We make various files available for public download from the GSA FTP server, such as the GATK resource bundle and presentation slides. We also maintain a public upload feature for processing bug reports from users. There are two logins to choose from depending on whether you want to upload or download something: ### Downloading location: ftp.broadinstitute.org username: gsapubftp-anonymous password: <blank>  ### Uploading location: ftp.broadinstitute.org username: gsapubftp password: 5WvQWSfi  ### Using a browser as FTP client If you use your browser as FTP client, make sure to include the login information in the address, otherwise you will access the general Broad Institute FTP instead of our team FTP. This should work as a direct link (for downloading only): ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle If you are sure that you cannot use VQSR / recalibrate variants (typically because your dataset is too small, or because there are no truth/training resources available for your organism), then you will need to use the VariantFiltration tool to manually filter your variants. To do this, you will need to compose filter expressions as explained here, here and here based on the recommendations detailed further below. ### But first, some caveats Let's be painfully clear about this: there is no magic formula that will give you perfect results. Filtering variants manually, using thresholds on annotation values, is subject to all sorts of caveats. The appropriateness of both the annotations and the threshold values is very highly dependent on the specific callset, how it was called, what the data was like, etc. HOWEVER, because we want to help and people always say that something is better than nothing (not necessarily true, but let's go with that for now), we have formulated some generic recommendations that should at least provide a starting point for people to experiment with their data. In case you didn't catch that bit in bold there, we're saying that you absolutely SHOULD NOT expect to run these commands and be done with your analysis. You absolutely SHOULD expect to have to evaluate your results critically and TRY AGAIN with some parameter adjustments until you find the settings that are right for your data. In addition, please note that these recommendations are mainly designed for dealing with very small data sets (in terms of both number of samples or size of targeted regions). If you are not using VQSR because you do not have training/truth resources available for your organism, then you should expect to have to do even more tweaking on the filtering parameters. So, here are some recommended arguments to use with VariantFiltration when ALL other options are unavailable to you: ### Filtering recommendations for SNPs: • QD < 2.0 • MQ < 40.0 • FS > 60.0 • HaplotypeScore > 13.0 only for variants output byUnifiedGenotyper; for HaplotypeCaller's output it is not informative • MQRankSum < -12.5 • ReadPosRankSum < -8.0 ### Filtering recommendations for indels: • QD < 2.0 • ReadPosRankSum < -20.0 • InbreedingCoeff < -0.8 • FS > 200.0 ### And now some more IMPORTANT caveats (don't skip this!) • The InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement. • For shallow-coverage (<10x), it is virtually impossible to use manual filtering to reliably separate true positives from false positives. You really, really, really should use the protocol involving variant quality score recalibration. If you can't do that, maybe you need to take a long hard look at your experimental design. In any case you're probably in for a world of pain. • The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data. ### Finally, a note of hope Some bits of this article may seem harsh, or depressing. Sorry. We believe in giving you the cold hard truth. HOWEVER, we do understand that this is one of the major points of pain that GATK users encounter -- along with understanding how VQSR works, so really, whichever option you go with, you're going to suffer. And we do genuinely want to help. So although we can't look at every single person's callset and give an opinion on how it looks (no, seriously, don't ask us to do that), we do want to hear from you about how we can best help you help yourself. What information do you feel would help you make informed decisions about how to set parameters? Are the meanings of the annotations not clear? Would knowing more about how they are computed help you understand how you can use them? Do you want more math? Less math, more concrete examples? Tell us what you'd like to see here, and we'll do our best to make it happen. (no unicorns though, we're out of stock) We also welcome testimonials from you. We are one small team; you are a legion of analysts all trying different things. Please feel free to come forward and share your findings on what works particularly well in your hands. Most GATK tools apply several read filters by default. You can look up exactly what are the defaults for each tool in their respective Technical Documentation pages. But sometimes you want to specify additional filters yourself (and before you ask, no, you cannot disable the default read filters used by a given tool). This is how you do it: The --read-filter argument (or -rf for short) allows you to apply whatever read filters you'd like. For example, to add the MaxReadLengthFilter filter above to PrintReads, you just add this to your command line: --read_filter MaxReadLength  #### Notice that when you specify a read filter, you need to strip the Filter part of its name off! The read filter will be applied with its default value (which you can also look up in the Tech Docs for that filter). Now, if you want to specify a different value from the default, you pass the relevant argument by adding this right after the read filter: --read_filter MaxReadLength -maxReadLength 76  It's important that you pass the argument right after the filter itself, otherwise the command line parser won't know that they're supposed to go together. And of course, you can add as many filters as you like by using multiple copies of the --read_filter parameter: --read_filter MaxReadLength --maxReadLength 76 --read_filter ZeroMappingQualityRead  This article describes the steps necessary to prepare your reference file (if it's not one that you got from us). As a complement to this article, see the relevant tutorial. ### Why these steps are necessary The GATK uses two files to access and safety check access to the reference files: a .dict dictionary of the contig names and sizes and a .fai fasta index file to allow efficient random access to the reference bases. You have to generate these files in order to be able to use a Fasta file as reference. NOTE: Picard and samtools treat spaces in contig names differently. We recommend that you avoid using spaces in contig names. ### Creating the fasta sequence dictionary file We use CreateSequenceDictionary.jar from Picard to create a .dict file from a fasta file. > java -jar CreateSequenceDictionary.jar R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict [Fri Jun 19 14:09:11 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict [Fri Jun 19 14:09:58 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary done. Runtime.totalMemory()=2112487424 44.922u 2.308s 0:47.09 100.2% 0+0k 0+0io 2pf+0w  This produces a SAM-style header file describing the contents of our fasta file. > cat Homo_sapiens_assembly18.dict @HD VN:1.0 SO:unsorted @SQ SN:chrM LN:16571 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d2ed829b8a1628d16cbeee88e88e39eb @SQ SN:chr1 LN:247249719 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:9ebc6df9496613f373e73396d5b3b6b6 @SQ SN:chr2 LN:242951149 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:b12c7373e3882120332983be99aeb18d @SQ SN:chr3 LN:199501827 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:0e48ed7f305877f66e6fd4addbae2b9a @SQ SN:chr4 LN:191273063 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:cf37020337904229dca8401907b626c2 @SQ SN:chr5 LN:180857866 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:031c851664e31b2c17337fd6f9004858 @SQ SN:chr6 LN:170899992 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:bfe8005c536131276d448ead33f1b583 @SQ SN:chr7 LN:158821424 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:74239c5ceee3b28f0038123d958114cb @SQ SN:chr8 LN:146274826 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:1eb00fe1ce26ce6701d2cd75c35b5ccb @SQ SN:chr9 LN:140273252 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:ea244473e525dde0393d353ef94f974b @SQ SN:chr10 LN:135374737 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:4ca41bf2d7d33578d2cd7ee9411e1533 @SQ SN:chr11 LN:134452384 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:425ba5eb6c95b60bafbf2874493a56c3 @SQ SN:chr12 LN:132349534 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d17d70060c56b4578fa570117bf19716 @SQ SN:chr13 LN:114142980 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:c4f3084a20380a373bbbdb9ae30da587 @SQ SN:chr14 LN:106368585 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:c1ff5d44683831e9c7c1db23f93fbb45 @SQ SN:chr15 LN:100338915 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:5cd9622c459fe0a276b27f6ac06116d8 @SQ SN:chr16 LN:88827254 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:3e81884229e8dc6b7f258169ec8da246 @SQ SN:chr17 LN:78774742 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:2a5c95ed99c5298bb107f313c7044588 @SQ SN:chr18 LN:76117153 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:3d11df432bcdc1407835d5ef2ce62634 @SQ SN:chr19 LN:63811651 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:2f1a59077cfad51df907ac25723bff28 @SQ SN:chr20 LN:62435964 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f126cdf8a6e0c7f379d618ff66beb2da @SQ SN:chr21 LN:46944323 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f1b74b7f9f4cdbaeb6832ee86cb426c6 @SQ SN:chr22 LN:49691432 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:2041e6a0c914b48dd537922cca63acb8 @SQ SN:chrX LN:154913754 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d7e626c80ad172a4d7c95aadb94d9040 @SQ SN:chrY LN:57772954 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:62f69d0e82a12af74bad85e2e4a8bd91 @SQ SN:chr1_random LN:1663265 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:cc05cb1554258add2eb62e88c0746394 @SQ SN:chr2_random LN:185571 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:18ceab9e4667a25c8a1f67869a4356ea @SQ SN:chr3_random LN:749256 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:9cc571e918ac18afa0b2053262cadab6 @SQ SN:chr4_random LN:842648 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:9cab2949ccf26ee0f69a875412c93740 @SQ SN:chr5_random LN:143687 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:05926bdbff978d4a0906862eb3f773d0 @SQ SN:chr6_random LN:1875562 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d62eb2919ba7b9c1d382c011c5218094 @SQ SN:chr7_random LN:549659 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:28ebfb89c858edbc4d71ff3f83d52231 @SQ SN:chr8_random LN:943810 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:0ed5b088d843d6f6e6b181465b9e82ed @SQ SN:chr9_random LN:1146434 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:1e3d2d2f141f0550fa28a8d0ed3fd1cf @SQ SN:chr10_random LN:113275 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:50be2d2c6720dabeff497ffb53189daa @SQ SN:chr11_random LN:215294 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:bfc93adc30c621d5c83eee3f0d841624 @SQ SN:chr13_random LN:186858 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:563531689f3dbd691331fd6c5730a88b @SQ SN:chr15_random LN:784346 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:bf885e99940d2d439d83eba791804a48 @SQ SN:chr16_random LN:105485 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:dd06ea813a80b59d9c626b31faf6ae7f @SQ SN:chr17_random LN:2617613 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:34d5e2005dffdfaaced1d34f60ed8fc2 @SQ SN:chr18_random LN:4262 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f3814841f1939d3ca19072d9e89f3fd7 @SQ SN:chr19_random LN:301858 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:420ce95da035386cc8c63094288c49e2 @SQ SN:chr21_random LN:1679693 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:a7252115bfe5bb5525f34d039eecd096 @SQ SN:chr22_random LN:257318 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:4f2d259b82f7647d3b668063cf18378b @SQ SN:chrX_random LN:1719168 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f4d71e0758986c15e5455bf3e14e5d6f  ### Creating the fasta index file We use the faidx command in samtools to prepare the fasta index file. This file describes byte offsets in the fasta file for each contig, allowing us to compute exactly where a particular reference base at contig:pos is in the fasta file. > samtools faidx Homo_sapiens_assembly18.fasta 108.446u 3.384s 2:44.61 67.9% 0+0k 0+0io 0pf+0w  This produces a text file with one record per line for each of the fasta contigs. Each record is of the: contig, size, location, basesPerLine, bytesPerLine. The index file produced above looks like: > cat Homo_sapiens_assembly18.fasta.fai chrM 16571 6 50 51 chr1 247249719 16915 50 51 chr2 242951149 252211635 50 51 chr3 199501827 500021813 50 51 chr4 191273063 703513683 50 51 chr5 180857866 898612214 50 51 chr6 170899992 1083087244 50 51 chr7 158821424 1257405242 50 51 chr8 146274826 1419403101 50 51 chr9 140273252 1568603430 50 51 chr10 135374737 1711682155 50 51 chr11 134452384 1849764394 50 51 chr12 132349534 1986905833 50 51 chr13 114142980 2121902365 50 51 chr14 106368585 2238328212 50 51 chr15 100338915 2346824176 50 51 chr16 88827254 2449169877 50 51 chr17 78774742 2539773684 50 51 chr18 76117153 2620123928 50 51 chr19 63811651 2697763432 50 51 chr20 62435964 2762851324 50 51 chr21 46944323 2826536015 50 51 chr22 49691432 2874419232 50 51 chrX 154913754 2925104499 50 51 chrY 57772954 3083116535 50 51 chr1_random 1663265 3142044962 50 51 chr2_random 185571 3143741506 50 51 chr3_random 749256 3143930802 50 51 chr4_random 842648 3144695057 50 51 chr5_random 143687 3145554571 50 51 chr6_random 1875562 3145701145 50 51 chr7_random 549659 3147614232 50 51 chr8_random 943810 3148174898 50 51 chr9_random 1146434 3149137598 50 51 chr10_random 113275 3150306975 50 51 chr11_random 215294 3150422530 50 51 chr13_random 186858 3150642144 50 51 chr15_random 784346 3150832754 50 51 chr16_random 105485 3151632801 50 51 chr17_random 2617613 3151740410 50 51 chr18_random 4262 3154410390 50 51 chr19_random 301858 3154414752 50 51 chr21_random 1679693 3154722662 50 51 chr22_random 257318 3156435963 50 51 chrX_random 1719168 3156698441 50 51  The GATK is an open source project that has greatly benefited from the contributions of outside users. The GATK team welcomes contributions from anyone who produces useful functionality in line with the goals of the toolkit. You are welcome to branch the GATK main repository and develop your own tools. Sometimes these tools may be useful to the GATK user community and you may want to make it part of the main GATK distribution. If so we ask you to follow our guidelines for submission of patches. ### 1. Good practices There are a few good GIT practices that you should follow to simplify the ultimate goal, which is, adding your changes to the main GATK repository. • Use branches. Every time you start new work that you are going to submit to the GATK team later, do it in a new branch. Make it a habit as this will simplify many of the following procedures and allow your master branch to always be a fresh (up to date) copy of the GATK main repository. Take a look on [[#How to create a new submission| how to create a new branch for submission]]. • Never merge. Merging creates a branched history with multiple parent nodes that make history hard to understand, impossible to modify and patches near-impossible to create. Merges are very useful when you need to combine multiple repositories and it should ''only'' be used when it makes sense. This means '''never merge''' and '''never pull''' (if it's not a fast-forward, or you will create a merge). • Commit as often as possible. Every change, should be committed to make sure you can go back in time effectively in your own tree. The commit messages don't matter to us as long as they're meaningful to you in this stage. You can essentially do whatever you want in your local tree with your commits, as long as you don't merge. • Rebase constantly Your branch is diverging from the master by the minute, so if you keep rebasing as often as you can, you will avoid major conflicts when it's time to send the patches. Take a look at our guide on [[#How to rebase | how to rebase]]. • Tell a meaningful story When it's time to submit your patches to us, reorder your commits and write meaningful commit messages. Each commit must be (as much as possible) self contained. These commits must tell a meaningful story to us so we can understand what it is you're adding to the codebase. Take a look at an [[#How to make your commits | example commit scenario]]. • Generate patches and email them to the group This part is super easy, provided you've followed the good practices. You just have to [[#How to generate the patches | generate the patches]] and e-mail them to gsa-patches@broadinstitute.org. ### 2. How to create a new submission You should always start your code by creating a new branch from the most recent version of the main repository with : git checkout master (make sure you are in the master branch) git fetch && git rebase origin/master (you can substitute this line for "git pull" if you have no changes in the master branch) git checkout -b newtool (create a new branch for your new tool)  Note: If you have submitted a patch to the group, do not continue development on the same branch as we cannot guarantee that your changes will make it to the main repository unchanged. ### 3. How to rebase Every time before you rebase, you have to update your copy of the main repository. To do this use: git fetch  If you are just trying to keep up with the changes in the main repository after a fetch, you can rebase your branch at anytime using (and this should be all you need to do): git rebase origin/master  In case there are conflicts, resolve them as you would and do: git rebase --continue  If you don't know how to resolve the conflicts, you can always safely abort the whole process and go back to your branch before you started rebasing: git rebase --abort  If you are done and want to generate your patches conforming to the latest repository changes, to edit, squash and reorder your commits use : git rebase -i origin/master  At the prompt, you can follow the instructions to squash, edit and reorder accordingly. You can also do this step from IntelliJ with a visual editor that allows you to select what to edit/squash/reorder. You can also take a look at this nice tutorial on how to use interactive rebase. ### 4. How to make your commits It is okay to have a list of commits (numbered) somewhat like this in your local tree: • added function X • fixed a b and c on X • b was actually d • started creating feature Y but had to go to the bathroom • added Y • found bug in X, fixed with e • added Z • fixed bug in Z with f Before you can send your tools to us, you have to organize these commits so they tell a meaningful history and are self contained. To achieve this you will need to rebase so you can squash, edit and reorder your commits. This tree makes a lot of sense for your development process, but it makes no sense in the main repository history as it becomes hard to pick/revert commits and understand the history at a glance. After rebasing, you should edit your commits to look like this: • added X (including commits 2, 3 and 6) • added Y (including commits 4 and 5) • added Z (including commits 7 and 8) Use your commit messages wisely to help quick processing of your patches. Make sure the first line of your commit messages have less than 50 characters (title). Add a blank line and write a paragraph or more explaining what this commit represents (now that it is a package of multiple commits. It is important to have the 50 char title because this is all we see when we look at an extended history to find bugs and it is also our quick access to remember what the commit does to the repository. A patch should be self contained. Meaning if we decide to adopt feature X and Z but not Y, we should be able to do so by only applying patches 1 and 2. If your patches are co-dependent, you should say so in the commits and justify why you didn't squash the commits together into one tool. ### 5. How to generate the patches To generate patches, use : git format-patch since  The since parameter is the last commit you want to generate patches from, for example: HEAD^3 will generate patches for HEAD^2, HEAD^1 and HEAD. You can also specify the commit by its id or by using the head of a branch. This is where using branches will make your life easier. If master is always up to date with the main repo with no changes, you can do: git format-patch master (provided your master is up to date)  This will generate a patch for each commit you've created and you can simply e-mail them as an attachment to us. By default, the forum does not send notification messages about new comments or discussions. If you want to turn on notifications or customize the type of notifications you want to receive (email, popup message etc), you need to do the following: • Go to your profile page by clicking on your user name (in blue box, top left corner); • Click on "Edit Profile" (button with silhouette of person, top right corner); • In the menu on the left, click on "Notification Preferences"; • Select the categories that you want to follow and the type of notification you want to receive. • Be sure to click on Save Preferences. To specifically get new GATK announcements, scroll down to "Category Notifications" and tick off the "Announcements" category for email notification for discussions (and comments if you really want to know everything). This document provides technical details and recommendations on how the parallelism options offered by the GATK can be used to yield optimal performance results. ### Overview As explained in the primer on parallelism for the GATK, there are two main kinds of parallelism that can be applied to the GATK: multi-threading and scatter-gather (using Queue). ### Multi-threading options There are two options for multi-threading with the GATK, controlled by the arguments -nt and -nct, respectively, which can be combined: • -nt / --num_threads controls the number of data threads sent to the processor • -nct / --num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread For more information on how these multi-threading options work, please read the primer on parallelism for the GATK. #### Memory considerations for multi-threading Each data thread needs to be given the full amount of memory you’d normally give a single run. So if you’re running a tool that normally requires 2 Gb of memory to run, if you use -nt 4, the multithreaded run will use 8 Gb of memory. In contrast, CPU threads will share the memory allocated to their “mother” data thread, so you don’t need to worry about allocating memory based on the number of CPU threads you use. #### Additional consideration when using -nct with versions 2.2 and 2.3 Because of the way the -nct option was originally implemented, in versions 2.2 and 2.3, there is one CPU thread that is reserved by the system to “manage” the rest. So if you use -nct, you’ll only really start seeing a speedup with -nct 3 (which yields two effective "working" threads) and above. This limitation has been resolved in the implementation that will be available in versions 2.4 and up. ### Scatter-gather For more details on scatter-gather, see the primer on parallelism for the GATK and the Queue documentation. ### Applicability of parallelism to the major GATK tools Please note that not all tools support all parallelization modes. The parallelization modes that are available for each tool depend partly on the type of traversal that the tool uses to walk through the data, and partly on the nature of the analyses it performs. Tool Full name Type of traversal NT NCT SG RTC RealignerTargetCreator RodWalker + - - IR IndelRealigner ReadWalker - - + BR BaseRecalibrator LocusWalker - + + PR PrintReads ReadWalker - + - RR ReduceReads ReadWalker - - + UG UnifiedGenotyper LocusWalker + + + ### Recommended configurations The table below summarizes configurations that we typically use for our own projects (one per tool, except we give three alternate possibilities for the UnifiedGenotyper). The different values allocated for each tool reflect not only the technical capabilities of these tools (which options are supported), but also our empirical observations of what provides the best tradeoffs between performance gains and commitment of resources. Please note however that this is meant only as a guide, and that we cannot give you any guarantee that these configurations are the best for your own setup. You will probably have to experiment with the settings to find the configuration that is right for you. Tool RTC IR BR PR RR UG Available modes NT SG NCT,SG NCT SG NT,NCT,SG Cluster nodes 1 4 4 1 4 4 / 4 / 4 CPU threads (-nct) 1 1 8 4-8 1 3 / 6 / 24 Data threads (-nt) 24 1 1 1 1 8 / 4 / 1 Memory (Gb) 48 4 4 4 4 32 / 16 / 4 Where NT is data multithreading, NCT is CPU multithreading and SG is scatter-gather using Queue. For more details on scatter-gather, see the primer on parallelism for the GATK and the Queue documentation. Note: only do this if you have been explicitly asked to do so. ### Scenario: You posted a question about a problem you had with GATK tools, we answered that we think it's a bug, and we asked you to submit a detailed bug report. ### Here's what you need to provide: • The exact command line that you used when you had the problem (in a text file) • The full log output (program output in the console) from the start of the run to the end or error message (in a text file) • A snippet of the BAM file if applicable and the index (.bai) file associated with it • If a non-standard reference (i.e. not available in our resource bundle) was used, we need the .fasta, .fai, and .dict files for the reference • Any other relevant files such as recalibration plots A snippet file is a slice of the original BAM file which contains the problematic region and is sufficient to reproduce the error. We need it in order to reproduce the problem on our end, which is the first necessary step to finding and fixing the bug. We ask you to provide this as a snippet rather than the full file so that you don't have to upload (and we don't have to process) huge giga-scale files. ### Here's how you create a snippet file: • Look at the error message and see if it cites a specific position where the error occurred • If not, identify what region caused the problem by running with -L argument and progressively narrowing down the interval • Once you have the region, use PrintReads with -L to write the problematic region (with 500 bp padding on either side) to a new file -- this is your snippet file. • Test your command line on this snippet file to make sure you can still reproduce the error on it. ### And finally, here's how you send us the files: • Put all those files into a .zip or .tar.gz archive • Upload them onto our FTP server as explained here (make sure you use the proper UPLOAD credentials) • Post in the original discussion thread that you have done this • Be sure to tell us the name of your archive file! ### We will get back to you --hopefully with a bug fix!-- as soon as we can. Imagine a simple question like, "What's the depth of coverage at position A of the genome?" First, you are given billions of reads that are aligned to the genome but not ordered in any particular way (except perhaps in the order they were emitted by the sequencer). This simple question is then very difficult to answer efficiently, because the algorithm is forced to examine every single read in succession, since any one of them might span position A. The algorithm must now take several hours in order to compute this value. Instead, imagine the billions of reads are now sorted in reference order (that is to say, on each chromosome, the reads are stored on disk in the same order they appear on the chromosome). Now, answering the question above is trivial, as the algorithm can jump to the desired location, examine only the reads that span the position, and return immediately after those reads (and only those reads) are inspected. The total number of reads that need to be interrogated is only a handful, rather than several billion, and the processing time is seconds, not hours. This reference-ordered sorting enables the GATK to process terabytes of data quickly and without tremendous memory overhead. Most GATK tools run very quickly and with less than 2 gigabytes of RAM. Without this sorting, the GATK cannot operate correctly. Thus, it is a fundamental rule of working with the GATK, which is the reason for the Central Dogma of the GATK: #### All datasets (reads, alignments, quality scores, variants, dbSNP information, gene tracks, interval lists - everything) must be sorted in order of one of the canonical references sequences. You may have noticed that a lot of the scores that are output by the GATK are in Phred scale. The Phred scale was originally used to represent base quality scores emitted by the Phred program in the early days of the Human Genome Project (see this Wikipedia article for more historical background). Now they are widely used to represent probabilities and confidence scores in other contexts of genome science. ### Phred scale in context In the context of sequencing, Phred-scaled quality scores are used to represent how confident we are in the assignment of each base call by the sequencer. In the context of variant calling, Phred-scaled quality scores can be used to represent many types of probabilities. The most commonly used in GATK is the QUAL score, or variant quality score. It is used in much the same way as the base quality score: the variant quality score is a Phred-scaled estimate of how confident we are that the variant caller correctly identified that a given genome position displays variation in at least one sample. ### Phred scale in practice In today’s sequencing output, by convention, Phred-scaled base quality scores range from 2 to 63. However, Phred-scaled quality scores in general can range anywhere from 0 to infinity. A higher score indicates a higher probability that a particular decision is correct, while conversely, a lower score indicates a higher probability that the decision is incorrect. The Phred quality score (Q) is logarithmically related to the error probability (E). $$Q = -10 \log E$$ So we can interpret this score as an estimate of error, where the error is e.g. the probability that the base is called incorrectly by the sequencer, but we can also interpret it as an estimate of accuracy, where the accuracy is e.g. the probability that the base was identified correctly by the sequencer. Depending on how we decide to express it, we can make the following calculations: If we want the probability of error (E), we take: $$E = 10 ^{-\left(\frac{Q}{10}\right)}$$ And conversely, if we want to express this as the estimate of accuracy (A), we simply take $$\begin{eqnarray} A &=& 1 - E \nonumber \ &=& 1 - 10 ^{-\left(\frac{Q}{10}\right)} \nonumber \ \end{eqnarray}$$ Here is a table of how to interpret a range of Phred Quality Scores. It is largely adapted from the Wikipedia page for Phred Quality Score. For many purposes, a Phred Score of 20 or above is acceptable, because this means that whatever it qualifies is 99% accurate, with a 1% chance of error. Phred Quality Score Error Accuracy (1 - Error) 10 1/10 = 10% 90% 20 1/100 = 1% 99% 30 1/1000 = 0.1% 99.9% 40 1/10000 = 0.01% 99.99% 50 1/100000 = 0.001% 99.999% 60 1/1000000 = 0.0001% 99.9999% And finally, here is a graphical representation of the Phred scores showing their relationship to accuracy and error probabilities. The red line shows the error, and the blue line shows the accuracy. Of course, as error decreases, accuracy increases symmetrically. Note: You can see that below Q20 (which is how we usually refer to a Phred score of 20), the curve is really steep, meaning that as the Phred score decreases, you lose confidence very rapidly. In contrast, above Q20, both of the graphs level out. This is why Q20 is a good cutoff score for many basic purposes. This document describes "regular" (variants-only) VCF files. For information on the gVCF format produced by HaplotypeCaller in -ERC GVCF mode, please see this companion document. ### 1. What is VCF? VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and expansion has been taken over by the Global Alliance for Genomics and Health Data Working group file format team. VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation. That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from NGS data, such as the UnifiedGenotyper and the HaplotypeCaller, is especially complex. This document describes some specific features and annotations used in the VCF files output by the GATK tools. ### 2. Basic structure of a VCF file The following text is a valid VCF file describing the first few SNPs found by the UG in a deep whole genome data set from our favorite test sample, NA12878: ##fileformat=VCFv4.0 ##FILTER=<ID=LowQual,Description="QUAL < 50.0"> ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=PL,Number=3,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic"> ##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?"> ##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions"> ##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"> ##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes"> ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality"> ##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads"> ##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth"> ##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias"> ##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="log10-scaled probability of variant being true under the trained gaussian mixture model"> ##UnifiedGenotyperV2="analysis_type=UnifiedGenotyperV2 input_file=[TEXT CLIPPED FOR CLARITY]" #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 chr1 873762 . T G 5231.78 PASS AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255 chr1 877664 rs3828047 A G 3931.66 PASS AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185 GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0 chr1 899282 rs28548431 C T 71.77 PASS AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148 GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26 chr1 974165 rs9442391 T C 29.84 LowQual AC=1;AF=0.50;AN=2;DB;DP=18;Dels=0.00;HRun=1;HaplotypeScore=0.16;MQ=95.26;MQ0=0;QD=1.66;SB=-0.98 GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255  It seems a bit complex, but the structure of the file is actually quite simple: [HEADER LINES] #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 chr1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255 chr1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0 chr1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26 chr1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255  After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that here everything is a SNP, but some could be indels or CNVs. ### 3. How variation is represented The first 6 columns of the VCF, which represent the observed variation, are easy to understand because they have a single, well-defined meaning. • CHROM and POS : The CHROM and POS gives the contig on which the variant occurs. For indels this is actually the base preceding the event, due to how indels are represented in a VCF. • ID: The dbSNP rs identifier of the SNP, based on the contig and position of the call and whether a record exists at this site in dbSNP. • REF and ALT: The reference base and alternative base that vary in the samples, or in the population in general. Note that REF and ALT are always given on the forward strand. For indels the REF and ALT bases always include at least one base each (the base before the event). • QUAL: The Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance. These values can grow very large when a large amount of NGS data is used for variant calling. • FILTER: In a perfect world, the QUAL field would be based on a complete model for all error modes present in the data used to call. Unfortunately, we are still far from this ideal, and we have to use orthogonal approaches to determine which called sites, independent of QUAL, are machine errors and which are real SNPs. Whatever approach is used to filter the SNPs, the VCFs produced by the GATK carry both the PASSing filter records (the ones that are good have PASS in their FILTER field) as well as those that fail (the filter field is anything but PASS or a dot). If the FILTER field is a ".", then no filtering has been applied to the records, meaning that all of the records will be used for analysis but without explicitly saying that any PASS. You should avoid such a situation by always filtering raw variant calls before analysis. For more details about these fields, please see this page. In the excerpt shown above, here is how we interpret the line corresponding to each variant: • chr1:873762 is a novel T/G polymorphism, found with very high confidence (QUAL = 5231.78) • chr1:877664 is a known A/G SNP (named rs3828047), found with very high confidence (QUAL = 3931.66) • chr1:899282 is a known C/T SNP (named rs28548431), but has a relative low confidence (QUAL = 71.77) • chr1:974165 is a known T/C SNP but we have so little evidence for this variant in our data that although we write out a record for it (for book keeping, really) our statistical evidence is so low that we filter the record out as a bad site, as indicated by the "LowQual" annotation. ### 4. How genotypes are represented The genotype fields of the VCF look more complicated but they're actually not that hard to interpret once you understand that they're just sets of tags and values. Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations: chr1 873762 . T G [CLIPPED] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255 chr1 877664 rs3828047 A G [CLIPPED] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0 chr1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26  Looking at that last column, here is what the tags mean: • GT : The genotype of this sample. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either: • 0/0 - the sample is homozygous reference • 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles • 1/1 - the sample is homozygous alternate In the three examples above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively. • GQ: The Genotype Quality, or Phred-scaled confidence that the true genotype is the one provided in GT. In the diploid case, if GT is 0/1, then GQ is really L(0/1) / (L(0/0) + L(0/1) + L(1/1)), where L is the likelihood that the sample is 0/0, 0/1/, or 1/1 under the model built for the NGS dataset. The GQ is simply the second most likely PL - the most likely PL. Because the most likely PL is always 0, GQ = second highest PL - 0. If the second most likely PL is greater than 99, we still assign a GQ of 99, so the highest value of GQ is 99. • AD and DP: These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site. See the Technical Documentation for details on AD (DepthPerAlleleBySample) and DP (Coverage). • PL: This field provides the likelihoods of the given genotypes (here, 0/0, 0/1, and 1/1). These are normalized, Phred-scaled likelihoods for each of the 0/0, 0/1, and 1/1, without priors. To be concrete, for the heterozygous case, this is L(data given that the true genotype is 0/1). The most likely genotype (given in the GT field) is scaled so that it's P = 1.0 (0 when Phred-scaled), and the other likelihoods reflect their Phred-scaled likelihoods relative to this most likely genotype. With that out of the way, let's interpret the genotypes for NA12878 at chr1:899282. chr1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26  At this site, the called genotype is GT = 0/1, which is C/T. The confidence indicated by GQ = 25.92 isn't so good, largely because there were only a total of 4 reads at this site (DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0). There's a chance that the subject is "hom-var" (=homozygous with the variant allele) since PL(1/1) = 26, which corresponds to 10^(-2.6), or 0.0025, but either way, it's clear that the subject is definitely not "hom-ref" (=homozygous with the reference allele) since PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number. ### 5. Understanding annotations Finally, variants in a VCF can be annotated with a variety of additional tags, either by the built-in tools or with others that you add yourself. The way they're formatted is similar to what we saw in the Genotype fields, except instead of being in two separate fields (tags and values, respectively) the annotation tags and values are grouped together, so tag-value pairs are written one after another. chr1 873762 [CLIPPED] AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 chr1 877664 [CLIPPED] AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185 chr1 899282 [CLIPPED] AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148  Here are some commonly used built-in annotations and what they mean: Annotation tag in VCF Meaning AC,AF,AN See the Technical Documentation for Chromosome Counts. DB If present, then the variant is in dbSNP. DP See the Technical Documentation for Coverage. DS Were any of the samples downsampled because of too much coverage? Dels See the Technical Documentation for SpanningDeletions. MQ and MQ0 See the Technical Documentation for RMS Mapping Quality and Mapping Quality Zero. BaseQualityRankSumTest See the Technical Documentation for Base Quality Rank Sum Test. MappingQualityRankSumTest See the Technical Documentation for Mapping Quality Rank Sum Test. ReadPosRankSumTest See the Technical Documentation for Read Position Rank Sum Test. HRun See the Technical Documentation for Homopolymer Run. HaplotypeScore See the Technical Documentation for Haplotype Score. QD See the Technical Documentation for Qual By Depth. VQSLOD Only present when using Variant quality score recalibration. Log odds ratio of being a true variant versus being false under the trained gaussian mixture model. FS See the Technical Documentation for Fisher Strand SB How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls). If this is your first rodeo, you're probably asking yourself: • What can GATK do for me? Identify variants in a bunch of sample sequences, with great sensitivity and specificity. • How do I get GATK to do that? You run the recommended Best Practices steps, one by one, from start to finish, as described in the Best Practices documentation. • No but really, how do I know what to do? For each step in the Best Practices, there is a tutorial that details how to run the tools involved, with example commands. The idea is to daisy-chain all thosee tutorials in the order that they're referenced in the Best Practices doc into a pipeline. • Oh, you mean I can just copy/paste all the tutorial commands as they are? Not quite, because there are a few things that need to be tweaked. For example, the tutorials use the -L/--intervals argument to restrict analysis for demo purposes, but depending on your data and experimental design, you may need to remove it (e.g. for WGS) or adapt it (for WEx). Hopefully it's explained clearly enough in the tutorials. • Why don't you just provide one script that runs all the tools? It's really hard to build and maintain a one-size-fits-all pipeline solution. Really really hard. And not nearly as much fun as developing new analysis methods. We do provide a pipelining program called Queue that has the advantage of understanding GATK argument syntax natively, but you still have to actually write scripts yourself in Scala to use it. Sorry. Maybe one day we will be able to offer GATK analysis on the Cloud. But not today. • What if I want to know what a command line argument does or change a parameter? First, check out the basic GATK command syntax FAQ if it's your first time using GATK, then consult the relevant Tool Documentation page. Keep in mind that some arguments are "engine parameters" that are shared by many tools, and are listed in a separate document. Also, you can always use the search box to find an argument description really quickly. • The documentation seems chaotic. Is there any logic to how it's organized? Sort of. (And, ouch. Tough crowd.) The main category names should be obvious enough (if not, see the "Documentation Categories" tab). Within categories, everything is just in alphabetical order. In future, we're going to try to provide more use-case based structure, but for now this is what we have. The best way to find practical information is to either go from the Best Practices doc (which provide links to all FAQs, method articles and tutorials directly related to a given step), or use the search box and search-by-tag functions (see the "Search tab"). Be sure to also check out the Presentations section, which provides workshop materials and videos that explain a lot of the motivation and methods behind the Best Practices. • Does GATK include other tools beside the ones in the Best Practices? Oh sure, there's a whole bunch of them, all listed in the Tool Documentation section, categorized by type of analysis. But be aware that anything that's not part of the Best Practices is most likely either a tool that was written for a one-off analysis years ago, an experimental feature that we're still not sure is actually useful, or an accessory utility that can be used in many different ways and takes expert inside knowledge to use properly. All these may be buggy, insufficiently documented, or both. We provide support for them as well as humanly possible but ultimately, you use them at your own risk. • Why do the answers to these questions keep getting longer and longer? I don't know what you're talking about. • What else should I know before I start? You should probably browse the titles of the Frequently Asked Questions -- there will be at least a handful you'll want to read, but it's hard for us to predict which ones. There are four major organizational units for next-generation DNA sequencing processes that used throughout the GATK documentation: • Lane: The basic machine unit for sequencing. The lane reflects the basic independent run of an NGS machine. For Illumina machines, this is the physical sequencing lane. • Library: A unit of DNA preparation that at some point is physically pooled together. Multiple lanes can be run from aliquots from the same library. The DNA library and its preparation is the natural unit that is being sequenced. For example, if the library has limited complexity, then many sequences are duplicated and will result in a high duplication rate across lanes. • Sample: A single individual, such as human CEPH NA12878. Multiple libraries with different properties can be constructed from the original sample DNA source. Throughout our documentation, we treat samples as independent individuals whose genome sequence we are attempting to determine. Note that from this perspective, tumor / normal samples are different despite coming from the same individual. • Cohort: A collection of samples being analyzed together. This organizational unit is the most subjective and depends very specifically on the design goals of the sequencing project. For population discovery projects like the 1000 Genomes, the analysis cohort is the ~100 individual in each population. For exome projects with many deeply sequenced samples (e.g., ESP with 800 EOMI samples) we divide up the complete set of samples into cohorts of ~50 individuals for multi-sample analyses. Note that many GATK commands can be run at the lane level, but will give better results seeing all of the data for a single sample, or even all of the data for all samples. Unfortunately, there's a trade-off in computational cost, since running these commands across all of your data simultaneously requires much more computing power. Please see the documentation for each step to understand what is the best way to group or partition your data for that particular process. ### Together is (almost always) better than alone If you do joint analysis according to our Best Practices, your analysis will be greatly empowered by the ability to leverage population-wide information from a cohort of multiple samples. It will allow you to detect variants with great sensitivity and genotype samples as accurately as possible. See the following sections for details of why that is if you are not convinced. Once you’re convinced: the good news is you don’t actually have to call variants on all your samples together. Since GATK 3.0, you can use the HaplotypeCaller to call variants individually per-sample in -ERC GVCF mode, followed by a joint genotyping step on all samples in the cohort, as described in this method article. This achieves what we call incremental joint discovery, providing you with all the benefits of classic joint calling (as described below) without the drawbacks. Why "almost always"? Because some people have reported missing a small fraction of singletons (variants that are unique to individual samples) when using the new method. For most studies, this is an acceptable tradeoff, but if you are very specifically looking for singletons, you may need to do some careful evaluation before committing to this method. ### Previously established cohort analysis strategies Until recently, three strategies were available for variant discovery in multiple samples: - single sample calling: sample BAMs are analyzed individually, and individual call sets are combined in a downstream processing step; - batch calling: sample BAMs are analyzed in separate batches, and batch call sets are merged in a downstream processing step; - joint calling: variants are called simultaneously across all sample BAMs, generating a single call set for the entire cohort. The best of these, from the point of view of variant discovery, was joint calling, because it provided the following benefits: #### 1. Clearer distinction between homozygous reference sites and sites with missing data Batch-calling does not output a genotype call at sites where no member in the batch has evidence for a variant; it is thus impossible to distinguish such sites from locations missing data. In contrast, joint calling emits genotype calls at every site where any individual in the call set has evidence for variation. #### 2. Greater sensitivity for low-frequency variants By sharing information across all samples, joint calling makes it possible to “rescue” genotype calls at sites where a carrier has low coverage but other samples within the call set have a confident variant at that location. However this does not apply to singletons, which are unique to a single sample. To minimize the chance of missing singletons, we increase the cohort size -- so that singletons themselves have less chance of happening in the first place. #### 3. Greater ability to filter out false positives The current approaches to variant filtering (such as VQSR) use statistical models that work better with large amounts of data. Of the three calling strategies above, only joint calling provides enough data for accurate error modeling and ensures that filtering is applied uniformly across all samples. Figure 1: Power of joint calling in finding mutations at low coverage sites. The variant allele is present in only two of the N samples, in both cases with such low coverage that the variant is not callable when processed separately. Joint calling allows evidence to be accumulated over all samples and renders the variant callable. (right) Importance of joint calling to square off the genotype matrix, using an example of two disease-relevant variants. Neither sample will have records in a variants-only output file, for different reasons: the first sample is homozygous reference while the second sample has no data. However, merging the results from single sample calling will incorrectly treat both of these samples identically as being non-informative. ### Drawbacks of joint calling There are two major problems with the joint calling strategy. - Scaling & infrastructure Joint calling scales very badly -- the calculations involved in variant calling (especially by methods like the HaplotypeCaller’s) become exponentially more computationally costly as you add samples to the cohort. If you don't have a lot of compute available, you run into limitations pretty quickly. Even here at Broad where we have fairly ridiculous amounts of compute available, we can't brute-force our way through the numbers for the larger cohort sizes that we're called on to handle. - The N+1 problem When you’re getting a large-ish number of samples sequenced (especially clinical samples), you typically get them in small batches over an extended period of time, and you analyze each batch as it comes in (whether it’s because the analysis is time-sensitive or your PI is breathing down your back). But that’s not joint calling, that’s batch calling, and it doesn’t give you the same significant gains that joint calling can give you. Unfortunately the joint calling approach doesn’t allow for incremental analysis -- every time you get even one new sample sequence, you have to re-call all samples from scratch. #### Both of these problems are solved by the single-sample calling + joint genotyping workflow. Use HaplotypeCaller! The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper. Its ability to call SNPs is equivalent to that of the UnifiedGenotyper, its ability to call indels is far superior, and it is now capable of calling non-diploid samples. It also comprises several unique functionalities such as the reference confidence model (which enables efficient and incremental variant discovery on ridiculously large cohorts) and special settings for RNAseq data. As of GATK version 3.3, we recommend using HaplotypeCaller in all cases, with no exceptions. Caveats for older versions If you are limited to older versions for project continuity, you may opt to use UnifiedGenotyper in the following cases: • If you are working with non-diploid organisms (UG can handle different levels of ploidy while older versions of HC cannot) • If you are working with pooled samples (also due to the HC’s limitation regarding ploidy) • If you want to analyze more than 100 samples at a time (for performance reasons) (versions 2.x) ### 1. JEXL in a nutshell JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants. ### 2. Basic structure of JEXL expressions for use with the GATK In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply. JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30: "QUAL > 30.0"  • QUAL is a key: the name of the annotation we want to look at • 30.0 is a value: the threshold that we want to use to evaluate variant quality against • > is an operator: it determines which "side" of the threshold we want to select The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example: "MY_STRING_KEY == 'foo'"  ### 3. Evaluation on multiple annotations You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value: "QUAL / DP < 10.0"  You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP): "QUAL > 30.0 && DP == 10"  where && is the logical "AND". Or if you want to select variants that have at least one of several conditions fulfilled: "QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"  where || is the logical "OR". ### 4. Important caveats #### Sensitivity to case and type • Case Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression. • Type The types (i.e. string, integer, non-integer or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (aka a Java exception). #### Complex queries We highly recommend that complex expressions involving multiple AND/OR operations be split up into separate expressions whenever possible to avoid confusion. If you are using complex expressions, make sure to test them on a panel of different sites with several combinations of yes/no criteria. ### 5. More complex JEXL magic Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial. #### Introducing the VariantContext object When you use SelectVariants with JEXL, what happens under the hood is that the program accesses something called the VariantContext, which is a representation of the variant call with all its annotation information. The VariantContext is technically not part of GATK; it's part of the variant library included within the Picard tools source code, which GATK uses for convenience. The reason we're telling you about this is that you can actually make more complex queries than what the GATK offers convenience functions for, provided you're willing to do a little digging into the VariantContext methods. This will allow you to leverage the full range of capabilities of the underlying objects from the command line. In a nutshell, the VariantContext is available through the vc variable, and you just need to add method calls to that variable in your command line. The bets way to find out what methods are available is to read the VariantContext documentation on the Picard tools source code repository (on SourceForge), but we list a few examples below to whet your appetite. #### Examples using VariantContext directly For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows: java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").isHomRef()'  Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample: ! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263  #### Examples using the VariantContext to evaluate boolean values The classic way of evaluating a boolean goes like this: java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'DB'  But you can also use the VariantContext object like this: java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.hasAttribute("DB")'  #### Example using JEXL to evaluate arrays Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example: java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").getAD().0 > 10'  ### 1. Operating system The GATK runs natively on most if not all flavors of UNIX, which includes MacOSX, Linux and BSD. It is possible to get it running on Windows using Cygwin, but we don't provide any support nor instructions for that. ### 2. Java The GATK is a Java-based program, so you'll need to have Java installed on your machine. The Java version should be at 1.7 (at this time we don't support 1.8, and 1.6 no longer works). You can check what version you have by typing java -version at the command line. This article has some more details about what to do if you don't have the right version. Note that at this time we only support the Sun/Oracle Java JDK; OpenJDK is not supported. ### 3. Familiarity with command-line programs The GATK does not have a Graphical User Interface (GUI). You don't open it by clicking on the .jar file; you have to use the Console (or Terminal) to input commands. If this is all new to you, we recommend you first learn about that and follow some online tutorials before trying to use the GATK. It's not difficult but you'll need to learn some jargon and get used to living without a mouse... ### 4. Other Some of the GATK tools produce plots using R, so if you want to get the plots you'll need to have R installed, as well as several R libraries. Full details can be found in the Tutorial on prerequisites. So you're going to a GATK workshop, and you've been selected to participate in a hands-on session? Fantastic! We're looking forward to walking you through some exercises that will help you master the tools. However -- in order to make the best of the time we have together, we'd like to ask you to come prepared. Specifically, please complete the following steps: #### - Download and install all necessary software as described in this tutorial. We don't always get around to using RStudio, but all others are required. Note that if you are a Mac user, you may need to install Apple's XCode Tools, which are free but fairly large, so plan ahead because it can take a loooong time to download them if your connection is anything less than super-fast. #### - Download one of the following tutorial bundles from our FTP server: • The basic tutorial data bundle Speaking of long downloads, this one is also pretty big (740M), so again, don't leave it until last minute. This mini-bundle contains chromosome 20 of the human genome reference, a BAM file snippet and accompanying dbsnp + known indels files. • OR the advanced tutorial data bundle. If you are attending an advanced hands-on session, you'll need some extra files that aren't in the basic tutorial bundle. This add-on bundle is also quite large (870M) because it contains the complete human genome and a complete whole-genome callset. Note that this will take around 4G of space on your hard drive once it's uncompressed, so make sure you have plenty of space available on your machine. At the start of the session, we'll give you handouts with a walkthrough of the session so you can follow along and take notes (highly recommended!). With that, you should be all set. See you soon! VariantEval accepts two types of modules: stratification and evaluation modules. • Stratification modules will stratify (group) the variants based on certain properties. • Evaluation modules will compute certain metrics for the variants ### CpG CpG is a three-state stratification: • The locus is a CpG site ("CpG") • The locus is not a CpG site ("non_CpG") • The locus is either a CpG or not a CpG site ("all") A CpG site is defined as a site where the reference base at a locus is a C and the adjacent reference base in the 3' direction is a G. ### EvalRod EvalRod is an N-state stratification, where N is the number of eval rods bound to VariantEval. ### Sample Sample is an N-state stratification, where N is the number of samples in the eval files. ### Filter Filter is a three-state stratification: • The locus passes QC filters ("called") • The locus fails QC filters ("filtered") • The locus either passes or fails QC filters ("raw") ### FunctionalClass FunctionalClass is a four-state stratification: • The locus is a synonymous site ("silent") • The locus is a missense site ("missense") • The locus is a nonsense site ("nonsense") • The locus is of any functional class ("any") ### CompRod CompRod is an N-state stratification, where N is the number of comp tracks bound to VariantEval. ### Degeneracy Degeneracy is a six-state stratification: • The underlying base position in the codon is 1-fold degenerate ("1-fold") • The underlying base position in the codon is 2-fold degenerate ("2-fold") • The underlying base position in the codon is 3-fold degenerate ("3-fold") • The underlying base position in the codon is 4-fold degenerate ("4-fold") • The underlying base position in the codon is 6-fold degenerate ("6-fold") • The underlying base position in the codon is degenerate at any level ("all") See the [http://en.wikipedia.org/wiki/Genetic_code#Degeneracy Wikipedia page on degeneracy] for more information. ### JexlExpression JexlExpression is an N-state stratification, where N is the number of JEXL expressions supplied to VariantEval. See [[Using JEXL expressions]] ### Novelty Novelty is a three-state stratification: • The locus overlaps the knowns comp track (usually the dbSNP track) ("known") • The locus does not overlap the knowns comp track ("novel") • The locus either overlaps or does not overlap the knowns comp track ("all") ### CountVariants CountVariants is an evaluation module that computes the following metrics: Metric Definition nProcessedLoci Number of processed loci nCalledLoci Number of called loci nRefLoci Number of reference loci nVariantLoci Number of variant loci variantRate Variants per loci rate variantRatePerBp Number of variants per base nSNPs Number of snp loci nInsertions Number of insertion nDeletions Number of deletions nComplex Number of complex loci nNoCalls Number of no calls loci nHets Number of het loci nHomRef Number of hom ref loci nHomVar Number of hom var loci nSingletons Number of singletons heterozygosity heterozygosity per locus rate heterozygosityPerBp heterozygosity per base pair hetHomRatio heterozygosity to homozygosity ratio indelRate indel rate (insertion count + deletion count) indelRatePerBp indel rate per base pair deletionInsertionRatio deletion to insertion ratio ### CompOverlap CompOverlap is an evaluation module that computes the following metrics: Metric Definition nEvalSNPs number of eval SNP sites nCompSNPs number of comp SNP sites novelSites number of eval sites outside of comp sites nVariantsAtComp number of eval sites at comp sites (that is, sharing the same locus as a variant in the comp track, regardless of whether the alternate allele is the same) compRate percentage of eval sites at comp sites nConcordant number of concordant sites (that is, for the sites that share the same locus as a variant in the comp track, those that have the same alternate allele) concordantRate the concordance rate #### Understanding the output of CompOverlap A SNP in the detection set is said to be 'concordant' if the position exactly matches an entry in dbSNP and the allele is the same. To understand this and other output of CompOverlap, we shall examine a detailed example. First, consider a fake dbSNP file (headers are suppressed so that one can see the important things):  grep -v '##' dbsnp.vcf
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       10327   rs112750067     T       C       .       .       ASP;R5;VC=SNP;VP=050000020005000000000100;WGT=1;dbSNPBuildID=132


Now, a detection set file with a single sample, where the variant allele is the same as listed in dbSNP:

 $grep -v '##' eval_correct_allele.vcf #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001-6 1 10327 . T C 5168.52 PASS ... GT:AD:DP:GQ:PL 0/1:357,238:373:99:3959,0,4059  Finally, a detection set file with a single sample, but the alternate allele differs from that in dbSNP: $ grep -v '##' eval_incorrect_allele.vcf
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT            001-6
1       10327   .       T       A       5168.52 PASS    ...     GT:AD:DP:GQ:PL    0/1:357,238:373:99:3959,0,4059


Running VariantEval with just the CompOverlap module:

 $java -jar$STING_DIR/dist/GenomeAnalysisTK.jar -T VariantEval \
-R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta \
-L 1:10327 \
-B:dbsnp,VCF dbsnp.vcf \
-B:eval_correct_allele,VCF eval_correct_allele.vcf \
-B:eval_incorrect_allele,VCF eval_incorrect_allele.vcf \
-noEV \
-EV CompOverlap \
-o eval.table


We find that the eval.table file contains the following:

 grep -v '##' eval.table | column -t CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants nCompVariants novelSites nVariantsAtComp compRate nConcordant concordantRate CompOverlap dbsnp eval_correct_allele none all 1 1 0 1 100.00000000 1 100.00000000 CompOverlap dbsnp eval_correct_allele none known 1 1 0 1 100.00000000 1 100.00000000 CompOverlap dbsnp eval_correct_allele none novel 0 0 0 0 0.00000000 0 0.00000000 CompOverlap dbsnp eval_incorrect_allele none all 1 1 0 1 100.00000000 0 0.00000000 CompOverlap dbsnp eval_incorrect_allele none known 1 1 0 1 100.00000000 0 0.00000000 CompOverlap dbsnp eval_incorrect_allele none novel 0 0 0 0 0.00000000 0 0.00000000  As you can see, the detection set variant was listed under nVariantsAtComp (meaning the variant was seen at a position listed in dbSNP), but only the eval_correct_allele dataset is shown to be concordant at that site, because the allele listed in this dataset and dbSNP match. ### TiTvVariantEvaluator TiTvVariantEvaluator is an evaluation module that computes the following metrics: Metric Definition nTi number of transition loci nTv number of transversion loci tiTvRatio the transition to transversion ratio nTiInComp number of comp transition sites nTvInComp number of comp transversion sites TiTvRatioStandard the transition to transversion ratio for comp sites All analyses done with the GATK typically involve several (though not necessarily all) of the following inputs: • Reference genome sequence • Sequencing reads • Intervals of interest • Reference-ordered data This article describes the corresponding file formats that are acceptable for use with the GATK. ### 1. Reference Genome Sequence The GATK requires the reference sequence in a single reference sequence in FASTA format, with all contigs in the same file. The GATK requires strict adherence to the FASTA standard. All the standard IUPAC bases are accepted, but keep in mind that non-standard bases (i.e. other than ACGT, such as W for example) will be ignored (i.e. those positions in the genome will be skipped). Some users have reported having issues with reference files that have been stored or modified on Windows filesystems. The issues manifest as "10" characters (corresponding to encoded newlines) inserted in the sequence, which cause the GATK to quit with an error. If you encounter this issue, you will need to re-download a valid master copy of the reference file, or clean it up yourself. Gzipped fasta files will not work with the GATK, so please make sure to unzip them first. Please see this article for more information on preparing FASTA reference sequences for use with the GATK. #### Important note about human genome reference versions If you are using human data, your reads must be aligned to one of the official b3x (e.g. b36, b37) or hg1x (e.g. hg18, hg19) references. The names and order of the contigs in the reference you used must exactly match that of one of the official references canonical orderings. These are defined by historical karotyping of largest to smallest chromosomes, followed by the X, Y, and MT for the b3x references; the order is thus 1, 2, 3, ..., 10, 11, 12, ... 20, 21, 22, X, Y, MT. The hg1x references differ in that the chromosome names are prefixed with "chr" and chrM appears first instead of last. The GATK will detect misordered contigs (for example, lexicographically sorted) and throw an error. This draconian approach, though unnecessary technically, ensures that all supplementary data provided with the GATK works correctly. You can use ReorderSam to fix a BAM file aligned to a missorted reference sequence. Our Best Practice recommendation is that you use a standard GATK reference from the GATK resource bundle. ### 2. Sequencing Reads The only input format for sequence reads that the GATK itself supports is the [Sequence Alignment/Map (SAM)] format. See [SAM/BAM] for more details on the SAM/BAM format as well as Samtools and Picard, two complementary sets of utilities for working with SAM/BAM files. If you don't find the information you need in this section, please see our FAQs on BAM files. If you are starting out your pipeline with raw reads (typically in FASTQ format) you'll need to make sure that when you map those reads to the reference and produce a BAM file, the resulting BAM file is fully compliant with the GATK requirements. See the Best Practices documentation for detailed instructions on how to do this. In addition to being in SAM format, we require the following additional constraints in order to use your file with the GATK: • The file must be binary (with .bam file extension). • The file must be indexed. • The file must be sorted in coordinate order with respect to the reference (i.e. the contig ordering in your bam must exactly match that of the reference you are using). • The file must have a proper bam header with read groups. Each read group must contain the platform (PL) and sample (SM) tags. For the platform value, we currently support 454, LS454, Illumina, Solid, ABI_Solid, and CG (all case-insensitive). • Each read in the file must be associated with exactly one read group. Below is an example well-formed SAM field header and fields (with @SQ dictionary truncated to show only the first two chromosomes for brevity): @HD VN:1.0 GO:none SO:coordinate @SQ SN:1 LN:249250621 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 @SQ SN:2 LN:243199373 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:a0d9851da00400dec1098a9255ac712e @RG ID:ERR000162 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC @RG ID:ERR000252 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC @RG ID:ERR001684 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC @RG ID:ERR001685 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC @PG ID:GATK TableRecalibration VN:v2.2.16 CL:Covariates=[ReadGroupCovariate, QualityScoreCovariate, DinucCovariate, CycleCovariate], use_original_quals=true, defau t_read_group=DefaultReadGroup, default_platform=Illumina, force_read_group=null, force_platform=null, solid_recal_mode=SET_Q_ZERO, window_size_nqs=5, homopolymer_nback=7, except on_if_no_tile=false, pQ=5, maxQ=40, smoothing=137 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:b4eb71ee878d3706246b7c1dbef69299 @PG ID:bwa VN:0.5.5 ERR001685.4315085 16 1 9997 25 35M * 0 0 CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT ?8:C7ACAABBCBAAB?CCAABBEBA@ACEBBB@? XT:A:U XN:i:4 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 RG:Z:ERR001685 NM:i:6 MD:Z:0N0N0N0N1A0A28 OQ:Z:>>:>2>>>>>>>>>>>>>>>>>>?>>>>??>???> ERR001689.1165834 117 1 9997 0 * = 9997 0 CCGATCTAGGGTTAGGGTTAGGGTTAGGGTTAGGG >7AA<@@C?@?B?B??>9?B??>A?B???BAB??@ RG:Z:ERR001689 OQ:Z:>:<<8<<<><<><><<>7<>>>?>>??>??????? ERR001689.1165834 185 1 9997 25 35M = 9997 0 CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT 758A:?>>8?=@@>>?;4<>=??@@==??@?==?8 XT:A:U XN:i:4 SM:i:25 AM:i:0 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 RG:Z:ERR001689 NM:i:6 MD:Z:0N0N0N0N1A0A28 OQ:Z:;74>7><><><>>>>><:<>>>>>>>>>>>>>>>> ERR001688.2681347 117 1 9998 0 * = 9998 0 CGATCTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG 5@BA@A6B???A?B??>B@B??>B@B??>BAB??? RG:Z:ERR001688 OQ:Z:=>>>><4><<?><??????????????????????  #### Note about fixing BAM files with alternative sortings The GATK requires that the BAM file be sorted in the same order as the reference. Unfortunately, many BAM files have headers that are sorted in some other order -- lexicographical order is a common alternative. To resort the BAM file please use ReorderSam. ### 3. Intervals of interest If you don't find the information you need in this section, please see our FAQs on interval lists. The GATK accept interval files for processing subsets of the genome in Picard-style interval lists. These files typically have an extension such as '.list' or more explicitly,.interval_list, and look like this: @HD VN:1.0 SO:coordinate @SQ SN:1 LN:249250621 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 SP:Homo Sapiens @SQ SN:2 LN:243199373 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:a0d9851da00400dec1098a9255ac712e SP:Homo Sapiens 1 30366 30503 + target_1 1 69089 70010 + target_2 1 367657 368599 + target_3 1 621094 622036 + target_4 1 861320 861395 + target_5 1 865533 865718 + target_6 ...  consisting of aSAM-file-like sequence dictionary (the header), and targets in the form of <chr> <start> <stop> + <target_name>. These interval lists are tab-delimited. They are also 1-based (first position in the genome is position 1, not position 0). The easiest way to create such a file is to combine your reference file's sequence dictionary (the file stored alongside the reference fasta file with the .dict extension) and your intervals into one file. You can also specify a list of intervals formatted as <chr>:<start>-<stop> (one interval per line). No sequence dictionary is necessary. This file format also uses 1-based coordinates. Note that only the <chr> part is strictly required; if you just want to specify chromosomes/ contigs as opposed to specific coordinate ranges, you don't need to specify the rest. Both <chr>:<start>-<stop> and <chr> can be present in the same file. You can also specify intervals in this format directly at the command line instead of writing them in a file. Finally, we also accept BED style interval lists. Warning: this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats should be offset by 1. ### 4. Reference Ordered Data (ROD) file formats The GATK can associate arbitrary reference ordered data (ROD) files with named tracks for all tools. Some tools require specific ROD data files for processing, and developers are free to write tools that access arbitrary data sets using the ROD interface. The general ROD system has the following syntax: -argumentName:name,type file  Where name is the name in the GATK tool (like "eval" in VariantEval), type is the type of the file, such as VCF or dbSNP, and file is the path to the file containing the ROD data. The GATK supports several common file formats for reading ROD data: • VCF : VCF type, the recommended format for representing variant loci and genotype calls. The GATK will only process valid VCF files; VCFTools provides the official VCF validator. See here for a useful poster detailing the VCF specification. • UCSC formated dbSNP : dbSNP type, UCSC dbSNP database output • BED : BED type, a general purpose format for representing genomic interval data, useful for masks and other interval outputs. Please note that the bed format is 0-based while most other formats are 1-based. Note that we no longer support the PED format. See here for converting .ped files to VCF. If you need additional information on VCF files, please see our FAQs on VCF files here and here. ### 1. What it is and how it helps us improve the GATK Since September, 2010, the GATK has had a "phone-home" feature that sends us information about each GATK run via the Broad filesystem (within the Broad) and Amazon's S3 cloud storage service (outside the Broad). This feature is enabled by default. The information provided by the phone-home feature is critical in driving improvements to the GATK • By recording detailed information about each error that occurs, it enables GATK developers to identify and fix previously-unknown bugs in the GATK. We are constantly monitoring the errors our users encounter and do our best to fix those errors that are caused by bugs in our code. • It allows us to better understand how the GATK is used in practice and adjust our documentation and development goals for common use cases. • It gives us a picture of which versions of the GATK are in use over time, and how successful we've been at encouraging users to migrate from obsolete or broken versions of the GATK to newer, improved versions. • It tells us which tools are most commonly used, allowing us to monitor the adoption of newly-released tools and abandonment of outdated tools. • It provides us with a sense of the overall size of our user base and the major organizations/institutions using the GATK. ### 2. What information is sent to us Below are two example GATK Run Reports showing exactly what information is sent to us each time the GATK phones home. #### A successful run: <GATK-run-report> <id>D7D31ULwTSxlAwnEOSmW6Z4PawXwMxEz</id> <start-time>2012/03/10 20.21.19</start-time> <end-time>2012/03/10 20.21.19</end-time> <run-time>0</run-time> <walker-name>CountReads</walker-name> <svn-version>1.4-483-g63ecdb2</svn-version> <total-memory>85000192</total-memory> <max-memory>129957888</max-memory> <user-name>depristo</user-name> <host-name>10.0.1.10</host-name> <java>Apple Inc.-1.6.0_26</java> <machine>Mac OS X-x86_64</machine> <iterations>105</iterations> </GATK-run-report>  #### A run where an exception has occurred: <GATK-run-report> <id>yX3AnltsqIlXH9kAQqTWHQUd8CQ5bikz</id> <exception> <message>Failed to parse Genome Location string: 20:10,000,000-10,000,001x</message> <stacktrace class="java.util.ArrayList"> <string>org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:377)</string> <string>org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:82)</string> <string>org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:106)</string> <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.loadIntervals(GenomeAnalysisEngine.java:618)</string> <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeIntervals(GenomeAnalysisEngine.java:585)</string> <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:231)</string> <string>org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:128)</string> <string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)</string> <string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)</string> <string>org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)</string> </stacktrace> <cause> <message>Position: &apos;10,000,001x&apos; contains invalid chars.</message> <stacktrace class="java.util.ArrayList"> <string>org.broadinstitute.sting.utils.GenomeLocParser.parsePosition(GenomeLocParser.java:411)</string> <string>org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:374)</string> <string>org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:82)</string> <string>org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:106)</string> <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.loadIntervals(GenomeAnalysisEngine.java:618)</string> <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeIntervals(GenomeAnalysisEngine.java:585)</string> <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:231)</string> <string>org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:128)</string> <string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)</string> <string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)</string> <string>org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)</string> </stacktrace> <is-user-exception>false</is-user-exception> </cause> <is-user-exception>true</is-user-exception> </exception> <start-time>2012/03/10 20.19.52</start-time> <end-time>2012/03/10 20.19.52</end-time> <run-time>0</run-time> <walker-name>CountReads</walker-name> <svn-version>1.4-483-g63ecdb2</svn-version> <total-memory>85000192</total-memory> <max-memory>129957888</max-memory> <user-name>depristo</user-name> <host-name>10.0.1.10</host-name> <java>Apple Inc.-1.6.0_26</java> <machine>Mac OS X-x86_64</machine> <iterations>0</iterations> </GATK-run-report>  Note that as of GATK 1.5 we no longer collect information about the command-line executed, the working directory, or tmp directory. ### 3. Disabling Phone Home The GATK is currently in the process of evolving to require interaction with Amazon S3 as a normal part of each run. For this reason, and because the information contained in the GATK run reports is so critical in driving improvements to the GATK, we strongly discourage our users from disabling the phone-home feature. At the same time, we recognize that some of our users do have legitimate reasons for needing to run the GATK with phone-home disabled, and we don't wish to make it impossible for these users to run the GATK. #### Examples of legitimate reasons for disabling Phone Home • Technical reasons: Your local network might have restrictions in place that don't allow the GATK to access external resources, or you might need to run the GATK in a network-less environment. • Organizational reasons: Your organization's policies might forbid the dissemination of one or more pieces of information contained in the GATK run report. For such users we have provided an -et NO_ET option in the GATK to disable the phone-home feature. To use this option in GATK 1.5 and later, you need to contact us to request a key. Instructions for doing so are below. #### How to obtain and use a GATK key To obtain a GATK key, please fill out the request form. Running the GATK with a key is simple: you just need to append a -K your.key argument to your customary command line, where your.key is the path to the key file you obtained from us: java -jar dist/GenomeAnalysisTK.jar \ -T PrintReads \ -I public/testdata/exampleBAM.bam \ -R public/testdata/exampleFASTA.fasta \ -et NO_ET \ -K your.key  The -K argument is only necessary when running the GATK with the NO_ET option. #### Troubleshooting key-related problems • Corrupt/Unreadable/Revoked Keys If you get an error message from the GATK saying that your key is corrupt, unreadable, or has been revoked, please email '''gsahelp@broadinstitute.org''' to ask for a replacement key. • GATK Public Key Not Found If you get an error message stating that the GATK public key could not be located or read, then something is likely wrong with your build of the GATK. If you're running the binary release, try downloading it again. If you're compiling from source, try doing an ant clean and re-compiling. If all else fails, please ask for help on our community forum. ### What does GSA use Phone Home data for? We use the phone home data for three main purposes. First, we monitor the input logs for errors that occur in the GATK, and proactively fix them in the codebase. Second, we monitor the usage rates of the GATK in general and specific versions of the GATK to explain how widely used the GATK is to funding agencies and other potential supporters. Finally, we monitor adoption rates of specific GATK tools to understand how quickly new tools reach our users. Many of these analyses require us to aggregate the data by unique user, which is why we still collect the username of the individual who ran the GATK (as you can see in the plots). Examples of all three uses are shown in the Tableau graphs below, which update each night and are sent to the GATK members each morning for review. Please note that GATK-Lite was retired in February 2013 when version 2.4 was released. See the announcement here. You probably know by now that GATK-Lite is a free-for-everyone and completely open-source version of the GATK (licensed under the original MIT license). But what's in the box? What can GATK-Lite do -- or rather, what can it not do that the full version (let's call it GATK-Full) can? And what does that mean exactly, in terms of functionality, reliability and power? To really understand the differences between GATK-Lite and GATK-Full, you need some more information on how the GATK works, and how we work to develop and improve it. ### First you need to understand what are the two core components of the GATK: the engine and tools (see picture below). As explained here, the engine handles all the common work that's related to data access, conversion and traversal, as well as high-performance computing features. The engine is supported by an infrastructure of software libraries. If the GATK was a car, that would be the engine and chassis. What we call the **tools* are attached on top of that, and they provide the various analytical and processing functionalities like variant calling and base or variant recalibration. On your car, that would be headlights, airbags and so on. ### Second is how we work on developing the GATK, and what it means for how improvements are shared (or not) between Lite and Full. We do all our development work on a single codebase. This means that everything --the engine and all tools-- is on one common workbench. There are not different versions that we work on in parallel -- that would be crazy to manage! That's why the version numbers of GATK-Lite and GATK-Full always match: if the latest GATK-Full version is numbered 2.1-13, then the latest GATK-Lite is also numbered 2.1-13. The most important consequence of this setup is that when we make improvements to the infrastructure and engine, the same improvements will end up in GATK Lite and in GATK Full. So for the purposes of power, speed and robustness of the GATK that is determined by the engine, there is no difference between them. For the tools, it's a little more complicated -- but not much. When we "build" the GATK binaries (the .jar files), we put everything from the workbench into the Full build, but we only put a subset into the Lite build. Note that this Lite subset is pretty big -- it contains all the tools that were previously available in GATK 1.x versions, and always will. We also reserve the right to add previews or not-fully-featured versions of the new tools that are in Full, at our discretion, to the Lite build. ### So there are two basic types of differences between the tools available in the Lite and Full builds (see picture below). 1. We have a new tool that performs a brand new function (which wasn't available in GATK 1.x), and we only include it in the Full build. 2. We have a tool that has some new add-on capabilities (which weren't possible in GATK 1.x); we put the tool in both the Lite and the Full build, but the add-ons are only available in the Full build. Reprising the car analogy, GATK-Lite and GATK-Full are like two versions of the same car -- the basic version and the fully-equipped one. They both have the exact same engine, and most of the equipment (tools) is the same -- for example, they both have the same airbag system, and they both have headlights. But there are a few important differences: 1. The GATK-Full car comes with a GPS (sat-nav for our UK friends), for which the Lite car has no equivalent. You could buy a portable GPS unit from a third-party store for your Lite car, but it might not be as good, and certainly not as convenient, as the Full car's built-in one. 2. Both cars have windows of course, but the Full car has power windows, while the Lite car doesn't. The Lite windows can open and close, but you have to operate them by hand, which is much slower. ### So, to summarize: The underlying engine is exactly the same in both GATK-Lite and GATK-Full. Most functionalities are available in both builds, performed by the same tools. Some functionalities are available in both builds, but they are performed by different tools, and the tool in the Full build is better. New, cutting-edge functionalities are only available in the Full build, and there is no equivalent in the Lite build. We hope this clears up some of the confusion surrounding GATK-Lite. If not, please leave a comment and we'll do our best to clarify further! ### Overview One of the key challenges of working with next-gen sequence data is that input files are usually very large. We can’t just make the program open the files, load all the data into memory and perform whatever analysis is needed on all of it in one go. It’s just too much work, even for supercomputers. Instead, we make the program cut the job into smaller tasks that the computer can easily process separately. Then we have it combine the results of each step into the final result. ### Map/Reduce Map/Reduce is the technique we use to achieve this. It consists of three steps formally called filter, map and reduce. Let’s apply it to an example case where we want to find out what is the average depth of coverage in our dataset for a certain region of the genome. • filter determines what subset of the data needs to be processed in each task. In our example, the program lists all the reference positions in our region of interest. • map applies the function, i.e. performs the analysis on each subset of data. In our example, for each position in the list, the program looks into the BAM file, pulls out the pileup of bases and outputs the depth of coverage at that position. • reduce combines the elements in the list of results output by the map function. In our example, the program takes the coverage numbers that were calculated separately for all the reference positions and calculates their average, which is the final result we want. This may seem trivial for such a simple example, but it is a very powerful method with many advantages. Among other things, it makes it relatively easy to parallelize operations, which makes the tools run much faster on large datasets. ### Walkers, filters and traversal types All the tools in the GATK are built from the ground up to take advantage of this method. That’s why we call them walkers: because they “walk” across the genome, getting things done. Note that even though it’s not included in the Map/Reduce technique’s name, the filter step is very important. It determines what data get presented to the tool for analysis, selecting only the appropriate data for each task and discarding anything that’s not relevant. This is a key part of the Map/Reduce technique, because that’s what makes each task “bite-sized” enough for the computer to handle easily. Each tool has filters that are tailored specifically for the type of analysis it performs. The filters rely on traversal engines, which are little programs that are designed to “traverse” the data (i.e. walk through the data) in specific ways. There are three major types of traversal: Locus Traversal, Read Traversal and Active Region Traversal. In our interval coverage example, the tool’s filter uses the Locus Traversal engine, which walks through the data by locus, i.e. by position along the reference genome. Because of that, the tool is classified as a Locus Walker. Similarly, the Read Traversal engine is used, you’ve guessed it, by Read Walkers. The GATK engine comes packed with many other ways to walk through the genome and get the job done seamlessly, but those are the ones you’ll encounter most often. ### Further reading A GATKReport is simply a text document that contains well-formatted, easy to read representation of some tabular data. Many GATK tools output their results as GATKReports, so it's important to understand how they are formatted and how you can use them in further analyses. Here's a simple example: #:GATKReport.v1.0:2 #:GATKTable:true:2:9:%.18E:%.15f:; #:GATKTable:ErrorRatePerCycle:The error rate per sequenced position in the reads cycle errorrate.61PA8.7 qualavg.61PA8.7 0 7.451835696110506E-3 25.474613284804366 1 2.362777171937477E-3 29.844949954504095 2 9.087604507451836E-4 32.875909752547310 3 5.452562704471102E-4 34.498999090081895 4 9.087604507451836E-4 35.148316651501370 5 5.452562704471102E-4 36.072234352256190 6 5.452562704471102E-4 36.121724890829700 7 5.452562704471102E-4 36.191048034934500 8 5.452562704471102E-4 36.003457059679770 #:GATKTable:false:2:3:%s:%c:; #:GATKTable:TableName:Description key column 1:1000 T 1:1001 A 1:1002 C  This report contains two individual GATK report tables. Every table begins with a header for its metadata and then a header for its name and description. The next row contains the column names followed by the data. We provide an R library called gsalib that allows you to load GATKReport files into R for further analysis. Here are four simple steps to getting gsalib, installing it and loading a report. #### 1. Start R (or open RStudio)  R

R version 2.11.0 (2010-04-22)
Copyright (C) 2010 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.


#### 2. Get the gsalib library from CRAN

The gsalib library is available on the Comprehensive R Archive Network, so you can just do:

> install.packages("gsalib")


From within R (we use RStudio for convenience).

In some cases you need to explicitly tell R where to find the library; you can do this as follows:

\$ cat .Rprofile
.libPaths("/path/to/Sting/R/")


#### 3. Load the gsalib library

> library(gsalib)


#### 4. Finally, load the GATKReport file and have fun

> d = gsa.read.gatkreport("/path/to/my.gatkreport")
> summary(d)
Length Class      Mode
CountVariants 27     data.frame list
CompOverlap   13     data.frame list


### Overview

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variants analyses.

### Important caveat

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller 3.x contain additional information that is formatted in a very specific way. Read on to find out more.

### General comparison of VCF vs. gVCF

The key difference between a regular VCF and a gVCF is that the gVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a gVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION gVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

### The two types of gVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a gVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the gVCF header. The purpose of the blocks (also called banding) is to keep file size down, and there is no downside for the downstream analysis, so we do recommend using the -GVCF option.

### Example gVCF file

This is a banded gVCF produced by HaplotypeCaller with the -GVCF option.

As you can see in the first line, the basic file format is a valid version 4.1 VCF:

##fileformat=VCFv4.1
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)
##GVCFBlock=minGQ=60(inclusive),maxGQ=2147483647(exclusive)
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=b37>
##reference=file:///humgen/1kg/reference/human_g1k_v37.fasta


Toward the middle you see the ##GVCFBlock lines (after the ##FORMAT lines) (repeated here for clarity):

##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)


which indicate the GQ ranges used for banding (corresponding to the boundaries [5, 20, 60]).

You can also see the definition of the MIN_DP annotation in the ##FORMAT lines.

#### Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10000000 and ends at 20:10000116.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10000000    .   T   <NON_REF>   .   .   END=10000116    GT:DP:GQ:MIN_DP:PL  0/0:44:99:38:0,89,1385
20  10000118    .   T   <NON_REF>   .   .   END=10000210    GT:DP:GQ:MIN_DP:PL  0/0:42:99:38:0,80,1314
20  10000212    .   A   <NON_REF>   .   .   END=10000438    GT:DP:GQ:MIN_DP:PL  0/0:52:99:42:0,99,1403
20  10000439    .   T   G,<NON_REF> 1737.77 .   DP=57;MLEAC=2,0;MLEAF=1.00,0.00;MQ=221.41;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,56,0:56:99:1771,168,0,1771,168,1771:0,0,0,0
20  10000440    .   T   <NON_REF>   .   .   END=10000597    GT:DP:GQ:MIN_DP:PL  0/0:56:99:49:0,120,1800
20  10000598    .   T   A,<NON_REF> 1754.77 .   DP=54;MLEAC=2,0;MLEAF=1.00,0.00;MQ=185.55;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,53,0:53:99:1788,158,0,1788,158,1788:0,0,0,0
20  10000599    .   T   <NON_REF>   .   .   END=10000693    GT:DP:GQ:MIN_DP:PL  0/0:51:99:47:0,120,1800
20  10000695    .   G   <NON_REF>   .   .   END=10000757    GT:DP:GQ:MIN_DP:PL  0/0:48:99:45:0,120,1800
20  10000758    .   T   A,<NON_REF> 1663.77 .   DP=51;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.32;MQ0=0  GT:AD:DP:GQ:PL:SB   1/1:0,50,0:50:99:1697,149,0,1697,149,1697:0,0,0,0
20  10000759    .   A   <NON_REF>   .   .   END=10001018    GT:DP:GQ:MIN_DP:PL  0/0:40:99:28:0,65,1080
20  10001020    .   C   <NON_REF>   .   .   END=10001020    GT:DP:GQ:MIN_DP:PL  0/0:26:72:26:0,72,1080
20  10001021    .   T   <NON_REF>   .   .   END=10001021    GT:DP:GQ:MIN_DP:PL  0/0:25:37:25:0,37,909
20  10001022    .   C   <NON_REF>   .   .   END=10001297    GT:DP:GQ:MIN_DP:PL  0/0:30:87:25:0,72,831
20  10001298    .   T   A,<NON_REF> 1404.77 .   DP=41;MLEAC=2,0;MLEAF=1.00,0.00;MQ=171.56;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,41,0:41:99:1438,123,0,1438,123,1438:0,0,0,0
20  10001299    .   C   <NON_REF>   .   .   END=10001386    GT:DP:GQ:MIN_DP:PL  0/0:43:99:39:0,95,1226
20  10001387    .   C   <NON_REF>   .   .   END=10001418    GT:DP:GQ:MIN_DP:PL  0/0:41:42:39:0,21,315
20  10001419    .   T   <NON_REF>   .   .   END=10001425    GT:DP:GQ:MIN_DP:PL  0/0:45:12:42:0,9,135
20  10001426    .   A   <NON_REF>   .   .   END=10001427    GT:DP:GQ:MIN_DP:PL  0/0:49:0:48:0,0,1282
20  10001428    .   T   <NON_REF>   .   .   END=10001428    GT:DP:GQ:MIN_DP:PL  0/0:49:21:49:0,21,315
20  10001429    .   G   <NON_REF>   .   .   END=10001429    GT:DP:GQ:MIN_DP:PL  0/0:47:18:47:0,18,270
20  10001430    .   G   <NON_REF>   .   .   END=10001431    GT:DP:GQ:MIN_DP:PL  0/0:45:0:44:0,0,1121
20  10001432    .   A   <NON_REF>   .   .   END=10001432    GT:DP:GQ:MIN_DP:PL  0/0:43:18:43:0,18,270
20  10001433    .   T   <NON_REF>   .   .   END=10001433    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1201
20  10001434    .   G   <NON_REF>   .   .   END=10001434    GT:DP:GQ:MIN_DP:PL  0/0:44:18:44:0,18,270
20  10001435    .   A   <NON_REF>   .   .   END=10001435    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1130
20  10001436    .   A   AAGGCT,<NON_REF>    1845.73 .   DP=43;MLEAC=2,0;MLEAF=1.00,0.00;MQ=220.07;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,42,0:42:99:1886,125,0,1888,126,1890:0,0,0,0
20  10001437    .   A   <NON_REF>   .   .   END=10001437    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,0


Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

There has been a lot of confusion about the difference between QUAL and GQ, and we hope this FAQ will clarify the difference.

The basic difference is that QUAL refers to the variant site whereas GQ refers to a specific sample's GT.

• QUAL tells you how confident we are that there is some kind of variation at a given site. The variation may be present in one or more samples.

• GQ tells you how confident we are that the genotype we assigned to a particular sample is correct. It is simply the second lowest PL, because it is the difference between the second lowest PL and the lowest PL (always 0).

QUAL (or more importantly, its normalized form, QD) is mostly useful in multisample context. When you are recalibrating a cohort callset, you're going to be looking exclusively at site-level annotations like QD, because at that point what you're looking for is evidence of variation overall. That way you don't rely too much on individual sample calls, which are less robust.

In fact, many cohort studies don't even really care about individual genotype assignments, so they only use site annotations for their entire analysis.

Conversely, QUAL may seem redundant if you have only one sample. Especially if it has a good GQ (and more importantly, well separated PLs) then admittedly you don't really need to look at the QUAL -- you know what you have. If the GQ is not good, you can typically rely on the PLs to tell you whether you do probably have a variant, but we're just not sure if it's het or hom-var. If hom-ref is also a possibility, the call may be a potential false positive.

That said, it is more effective to filter on site-level annotations first, then refine and filter genotypes as appropriate. That's the workflow we recommend, based on years of experience doing this at fairly large scales...

#### Overview

This document describes how GATK commands are structured and how to add arguments to basic command examples.

### Basic java syntax

Commands for GATK always follow the same basic syntax:

java [Java arguments] -jar GenomeAnalysisTK.jar [GATK arguments]


The core of the command is java -jar GenomeAnalysisTK.jar, which starts up the GATK program in a Java Virtual Machine (JVM). Any additional java-specific arguments (such as -Xmx to increase memory allocation) should be inserted between java and -jar, like this:

java -Xmx4G -jar GenomeAnalysisTK.jar [GATK arguments]


The order of arguments between java and -jar is not important.

### GATK arguments

There are two universal arguments that are required for every GATK command (with very few exceptions, the clp-type utilities), -R for Reference (e.g. -R human_b37.fasta) and -T for Tool name (e.g. -T HaplotypeCaller).

Additional arguments fall in two categories:

• Engine arguments like -L (for specifying a list of intervals) which can be given to all tools and are technically optional but may be effectively required at certain steps for specific analytical designs (e.g. the -L argument for calling variants on exomes);

• Tool-specific arguments which may be required, like -I (to provide an input file containing sequence reads to tools that process BAM files) or optional, like -alleles (to provide a list of known alleles for genotyping).

The ordering of GATK arguments is not important, but we recommend always passing the tool name (-T) and reference (-R) first for consistency. It is also a good idea to consistently order arguments by some kind of logic in order to make it easy to compare different commands over the course of a project. It’s up to you to choose what that logic should be.

All available engine and tool-specific arguments are listed in the tool documentation section. Arguments typically have both a long name (prefixed by --) and a short name (prefixed by -). The GATK command line parser recognizes both equally, so you can use whichever you prefer, depending on whether you prefer commands to be more verbose or more succinct.

Finally, a note about flags. Flags are arguments that have boolean values, i.e. TRUE or FALSE. They are typically used to enable or disable specific features; for example, --keep_program_records will make certain GATK tools output additional information in the BAM header that would be omitted otherwise. In GATK, all flags are set to FALSE by default, so if you want to set one to TRUE, all you need to do is add the flag name to the command. You don't need to specify an actual value.

### Examples of complete GATK command lines

This is a very simple command that runs HaplotypeCaller in default mode on a single input BAM file containing sequence data and outputs a VCF file containing raw variants.

java -Xmx4G -jar GenomeAnalysisTK.jar -R human_b37.fasta -T HaplotypeCaller -I sample1.bam -o raw_variants.vcf


If the data is from exome sequencing, we should additionally provide the exome targets using the -L argument:

java -Xmx4G -jar GenomeAnalysisTK.jar -R human_b37.fasta -T HaplotypeCaller -I sample1.bam -o raw_variants.vcf -L exome_intervals.list


If we just want to genotype specific sites of interest using known alleles based on results from a previous study, we can change the HaplotypeCaller’s genotyping mode using -gt_mode, provide those alleles using -alleles, and restrict the analysis to just those sites using -L:

java -Xmx4G -jar GenomeAnalysisTK.jar -R human_b37.fasta -T HaplotypeCaller -I sample1.bam -o raw_variants.vcf -L known_alleles.vcf -alleles known_alleles.vcf -gt_mode GENOTYPE_GIVEN_ALLELES


For more examples of commands and for specific tool commands, see the tool documentation section.

### 1. Notes on known sites

#### Why are they important?

Each tool uses known sites differently, but what is common to all is that they use them to help distinguish true variants from false positives, which is very important to how these tools work. If you don't provide known sites, the statistical analysis of the data will be skewed, which can dramatically affect the sensitivity and reliability of the results.

In the variant calling pipeline, the only tools that do not strictly require known sites are UnifiedGenotyper and HaplotypeCaller.

#### Human genomes

If you're working on human genomes, you're in luck. We provide sets of known sites in the human genome as part of our resource bundle, and we can give you specific Best Practices recommendations on which sets to use for each tool in the variant calling pipeline. See the next section for details.

#### Non-human genomes

If you're working on genomes of other organisms, things may be a little harder -- but don't panic, we'll try to help as much as we can. We've started a community discussion in the forum on What are the standard resources for non-human genomes? in which we hope people with non-human genomics experience will share their knowledge.

And if it turns out that there is as yet no suitable set of known sites for your organisms, here's how to make your own for the purposes of BaseRecalibration: First, do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence. Good luck!

Some experimentation will be required to figure out the best way to find the highest confidence SNPs for use here. Perhaps one could call variants with several different calling algorithms and take the set intersection. Or perhaps one could do a very strict round of filtering and take only those variants which pass the test.

### 2. Recommended sets of known sites per tool

#### Summary table

Tool dbSNP 129 - - dbSNP >132 - - Mills indels - - 1KG indels - - HapMap - - Omni
RealignerTargetCreator X X
IndelRealigner X X
BaseRecalibrator X X X
(UnifiedGenotyper/ HaplotypeCaller) X
VariantRecalibrator X X X X
VariantEval X

#### RealignerTargetCreator and IndelRealigner

These tools require known indels passed with the -known argument to function properly. We use both the following files:

• Mills_and_1000G_gold_standard.indels.b37.sites.vcf
• 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)

#### BaseRecalibrator

This tool requires known SNPs and indels passed with the -knownSites argument to function properly. We use all the following files:

• The most recent dbSNP release (build ID > 132)
• Mills_and_1000G_gold_standard.indels.b37.sites.vcf
• 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)

#### UnifiedGenotyper / HaplotypeCaller

These tools do NOT require known sites, but if SNPs are provided with the -dbsnp argument they will use them for variant annotation. We use this file:

• The most recent dbSNP release (build ID > 132)

#### VariantRecalibrator

For VariantRecalibrator, please see the FAQ article on VQSR training sets and arguments.

#### VariantEval

This tool requires known SNPs passed with the -dbsnp argument to function properly. We use the following file:

• A version of dbSNP subsetted to only sites discovered in or before dbSNP BuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.

The answer depends on what tool we're talking about, and whether we're considering variant discovery or variant manipulation.

GATK variant manipulation tools are able to recognize the following types of alleles:

• SNP (single nucleotide polymorphism)
• INDEL (insertion/deletion)
• MIXED (combination of SNPs and indels at a single position)
• MNP (multi-nucleotide polymorphism, e.g. a dinucleotide substitution)
• SYMBOLIC (generally, a very large allele or one that's fuzzy and not fully modeled; i.e. there's some event going on here but we don't know what exactly)

Of our two variant callers, UnifiedGenotyper is the more limited, as it only calls SNPs and indels, and does so separately (even if you run in calling mode BOTH, the program performs separate calling operations internally). The HaplotypeCaller is more sophisticated and calls different types of variants at the same time. So in addition to SNPs and indels, it is capable of emitting mixed records by default. It is also capable of emitting MNPs and symbolic alleles, but the modes to do so are not enabled by default and they are not part of our recommended best practices for the tool.

The GATK currently does not handle SVs (structural variations) or CNVs (copy number variations), but there are some third-party software packages built on top of GATK that provide this functionality. See GenomeSTRiP for SVs and XHMM for CNVs.

### 1. Obtaining the bundle

Inside of the Broad, the latest bundle will always be available in:

/humgen/gsa-hpprojects/GATK/bundle/current


with a subdirectory containing for each reference sequence and associated data files.

External users can download these files (or corresponding .gz versions) from the GSA FTP Server in the directory bundle. Gzipped files should be unzipped before attempting to use them. Note that there is no "current link" on the FTP; users should download the highest numbered directory under current (this is the most recent data set).

### 2. b37 Resources: the Standard Data Set

• Reference sequence (standard 1000 Genomes fasta) along with fai and dict files
• dbSNP in VCF. This includes two files:
• The most recent dbSNP release
• This file subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
• HapMap genotypes and sites VCFs
• OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF
• The current best set of known indels to be used for local realignment (note that we don't use dbSNP for this anymore); use both files:
• 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)
• Mills_and_1000G_gold_standard.indels.b37.sites.vcf
• A large-scale standard single sample BAM file for testing:
• NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam containing ~64x reads of NA12878 on chromosome 20
• The results of the latest UnifiedGenotyper with default arguments run on this data set (NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.vcf)

Additionally, these files all have supplementary indices, statistics, and other QC data available.

### 3. hg18 Resources: lifted over from b37

Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

### 4. b36 Resources: lifted over from b37

Includes the 1000 Genomes pilot b36 formated reference sequence (human_b36_both.fasta) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

### 5. hg19 Resources: lifted over from b37

Includes the UCSC-style hg19 reference along with all lifted over VCF files.

The -L argument (short for --intervals) enables you to restrict your analysis to specific intervals instead of running over the whole genome. Using this argument can have important consequences for performance and/or results. Here, we present some guidelines for using it appropriately depending on your experimental design.

### In a nutshell, if you’re doing:

- Whole genome analysis: no need to include intervals
- Whole exome analysis: you need to provide the list of capture targets (typically genes/exons)
- Small targeted experiment: you need to provide the targeted interval(s)
- Troubleshooting: you can run on a specific interval to test parameters or create a data snippet

#### Important notes:

Whatever you end up using -L for, keep this in mind: for tools that output a bam or VCF file, the output file will only contain data from the intervals specified by the -L argument. To be clear, we do not recommend using -L with tools that output a bam file since doing so will omit some data from the output.

#### Example Use of -L:

-L 20 (for chromosome 20 in b37/b39 build)

-L chr20:1-100 (for chromosome 20 positions 1-100 in hg18/hg19 build)

### Whole genome analysis

It is not necessary to use -L in whole genome analysis. You should be interested in the whole genome!

Nevertheless, in some cases, you may want to mask out certain contigs (e.g. chrY or non-chromosome contigs) or regions (e.g. centromere). You can do this with -XL, which does the exact opposite of -L; it excludes the provided intervals.

### Whole exome analysis

By definition, exome sequencing data doesn’t cover the entire genome, so many analyses can be restricted to just the capture targets (genes or exons) to save processing time. There are even some analyses which should be restricted to the capture targets because failing to do so can lead to suboptimal results.

Note that we recommend adding some “padding” to the intervals in order to include the flanking regions (typically ~100 bp). No need to modify your target list; you can have the GATK engine do it for you automatically using the interval padding argument. This is not required, but if you do use it, you should do it consistently at all steps where you use -L.

Below is a step-by-step breakdown of the Best Practices workflow, with a detailed explanation of why -L should or shouldn’t be used with each tool.

Tool -L? Why / why not
RealignerTargetCreator YES Faster since RTC will only look for regions that need to be realigned within the input interval; no time wasted on the rest.
IndelRealigner NO IR will only try to realign the regions output from RealignerTargetCreator, so there is nothing to be gained by providing the capture targets.
BaseRecalibrator YES This excludes off-target sequences and sequences that may be poorly mapped, which have a higher error rate. Including them could lead to a skewed model and bad recalibration.
PrintReads NO Output is a bam file; using -L would lead to lost data.
UnifiedGenotyper/Haplotype Caller YES We’re only interested in making calls in exome regions; the rest is a waste of time & includes lots of false positives.
Next steps NO No need since subsequent steps operate on the callset, which was restricted to the exome at the calling step.

### Small targeted experiments

The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.

### Debugging / troubleshooting

You can go crazy with -L while troubleshooting! For example, you can just provide an interval at the command line, and the output file will contain the data from that interval.This is really useful when you’re trying to figure out what’s going on in a specific interval (e.g. why HaplotypeCaller is not calling your favorite indel) or what would be the effect of changing a parameter (e.g. what happens to your indel call if you increase the value of -minPruning). This is also what you’d use to generate a file snippet to send us as part of a bug report (except that never happens because GATK has no bugs, ever).

We know this field can be confusing or even overwhelming to newcomers, and getting to grips with a large and varied toolkit like the GATK can be a big challenge. We have produce a presentation that we hope will help you review all the background information that you need to know in order to use the GATK:

• Introduction to NGS Analysis: all you need to know to use the GATK: slides and video

In addition, the following links feature a lot of useful educational material about concepts and terminology related to next-generation sequencing:

Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.

Let's say we have this example data:

• sample1_lane1.fq
• sample1_lane2.fq
• sample2_lane1.fq
• sample2_lane2.fq

#### 1. Run all core steps per-lane once

At the basic level, all pre-processing steps are meant to be performed per-lane. Assuming that you received one FASTQ file per lane of sequence data, just run each file through each pre-processing step individually: map & dedup -> realign -> recal.

The example data becomes:

• sample1_lane1.dedup.realn.recal.bam
• sample1_lane2.dedup.realn.recal.bam
• sample2_lane1.dedup.realn.recal.bam
• sample2_lane2.dedup.realn.recal.bam

#### 2. Merge lanes per sample

Once you have pre-processed each lane individually, you merge lanes belonging to the same sample into a single BAM file.

The example data becomes:

• sample1.merged.bam
• sample2.merged.bam

#### 3. Per-sample refinement

You can increase the quality of your results by performing an extra round of dedupping and realignment, this time at the sample level. It is not absolutely required and will increase your computational costs, so it's up to you to decide whether you want to do it on your data, but that's how we do it internally at Broad.

The example data becomes:

• sample1.merged.dedup.realn.bam
• sample2.merged.dedup.realn.bam

This gets you two big wins:

• Dedupping per-sample eliminates PCR duplicates across all lanes in addition to optical duplicates (which are by definition only per-lane)
• Realigning per-sample means that you will have consistent alignments across all lanes within a sample.

People often ask also if it's worth the trouble to try realigning across all samples in a cohort. The answer is almost always no, unless you have very shallow coverage. The problem is that while it would be lovely to ensure consistent alignments around indels across all samples, the computational cost gets too ridiculous too fast. That being said, for contrastive calling projects -- such as cancer tumor/normals -- we do recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.

Finally, why not do base recalibration across lanes or across samples? Well, by definition there is no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine. That said, don't worry if you find yourself needing to recalibrate a BAM file with the lanes already merged -- the GATK's BaseRecalibrator is read group-aware, which means that it will identify separate lanes as such even if they are in the same BAM file, and it will always process them separately.

## New WGS and WEx CEU trio BAM files

We have sequenced at the Broad Institute and released to the 1000 Genomes Project the following datasets for the three members of the CEU trio (NA12878, NA12891 and NA12892):

• WEx (150x) sequence
• WGS (>60x) sequence

This is better data to work with than the original DePristo et al. BAMs files, so we recommend you download and analyze these files if you are looking for complete, large-scale data sets to evaluate the GATK or other tools.

Here's the rough library properties of the BAMs:

## NA12878 Datasets from DePristo et al. (2011) Nature Genetics

Here are the datasets we used in the GATK paper cited below.

DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D and Daly, M (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics. 43:491-498.

Some of the BAM and VCF files are currently hosted by the NCBI: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20101201_cg_NA12878/

• NA12878.hiseq.wgs.bwa.recal.bam -- BAM file for NA12878 HiSeq whole genome
• NA12878.hiseq.wgs.bwa.raw.bam Raw reads (in BAM format, see below)
• NA12878.ga2.exome.maq.recal.bam -- BAM file for NA12878 GenomeAnalyzer II whole exome (hg18)
• NA12878.ga2.exome.maq.raw.bam Raw reads (in BAM format, see below)
• NA12878.hiseq.wgs.vcf.gz -- SNP calls for NA12878 HiSeq whole genome (hg18)
• NA12878.ga2.exome.vcf.gz -- SNP calls for NA12878 GenomeAnalyzer II whole exome (hg18)
• BAM files for CEU + NA12878 whole genome (b36). These are the standard BAM files for the 1000 Genomes pilot CEU samples plus a 4x downsampled version of NA12878 from the pilot 2 data set, available in the DePristoNatGenet2011 directory of the GSA FTP Server
• SNP calls for CEU + NA12878 whole genome (b36) are available in the DePristoNatGenet2011 directory of the GSA FTP Server
• Crossbow comparison SNP calls are available in the DePristoNatGenet2011 directory of the GSA FTP Server as crossbow.filtered.vcf. The raw calls can be viewed by ignoring the FILTER field status
• whole_exome_agilent_designed_120.Homo_sapiens_assembly18.targets.interval_list -- targets used in the analysis of the exome capture data

Please note that we have not collected the indel calls for the paper, as these are only used for filtering SNPs near indels. If you want to call accurate indels, please use the new GATK indel caller in the Unified Genotyper.

### Warnings

Both the GATK and the sequencing technologies have improved significantly since the analyses performed in this paper.

• If you are conducting a review today, we would recommend that the newest version of the GATK, which performs much better than the version described in the paper. Moreover, we would also recommend one use the newest version of Crossbow as well, in case they have improved things. The GATK calls for NA12878 from the paper (above) will give one a good idea what a good call set looks like whole-genome or whole-exome.

• The data sets used in the paper are no longer state-of-the-art. The WEx BAM is GAII data aligned with MAQ on hg18, but a state-of-the-art data set would use HiSeq and BWA on hg19. Even the 64x HiSeq WG data set is already more than one year old. For a better assessment, we would recommend you use a newer data set for these samples, if you have the capacity to generate it. This applies less to the WG NA12878 data, which is pretty good, but the NA12878 WEx from the paper is nearly 2 years old now and notably worse than our most recent data sets.

Obviously, this was an annoyance for us as well, as it would have been nice to use a state-of-the-art data set for the WEx. But we decided to freeze the data used for analysis to actually finish this paper.

### How do I get the raw FASTQ file from a BAM?

If you want the raw, machine output for the data analyzed in the GATK framework paper, obtain the raw BAM files above and convert them from SAM to FASTQ using the Picard tool SamToFastq.

There are two types of GATK tools that are able to use pedigree (family structure) information:

### Tools that require a pedigree to operate

PhaseByTransmission and CalculateGenotypePosterior will not run without a properly formatted pedigree file. These tools are part of the Genotype Refinement workflow, which is documented here.

### Tools that are able to generate standard variant annotations

The two variant callers (HaplotypeCaller and the deprecated UnifiedGenotyper) as well as VariantAnnotator and GenotypeGVCFs are all able to use pedigree information if you request an annotation that involves population structure (e.g. Inbreeding Coefficient). To be clear though, the pedigree information is not used during the variant calling process; it is only used during the annotation step at the end.

If you already have VCF files that were called without pedigree information, and you want to add pedigree-related annotations (e.g to use Variant Quality Score Recalibration (VQSR) with the InbreedingCoefficient as a feature annotation), don't panic. Just run the latest version of the VariantAnnotator to re-annotate your variants, requesting any missing annotations, and make sure you pass your PED file to the VariantAnnotator as well. If you forget to provide the pedigree file, the tool will run successfully but pedigree-related annotations may not be generated (this behavior is different in some older versions).

The PED files used as input for these tools are based on PLINK pedigree files. The general description can be found here.

For these tools, the PED files must contain only the first 6 columns from the PLINK format PED file, and no alleles, like a FAM file in PLINK.

This document describes the resource datasets and arguments that we recommend for use in the two steps of VQSR (i.e. the successive application of VariantRecalibrator and ApplyRecalibration), based on our work with human genomes, to comply with the GATK Best Practices. The recommendations detailed in this document take precedence over any others you may see elsewhere in our documentation (e.g. in Tutorial articles, which are only meant to illustrate usage, or in past presentations, which may be out of date).

The document covers:

• Explanation of resource datasets
• Important notes about exome experiments
• Argument recommendations for VariantRecalibrator
• Argument recommendations for ApplyRecalibration

These recommendations are valid for use with calls generated by both the UnifiedGenotyper and HaplotypeCaller. In the past we made a distinction in how we processed the calls from these two callers, but now we treat them the same way. These recommendations will probably not work properly on calls generated by other (non-GATK) callers.

Note that VQSR must be run twice in succession in order to build a separate error model for SNPs and INDELs (see the VQSR documentation for more details).

### Explanation of resource datasets

The human genome training, truth and known resource datasets mentioned in this document are all available from our resource bundle.

If you are working with non-human genomes, you will need to find or generate at least truth and training resource datasets with properties corresponding to those described below. To generate your own resource set, one idea is to first do an initial round of SNP calling and only use those SNPs which have the highest quality scores. These sites which have the most confidence are probably real and could be used as truth data to help disambiguate the rest of the variants in the call set. Another idea is to try using several SNP callers in addition to the UnifiedGenotyper or HaplotypeCaller, and use those sites which are concordant between the different methods as truth data. In either case, you'll need to assign your set a prior likelihood that reflects your confidence in how reliable it is as a truth set. We recommend Q10 as a starting value, which you can then experiment with to find the most appropriate value empirically. There are many possible avenues of research here. Hopefully the model reporting plots that are generated by the recalibration tools will help facilitate this experimentation.

#### Resources for SNPs

• True sites training resource: HapMap
This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).

• True sites training resource: Omni
This resource is a set of polymorphic SNP sites produced by the Omni geno- typing array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

• Non-true sites training resource: 1000G
This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this re- source may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (%). 17

• Known sites resource, not used in training: dbSNP
This resource is a call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

#### Resources for Indels

• Known and true sites training resource: Mills
This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

Some of the annotations included in the recommendations given below might not be the best for your particular dataset. In particular, the following caveats apply:

• Depth of coverage (the DP annotation invoked by Coverage) should not be used when working with exome datasets since there is extreme variation in the depth to which targets are captured! In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.

• You may have seen HaplotypeScore mentioned in older documents. That is a statistic produced by UnifiedGenotyper that should only be used if you called your variants with UG. This statistic isn't produced by the HaplotypeCaller because that mathematics is already built into the likelihood function itself when calling full haplotypes with HC.

• The InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be computed. For projects with fewer samples, or that includes many closely related samples (such as a family) please omit this annotation from the command line.

### Important notes for exome capture experiments

In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP and/or indel callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:

• Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline). Be aware that you cannot simply add VCFs from the 1000 Genomes Project. You must either call variants from the original BAMs jointly with your own samples, or (better) use the reference model workflow to generate GVCFs from the original BAMs, and perform joint genotyping on those GVCFs along with your own samples' GVCFs with GenotypeGVCFs.

• You can also try using the VQSR with the smaller variant callset, but experiment with argument settings (try adding --maxGaussians 4 to your command line, for example). You should only do this if you are working with a non-model organism for which there are no available genomes or exomes that you can use to supplement your own cohort.

### Argument recommendations for VariantRecalibrator

The variant quality score recalibrator builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant is a true genetic variant or a machine artifact. One major improvement from previous recommended protocols is that hand filters do not need to be applied at any point in the process now. All filtering criteria are learned from the data itself.

#### Common, base command line

This is the first part of the VariantRecalibrator command line, to which you need to add either the SNP-specific recommendations or the indel-specific recommendations given further below.

java -Xmx4g -jar GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R path/to/reference/human_g1k_v37.fasta \
-input raw.input.vcf \
-recalFile path/to/output.recal \
-tranchesFile path/to/output.tranches \
-nt 4 \
[SPECIFY TRUTH AND TRAINING SETS] \
[SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \
[SPECIFY WHICH CLASS OF VARIATION TO MODEL] \


#### SNP specific recommendations

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. In addition we take the highest confidence SNPs from the project's callset. These datasets are available in the GATK resource bundle.

   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
-resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -an InbreedingCoeff \
-mode SNP \


Please note that these recommendations are formulated for whole-genome datasets. For exomes, we do not recommend using DP for variant recalibration (see below for details of why).

Note also that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, DP, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.

Also, using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.

You may notice that these recommendations no longer include the --numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.

#### Indel specific recommendations

When modeling indels with the VQSR we use a training dataset that was created at the Broad by strictly curating the (Mills, Devine, Genome Research, 2011) dataset as as well as adding in very high confidence indels from the 1000 Genomes Project. This dataset is available in the GATK resource bundle.

   --maxGaussians 4 \
-resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
-an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -an InbreedingCoeff \
-mode INDEL \


Note that indels use a different set of annotations than SNPs. Most annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

You may notice that these recommendations no longer include the --numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.

### Argument recommendations for ApplyRecalibration

The power of the VQSR is that it assigns a calibrated probability to every putative mutation in the callset. The user is then able to decide at what point on the theoretical ROC curve their project wants to live. Some projects, for example, are interested in finding every possible mutation and can tolerate a higher false positive rate. On the other hand, some projects want to generate a ranked list of mutations that they are very certain are real and well supported by the underlying data. The VQSR provides the necessary statistical machinery to effectively apply this sensitivity/specificity tradeoff.

#### Common, base command line

This is the first part of the ApplyRecalibration command line, to which you need to add either the SNP-specific recommendations or the indel-specific recommendations given further below.


java -Xmx3g -jar GenomeAnalysisTK.jar \
-T ApplyRecalibration \
-R reference/human_g1k_v37.fasta \
-input raw.input.vcf \
-tranchesFile path/to/input.tranches \
-recalFile path/to/input.recal \
-o path/to/output.recalibrated.filtered.vcf \
[SPECIFY THE DESIRED LEVEL OF SENSITIVITY TO TRUTH SITES] \
[SPECIFY WHICH CLASS OF VARIATION WAS MODELED] \


#### SNP specific recommendations

For SNPs we used HapMap 3.3 and the Omni 2.5M chip as our truth set. We typically seek to achieve 99.5% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.5 \
-mode SNP \


#### Indel specific recommendations

For indels we use the Mills / 1000 Genomes indel truth set described above. We typically seek to achieve 99.0% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.0 \
-mode INDEL \


As featured in this forum question.

Two main things account for these kinds of differences, both linked to default behaviors of the tools:

• The tools downsample to different depths of coverage

• The tools apply different read filters

In both cases, you can end up looking at different sets or numbers of reads, which causes some of the annotation values to be different. It's usually not a cause for alarm. Remember that many of these annotations should be interpreted relatively, not absolutely.

Just because something looks like a SNP in IGV doesn't mean that it is of high quality. We are extremely confident in the genotype likelihoods calculations in the Unified Genotyper (especially for SNPs) and in the HaplotypeCaller (for all variants including indels). So, before you post this issue in our support forum, please do a little bit of investigation on your own, as follows.

To diagnose what is happening, you should take a look at the pileup of bases at the position in question. It is very important for you to look at the underlying data here.

Here is a checklist of questions you should ask yourself:

• How many overlapping deletions are there at the position?

The genotyper ignores sites if there are too many overlapping deletions. This value can be set using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument), but be aware that increasing it could affect the reliability of your results.

• What do the base qualities look like for the non-reference bases?

Remember that there is a minimum base quality threshold and that low base qualities mean that the sequencer assigned a low confidence to that base. If your would-be SNP is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the VCF is the unfiltered depth. You may think you have good coverage at that site, but the Unified Genotyper ignores bases if they don't look good, so actual coverage seen by the UG may be lower than you think.

• What do the mapping qualities look like for the reads with the non-reference bases?

A base's quality is capped by the mapping quality of its read. The reason for this is that low mapping qualities mean that the aligner had little confidence that the read is mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

• Are there a lot of alternate alleles?

By default the UG will only consider a certain number of alternate alleles. This value can be set using the --max_alternate_alleles` argument (see the UG's documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles is both CPU and memory intensive and it scales exponentially based on the number of alternate alleles. Unless there is a good reason to change the default value, we highly recommend that you not play around with this parameter.

• Are you working with SOLiD data?

SOLiD alignments tend to have reference bias and it can be severe in some cases. Do the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site? If so, you are probably seeing false positives.

• Specifically for Haplotype Caller

In addition to the reasons above, Haplotype Caller has another reason why some variants do not get called when it seems obvious in the original bam file.

Haplotype Caller performs a local reassembly of the reads in the active region. Please refer here for more details: http://www.broadinstitute.org/gatk/guide/article?id=4148

This reassembly is important, because when mapping reads to the whole genome, it is easy to make an error. When reassembling reads in a much smaller region, such as the active region, the alignment is more likely to be accurate.

The reads you see in the alignment of the original bam file may suggest a variant should be called. However, due to the realignment, the reads may no longer support the variant. In order to see the new alignment of reads, you can use -bamout argument. You can then compare the aligned reads from the original bam file to the newly aligned reads in the -bamout file.

In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!