We support three types of interval lists, as mentioned here. Interval lists should preferentially be formatted as Picard-style interval lists, with an explicit sequence dictionary, as this prevents accidental misuse (e.g. hg18 intervals on an hg19 file). Note that this file is 1-based, not 0-based (first position in the genome is position 1).
One relatively easy way to combine your intervals is to use the online tool Galaxy, using the
Get Data -> Upload command to upload your intervals, and the
Operate on Genomic Intervals command to compute the intersection or union of your intervals (depending on your needs).
All analyses done with the GATK typically involve several (though not necessarily all) of the following inputs:
This article describes the corresponding file formats that are acceptable for use with the GATK.
The GATK requires the reference sequence in a single reference sequence in FASTA format, with all contigs in the same file. The GATK requires strict adherence to the FASTA standard. All the standard IUPAC bases are accepted, but keep in mind that non-standard bases (i.e. other than ACGT, such as W for example) will be ignored (i.e. those positions in the genome will be skipped).
Some users have reported having issues with reference files that have been stored or modified on Windows filesystems. The issues manifest as "10" characters (corresponding to encoded newlines) inserted in the sequence, which cause the GATK to quit with an error. If you encounter this issue, you will need to re-download a valid master copy of the reference file, or clean it up yourself.
Gzipped fasta files will not work with the GATK, so please make sure to unzip them first. Please see this article for more information on preparing FASTA reference sequences for use with the GATK.
If you are using human data, your reads must be aligned to one of the official b3x (e.g. b36, b37) or hg1x (e.g. hg18, hg19) references. The contig ordering in the reference you used must exactly match that of one of the official references canonical orderings. These are defined by historical karotyping of largest to smallest chromosomes, followed by the X, Y, and MT for the b3x references; the order is thus 1, 2, 3, ..., 10, 11, 12, ... 20, 21, 22, X, Y, MT. The hg1x references differ in that the chromosome names are prefixed with "chr" and chrM appears first instead of last. The GATK will detect misordered contigs (for example, lexicographically sorted) and throw an error. This draconian approach, though unnecessary technically, ensures that all supplementary data provided with the GATK works correctly. You can use ReorderSam to fix a BAM file aligned to a missorted reference sequence.
Our Best Practice recommendation is that you use a standard GATK reference from the GATK resource bundle.
The only input format for sequence reads that the GATK itself supports is the [Sequence Alignment/Map (SAM)] format. See [SAM/BAM] for more details on the SAM/BAM format as well as Samtools and Picard, two complementary sets of utilities for working with SAM/BAM files.
If you don't find the information you need in this section, please see our FAQs on BAM files.
If you are starting out your pipeline with raw reads (typically in FASTQ format) you'll need to make sure that when you map those reads to the reference and produce a BAM file, the resulting BAM file is fully compliant with the GATK requirements. See the Best Practices documentation for detailed instructions on how to do this.
In addition to being in SAM format, we require the following additional constraints in order to use your file with the GATK:
Below is an example well-formed SAM field header and fields (with @SQ dictionary truncated to show only the first two chromosomes for brevity):
@HD VN:1.0 GO:none SO:coordinate @SQ SN:1 LN:249250621 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 @SQ SN:2 LN:243199373 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:a0d9851da00400dec1098a9255ac712e @RG ID:ERR000162 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC @RG ID:ERR000252 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC @RG ID:ERR001684 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC @RG ID:ERR001685 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC @PG ID:GATK TableRecalibration VN:v2.2.16 CL:Covariates=[ReadGroupCovariate, QualityScoreCovariate, DinucCovariate, CycleCovariate], use_original_quals=true, defau t_read_group=DefaultReadGroup, default_platform=Illumina, force_read_group=null, force_platform=null, solid_recal_mode=SET_Q_ZERO, window_size_nqs=5, homopolymer_nback=7, except on_if_no_tile=false, pQ=5, maxQ=40, smoothing=137 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:b4eb71ee878d3706246b7c1dbef69299 @PG ID:bwa VN:0.5.5 ERR001685.4315085 16 1 9997 25 35M * 0 0 CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT ?8:C7ACAABBCBAAB?CCAABBEBA@ACEBBB@? XT:A:U XN:i:4 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 RG:Z:ERR001685 NM:i:6 MD:Z:0N0N0N0N1A0A28 OQ:Z:>>:>2>>>>>>>>>>>>>>>>>>?>>>>??>???> ERR001689.1165834 117 1 9997 0 * = 9997 0 CCGATCTAGGGTTAGGGTTAGGGTTAGGGTTAGGG >7AA<@@C?@?B?B??>9?B??>A?B???BAB??@ RG:Z:ERR001689 OQ:Z:>:<<8<<<><<><><<>7<>>>?>>??>??????? ERR001689.1165834 185 1 9997 25 35M = 9997 0 CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT 758A:?>>8?=@@>>?;4<>=??@@==??@?==?8 XT:A:U XN:i:4 SM:i:25 AM:i:0 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 RG:Z:ERR001689 NM:i:6 MD:Z:0N0N0N0N1A0A28 OQ:Z:;74>7><><><>>>>><:<>>>>>>>>>>>>>>>> ERR001688.2681347 117 1 9998 0 * = 9998 0 CGATCTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG 5@BA@A6B???A?B??>B@B??>B@B??>BAB??? RG:Z:ERR001688 OQ:Z:=>>>><4><<?><??????????????????????
The GATK requires that the BAM file be sorted in the same order as the reference. Unfortunately, many BAM files have headers that are sorted in some other order -- lexicographical order is a common alternative. To resort the BAM file please use ReorderSam.
If you don't find the information you need in this section, please see our FAQs on interval lists.
The GATK accept interval files for processing subsets of the genome in Picard-style interval lists. These files typically have an extension such as
'.list' or more explicitly,.interval_list`, and look like this:
@HD VN:1.0 SO:coordinate @SQ SN:1 LN:249250621 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 SP:Homo Sapiens @SQ SN:2 LN:243199373 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:a0d9851da00400dec1098a9255ac712e SP:Homo Sapiens 1 30366 30503 + target_1 1 69089 70010 + target_2 1 367657 368599 + target_3 1 621094 622036 + target_4 1 861320 861395 + target_5 1 865533 865718 + target_6 ...
consisting of aSAM-file-like sequence dictionary (the header), and targets in the form of
<chr> <start> <stop> + <target_name>. These interval lists are tab-delimited. They are also 1-based (first position in the genome is position 1, not position 0). The easiest way to create such a file is to combine your reference file's sequence dictionary (the file stored alongside the reference fasta file with the
.dict extension) and your intervals into one file.
You can also specify a list of intervals formatted as
<chr>:<start>-<stop> (one interval per line). No sequence dictionary is necessary. This file format also uses 1-based coordinates.
Finally, we also accept BED style interval lists. Warning: this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats should be offset by 1.
The GATK can associate arbitrary reference ordered data (ROD) files with named tracks for all tools. Some tools require specific ROD data files for processing, and developers are free to write tools that access arbitrary data sets using the ROD interface. The general ROD system has the following syntax:
name is the name in the GATK tool (like "eval" in VariantEval),
type is the type of the file, such as VCF or dbSNP, and
file is the path to the file containing the ROD data.
The GATK supports several common file formats for reading ROD data:
Note that we no longer support the PED format. See here for converting .ped files to VCF.
Recently I've been wanting to preform tasks for each interval in a file using the GATK. Are you guys planning to create an IntervalWalker class? (Or is there a workaround?) DiagnoseTargets in gatk-protected seem to work this way, but not in a very straightforward way - also since it's protected I'm not sure how much if any code I could borrow from there.
I have a vcf file of indels, and an interval file of single-base positions I am interested in. I would like to select all of the variants in the file that overlap the positions I'm looking at. Using select variants with the interval list I have returns nothing, because the indels are not contained within any intervals. I don't want to just add an arbitrary amount of padding to my intervals because I don't want to include other nearby variants that don't actually overlap my sites.
Is there a way to do this easily in GATK?
This is two separate questions:
Starting with a vcf file, plotting the depth (DP) distribution gives a nice, slightly asymmetrical bell-shaped curve. Given that SNPs with very high and very low coverages should be excluded, how does one decide what is very high and low. e.g. 5% either side ?
I'm only interested in chromosomes 2L, 2R, 3L, 3R and X of my Drosophila sequences. Filtering for these is easy with a Perl script but I'm trying to do this earlier on in the GATK processes. I've tried ...-L 2L -L 2R -L 3L ...etc, -L 2L 2R 3L ....etc and, -L 2L, 2R, 3R...etc but the result is either input error message or chromosome 2L only.
Many thanks and apologies if I've missed anything in the instructions.
this is my interval_list
chr1 762095 762275 LINC00115|NR_024321 chr1 762280 762414 LINC00115|NR_024321 chr1 762420 762565 LINC00115|NR_024321 chr1 777259 777349 LOC643837 chr1 777391 777481 LOC643837 chr1 777482 777642 LOC643837 chr1 783061 783151 LOC643837 chr1 792270 792446 LOC643837 chr1 861266 861496 NM_152486|SAMD11 chr1 865582 865787 NM_152486|SAMD11 chr1 866331 866507 NM_152486|SAMD11
and this is the output from the sample_interval_summary
chr1:762095-762275 ... chr1:762280-762414 ... chr1:762420-762565 ... chr1:777259-777349 ... chr1:783061-783151 ... chr1:792270-792446 ... chr1:861266-861496 ... chr1:865582-865787 ... chr1:866331-866507 ...
why am I missing two exons?
this is my cmd:
java -Xmx32g -jar /local/apps/gatk/2.5-2-gf57256b/GenomeAnalysisTK.jar -I sample.bam -R .../genome.fa -T DepthOfCoverage -o jtn -geneList hg19.tsv -L exons.list --omitDepthOutputAtEachBase --includeDeletions --interval_merging OVERLAPPING_ONLY -l INFO
Thanks for your input!
I am using one of the 1000 Genomes exome data (*.bam format) which is called NA12892, aligned to the GRCh37 build. And I just started to use IndelRealigner tool which requires proper *.interval_list file to work.
However, I am unable to find/generate *.interval_list file compatible with my data. Where can I download/generate *.interval_list (or *.bed) file that are compatible with exome data?
Normally, these files are provided by producers for Library Prep. kits (e.g illumina). But, I couldn't find which interval file should be used in 1000 Genomes data.
Hi, I am using FastaAlternateReferenceMaker and have a set of intervals ordered first by chromosome and then by their start positions. I have tried ordering chromosomes alphabetically(chr1, chr10, chr11,..) as well as numerically (chr1, chr2, chr3...) but the output fasta sequence returned is not in the same order as listed in interval file. I find that even the names target_1, target_2 etc are also not used as fasta headers in the output file. I am stuck with mapping the input intervals with the output fasta sequences. Thanks in advance for all the help, Ramya
I got this error message, when trying to use a file to specify at which positions to emit variants:
ERROR MESSAGE: Couldn't read file /lustre/scratch109/sanger/tc9/agv/wgs/pipeline/union4x.positions because The interval file /lustre/scratch109/sanger/tc9/agv/wgs/pipeline/union4x.positions does not have one of the supported extensions (.bed, .list, .picard, .interval_list, or .intervals). Please rename your file with the appropriate extension. Is there a GATK page describing those 5 file formats? Some of them are unknown to me; e.g. .list.
I asked my question here, but please ignore it: http://gatkforums.broadinstitute.org/discussion/2219/l-option
Thanks a lot.
Also, the error message does not mention support for vcf files, but the documentation does. Are vcf files supported?
GATK Queue implements a Scatter/Gather algorithm to create a set of intervals in order to parallelise data alalysis. If -scatter_gather option is issued, a respective number of interval files will be created and the input BAM files will be processed using these intervals. However a question arises what happens if the point of subsequent analysis is at or near the start/end of an interval? Are all the codes which support -l/-L options robust in the respect to interval positions? Since the input files are not actually spliced, all information is available to the processing program which could make the right decisions so that no artefacts are produced. Are there any restrictions on interval creation? Perhaps it should be at least a few read lengths. Anything else? Thanks in advance!
Hello, I want to know how important it is to have the -L target-intervals.intervals option in UnifiedGenotyper, and if it is recommended in VariantCallRecalibrator too.
I have ran the Unified Genotyper tool with 1 input file at a time or the 2 files I want to compare at once. My command-lines are the following:
java -jar -Xmx15g GenomeAnalysisTK.jar -R ./genome.fa -T UnifiedGenotyper -I ./1.bam --dbsnp ./dbsnp_137.hg19.vcf -o ./1-gatk.vcf --min_base_quality_score 25 -stand_call_conf 50 -stand_emit_conf 10 -dcov 200 -L ./intervals-1.intervals java -jar -Xmx15g GenomeAnalysisTK.jar -R ./genome.fa -T UnifiedGenotyper -I ./2.bam --dbsnp ./dbsnp_137.hg19.vcf -o ./2-gatk.vcf --min_base_quality_score 25 -stand_call_conf 50 -stand_emit_conf 10 -dcov 200 -L ./intervals-2.intervals java -jar -Xmx15g GenomeAnalysisTK.jar -R ./genome.fa -T UnifiedGenotyper -I ./1.bam -I ./2.bam --dbsnp ./dbsnp_137.hg19.vcf -o ./1vs2-gatk.vcf --min_base_quality_score 25 -stand_call_conf 50 -stand_emit_conf 10 -dcov 200 -L ./intervals-1.intervals -L ./intervals-2.intervals