Tagged with #intervals
3 documentation articles | 0 announcements | 16 forum discussions

Created 2014-05-06 21:51:40 | Updated 2016-04-13 00:24:35 | Tags: intervals exome

Comments (43)

The -L argument (short for --intervals) enables you to restrict your analysis to specific intervals instead of running over the whole genome. Using this argument can have important consequences for performance and/or results. Here, we present some guidelines for using it appropriately depending on your experimental design.

In a nutshell, if you’re doing:

- Whole genome analysis: no need to include intervals
- Whole exome analysis: you need to provide the list of capture targets (typically genes/exons)
- Small targeted experiment: you need to provide the targeted interval(s)
- Troubleshooting: you can run on a specific interval to test parameters or create a data snippet

Important notes:

Whatever you end up using -L for, keep this in mind: for tools that output a bam or VCF file, the output file will only contain data from the intervals specified by the -L argument. To be clear, we do not recommend using -L with tools that output a bam file since doing so will omit some data from the output.

Example Use of -L:

-L 20 (for chromosome 20 in b37/b39 build)

-L chr20:1-100 (for chromosome 20 positions 1-100 in hg18/hg19 build)

So here’s a little more detail for each experimental design type.

Whole genome analysis

It is not necessary to use -L in whole genome analysis. You should be interested in the whole genome!

Nevertheless, in some cases, you may want to mask out certain contigs (e.g. chrY or non-chromosome contigs) or regions (e.g. centromere). You can do this with -XL, which does the exact opposite of -L; it excludes the provided intervals.

Whole exome analysis

By definition, exome sequencing data doesn’t cover the entire genome, so many analyses can be restricted to just the capture targets (genes or exons) to save processing time. There are even some analyses which should be restricted to the capture targets because failing to do so can lead to suboptimal results.

Note that we recommend adding some “padding” to the intervals in order to include the flanking regions (typically ~100 bp). No need to modify your target list; you can have the GATK engine do it for you automatically using the interval padding argument. This is not required, but if you do use it, you should do it consistently at all steps where you use -L.

Below is a step-by-step breakdown of the Best Practices workflow, with a detailed explanation of why -L should or shouldn’t be used with each tool.

Tool -L? Why / why not
RealignerTargetCreator YES Faster since RTC will only look for regions that need to be realigned within the input interval; no time wasted on the rest.
IndelRealigner NO IR will only try to realign the regions output from RealignerTargetCreator, so there is nothing to be gained by providing the capture targets.
BaseRecalibrator YES This excludes off-target sequences and sequences that may be poorly mapped, which have a higher error rate. Including them could lead to a skewed model and bad recalibration.
PrintReads NO Output is a bam file; using -L would lead to lost data.
UnifiedGenotyper/Haplotype Caller YES We’re only interested in making calls in exome regions; the rest is a waste of time & includes lots of false positives.
Next steps NO No need since subsequent steps operate on the callset, which was restricted to the exome at the calling step.

Small targeted experiments

The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.

Debugging / troubleshooting

You can go crazy with -L while troubleshooting! For example, you can just provide an interval at the command line, and the output file will contain the data from that interval.This is really useful when you’re trying to figure out what’s going on in a specific interval (e.g. why HaplotypeCaller is not calling your favorite indel) or what would be the effect of changing a parameter (e.g. what happens to your indel call if you increase the value of -minPruning). This is also what you’d use to generate a file snippet to send us as part of a bug report (except that never happens because GATK has no bugs, ever).

Created 2012-08-11 04:16:24 | Updated 2013-01-15 02:59:32 | Tags: faq intervals

Comments (9)

1. What file formats do you support for interval lists?

We support three types of interval lists, as mentioned here. Interval lists should preferentially be formatted as Picard-style interval lists, with an explicit sequence dictionary, as this prevents accidental misuse (e.g. hg18 intervals on an hg19 file). Note that this file is 1-based, not 0-based (first position in the genome is position 1).

2. I have two (or more) sequencing experiments with different target intervals. How can I combine them?

One relatively easy way to combine your intervals is to use the online tool Galaxy, using the Get Data -> Upload command to upload your intervals, and the Operate on Genomic Intervals command to compute the intersection or union of your intervals (depending on your needs).

Created 2012-07-25 16:44:53 | Updated 2016-01-05 01:16:54 | Tags: fastareference intro inputs intervals

Comments (5)

All analyses done with the GATK typically involve several (though not necessarily all) of the following inputs:

  • Reference genome sequence
  • Sequencing reads
  • Intervals of interest
  • Reference-ordered data

This article describes the corresponding file formats that are acceptable for use with the GATK.

1. Reference Genome Sequence

The GATK requires the reference sequence in a single reference sequence in FASTA format, with all contigs in the same file. The GATK requires strict adherence to the FASTA standard. All the standard IUPAC bases are accepted, but keep in mind that non-standard bases (i.e. other than ACGT, such as W for example) will be ignored (i.e. those positions in the genome will be skipped).

Some users have reported having issues with reference files that have been stored or modified on Windows filesystems. The issues manifest as "10" characters (corresponding to encoded newlines) inserted in the sequence, which cause the GATK to quit with an error. If you encounter this issue, you will need to re-download a valid master copy of the reference file, or clean it up yourself.

Gzipped fasta files will not work with the GATK, so please make sure to unzip them first. Please see this article for more information on preparing FASTA reference sequences for use with the GATK.

Important note about human genome reference versions

If you are using human data, your reads must be aligned to one of the official b3x (e.g. b36, b37) or hg1x (e.g. hg18, hg19) references. The names and order of the contigs in the reference you used must exactly match that of one of the official references canonical orderings. These are defined by historical karotyping of largest to smallest chromosomes, followed by the X, Y, and MT for the b3x references; the order is thus 1, 2, 3, ..., 10, 11, 12, ... 20, 21, 22, X, Y, MT. The hg1x references differ in that the chromosome names are prefixed with "chr" and chrM appears first instead of last. The GATK will detect misordered contigs (for example, lexicographically sorted) and throw an error. This draconian approach, though unnecessary technically, ensures that all supplementary data provided with the GATK works correctly. You can use ReorderSam to fix a BAM file aligned to a missorted reference sequence.

Our Best Practice recommendation is that you use a standard GATK reference from the GATK resource bundle.

2. Sequencing Reads

The only input format for sequence reads that the GATK itself supports is the [Sequence Alignment/Map (SAM)] format. See [SAM/BAM] for more details on the SAM/BAM format as well as Samtools and Picard, two complementary sets of utilities for working with SAM/BAM files.

If you don't find the information you need in this section, please see our FAQs on BAM files.

If you are starting out your pipeline with raw reads (typically in FASTQ format) you'll need to make sure that when you map those reads to the reference and produce a BAM file, the resulting BAM file is fully compliant with the GATK requirements. See the Best Practices documentation for detailed instructions on how to do this.

In addition to being in SAM format, we require the following additional constraints in order to use your file with the GATK:

  • The file must be binary (with .bam file extension).
  • The file must be indexed.
  • The file must be sorted in coordinate order with respect to the reference (i.e. the contig ordering in your bam must exactly match that of the reference you are using).
  • The file must have a proper bam header with read groups. Each read group must contain the platform (PL) and sample (SM) tags. For the platform value, we currently support 454, LS454, Illumina, Solid, ABI_Solid, and CG (all case-insensitive).
  • Each read in the file must be associated with exactly one read group.

Below is an example well-formed SAM field header and fields (with @SQ dictionary truncated to show only the first two chromosomes for brevity):

@HD     VN:1.0  GO:none SO:coordinate
@SQ     SN:1    LN:249250621    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ     SN:2    LN:243199373    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:a0d9851da00400dec1098a9255ac712e
@RG     ID:ERR000162    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR000252    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR001684    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR001685    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@PG     ID:GATK TableRecalibration      VN:v2.2.16      CL:Covariates=[ReadGroupCovariate, QualityScoreCovariate, DinucCovariate, CycleCovariate], use_original_quals=true, defau 
t_read_group=DefaultReadGroup, default_platform=Illumina, force_read_group=null, force_platform=null, solid_recal_mode=SET_Q_ZERO, window_size_nqs=5, homopolymer_nback=7, except on_if_no_tile=false, pQ=5, maxQ=40, smoothing=137       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:b4eb71ee878d3706246b7c1dbef69299
@PG     ID:bwa  VN:0.5.5
ERR001685.4315085       16      1       9997    25      35M     *       0       0       CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT     ?8:C7ACAABBCBAAB?CCAABBEBA@ACEBBB@?     XT:A:U  XN:i:4    X0:i:1  X1:i:0  XM:i:2  XO:i:0  XG:i:0  RG:Z:ERR001685  NM:i:6  MD:Z:0N0N0N0N1A0A28     OQ:Z:>>:>2>>>>>>>>>>>>>>>>>>?>>>>??>???>
ERR001689.1165834       117     1       9997    0       *       =       9997    0       CCGATCTAGGGTTAGGGTTAGGGTTAGGGTTAGGG     >7AA<@@C?@?B?B??>9?B??>A?B???BAB??@     RG:Z:ERR001689    OQ:Z:>:<<8<<<><<><><<>7<>>>?>>??>???????
ERR001689.1165834       185     1       9997    25      35M     =       9997    0       CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT     758A:?>>8?=@@>>?;4<>=??@@==??@?==?8     XT:A:U  XN:i:4    SM:i:25 AM:i:0  X0:i:1  X1:i:0  XM:i:2  XO:i:0  XG:i:0  RG:Z:ERR001689  NM:i:6  MD:Z:0N0N0N0N1A0A28     OQ:Z:;74>7><><><>>>>><:<>>>>>>>>>>>>>>>>
ERR001688.2681347       117     1       9998    0       *       =       9998    0       CGATCTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG     5@BA@A6B???A?B??>B@B??>B@B??>BAB???     RG:Z:ERR001688    OQ:Z:=>>>><4><<?><??????????????????????       

Note about fixing BAM files with alternative sortings

The GATK requires that the BAM file be sorted in the same order as the reference. Unfortunately, many BAM files have headers that are sorted in some other order -- lexicographical order is a common alternative. To resort the BAM file please use ReorderSam.

3. Intervals of interest

If you don't find the information you need in this section, please see our FAQs on interval lists.

The GATK accept interval files for processing subsets of the genome in Picard-style interval lists. These files typically have an extension such as '.list' or more explicitly,.interval_list`, and look like this:

@HD     VN:1.0  SO:coordinate
@SQ     SN:1    LN:249250621    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:1b22b98cdeb4a9304cb5d48026a85128     SP:Homo Sapiens
@SQ     SN:2    LN:243199373    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:a0d9851da00400dec1098a9255ac712e     SP:Homo Sapiens
1       30366   30503   +       target_1
1       69089   70010   +       target_2
1       367657  368599  +       target_3
1       621094  622036  +       target_4
1       861320  861395  +       target_5
1       865533  865718  +       target_6

consisting of aSAM-file-like sequence dictionary (the header), and targets in the form of <chr> <start> <stop> + <target_name>. These interval lists are tab-delimited. They are also 1-based (first position in the genome is position 1, not position 0). The easiest way to create such a file is to combine your reference file's sequence dictionary (the file stored alongside the reference fasta file with the .dict extension) and your intervals into one file.

You can also specify a list of intervals formatted as <chr>:<start>-<stop> (one interval per line). No sequence dictionary is necessary. This file format also uses 1-based coordinates. Note that only the <chr> part is strictly required; if you just want to specify chromosomes/ contigs as opposed to specific coordinate ranges, you don't need to specify the rest. Both <chr>:<start>-<stop> and <chr> can be present in the same file. You can also specify intervals in this format directly at the command line instead of writing them in a file.

Finally, we also accept BED style interval lists. Warning: this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats should be offset by 1.

4. Reference Ordered Data (ROD) file formats

The GATK can associate arbitrary reference ordered data (ROD) files with named tracks for all tools. Some tools require specific ROD data files for processing, and developers are free to write tools that access arbitrary data sets using the ROD interface. The general ROD system has the following syntax:

-argumentName:name,type file

Where name is the name in the GATK tool (like "eval" in VariantEval), type is the type of the file, such as VCF or dbSNP, and file is the path to the file containing the ROD data.

The GATK supports several common file formats for reading ROD data:

  • VCF : VCF type, the recommended format for representing variant loci and genotype calls. The GATK will only process valid VCF files; VCFTools provides the official VCF validator. See here for a useful poster detailing the VCF specification.
  • UCSC formated dbSNP : dbSNP type, UCSC dbSNP database output
  • BED : BED type, a general purpose format for representing genomic interval data, useful for masks and other interval outputs. Please note that the bed format is 0-based while most other formats are 1-based.

Note that we no longer support the PED format. See here for converting .ped files to VCF.

If you need additional information on VCF files, please see our FAQs on VCF files here and here.

No articles to display.

Created 2016-04-12 16:37:19 | Updated | Tags: intervals genotypegvcf

Comments (4)

Hi, I have a problem with joint genotpying a large Exome cohort using GATK 3.5. I am trying to call genotypes on the exonic intervals only, but it appears that in a few locations a long deletion spanning outside of the interval results in a reference sequence mismatch, which trips up GenotypeGVCF, leading to an error message as seen below.

Here the relevant region from the interval file, as BED, which I use for the initial HaplotypeCaller in gvcf mode: chr2 27380091 27380444 chr2 27380460 27380648

Here is the relevant snippet from the single vcf that has the long deletion spanning from the first interval almost up to the second:

chr2 27380323 . A <NON_REF> . . END=27380420 GT:DP:GQ:MIN_DP:PL 0/0:31:60:21:0,60,743 chr2 27380421 . TCCACTGACCCCACCCCCGCTCCTCTCCCCTCAAGG T,<NON_REF> 0 . DP=29;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;RAW_MQ=104400.00 GT:AD:DP:GQ:PL:SB 0/0:27,0,0:27:79:0,79,6544,81,6546,6548:7,20,0,0 chr2 27380461 . T <NON_REF> . . END=27380461 GT:DP:GQ:MIN_DP:PL 0/0:29:0:29:0,0,595 Since I am calling 18.000 exomes together, I combine them in chunks of 100. The result of CombineGVCF at this region looks like this:

chr2 27380421 . TCCACTGACCCCACCCCCGCTCCTCTCCCCTCAAGG T,<NON_REF> . chr2 27380422 . C *,<NON_REF> . chr2 27380423 . C *,<NON_REF> . chr2 27380424 . A *,<NON_REF> . chr2 27380425 . C *,<NON_REF> . chr2 27380426 . T *,<NON_REF> . chr2 27380427 . G *,<NON_REF> . chr2 27380428 . A *,<NON_REF> . chr2 27380429 . C *,<NON_REF> . chr2 27380430 . C *,<NON_REF> . chr2 27380431 . C *,<NON_REF> . chr2 27380432 . C *,<NON_REF> . chr2 27380433 . A C,*,<NON_REF> . chr2 27380434 . C *,<NON_REF> . chr2 27380435 . C *,<NON_REF> . chr2 27380436 . C *,<NON_REF> . chr2 27380437 . C *,<NON_REF> . chr2 27380438 . C *,<NON_REF> . chr2 27380439 . G C,*,<NON_REF> . chr2 27380440 . C *,<NON_REF> . chr2 27380441 . T C,*,<NON_REF> . chr2 27380442 . C *,<NON_REF> . chr2 27380443 . C *,<NON_REF> . chr2 27380444 . T *,<NON_REF> . chr2 27380460 . T *,<NON_REF> . chr2 27380461 . T <NON_REF> .'

Note the sequence content T at position 27380460. This does not match the actual reference C (second to last base in the call below)

java -jar GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -R GRCh38/GRCh38.fa -T FastaReferenceMaker -L chr2:27380421-27380461 2> /dev/null \>1 chr2:27380421 TCCACTGACCCCACCCCCGCTCCTCTCCCCTCAAGGCCCCT

This then leads to downstream problems during GenotpyeVCFs which errors out with a message:

ERROR MESSAGE: The provided variant file(s) have inconsistent references for the same position(s) at chr2:27380460, C* vs. T*

This looks to me that the CombineGVCF step somehow mangles the sequence content of the actual deletion into the reference column of the multisample vcf. I have seen this happening at five different occasions in my dataset. Of course I could simply remove those five intervals from my set and be done with it, but I suppose this error will pop up more often with larger datasets, so I am looking for a better solution.

BTW, I noted that the actual deletion is reported with an allele depth of 0 in the initial gvcf ( 0/0:27,0,0). This seems weird to me but I assume that this is just one of the magic tricks that the de-novo assembler in the haplotype caller does. Or maybe that is already the root cause of all this mess.

Thanks for any feedback on this


Created 2016-02-25 15:49:57 | Updated | Tags: haplotypecaller intervals wgs

Comments (2)

I'm currently running my first real use of GATK. I was worried about running HaplotypeCaller on whole geneomes given some of the reports I've seen on these forums about how long it can take to run. In contrast, I was pleasantly surprised with the current GATK it is proceeding well (~7 day estimate on dog wgs). But it seems it could be much faster if I divided it up by chromosome with the -L flag.

I see that the advice is to not use the -L flag for whole genome analysis [1]. But the wording in that link seems open: it is not necessary, but if it would help efficiency it might be worthwhile.

I've found a related question on the forums here [2], but it seems the descrepancy discussed in that thread is suspected to be due to downsampling and not actually the result of a chromosome-by-chromosome use of HaplotypeCaller.

Again, I'm content with a ~7 day run time in order to take proper care of our data. I wouldn't want to sacrifice power or accuracy for a shorter runtime, but if there is really no trade-off, a chromosomal approach would be even better. So I'm curious if there is a downside to partitioning the HaplotypeCaller step by chromosome?

[1] http://gatkforums.broadinstitute.org/gatk/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals [2] http://gatkforums.broadinstitute.org/gatk/discussion/5008/haplotypecaller-on-whole-genome-or-chromosome-by-chromosome-different-results

Created 2015-09-11 15:35:09 | Updated 2015-09-11 15:35:29 | Tags: intervals mutect b37

Comments (15)

I hate to put this same error on the GATK forum again, but I went through many of these errors already posted on the forum, but none of the answers shed light on my issue. I have my bam files aligned to GRCh37-lite and am using the same reference genome downloaded from ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/special_requests

I have next performed GATK best practices for pre-processing of these bams using the same ref genome without throwing any error in the process. Currently I'm running MuTect as java -Xmx56g -jar muTect-1.1.4.jar --analysis_type MuTect --reference_sequence ./resources/b37/human_g1k_v37.fasta --cosmic ./resources/Cosmic.b37.vcf --dbsnp ./resources/dbsnp_138.b37.vcf --intervals ./resources/mirna.1.5flank-interval-list.list --input_file:normal $normal.recal_reads.bam --input_file:tumor $tumor.recal_reads.bam --out $sample.call_stats.out --coverage_file $sample.coverage.wig.txt

And getting this error message:

ERROR MESSAGE: Badly formed genome loc: Contig 'chr1' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?

What more tests should I run to troubleshoot this issue? Also, the interval list is what I created from a .bed file. I have restricted my bam files to a limited bed regions using the same file in a command "samtools view -@8 -b -h -L"

This was the file I was most confused about. Is it possible that this file is causing the error? First few lines of this file are: chr1:15869-18936 chr1:28866-32003 chr1:566205-569293 chr1:1100984-1104078 chr1:1101743-1104832 chr1:1102885-1105967 chr1:1229990-1233050 chr1:1246382-1249446 chr1:1273530-1276588 chr1:3043039-3046099 chr1:3475759-3478854 chr1:5622631-5625703 chr1:5921232-5924301 chr1:6488394-6491456 chr1:8925061-8928149 chr1:9210227-9213336 chr1:10025939-10029016 chr1:10286276-10289361 chr1:12087715-12090779



Thanks a ton for your help!

Created 2015-02-24 23:21:25 | Updated | Tags: commandlinegatk intervals

Comments (3)

Hi, I hope this is a quick question; but does using the 'include intervals' command line option '-L' only include the region specified?

For instance if I have an file that includes reads for chromosomes 1,2,6,X,and Y and I specifiy "-L 6", will the walker only process chromosome 6, or will it include the rest of my data as well?

Thank you for the clarification!

Comments (43)

I am running HC3.3-0 with the following options (e.g. GENOTYPE_GIVEN_ALLELES):

$java7 -Djava.io.tmpdir=tmp -Xmx3900m \
 -jar $jar \
 --analysis_type HaplotypeCaller \
 --reference_sequence $ref \
 --input_file $BAM \
 --intervals $CHROM \
 --dbsnp $dbSNP \
 --out $out \
 -stand_call_conf 0 \
 -stand_emit_conf 0 \
 -A Coverage -A FisherStrand -A HaplotypeScore -A MappingQualityRankSumTest -A QualByDepth -A RMSMappingQuality -A ReadPosRankSumTest \
 -L $allelesVCF \
 -L 20:60000-70000 \
 --interval_set_rule INTERSECTION \
 --genotyping_mode GENOTYPE_GIVEN_ALLELES \
 --alleles $allelesVCF \
 --emitRefConfidence NONE \
 --output_mode EMIT_ALL_SITES \

The file $allelesVCF contains these neighbouring SNPs:

20  60807   .   C   T   118.96  .
20  60808   .   G   A   46.95   .
20  61270   .   A   C   2870.18 .
20  61271   .   T   A   233.60  .

I am unable to call these neighbouring SNPs; despite reads being present in the file $BAM, which shouldn't matter anyway. I also tried adding --interval_merging OVERLAPPING_ONLY to the command line, but that didn't solve the problem. What am I doing wrong? I should probably add GATK breaker/misuser to my CV...

Thank you as always.

P.S. The CommandLineGATK documentation does not say, what the default value for --interval_merging is.

P.P.S. Iterative testing a bit slow, because HC always has to do this step:

HCMappingQualityFilter - Filtering out reads with MAPQ < 20

Created 2013-12-09 13:00:14 | Updated | Tags: developer intervals walker

Comments (4)


Recently I've been wanting to preform tasks for each interval in a file using the GATK. Are you guys planning to create an IntervalWalker class? (Or is there a workaround?) DiagnoseTargets in gatk-protected seem to work this way, but not in a very straightforward way - also since it's protected I'm not sure how much if any code I could borrow from there.

cheers Daniel

Created 2013-08-15 09:45:46 | Updated | Tags: intervals

Comments (0)

Hi. I ran MuTect using a tumor-normal pair to detect variants from whole genome, and I got 'N' number of rejected variants (R). Then, I reran MuTect using the same tumor-normal pair but on selected intervals defined as follow: one base pair before and after the positions of the rejected variants (R). I thought that the same number of rejected variants (N) will be returned, but the number of rejected variants (based on intervals) increases. Intervals are saved in .bed

Did I do something wrong? Thanks!

p/s: I would like to know the reasons of variants being rejected, but the first iteration was run without enabling the extended output flag. So, to save time, i chose to rerun MuTect only on the sites where rejected variants are detected.

Created 2013-07-29 16:59:10 | Updated | Tags: selectvariants intervals

Comments (2)


I have a vcf file of indels, and an interval file of single-base positions I am interested in. I would like to select all of the variants in the file that overlap the positions I'm looking at. Using select variants with the interval list I have returns nothing, because the indels are not contained within any intervals. I don't want to just add an arbitrary amount of padding to my intervals because I don't want to include other nearby variants that don't actually overlap my sites.

Is there a way to do this easily in GATK?

Created 2013-07-22 11:09:51 | Updated | Tags: depthofcoverage snp vcf intervals non-human filter

Comments (3)

Hi team,

This is two separate questions:

  1. Starting with a vcf file, plotting the depth (DP) distribution gives a nice, slightly asymmetrical bell-shaped curve. Given that SNPs with very high and very low coverages should be excluded, how does one decide what is very high and low. e.g. 5% either side ?

  2. I'm only interested in chromosomes 2L, 2R, 3L, 3R and X of my Drosophila sequences. Filtering for these is easy with a Perl script but I'm trying to do this earlier on in the GATK processes. I've tried ...-L 2L -L 2R -L 3L ...etc, -L 2L 2R 3L ....etc and, -L 2L, 2R, 3R...etc but the result is either input error message or chromosome 2L only.

Many thanks and apologies if I've missed anything in the instructions.



Created 2013-07-15 14:01:25 | Updated | Tags: findcoveredintervals intervals

Comments (3)

Is there an existing tool to convert a GATK formatted interval list file (such as the one outputted by FindCoveredIntervals) to a BED file?

Created 2013-05-15 09:46:39 | Updated 2013-05-15 09:47:04 | Tags: depthofcoverage intervals

Comments (4)

Hi there,

this is my interval_list

chr1 762095 762275 LINC00115|NR_024321 chr1 762280 762414 LINC00115|NR_024321 chr1 762420 762565 LINC00115|NR024321 chr1 777259 777349 LOC643837 chr1 777391 777481 LOC643837 chr1 777482 777642 LOC643837_ chr1 783061 783151 LOC643837 chr1 792270 792446 LOC643837 chr1 861266 861496 NM_152486|SAMD11 chr1 865582 865787 NM_152486|SAMD11 chr1 866331 866507 NM_152486|SAMD11

and this is the output from the sample_interval_summary

chr1:762095-762275 ... chr1:762280-762414 ... chr1:762420-762565 ... chr1:777259-777349 ... chr1:783061-783151 ... chr1:792270-792446 ... chr1:861266-861496 ... chr1:865582-865787 ... chr1:866331-866507 ...

why am I missing two exons?

this is my cmd:

java -Xmx32g -jar /local/apps/gatk/2.5-2-gf57256b/GenomeAnalysisTK.jar -I sample.bam -R .../genome.fa -T DepthOfCoverage -o jtn -geneList hg19.tsv -L exons.list --omitDepthOutputAtEachBase --includeDeletions --interval_merging OVERLAPPING_ONLY -l INFO

Thanks for your input!


Created 2013-05-12 22:55:38 | Updated | Tags: indelrealigner intervals

Comments (2)


I am using one of the 1000 Genomes exome data (.bam format) which is called NA12892, aligned to the GRCh37 build. And I just started to use IndelRealigner tool which requires proper .interval_list file to work.

However, I am unable to find/generate *.interval_list file compatible with my data. Where can I download/generate .interval_list (or .bed) file that are compatible with exome data?

Normally, these files are provided by producers for Library Prep. kits (e.g illumina). But, I couldn't find which interval file should be used in 1000 Genomes data.

Thank you.

Created 2013-05-08 08:23:44 | Updated | Tags: intervals fastaalternatereferencemaker

Comments (6)

Hi, I am using FastaAlternateReferenceMaker and have a set of intervals ordered first by chromosome and then by their start positions. I have tried ordering chromosomes alphabetically(chr1, chr10, chr11,..) as well as numerically (chr1, chr2, chr3...) but the output fasta sequence returned is not in the same order as listed in interval file. I find that even the names target_1, target_2 etc are also not used as fasta headers in the output file. I am stuck with mapping the input intervals with the output fasta sequences. Thanks in advance for all the help, Ramya

Created 2013-04-22 13:27:13 | Updated 2013-04-22 13:29:47 | Tags: commandlinegatk intervals

Comments (3)

I got this error message, when trying to use a file to specify at which positions to emit variants:

ERROR MESSAGE: Couldn't read file /lustre/scratch109/sanger/tc9/agv/wgs/pipeline/union4x.positions because The interval file /lustre/scratch109/sanger/tc9/agv/wgs/pipeline/union4x.positions does not have one of the supported extensions (.bed, .list, .picard, .interval_list, or .intervals). Please rename your file with the appropriate extension. Is there a GATK page describing those 5 file formats? Some of them are unknown to me; e.g. .list.

I asked my question here, but please ignore it: http://gatkforums.broadinstitute.org/discussion/2219/l-option

Thanks a lot.

Also, the error message does not mention support for vcf files, but the documentation does. Are vcf files supported?

Created 2013-02-25 11:45:46 | Updated | Tags: queue intervals

Comments (4)

GATK Queue implements a Scatter/Gather algorithm to create a set of intervals in order to parallelise data alalysis. If -scatter_gather option is issued, a respective number of interval files will be created and the input BAM files will be processed using these intervals. However a question arises what happens if the point of subsequent analysis is at or near the start/end of an interval? Are all the codes which support -l/-L options robust in the respect to interval positions? Since the input files are not actually spliced, all information is available to the processing program which could make the right decisions so that no artefacts are produced. Are there any restrictions on interval creation? Perhaps it should be at least a few read lengths. Anything else? Thanks in advance!

Created 2013-02-20 03:05:44 | Updated 2013-02-20 03:07:11 | Tags: unifiedgenotyper variantrecalibrator intervals

Comments (1)

Hello, I want to know how important it is to have the -L target-intervals.intervals option in UnifiedGenotyper, and if it is recommended in VariantCallRecalibrator too.

I have ran the Unified Genotyper tool with 1 input file at a time or the 2 files I want to compare at once. My command-lines are the following:

java -jar -Xmx15g GenomeAnalysisTK.jar -R ./genome.fa -T UnifiedGenotyper -I ./1.bam --dbsnp ./dbsnp_137.hg19.vcf -o ./1-gatk.vcf  --min_base_quality_score 25 -stand_call_conf 50 -stand_emit_conf 10 -dcov 200 -L ./intervals-1.intervals

java -jar -Xmx15g GenomeAnalysisTK.jar -R ./genome.fa -T UnifiedGenotyper -I ./2.bam --dbsnp ./dbsnp_137.hg19.vcf -o ./2-gatk.vcf  --min_base_quality_score 25 -stand_call_conf 50 -stand_emit_conf 10 -dcov 200 -L ./intervals-2.intervals

java -jar -Xmx15g GenomeAnalysisTK.jar -R ./genome.fa -T UnifiedGenotyper -I ./1.bam -I ./2.bam --dbsnp ./dbsnp_137.hg19.vcf -o ./1vs2-gatk.vcf  --min_base_quality_score 25 -stand_call_conf 50 -stand_emit_conf 10 -dcov 200 -L ./intervals-1.intervals -L ./intervals-2.intervals

Thanks, G.