The -L argument (short for --intervals) enables you to restrict your analysis to specific intervals instead of running over the whole genome. Using this argument can have important consequences for performance and/or results. Here, we present some guidelines for using it appropriately depending on your experimental design.
- Whole genome analysis: no need to include intervals
- Whole exome analysis: you need to provide the list of capture targets (typically genes/exons)
- Small targeted experiment: you need to provide the targeted interval(s)
- Troubleshooting: you can run on a specific interval to test parameters or create a data snippet
Whatever you end up using -L for, keep this in mind: for tools that output a bam or VCF file, the output file will only contain data from the intervals specified by the -L argument. To be clear, we do not recommend using -L with tools that output a bam file since doing so will omit some data from the output.
-L 20 (for chromosome 20 in b37/b39 build)
-L chr20:1-100 (for chromosome 20 positions 1-100 in hg18/hg19 build)
It is not necessary to use -L in whole genome analysis. You should be interested in the whole genome!
Nevertheless, in some cases, you may want to mask out certain contigs (e.g. chrY or non-chromosome contigs) or regions (e.g. centromere). You can do this with -XL, which does the exact opposite of -L; it excludes the provided intervals.
By definition, exome sequencing data doesn’t cover the entire genome, so many analyses can be restricted to just the capture targets (genes or exons) to save processing time. There are even some analyses which should be restricted to the capture targets because failing to do so can lead to suboptimal results.
Note that we recommend adding some “padding” to the intervals in order to include the flanking regions (typically ~100 bp). No need to modify your target list; you can have the GATK engine do it for you automatically using the interval padding argument. This is not required, but if you do use it, you should do it consistently at all steps where you use -L.
Below is a step-by-step breakdown of the Best Practices workflow, with a detailed explanation of why -L should or shouldn’t be used with each tool.
|Tool||-L?||Why / why not|
|RealignerTargetCreator||YES||Faster since RTC will only look for regions that need to be realigned within the input interval; no time wasted on the rest.|
|IndelRealigner||NO||IR will only try to realign the regions output from RealignerTargetCreator, so there is nothing to be gained by providing the capture targets.|
|BaseRecalibrator||YES||This excludes off-target sequences and sequences that may be poorly mapped, which have a higher error rate. Including them could lead to a skewed model and bad recalibration.|
|PrintReads||NO||Output is a bam file; using -L would lead to lost data.|
|UnifiedGenotyper/Haplotype Caller||YES||We’re only interested in making calls in exome regions; the rest is a waste of time & includes lots of false positives.|
|Next steps||NO||No need since subsequent steps operate on the callset, which was restricted to the exome at the calling step.|
The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.
You can go crazy with -L while troubleshooting! For example, you can just provide an interval at the command line, and the output file will contain the data from that interval.This is really useful when you’re trying to figure out what’s going on in a specific interval (e.g. why HaplotypeCaller is not calling your favorite indel) or what would be the effect of changing a parameter (e.g. what happens to your indel call if you increase the value of -minPruning). This is also what you’d use to generate a file snippet to send us as part of a bug report (except that never happens because GATK has no bugs, ever).
I have tried looking for the good discussion on how to calculate the average coverage of exome sequencing after alignment. I found that depthofcoverage is a good tool to get the output, however, I am unable to understand what all the output of DepthOfCoverage means.
My Aim is to calculate the average x coverage or statistics summary of a depth of coverage of 7 samples of exome sequencing after alignment.
So for that I followed the steps:
create an input bam file with list the bam files with path directing to it. file called input_bam.list eg /home/test/Desktop/bam1.bam /home/test/Desktop/bam2.bam /home/test/Desktop/bam3.bam
we have bed files with region and chr with headers chr start stop name
and sorted the file using following command sort -nk3 -nk5 hgTables.txt > genes_refgene_sorted.txt
after executing following command:
java -jar ./../GATK/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T DepthOfCoverage -I input_bam.list -o file_base_name_withbedfile --outputFormat table -R humangenome/ucsc/ucsc.hg19.fasta -L Regions.bed -geneList genes_refgene_sorted.txt -dt NONE
MESSAGE: Input file must have contiguous chromosomes. Saw feature chr22:19510547-19512860 followed later by chr18:19993564-19997878 and then chr22:22113947-22221970, for input source: Desktop/genes_refgene_sorted.txt
please suggest if I should sort the file with a different command.
If I use the command without refgene
java -jar ./../GATK/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T DepthOfCoverage -I input_bam.list -o file_base_name_withbedfile --outputFormat table -R humangenome/ucsc/ucsc.hg19.fasta -L Regions.bed
I get the following output files
file_base_name_withbedfile.sample_cumulative_coverage_counts file_base_name_withbedfile.sample_cumulative_coverage_proportions file_base_name_withbedfile.sample_interval_statistics file_base_name_withbedfile.sample_interval_summary file_base_name_withbedfile.sample_statistics file_base_name_withbedfile.sample_summary
I don't understand which output file is the best to answer my question fo depth.
In the last output file -- file_base_name_withbedfile.sample_summary the output looks like sample_id total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_15 test 1162396121 1775.69 500 500 343 91.7 Total 1162396121 1775.69 N/A N/A N/A
I don't understand what to make of it, and why there are NA
and in file file_base_name_withbedfile.sample_interval_summary the output looks like the following, I don't understand what to make out of this apart from total coverage over 3 bam files for that location. That means there are total 6638920 reads (or nt) in 3 bam files (for example) in that particular location. what does test granular Q value mean? which column should I use to average x coverage to state that after alignment the exomes have x coverage.
Target total_coverage average_coverage test_total_cvg test_mean_cvg test_granular_Q1 test_granular_median test_granularQ3 test%_above_15 chr1:1716462-1719040 6638920 2574.22 6638920 2574.22 >500 >500 >500 100.0 chr1:1719110-1720851 4192130 2406.50 4192130 2406.50 >500 >500 >500 91.8 chr1:1721604-1722165 1011309 1799.48 1011309 1799.48 >500 >500 >500 99.3 chr1:1724574-1725729 3912540 3384.55 3912540 3384.55 >500 >500 >500 99.9
If this is a redundant question, could anyone direct me to the correct discussion to understand the output.
Thanks in advance.
Hi, I was wondering if there is a set of best practices developed specifically for calling variants from a small region target capture sample. I looked but I could not find a one-stop comprehensive documentation. If not, are there some specific recommendations (such as certain values of parameters for the exome/targetome analysis? For eg., -L.
I am new to Bioinformatics, and would like some advice on changes to the GATK workflow for cancer. I was told that the cancer workflow is different, and see that several different tools are available.
I have Exome data from tumor and normal. I have aligned them, and have BAM files for each sample. I am interested in identifying somatic variants.
The current workflow as I understand it is: -(Non-GATK) Picard Mark Duplicates or Samtools roundup -Indel Realignment (Realigner TargetCreator + Indel Realigner) -Base Quality Score Reacalibration (Base Recalibrator + PrintReads) -UnifiedGenotyper -Annotation using Oncotator (?) -MuTect (identify somatic mutations)
**My questions are:
Thank you, Gaius
Hello, I've asked this question at the workshop in Brussels, and I would like to post it here: I'm working on an exome analysis on trio. I would like to run VQSR of filteration on the data. since this is an exome project, there are not a lot of varients, and therefore, as I understand. the VQSR is not accurate. You suggest to add more data from 1000Genomes or other published data. The families that I'm working on belongs to a very small and specific population, and I'm afraid that adding published data will add a lot of noise. What do you think, should I add more published data? change parameters such as maxGaussians? do hard filteration?
Dear Geraldine and GATK experts,
I have attended the great Brussels workshop, and I am posting here the BQSR question I had.
I have a human exome experiment with 24-multiplexed samples per lane (Nextera libraries) where we first only did one lane of sequencing (~15x) and then added a second lane (summing up to ~30x). From what I understood reading the Best Practices, I probably don't have enough data to run a BQSR on each sample. How should I then do the BQSR step? Should I skip it altogether? Could I estimate the model parameters on one whole lane (all the samples) and then apply it separately to each sample?
And as a separate question: If I could turn back the clock, would it have been better to do 12-multiplexed samples per lane and run these two lanes of sequencing (24 samples in total) for the same amount of reads but giving me more data to do a BQSR step per sample?
Thanks a lot for your help!
I am running some exome samples with UG. I tried this in 2 ways:
Is there any difference in these approaches if I am only concerned about the variants in exome region? I have ran in both ways and on comparison I see that some values are different for the same locus. For eg. PL, MQRankSum, BaseQRankSum. However, the different is not huge but does UG perform slightly differently for making calls with target regions vs without?
PS: I have ran this for UG with multiple samples
I am trying to run UG across just over 2,200 individuals (exome sequencing). I have successfully done this on our computing cluster with just over 1,000 samples without issues (apart from having to get the limit on no. of open files (ulimit) increased).
I got another increase in ulimit to allow me to run UG on the larger set. However, our IO is being pushed over the edge with the 2,200 input samples. I have two questions:
Would appreciate any advice you would have on getting this to run on this size of data. Thanks!
We are running GATK HaplotypeCaller on ~50 whole exome samples. We are interested in rare variants - so we ran GATK in single sample mode instead of multi sample as you recommend, however we would like to take advantage of VQSR. What would you recommend? Can we run VQSR on the output from GATK single sample?
Additionally, we are likely to run extra batches of new exome samples. Should we wait until we have them all before running them through the GATK pipeline?
Many thanks in advance.
Sorry if this is the wrong forum for this question - I just thought someone might have an idea/opinion...
Should I trim/filter exome sequencing reads prior to mapping with BWA and variant calling using GATK? I am currently filtering out reads in which <80% of bases have quality>=Q30 but I lose >20% of my reads this way. Does GATK take quality into account therefore rendering pre-mapping filtering unecessary?
Thanks in advance, Kath
Hi I am trying to run the DepthofCoverage walker from GATK for the first time and I am running into stumbling blocks. I am interested in getting coverage info for genes and I downloaded the Refseq genes list and am filtering and sorting according to specifications. I am using the SortByRef.pl script provided in the website. However, I am getting an error saying:
Can not open temporary file 1045: Too many open files at SortByRef.pl line 68, <$INPUT> line 1021.
I also checked the soft/hard limits for filesystem and it shows that my current system's limit is the following:
ulimit -u :386246 ulimit -S -u: 386246
I am stuck at this point and don't know how to proceed.
Any help is much appreciated.
Hi, I'm working with trios and small-pedigrees (up to six individuals). The VQSR section of the 'best practice' document states that 'in order to achieve the best exome results one needs an exome callset with at least 30 samples', and suggests to add additional samples such as 1000 genomes BAMs.
I' a little confused about two aspects:
1) the addition of 1000G BAMs being suggested in the VQSR section. If we need the 1000G call sets, we'd have to run these through the HaplotypeCaller or UnifiedGenotyper stages? Please forgive the question - I'm not trying to find fault in your perfect document, but please confirm as it would dramatically increase compute time (though only once), and overlaps with my next point of confusion:
2) I can understand how increasing the number of individuals from a consistent cohort, or maybe even from very similar experimental platforms, improves the outcome of the VQSR stage. However, the workshop video comments that the variant call properties are highly dependent on individual experiments (design, coverage, technical, etc). So I can't understand how the overall result is improved when I add variant calls from 30 1000G exomes (with their own typical variant quality distributions) to my trio's sample variant calls (also with their own, but very different to the 1000G's, quality distribution).
Hopefully I'm missing an important point somewhere? Many thanks in advance, K
I have exome data for a few sets of parent-offspring trios, in which offspring have phenotypically related but probably genetically different diseases. Their parents are unaffected so I am particularly interested in identifying de novo mutations. My plan was to analyse each individual separately up to the variant calling phase and then to input three (mother, father, child) analysis-ready BAMs into the UnifiedGenotyper along with a ped file. My questions are:
1) Can you tell me whether the UnifiedGenotyper uses pedigree information in the ped file to call genotypes more accurately? In other words, is this better than calling variants jointly without supplying the ped file? If not, does GATK recommend any external tools for doing this step?
2) It is better to call variants jointly using all the trios (even though they are not related and probably don't share the same disease-causing mutations)?
I found the materials of the BroadE Workshop very helpful, especially the slide on analyzing variant calls using VariantEval, because there is not much documentation for it on GATK site. As an example 62 whole genome sequencing samples from north Europe were evaluated together with 1000G FIN samples, and also the polymorphic and monomorphic sites on the 1000G genotype chip were used as comparator. I would like very much to do the same for our whole exome data, the question is: is there good quality whole exome data that I can use to evaluate our own exome data?
I have checked the NHLBI ESP project Exome Variant Server site, the vcf files can be downloaded doesn't have the genotype data.
Thanks in advance!
I have exome data run on two lanes per library is it better to combine the lanes into one or to run each lane independently through GATK? What are the pros and cons? Many thanks
I have seen the definition of strand bias on this site (below) but I need a little clarification. Does the FS filter (a) highlight instances where reads are only present on a single strand and contain a variant (as may occur toward the end of exome capture regions) or does it (b) specifically look for instances where there are reads on both strands but the variant allele is disproportionately represented on one strand (as might be indicative of a false positive), or does it (c) do both?
I had thought it did (b) but have encountered some disagreement.
** How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls.
Hi all, I have a 4 exome experiment (high coverage). According to best practices I can't apply VQSR, so I must filter on SNP properties. One thing is not clear: should all the option apply? In other words should I select using boolean "and" (i.e. QD < 2.0 & MQ < 40.0 & FS > 60.0 & HaplotypeScore > 13.0 & MQRankSum < -12.5 & ReadPosRankSum < -8.0 for SNP) or "or" (i.e. QD < 2.0 | MQ < 40.0 | FS > 60.0 | HaplotypeScore > 13.0 | MQRankSum < -12.5 | ReadPosRankSum < -8.0 for SNP)?
I am currently working on a Exome sequencing projekt with older single-end SOLiD exomes and new paired-end exomes. In a first attempt (GATK 1.7 and best practices v3 back then) i tried calling and recalibrating all exomes together (at that time 120) without selecting for paired/single-end. As I already had validatet many variants I could check the quality of the calls and got very bad results, especially for InDels (previously called, true positive variants missing). My idee is that the UnifiedGenotyper has Problems mixing paired-end with single-end exomes.
Is there any official recommendation for this problem? My solution right now is to group the exomes in batches (40-50 Exomes) which ran on the same technology.
Also a second Problem/Question: For some individuals exomes where sequenced twice, and for some of these the first run was single-end and the second one was paired. The best practices mentions one should se all available reads for a individual for calling. Do you have any experience on how to handle these cases?
Any help is greatly appreciated!
I'm curious about the experience of the community at large with VQSR, and specifically with which sets of annotations people have found to work well. The GATK team's recommendations are valuable, but my impression is that they have fairly homogenous data types - I'd like to know if anyone has found it useful to deviate from their recommendations.
For instance, I no longer include InbreedingCoefficient with my exome runs. This was spurred by a case where previously validated variants were getting discarded by VQSR. It turned out that these particular variants were homozygous alternate in the diseased samples and homozygous reference in the controls, yielding an InbreedingCoefficient very close to 1. We decided that the all-homozygous case was far more likely to be genuinely interesting than a sequencing/variant calling artifact, so we removed the annotation from VQSR. In order to catch the all-heterozygous case (which is more likely to be an error), we add a VariantFiltration pass for 'InbreedingCoefficient < -0.8' following ApplyRecalibration.
In my case, I think InbreedingCoefficient isn't as useful because my UG/VQSR cohorts tend to be smaller and less diverse than what the GATK team typically runs (and to be honest, I'm still not sure we're doing the best thing). Has anyone else found it useful to modify these annotations? It would be helpful if we could build a more complete picture of these metrics in a diverse set of experiments.