# Tagged with #analyst 24 documentation articles | 0 announcements | 2 forum discussions

Created 2013-07-03 00:53:53 | Updated 2015-01-26 18:11:45 | Tags: analyst readgroup indexing

#### Objective

Fix a BAM that is not indexed or not sorted, has not had duplicates marked, or is lacking read group information. These steps can be performed independently of each other but this order is recommended.

#### Prerequisites

• Installed Picard tools

#### Steps

1. Sort the aligned reads by coordinate order
2. Mark duplicates
4. Index the BAM file

#### Note

You may ask, is all of this really necessary? The GATK is notorious for imposing strict formatting guidelines and requiring the presence of information such as read groups that other software packages do not require. Although this represents a small additional processing burden upfront, the downstream benefits are numerous, including the ability to process library data individually, and significant gains in speed and parallelization options.

### 1. Sort the aligned reads by coordinate order

#### Action

Run the following Picard command:

java -jar SortSam.jar \
SORT_ORDER=coordinate


#### Expected Results

This creates a file called sorted_reads.bam containing the aligned reads sorted by coordinate.

#### Action

Run the following Picard command:

java -jar MarkDuplicates.jar \
METRICS_FILE=metrics.txt


#### Expected Results

This creates a file called dedup_reads.bam with the same content as the input file, except that any duplicate reads are marked as such. It also creates a file called metrics.txt that contains metrics regarding duplication of the data.

#### More details

During the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process (sometimes called dedupping in bioinformatics slang) identifies these reads as such so that the GATK tools know to ignore them.

#### Action

Run the following Picard command:

java -jar AddOrReplaceReadGroups.jar  \
RGID=group1 RGLB= lib1 RGPL=illumina RGPU=unit1 RGSM=sample1


#### Expected Results

This creates a file called addrg_reads.bam with the same content as the input file, except that the reads will now have read group information attached.

### 4. Index the BAM file

#### Action

Run the following Picard command:

java -jar BuildBamIndex.jar \


#### Expected Results

This creates an index file called addrg_reads.bai, which is ready to be used in the Best Practices workflow.

Since Picard tools do not systematically create an index file when they output a new BAM file (unlike GATK tools, which will always output indexed files), it is best to keep the indexing step for last.

Created 2013-07-03 00:18:01 | Updated 2013-07-31 15:57:23 | Tags: official analyst bam revert

#### Objective

Revert a BAM file back to FastQ. This comes in handy when you receive data that has been processed but not according to GATK Best Practices, and you want to reset and reprocess it properly.

#### Prerequisites

• Installed HTSlib

#### Steps

1. Shuffle the reads in the bam file
2. Revert the BAM file to FastQ format
3. Compress the FastQ file

### 1. Shuffle the reads in the bam file

#### Action

Shuffle the reads in the bam file so they are not in a biased order before alignment by running the following HTSlib command:

htscmd bamshuf -uOn 128 aln_reads.bam tmp > shuffled_reads.bam


#### Expected Result

This creates a new BAM file containing the original reads, which still retain their mapping information, but now they are no longer sorted.

The aligner uses blocks of paired reads to estimate the insert size. If you don’t shuffle your original bam, the blocks of insert size will not be randomly distributed across the genome, rather they will all come from the same region, biasing the insert size calculation. This is a very important step which is unfortunately often overlooked.

### 2. Revert the BAM file to FastQ

#### Action

Revert the BAM file to FastQ format by running the following HTSlib command:

htscmd bam2fq -a shuffled_reads.bam > interleaved_reads.fq


#### Expected Result

This creates an interleaved FastQ file called interleaved_reads.fq containing the now-unmapped paired reads.

Interleaved simply means that for each pair of reads in your paired-end data set, both the forward and the reverse reads are in the same file, as opposed to having them in separate files.

### 3. Compress the FastQ file

#### Action

Compress the FastQ file to reduce its size using the gzip utility:

gzip interleaved_reads.fq


#### Expected Result

This creates a gzipped FastQ file called interleaved_reads.fq.gz. This file is ready to be used as input for the Best Practices workflow.

BWA handles gzipped fastq files natively, so you don’t need to unzip the file to use it later on.

### 4. Note for advanced users

If you’re feeling adventurous, you can do all of the above with this beautiful one-liner, which will save you a heap of time that the program would otherwise spend performing I/O (loading in and writing out data to/from disk):

htscmd bamshuf -uOn 128 aln_reads.bam tmp | htscmd bam2fq -a - | gzip > interleaved_reads.fq.gz


Created 2013-06-17 22:48:43 | Updated 2015-03-30 11:18:39 | Tags: official analyst filtering hardfilters

#### Objective

Apply hard filters to a variant callset that is too small for VQSR or for which truth/training sets are not available.

• TBD

#### Steps

1. Extract the SNPs from the call set
2. Determine parameters for filtering SNPs
3. Apply the filter to the SNP call set
4. Extract the Indels from the call set
5. Determine parameters for filtering indels
6. Apply the filter to the Indel call set

### 1. Extract the SNPs from the call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R reference.fa \
-V raw_variants.vcf \
-L 20 \
-selectType SNP \
-o raw_snps.vcf


#### Expected Result

This creates a VCF file called raw_snps.vcf, containing just the SNPs from the original file of raw variants.

### 2. Determine parameters for filtering SNPs

SNPs matching any of these conditions will be considered bad and filtered out, i.e. marked FILTER in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the SNP using the culprit annotation. SNPs that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

• QualByDepth (QD) 2.0

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

• FisherStrand (FS) 60.0

Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

• RMSMappingQuality (MQ) 40.0

This is the Root Mean Square of the mapping quality of the reads across all samples.

• MappingQualityRankSumTest (MQRankSum) 12.5

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele). Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

### 3. Apply the filter to the SNP call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R reference.fa \
-V raw_snps.vcf \
--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
--filterName "my_snp_filter" \
-o filtered_snps.vcf


#### Expected Result

This creates a VCF file called filtered_snps.vcf, containing all the original SNPs from the raw_snps.vcf file, but now the SNPs are annotated with either PASS or FILTER depending on whether or not they passed the filters.

For SNPs that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each SNP failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.

### 4. Extract the Indels from the call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R reference.fa \
-V raw_HC_variants.vcf \
-L 20 \
-selectType INDEL \
-o raw_indels.vcf


#### Expected Result

This creates a VCF file called raw_indels.vcf, containing just the Indels from the original file of raw variants.

### 5. Determine parameters for filtering Indels.

Indels matching any of these conditions will be considered bad and filtered out, i.e. marked FILTER in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the indel using the culprit annotation. Indels that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

• QualByDepth (QD) 2.0

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

• FisherStrand (FS) 200.0

Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

### 6. Apply the filter to the Indel call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R reference.fa \
-V raw_indels.vcf \
--filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \
--filterName "my_indel_filter" \
-o filtered_indels.vcf


#### Expected Result

This creates a VCF file called filtered_indels.vcf, containing all the original Indels from the raw_indels.vcf file, but now the Indels are annotated with either PASS or FILTER depending on whether or not they passed the filters.

For Indels that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each Indel failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.

Created 2013-06-17 21:07:46 | Updated 2015-07-27 16:45:16 | Tags: official analyst indel-realignment

#### Objective

Perform local realignment around indels to correct mapping-related artifacts.

• TBD

#### Steps

1. Create a target list of intervals to be realigned
2. Perform realignment of the target intervals

### 1. Create a target list of intervals to be realigned

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R reference.fa \
-L 20 \
-known gold_indels.vcf \
-o realignment_targets.list


#### Expected Result

This creates a file called realignment_targets.list containing the list of intervals that the program identified as needing realignment within our target, chromosome 20.

The list of known indel sites (gold_indels.vcf) are used as targets for realignment. Only use it if there is such a list for your organism.

### 2. Perform realignment of the target intervals

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T IndelRealigner \
-R reference.fa \
-targetIntervals realignment_targets.list \
-known gold_indels.vcf \


#### Expected Result

This creates a file called realigned_reads.bam containing all the original reads, but with better local alignments in the regions that were realigned.

Note that here, we didn’t include the -L 20 argument. It's not necessary since the program will only run on the target intervals we are providing.

Created 2013-03-18 20:25:42 | Updated 2013-03-18 20:26:03 | Tags: official varianteval analyst intermediate tooltips

VariantEval accepts two types of modules: stratification and evaluation modules.

• Stratification modules will stratify (group) the variants based on certain properties.
• Evaluation modules will compute certain metrics for the variants

### CpG

CpG is a three-state stratification:

• The locus is a CpG site ("CpG")
• The locus is not a CpG site ("non_CpG")
• The locus is either a CpG or not a CpG site ("all")

A CpG site is defined as a site where the reference base at a locus is a C and the adjacent reference base in the 3' direction is a G.

### EvalRod

EvalRod is an N-state stratification, where N is the number of eval rods bound to VariantEval.

### Sample

Sample is an N-state stratification, where N is the number of samples in the eval files.

### Filter

Filter is a three-state stratification:

• The locus passes QC filters ("called")
• The locus fails QC filters ("filtered")
• The locus either passes or fails QC filters ("raw")

### FunctionalClass

FunctionalClass is a four-state stratification:

• The locus is a synonymous site ("silent")
• The locus is a missense site ("missense")
• The locus is a nonsense site ("nonsense")
• The locus is of any functional class ("any")

### CompRod

CompRod is an N-state stratification, where N is the number of comp tracks bound to VariantEval.

### Degeneracy

Degeneracy is a six-state stratification:

• The underlying base position in the codon is 1-fold degenerate ("1-fold")
• The underlying base position in the codon is 2-fold degenerate ("2-fold")
• The underlying base position in the codon is 3-fold degenerate ("3-fold")
• The underlying base position in the codon is 4-fold degenerate ("4-fold")
• The underlying base position in the codon is 6-fold degenerate ("6-fold")
• The underlying base position in the codon is degenerate at any level ("all")

### JexlExpression

JexlExpression is an N-state stratification, where N is the number of JEXL expressions supplied to VariantEval. See [[Using JEXL expressions]]

### Novelty

Novelty is a three-state stratification:

• The locus overlaps the knowns comp track (usually the dbSNP track) ("known")
• The locus does not overlap the knowns comp track ("novel")
• The locus either overlaps or does not overlap the knowns comp track ("all")

### CountVariants

CountVariants is an evaluation module that computes the following metrics:

Metric Definition
nProcessedLoci Number of processed loci
nCalledLoci Number of called loci
nRefLoci Number of reference loci
nVariantLoci Number of variant loci
variantRate Variants per loci rate
variantRatePerBp Number of variants per base
nSNPs Number of snp loci
nInsertions Number of insertion
nDeletions Number of deletions
nComplex Number of complex loci
nNoCalls Number of no calls loci
nHets Number of het loci
nHomRef Number of hom ref loci
nHomVar Number of hom var loci
nSingletons Number of singletons
heterozygosity heterozygosity per locus rate
heterozygosityPerBp heterozygosity per base pair
hetHomRatio heterozygosity to homozygosity ratio
indelRate indel rate (insertion count + deletion count)
indelRatePerBp indel rate per base pair
deletionInsertionRatio deletion to insertion ratio

### CompOverlap

CompOverlap is an evaluation module that computes the following metrics:

Metric Definition
nEvalSNPs number of eval SNP sites
nCompSNPs number of comp SNP sites
novelSites number of eval sites outside of comp sites
nVariantsAtComp number of eval sites at comp sites (that is, sharing the same locus as a variant in the comp track, regardless of whether the alternate allele is the same)
compRate percentage of eval sites at comp sites
nConcordant number of concordant sites (that is, for the sites that share the same locus as a variant in the comp track, those that have the same alternate allele)
concordantRate the concordance rate

#### Understanding the output of CompOverlap

A SNP in the detection set is said to be 'concordant' if the position exactly matches an entry in dbSNP and the allele is the same. To understand this and other output of CompOverlap, we shall examine a detailed example. First, consider a fake dbSNP file (headers are suppressed so that one can see the important things):

 $grep -v '##' dbsnp.vcf #CHROM POS ID REF ALT QUAL FILTER INFO 1 10327 rs112750067 T C . . ASP;R5;VC=SNP;VP=050000020005000000000100;WGT=1;dbSNPBuildID=132  Now, a detection set file with a single sample, where the variant allele is the same as listed in dbSNP: $ grep -v '##' eval_correct_allele.vcf
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT            001-6
1       10327   .       T       C       5168.52 PASS    ...     GT:AD:DP:GQ:PL    0/1:357,238:373:99:3959,0,4059


Finally, a detection set file with a single sample, but the alternate allele differs from that in dbSNP:

 $grep -v '##' eval_incorrect_allele.vcf #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001-6 1 10327 . T A 5168.52 PASS ... GT:AD:DP:GQ:PL 0/1:357,238:373:99:3959,0,4059  Running VariantEval with just the CompOverlap module: $ java -jar $STING_DIR/dist/GenomeAnalysisTK.jar -T VariantEval \ -R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta \ -L 1:10327 \ -B:dbsnp,VCF dbsnp.vcf \ -B:eval_correct_allele,VCF eval_correct_allele.vcf \ -B:eval_incorrect_allele,VCF eval_incorrect_allele.vcf \ -noEV \ -EV CompOverlap \ -o eval.table  We find that the eval.table file contains the following: $ grep -v '##' eval.table | column -t
CompOverlap  CompRod  EvalRod                JexlExpression  Novelty  nEvalVariants  nCompVariants  novelSites  nVariantsAtComp  compRate      nConcordant  concordantRate
CompOverlap  dbsnp    eval_correct_allele    none            all      1              1              0           1                100.00000000  1            100.00000000
CompOverlap  dbsnp    eval_correct_allele    none            known    1              1              0           1                100.00000000  1            100.00000000
CompOverlap  dbsnp    eval_correct_allele    none            novel    0              0              0           0                0.00000000    0            0.00000000
CompOverlap  dbsnp    eval_incorrect_allele  none            all      1              1              0           1                100.00000000  0            0.00000000
CompOverlap  dbsnp    eval_incorrect_allele  none            known    1              1              0           1                100.00000000  0            0.00000000
CompOverlap  dbsnp    eval_incorrect_allele  none            novel    0              0              0           0                0.00000000    0            0.00000000


As you can see, the detection set variant was listed under nVariantsAtComp (meaning the variant was seen at a position listed in dbSNP), but only the eval_correct_allele dataset is shown to be concordant at that site, because the allele listed in this dataset and dbSNP match.

### TiTvVariantEvaluator

TiTvVariantEvaluator is an evaluation module that computes the following metrics:

Metric Definition
nTi number of transition loci
nTv number of transversion loci
tiTvRatio the transition to transversion ratio
nTiInComp number of comp transition sites
nTvInComp number of comp transversion sites
TiTvRatioStandard the transition to transversion ratio for comp sites

Created 2012-11-05 16:20:38 | Updated 2013-01-14 17:35:25 | Tags: official basic analyst intro parallelism performance walkers map-reduce

### Overview

One of the key challenges of working with next-gen sequence data is that input files are usually very large. We can’t just make the program open the files, load all the data into memory and perform whatever analysis is needed on all of it in one go. It’s just too much work, even for supercomputers.

Instead, we make the program cut the job into smaller tasks that the computer can easily process separately. Then we have it combine the results of each step into the final result.

### Map/Reduce

Map/Reduce is the technique we use to achieve this. It consists of three steps formally called filter, map and reduce. Let’s apply it to an example case where we want to find out what is the average depth of coverage in our dataset for a certain region of the genome.

• filter determines what subset of the data needs to be processed in each task. In our example, the program lists all the reference positions in our region of interest.

• map applies the function, i.e. performs the analysis on each subset of data. In our example, for each position in the list, the program looks into the BAM file, pulls out the pileup of bases and outputs the depth of coverage at that position.

• reduce combines the elements in the list of results output by the map function. In our example, the program takes the coverage numbers that were calculated separately for all the reference positions and calculates their average, which is the final result we want.

This may seem trivial for such a simple example, but it is a very powerful method with many advantages. Among other things, it makes it relatively easy to parallelize operations, which makes the tools run much faster on large datasets.

### Walkers, filters and traversal types

All the tools in the GATK are built from the ground up to take advantage of this method. That’s why we call them walkers: because they “walk” across the genome, getting things done.

Note that even though it’s not included in the Map/Reduce technique’s name, the filter step is very important. It determines what data get presented to the tool for analysis, selecting only the appropriate data for each task and discarding anything that’s not relevant. This is a key part of the Map/Reduce technique, because that’s what makes each task “bite-sized” enough for the computer to handle easily.

Each tool has filters that are tailored specifically for the type of analysis it performs. The filters rely on traversal engines, which are little programs that are designed to “traverse” the data (i.e. walk through the data) in specific ways.

There are three major types of traversal: Locus Traversal, Read Traversal and Active Region Traversal. In our interval coverage example, the tool’s filter uses the Locus Traversal engine, which walks through the data by locus, i.e. by position along the reference genome. Because of that, the tool is classified as a Locus Walker. Similarly, the Read Traversal engine is used, you’ve guessed it, by Read Walkers.

The GATK engine comes packed with many other ways to walk through the genome and get the job done seamlessly, but those are the ones you’ll encounter most often.

Created 2012-10-25 20:40:16 | Updated 2013-09-13 18:04:45 | Tags: official faq basic analyst intro gatk-lite lite

Please note that GATK-Lite was retired in February 2013 when version 2.4 was released. See the announcement here.

You probably know by now that GATK-Lite is a free-for-everyone and completely open-source version of the GATK (licensed under the original MIT license).

But what's in the box? What can GATK-Lite do -- or rather, what can it not do that the full version (let's call it GATK-Full) can? And what does that mean exactly, in terms of functionality, reliability and power?

To really understand the differences between GATK-Lite and GATK-Full, you need some more information on how the GATK works, and how we work to develop and improve it.

### First you need to understand what are the two core components of the GATK: the engine and tools (see picture below).

As explained here, the engine handles all the common work that's related to data access, conversion and traversal, as well as high-performance computing features. The engine is supported by an infrastructure of software libraries. If the GATK was a car, that would be the engine and chassis. What we call the **tools* are attached on top of that, and they provide the various analytical and processing functionalities like variant calling and base or variant recalibration. On your car, that would be headlights, airbags and so on.

### Second is how we work on developing the GATK, and what it means for how improvements are shared (or not) between Lite and Full.

We do all our development work on a single codebase. This means that everything --the engine and all tools-- is on one common workbench. There are not different versions that we work on in parallel -- that would be crazy to manage! That's why the version numbers of GATK-Lite and GATK-Full always match: if the latest GATK-Full version is numbered 2.1-13, then the latest GATK-Lite is also numbered 2.1-13.

The most important consequence of this setup is that when we make improvements to the infrastructure and engine, the same improvements will end up in GATK Lite and in GATK Full. So for the purposes of power, speed and robustness of the GATK that is determined by the engine, there is no difference between them.

For the tools, it's a little more complicated -- but not much. When we "build" the GATK binaries (the .jar files), we put everything from the workbench into the Full build, but we only put a subset into the Lite build. Note that this Lite subset is pretty big -- it contains all the tools that were previously available in GATK 1.x versions, and always will. We also reserve the right to add previews or not-fully-featured versions of the new tools that are in Full, at our discretion, to the Lite build.

### So there are two basic types of differences between the tools available in the Lite and Full builds (see picture below).

1. We have a new tool that performs a brand new function (which wasn't available in GATK 1.x), and we only include it in the Full build.

2. We have a tool that has some new add-on capabilities (which weren't possible in GATK 1.x); we put the tool in both the Lite and the Full build, but the add-ons are only available in the Full build.

Reprising the car analogy, GATK-Lite and GATK-Full are like two versions of the same car -- the basic version and the fully-equipped one. They both have the exact same engine, and most of the equipment (tools) is the same -- for example, they both have the same airbag system, and they both have headlights. But there are a few important differences:

1. The GATK-Full car comes with a GPS (sat-nav for our UK friends), for which the Lite car has no equivalent. You could buy a portable GPS unit from a third-party store for your Lite car, but it might not be as good, and certainly not as convenient, as the Full car's built-in one.

2. Both cars have windows of course, but the Full car has power windows, while the Lite car doesn't. The Lite windows can open and close, but you have to operate them by hand, which is much slower.

### So, to summarize:

The underlying engine is exactly the same in both GATK-Lite and GATK-Full. Most functionalities are available in both builds, performed by the same tools. Some functionalities are available in both builds, but they are performed by different tools, and the tool in the Full build is better. New, cutting-edge functionalities are only available in the Full build, and there is no equivalent in the Lite build.

We hope this clears up some of the confusion surrounding GATK-Lite. If not, please leave a comment and we'll do our best to clarify further!

Created 2012-10-02 19:21:59 | Updated 2013-09-13 17:29:43 | Tags: official fastareference basic analyst intro inputs

This article describes the steps necessary to prepare your reference file (if it's not one that you got from us). As a complement to this article, see the relevant tutorial.

### Why these steps are necessary

The GATK uses two files to access and safety check access to the reference files: a .dict dictionary of the contig names and sizes and a .fai fasta index file to allow efficient random access to the reference bases. You have to generate these files in order to be able to use a Fasta file as reference.

NOTE: Picard and samtools treat spaces in contig names differently. We recommend that you avoid using spaces in contig names.

### Creating the fasta sequence dictionary file

We use CreateSequenceDictionary.jar from Picard to create a .dict file from a fasta file.

> java -jar CreateSequenceDictionary.jar R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict
[Fri Jun 19 14:09:11 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict
[Fri Jun 19 14:09:58 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary done.
Runtime.totalMemory()=2112487424
44.922u 2.308s 0:47.09 100.2%   0+0k 0+0io 2pf+0w


This produces a SAM-style header file describing the contents of our fasta file.

> cat Homo_sapiens_assembly18.dict
@HD     VN:1.0  SO:unsorted
@SQ     SN:chrM LN:16571        UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:d2ed829b8a1628d16cbeee88e88e39eb
@SQ     SN:chr1 LN:247249719    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:9ebc6df9496613f373e73396d5b3b6b6
@SQ     SN:chr2 LN:242951149    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:b12c7373e3882120332983be99aeb18d
@SQ     SN:chr4 LN:191273063    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:cf37020337904229dca8401907b626c2
@SQ     SN:chr5 LN:180857866    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:031c851664e31b2c17337fd6f9004858
@SQ     SN:chr7 LN:158821424    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:74239c5ceee3b28f0038123d958114cb
@SQ     SN:chr8 LN:146274826    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:1eb00fe1ce26ce6701d2cd75c35b5ccb
@SQ     SN:chr9 LN:140273252    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:ea244473e525dde0393d353ef94f974b
@SQ     SN:chr10        LN:135374737    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:4ca41bf2d7d33578d2cd7ee9411e1533
@SQ     SN:chr11        LN:134452384    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:425ba5eb6c95b60bafbf2874493a56c3
@SQ     SN:chr12        LN:132349534    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:d17d70060c56b4578fa570117bf19716
@SQ     SN:chr13        LN:114142980    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:c4f3084a20380a373bbbdb9ae30da587
@SQ     SN:chr14        LN:106368585    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:c1ff5d44683831e9c7c1db23f93fbb45
@SQ     SN:chr15        LN:100338915    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:5cd9622c459fe0a276b27f6ac06116d8
@SQ     SN:chr16        LN:88827254     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:3e81884229e8dc6b7f258169ec8da246
@SQ     SN:chr17        LN:78774742     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:2a5c95ed99c5298bb107f313c7044588
@SQ     SN:chr18        LN:76117153     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:3d11df432bcdc1407835d5ef2ce62634
@SQ     SN:chr20        LN:62435964     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f126cdf8a6e0c7f379d618ff66beb2da
@SQ     SN:chr21        LN:46944323     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f1b74b7f9f4cdbaeb6832ee86cb426c6
@SQ     SN:chr22        LN:49691432     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:2041e6a0c914b48dd537922cca63acb8
@SQ     SN:chr2_random  LN:185571       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:18ceab9e4667a25c8a1f67869a4356ea
@SQ     SN:chr4_random  LN:842648       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:9cab2949ccf26ee0f69a875412c93740
@SQ     SN:chr5_random  LN:143687       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:05926bdbff978d4a0906862eb3f773d0
@SQ     SN:chr6_random  LN:1875562      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:d62eb2919ba7b9c1d382c011c5218094
@SQ     SN:chr7_random  LN:549659       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:28ebfb89c858edbc4d71ff3f83d52231
@SQ     SN:chr8_random  LN:943810       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:0ed5b088d843d6f6e6b181465b9e82ed
@SQ     SN:chr9_random  LN:1146434      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:1e3d2d2f141f0550fa28a8d0ed3fd1cf
@SQ     SN:chr10_random LN:113275       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:50be2d2c6720dabeff497ffb53189daa
@SQ     SN:chr13_random LN:186858       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:563531689f3dbd691331fd6c5730a88b
@SQ     SN:chr15_random LN:784346       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:bf885e99940d2d439d83eba791804a48
@SQ     SN:chr16_random LN:105485       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:dd06ea813a80b59d9c626b31faf6ae7f
@SQ     SN:chr17_random LN:2617613      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:34d5e2005dffdfaaced1d34f60ed8fc2
@SQ     SN:chr18_random LN:4262 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f3814841f1939d3ca19072d9e89f3fd7
@SQ     SN:chr19_random LN:301858       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:420ce95da035386cc8c63094288c49e2
@SQ     SN:chr21_random LN:1679693      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:a7252115bfe5bb5525f34d039eecd096
@SQ     SN:chr22_random LN:257318       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:4f2d259b82f7647d3b668063cf18378b
@SQ     SN:chrX_random  LN:1719168      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f4d71e0758986c15e5455bf3e14e5d6f


### Creating the fasta index file

We use the faidx command in samtools to prepare the fasta index file. This file describes byte offsets in the fasta file for each contig, allowing us to compute exactly where a particular reference base at contig:pos is in the fasta file.

> samtools faidx Homo_sapiens_assembly18.fasta
108.446u 3.384s 2:44.61 67.9%   0+0k 0+0io 0pf+0w


This produces a text file with one record per line for each of the fasta contigs. Each record is of the: contig, size, location, basesPerLine, bytesPerLine. The index file produced above looks like:

> cat Homo_sapiens_assembly18.fasta.fai
chrM    16571   6       50      51
chr1    247249719       16915   50      51
chr2    242951149       252211635       50      51
chr3    199501827       500021813       50      51
chr4    191273063       703513683       50      51
chr5    180857866       898612214       50      51
chr6    170899992       1083087244      50      51
chr7    158821424       1257405242      50      51
chr8    146274826       1419403101      50      51
chr9    140273252       1568603430      50      51
chr10   135374737       1711682155      50      51
chr11   134452384       1849764394      50      51
chr12   132349534       1986905833      50      51
chr13   114142980       2121902365      50      51
chr14   106368585       2238328212      50      51
chr15   100338915       2346824176      50      51
chr16   88827254        2449169877      50      51
chr17   78774742        2539773684      50      51
chr18   76117153        2620123928      50      51
chr19   63811651        2697763432      50      51
chr20   62435964        2762851324      50      51
chr21   46944323        2826536015      50      51
chr22   49691432        2874419232      50      51
chrX    154913754       2925104499      50      51
chrY    57772954        3083116535      50      51
chr1_random     1663265 3142044962      50      51
chr2_random     185571  3143741506      50      51
chr3_random     749256  3143930802      50      51
chr4_random     842648  3144695057      50      51
chr5_random     143687  3145554571      50      51
chr6_random     1875562 3145701145      50      51
chr7_random     549659  3147614232      50      51
chr8_random     943810  3148174898      50      51
chr9_random     1146434 3149137598      50      51
chr10_random    113275  3150306975      50      51
chr11_random    215294  3150422530      50      51
chr13_random    186858  3150642144      50      51
chr15_random    784346  3150832754      50      51
chr16_random    105485  3151632801      50      51
chr17_random    2617613 3151740410      50      51
chr18_random    4262    3154410390      50      51
chr19_random    301858  3154414752      50      51
chr21_random    1679693 3154722662      50      51
chr22_random    257318  3156435963      50      51
chrX_random     1719168 3156698441      50      51


Created 2012-09-19 18:45:35 | Updated 2013-08-23 22:00:11 | Tags: unifiedgenotyper official variantannotator analyst intermediate

As featured in this forum question.

Two main things account for these kinds of differences, both linked to default behaviors of the tools:

• The tools downsample to different depths of coverage

• The tools apply different read filters

In both cases, you can end up looking at different sets or numbers of reads, which causes some of the annotation values to be different. It's usually not a cause for alarm. Remember that many of these annotations should be interpreted relatively, not absolutely.

Created 2012-08-11 04:50:54 | Updated 2013-09-07 14:18:21 | Tags: official basic analyst ngs

We know this field can be confusing or even overwhelming to newcomers, and getting to grips with a large and varied toolkit like the GATK can be a big challenge. We have produce a presentation that we hope will help you review all the background information that you need to know in order to use the GATK:

• Introduction to NGS Analysis: all you need to know to use the GATK: slides and video

In addition, the following links feature a lot of useful educational material about concepts and terminology related to next-generation sequencing:

Created 2012-08-11 04:36:16 | Updated 2012-10-18 14:57:10 | Tags: official basic analyst developer performance ngs

Imagine a simple question like, "What's the depth of coverage at position A of the genome?"

First, you are given billions of reads that are aligned to the genome but not ordered in any particular way (except perhaps in the order they were emitted by the sequencer). This simple question is then very difficult to answer efficiently, because the algorithm is forced to examine every single read in succession, since any one of them might span position A. The algorithm must now take several hours in order to compute this value.

Instead, imagine the billions of reads are now sorted in reference order (that is to say, on each chromosome, the reads are stored on disk in the same order they appear on the chromosome). Now, answering the question above is trivial, as the algorithm can jump to the desired location, examine only the reads that span the position, and return immediately after those reads (and only those reads) are inspected. The total number of reads that need to be interrogated is only a handful, rather than several billion, and the processing time is seconds, not hours.

This reference-ordered sorting enables the GATK to process terabytes of data quickly and without tremendous memory overhead. Most GATK tools run very quickly and with less than 2 gigabytes of RAM. Without this sorting, the GATK cannot operate correctly. Thus, it is a fundamental rule of working with the GATK, which is the reason for the Central Dogma of the GATK:

#### All datasets (reads, alignments, quality scores, variants, dbSNP information, gene tracks, interval lists - everything) must be sorted in order of one of the canonical references sequences.

Created 2012-08-11 04:16:24 | Updated 2013-01-15 02:59:32 | Tags: official faq basic analyst intervals

### 1. What file formats do you support for interval lists?

We support three types of interval lists, as mentioned here. Interval lists should preferentially be formatted as Picard-style interval lists, with an explicit sequence dictionary, as this prevents accidental misuse (e.g. hg18 intervals on an hg19 file). Note that this file is 1-based, not 0-based (first position in the genome is position 1).

### 2. I have two (or more) sequencing experiments with different target intervals. How can I combine them?

One relatively easy way to combine your intervals is to use the online tool Galaxy, using the Get Data -> Upload command to upload your intervals, and the Operate on Genomic Intervals command to compute the intersection or union of your intervals (depending on your needs).

Created 2012-08-11 04:08:04 | Updated 2012-10-18 15:00:51 | Tags: official faq basic analyst vcf developer variants

### 1. What file formats do you support for variant callsets?

We support the Variant Call Format (VCF) for variant callsets. No other file formats are supported.

### 2. How can I know if my VCF file is valid?

VCFTools contains a validation tool that will allow you to verify it.

### 3. Are you planning to include any converters from different formats or allow different input formats than VCF?

No, we like VCF and we think it's important to have a good standard format. Multiplying formats just makes life hard for everyone, both developers and analysts.

Created 2012-08-11 03:43:49 | Updated 2013-03-05 17:58:44 | Tags: official faq basic analyst bam

### 1. What file formats do you support for sequencer output?

The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample name, etc.). No other file formats are supported.

### 2. How do I get my data into BAM format?

The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at Picard and Samtools for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See BWA for one such aligner.

### 3. What are the formatting requirements for my BAM file(s)?

All BAM files must satisfy the following requirements:

• It must be aligned to one of the references described here.
• It must be sorted in coordinate order (not by queryname and not "unsorted").
• It must list the read groups with sample names in the header.
• The BAM file must pass Picard validation.

### 4. What is the canonical ordering of human reference contigs in a BAM file?

It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is as follows:

Human genome reference consortium standard ordering and names (b3x): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT...

UCSC convention (hg1x): chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY...

$samtools view -H /path/to/my.bam @HD VN:1.0 GO:none SO:coordinate @SQ SN:1 LN:247249719 @SQ SN:2 LN:242951149 @SQ SN:3 LN:199501827 @SQ SN:4 LN:191273063 @SQ SN:5 LN:180857866 @SQ SN:6 LN:170899992 @SQ SN:7 LN:158821424 @SQ SN:8 LN:146274826 @SQ SN:9 LN:140273252 @SQ SN:10 LN:135374737 @SQ SN:11 LN:134452384 @SQ SN:12 LN:132349534 @SQ SN:13 LN:114142980 @SQ SN:14 LN:106368585 @SQ SN:15 LN:100338915 @SQ SN:16 LN:88827254 @SQ SN:17 LN:78774742 @SQ SN:18 LN:76117153 @SQ SN:19 LN:63811651 @SQ SN:20 LN:62435964 @SQ SN:21 LN:46944323 @SQ SN:22 LN:49691432 @SQ SN:X LN:154913754 @SQ SN:Y LN:57772954 @SQ SN:MT LN:16571 @SQ SN:NT_113887 LN:3994 ...  If the order of the contigs here matches the contig ordering specified above, and the SO:coordinate flag appears in your header, then your contig and read ordering satisfies the GATK requirements. ### 6. My BAM file isn't sorted that way. How can I fix it? Picard offers a tool called SortSam that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the ReorderSam command to make a BAM file SQ order match another reference sequence. ### 7. How can I tell if my BAM file has read group and sample information? A quick Unix command using Samtools will do the trick: $ samtools view -H /path/to/my.bam | grep '^@RG'
@RG ID:0    PL:solid    PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP   LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:1    PL:solid    PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP    LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:2    PL:LS454    PU:R_2008_10_02_06_06_12_FLX01080312_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:3    PL:LS454    PU:R_2008_10_02_06_07_08_rig19_retry    LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:4    PL:LS454    PU:R_2008_10_02_17_50_32_FLX03080339_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
...


The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate.

INFO  16:18:26,233 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 16:18:26,351 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] INFO 16:18:26,351 TraversalEngine - Location processed.sites runtime per.1M.sites completed total.runtime remaining 2052 INFO 16:18:26,411 TraversalEngine - Total runtime 0.08 secs, 0.00 min, 0.00 hours INFO 16:18:26,450 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%) INFO 16:18:27,124 GATKRunReport - Uploaded run statistics report to AWS S3  Great! But wait -- where's the result? Last time the result was given on this line: INFO 21:53:04,556 Walker - [REDUCE RESULT] Traversal result is: 33  But this time there is no line that says [REDUCE RESULT]! Is something wrong? Not really. The program ran just fine -- but we forgot to give it an output file name. You see, the CountLoci walker is set up to output the result of its calculations to a text file, unlike CountReads, which is perfectly happy to output its result to the terminal screen. #### Action So we repeat the command, but this time we specify an output file, like this: java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam -o output.txt  where -o (lowercase o, not zero) is used to specify the output. #### Result You should get essentially the same output on the terminal screen as previously (but notice the difference in the line that contains Program Args -- the new argument is included): INFO 16:29:15,451 HelpFormatter - --------------------------------------------------------------------------------- INFO 16:29:15,453 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-22-g40f97eb, Compiled 2012/07/25 15:29:41 INFO 16:29:15,453 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 16:29:15,453 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 16:29:15,453 HelpFormatter - Program Args: -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam -o output.txt INFO 16:29:15,454 HelpFormatter - Date/Time: 2012/07/25 16:29:15 INFO 16:29:15,454 HelpFormatter - --------------------------------------------------------------------------------- INFO 16:29:15,454 HelpFormatter - --------------------------------------------------------------------------------- INFO 16:29:15,457 GenomeAnalysisEngine - Strictness is SILENT INFO 16:29:15,488 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  16:29:15,618 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
INFO  16:29:15,618 TraversalEngine -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
INFO  16:29:15,679 TraversalEngine - Total runtime 0.08 secs, 0.00 min, 0.00 hours
INFO  16:29:15,718 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%)
INFO  16:29:16,712 GATKRunReport - Uploaded run statistics report to AWS S3


This time however, if we look inside the working directory, there is a newly created file there called output.txt.

[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% ls -la
drwxr-xr-x  9 vdauwera  CHARLES\Domain Users     306 Jul 25 16:29 .
drwxr-xr-x@ 6 vdauwera  CHARLES\Domain Users     204 Jul 25 15:31 ..
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users    3635 Apr 10 07:39 exampleBAM.bam
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users     232 Apr 10 07:39 exampleBAM.bam.bai
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users     148 Apr 10 07:39 exampleFASTA.dict
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users  101673 Apr 10 07:39 exampleFASTA.fasta
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users      20 Apr 10 07:39 exampleFASTA.fasta.fai
-rw-r--r--  1 vdauwera  CHARLES\Domain Users       5 Jul 25 16:29 output.txt


This file contains the result of the analysis:

[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% cat output.txt
2052


This means that there are 2052 loci in the reference sequence that are covered by at least one or more reads in the BAM file.

#### Discussion

Okay then, but why not show the full, correct command in the first place? Because this was a good opportunity for you to learn a few of the caveats of the GATK command system, which may save you a lot of frustration later on.

Beyond the common basic arguments that almost all GATK walkers require, most of them also have specific requirements or options that are important to how they work. You should always check what are the specific arguments that are required, recommended and/or optional for the walker you want to use before starting an analysis.

Fortunately the GATK is set up to complain (i.e. terminate with an error message) if you try to run it without specifying a required argument. For example, if you try to run this:

java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta


the GATK will spit out a wall of text, including the basic usage guide that you can invoke with the --help option, and more importantly, the following error message:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.0-22-g40f97eb):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR
##### ERROR MESSAGE: Walker requires reads but none were provided.
##### ERROR ------------------------------------------------------------------------------------------


You see the line that says ERROR MESSAGE: Walker requires reads but none were provided? This tells you exactly what was wrong with your command.

So the GATK will not run if a walker does not have all the required inputs. That's a good thing! But in the case of our first attempt at running CountLoci, the -o argument is not required by the GATK to run -- it's just highly desirable if you actually want the result of the analysis!

There will be many other cases of walkers with arguments that are not strictly required, but highly desirable if you want the results to be meaningful.

So, at the risk of getting repetitive, always read the documentation of each walker that you want to use!

Created 2012-07-23 17:05:10 | Updated 2013-03-25 22:18:53 | Tags: official analyst dataprocessingpipeline queue workflow pacbio qscript intermediate

### Introduction

Processing data originated in the Pacific Biosciences RS platform has been evaluated by the GSA and publicly presented in numerous occasions. The guidelines we describe in this document were the result of a systematic technology development experiment on some datasets (human, E. coli and Rhodobacter) from the Broad Institute. These guidelines produced better results than the ones obtained using alternative pipelines up to this date (september 2011) for the datasets tested, but there is no guarantee that it will be the best for every dataset and that other pipelines won't supersede it in the future.

The pipeline we propose here is illustrated in a Q script (PacbioProcessingPipeline.scala) distributed with the GATK as an example for educational purposes. This pipeline has not been extensively tested and is not supported by the GATK team. You are free to use it and modify it for your needs following the guidelines below.

### BWA alignment

First we take the filtered_subreads.fq file output by the Pacific Biosciences RS SMRT pipeline and align it using BWA. We use BWA with the bwasw algorithm and allow for relaxing the gap open penalty to account for the excess of insertions and deletions known to be typical error modes of the data. For an idea on what parameters to use check suggestions given by the BWA author in the BWA manual page that are specific to Pacbio. The goal is to account for Pacific Biosciences RS known error mode and benefit from the long reads for a high scoring overall match. (for older versions, you can use the filtered_subreads.fasta and combine the base quality scores extracted from the h5 files using Pacific Biosciences SMRT pipeline python tools)

To produce a BAM file that is sorted by coordinate with adequate read group information we use Picard tools: SortSam and AddOrReplaceReadGroups. These steps are necessary because all subsequent tools require that the BAM file follow these rules. It is also generally considered good practices to have your BAM file conform to these specifications.

### Best Practices for Variant Calling

Once we have a proper BAM file, it is important to estimate the empirical quality scores using statistics based on a known callset (e.g. latest dbSNP) and the following covariates: QualityScore, Dinucleotide and ReadGroup. You can follow the GATK's Best Practices for Variant Detection according the type of data you have, with the exception of indel realignment, because the tool has not been adapted for Pacific Biosciences RS data.

### Problems with Variant Calling with Pacific Biosciences

• Calling must be more permissive of indels in the data.

You will have to adjust your calling thresholds in the Unified Genotyper to allow sites with a higher indel rate to be analyzed.

• Base quality thresholds should be adjusted to the specifics of your data.

Be aware that the Unified Genotyper has cutoffs for base quality score and if your data is on average Q20 (a common occurrence with Pacific Biosciences RS data) you may need to adjust your quality thresholds to allow the GATK to analyze your data. There is no right answer here, you have to choose parameters consistent with your average base quality scores, evaluate the calls made with the selected threshold and modify as necessary.

• Reference bias

To account for the high insertion and deletion error rate of the Pacific Biosciences data instrument, we often have to set the gap open penalty to be lower than the base mismatch penalty in order to maximize alignment performance. Despite aligning most of the reads successfully, this creates the side effect that the aligner will sometimes prefer to "hide" a true SNP inside an insertion. The result is accurate mapping, albeit with a reference-biased alignment. It is important to note however, that reference bias is an artifact of the alignment process, not the data, and can be greatly reduced by locally realigning the reads based on the reference and the data. Presently, the available software for local realignment is not compatible with the length and the high indel rate of Pacific Bioscience data, but we expect new tools to handle this problem in the future. Ultimately reference bias will mask real calls and you will have to inspect these by hand.

Created 2012-07-23 16:45:48 | Updated 2014-12-08 17:56:15 | Tags: official inbreedingcoeff phasebytransmission analyst pedigree intermediate phasing methods plink allelefrequency

There are two types of GATK tools that are able to use pedigree (family structure) information:

### Tools that require a pedigree to operate

PhaseByTransmission and CalculateGenotypePosterior will not run without a properly formatted pedigree file. These tools are part of the Genotype Refinement workflow, which is documented here.

### Tools that are able to generate standard variant annotations

The two variant callers (HaplotypeCaller and the deprecated UnifiedGenotyper) as well as VariantAnnotator and GenotypeGVCFs are all able to use pedigree information if you request an annotation that involves population structure (e.g. Inbreeding Coefficient). To be clear though, the pedigree information is not used during the variant calling process; it is only used during the annotation step at the end.

If you already have VCF files that were called without pedigree information, and you want to add pedigree-related annotations (e.g to use Variant Quality Score Recalibration (VQSR) with the InbreedingCoefficient as a feature annotation), don't panic. Just run the latest version of the VariantAnnotator to re-annotate your variants, requesting any missing annotations, and make sure you pass your PED file to the VariantAnnotator as well. If you forget to provide the pedigree file, the tool will run successfully but pedigree-related annotations may not be generated (this behavior is different in some older versions).

The PED files used as input for these tools are based on PLINK pedigree files. The general description can be found here.

For these tools, the PED files must contain only the first 6 columns from the PLINK format PED file, and no alleles, like a FAM file in PLINK.

Created 2012-07-20 14:22:00 | Updated 2014-01-24 16:00:04 | Tags: official basic analyst forum developer

By default, the forum does not send notification messages about new comments or discussions. If you want to turn on notifications or customize the type of notifications you want to receive (email, popup message etc), you need to do the following:

• Go to your profile page by clicking on your user name (in blue box, top left corner);
• Click on "Edit Profile" (button with silhouette of person, top right corner);
• Select the categories that you want to follow and the type of notification you want to receive.
• Be sure to click on Save Preferences.

To specifically get new GATK announcements, scroll down to "Category Notifications" and tick off the "Announcements" category for email notification for discussions (and comments if you really want to know everything).

No posts found with the requested search criteria.

Created 2015-06-27 16:56:26 | Updated | Tags: analyst bam gatk

Meaning does it matter if my intervals file goes chr1, chr10, instead of chr1, chr2 ? Additionally, will a command line error occur if bam files are not sorted in coordinate order with respect to the reference? Thank you -Best Steve

Created 2014-06-10 20:57:36 | Updated | Tags: variantannotator best-practices analyst vcf developer performance walkers java gatk

Hi,

I was wondering if there is a nice way to apply multiple processing steps to each variant (or a group of variants) as they are read so that the variant file is not read again and again. My understanding is that even if I use Queue, each script would read the vcf again. Is that correct?