Tagged with #indelrealigner
2 documentation articles | 1 announcement | 91 forum discussions



Created 2016-03-08 16:49:04 | Updated 2016-03-23 22:16:45 | Tags: indelrealigner realignertargetcreator indel-realignment

Comments (2)

This tutorial replaces Tutorial#2800 and applies to data types within the scope of the GATK Best Practices variant discovery workflow.

We provide example data and example commands for performing local realignment around small insertions and deletions (indels) against a reference. The resulting BAM reduces false positive SNPs and represents indels parsimoniously. First we use RealignerTargetCreator to identify and create a target intervals list (step 1). Then we perform local realignment for the target intervals using IndelRealigner (step 2).


Jump to a section

  1. Introduction
  2. Create target intervals list using RealignerTargetCreator
  3. Realign reads using IndelRealigner
  4. Some additional considerations
  5. Related resources

1. Introduction and tutorial materials

Why do indel realignment?

Local realignment around indels allows us to correct mapping errors made by genome aligners and make read alignments more consistent in regions that contain indels.

Genome aligners can only consider each read independently, and the scoring strategies they use to align reads relative to the reference limit their ability to align reads well in the presence of indels. Depending on the variant event and its relative location within a read, the aligner may favor alignments with mismatches or soft-clips instead of opening a gap in either the read or the reference sequence. In addition, the aligner's scoring scheme may use arbitrary tie-breaking, leading to different, non-parsimonious representations of the event in different reads.

In contrast, local realignment considers all reads spanning a given position. This makes it possible to achieve a high-scoring consensus that supports the presence of an indel event. It also produces a more parsimonious representation of the data in the region .

This two-step indel realignment process first identifies such regions where alignments may potentially be improved, then realigns the reads in these regions using a consensus model that takes all reads in the alignment context together.

Tools involved

Prerequisites

  • Installed GATK tools
  • Coordinate-sorted and indexed BAM alignment data
  • Reference sequence, index and dictionary
  • An optional VCF file representing population variants, subset for indels

Download example data

  • To download the reference, open ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/b37/ in your browser. Leave the password field blank. Download the following three files (~860 MB) to the same folder: human_g1k_v37_decoy.fasta.gz, .fasta.fai.gz, and .dict.gz. This same reference is available to load in IGV.
  • Click tutorial_7156.tar.gz to download the tutorial data. The data is human paired 2x150 whole genome sequence reads originally aligning at ~30x depth of coverage. The sample is a PCR-free preparation of the NA12878 individual run on the HiSeq X platform. I took the reads aligning to a one Mbp genomic interval (10:96,000,000-97,000,000) and sanitized and realigned the reads (BWA-MEM -M) to the entire genome according to the workflow presented in Tutorial#6483 and marked duplicates using MarkDuplicates according to Tutorial#6747. We expect the alignment to reveal a good proportion of indels given its long reads (~150 bp per read), high complexity (PCR-free whole genome data) and deep coverage depth (30x).

    Tutorial download also contains a known indels VCF from Phase 3 of the 1000 Genomes Project subset for indel-only records in the interval 10:96,000,000-97,000,000. These represent consensus common and low-frequency indels in the studied populations from multiple approaches. The individual represented by our snippet, NA12878, is part of the 1000 Genomes Project data. Because of the differences in technology and methods used by the Project versus our sample library, our library has potential to reveal additional variants.

back to top


2. Create target intervals list using RealignerTargetCreator

For simplicity, we use a single known indels VCF, included in the tutorial data. For recommended resources, see Article#1247.

In the command, RealignerTargetCreator takes a coordinate-sorted and indexed BAM and a VCF of known indels and creates a target intervals file.

java -jar GenomeAnalysisTK.jar \
    -T RealignerTargetCreator \
    -R human_g1k_v37_decoy.fasta \
    -L 10:96000000-97000000 \
    -known INDEL_chr10_1Mb_b37_1000G_phase3_v4_20130502.vcf \
    -I 7156_snippet.bam \
    -o 7156_realignertargetcreator.intervals

In the resulting file, 7156_realignertargetcreator.intervals, intervals represent sites of extant and potential indels. If sites are proximal, the tool represents them as a larger interval spanning the sites.

Comments on specific parameters

  • We specify the BAM alignment file with -I.
  • We specify the known indels VCF file with -known. The known indels VCF contains indel records only.
  • Three input choices are technically feasible in creating a target intervals list: you may provide RealignerTargetCreator (i) one or more VCFs of known indels each passed in via -known, (ii) one or more alignment BAMs each passed in via -I or (iii) both. We recommend the last mode, and we use it in the example command. We use these same input files again in the realignment step.

    The tool adds indel sites present in the known indels file and indel sites in the alignment CIGAR strings to the targets. Additionally, the tool considers the presence of mismatches and soft-clips, and adds regions that pass a concentration threshold to the target intervals.

    If you create an intervals list using only the VCF, RealignerTargetCreator will add sites of indel only records even if SNPs are present in the file. If you create an intervals list using both alignment and known indels, the known indels VCF should contain only indels. See Related resources.

  • We include -L 10:96000000-97000000 in the command to limit processing time. Otherwise, the tool traverses the entire reference genome and intervals outside these coordinates may be added given our example 7156_snippet.bam contains a small number of alignments outside this region.
  • The tool samples to a target coverage of 1,000 for regions with greater coverage.

The target intervals file

The first ten rows of 7156_realignertargetcreator.intervals are as follows. The file is a text-based one-column list with one interval per row in 1-based coordinates. Header and column label are absent. For an interval derived from a known indel, the start position refers to the corresponding known variant. For example, for the first interval, we can zgrep -w 96000399 INDEL_chr10_1Mb_b37_1000G_phase3_v4_20130502.vcf for details on the 22bp deletion annotated at position 96000399.

10:96000399-96000421
10:96002035-96002036
10:96002573-96002577
10:96003556-96003558
10:96004176-96004177
10:96005264-96005304
10:96006455-96006461
10:96006871-96006872
10:96007627-96007628
10:96008204

To view intervals on IGV, convert the list to 0-based BED format using the following AWK command. The command saves a new text-based file with .bed extension where chromosome, start and end are tab-separated, and the start position is one less than that in the intervals list.

awk -F '[:-]' 'BEGIN { OFS = "\t" } { if( $3 == "") { print $1, $2-1, $2 } else { print $1, $2-1, $3}}' 7156_realignertargetcreator.intervals > 7156_realignertargetcreator.bed

back to top


3. Realign reads using IndelRealigner

In the following command, IndelRealigner takes a coordinate-sorted and indexed BAM and a target intervals file generated by RealignerTargetCreator. IndelRealigner then performs local realignment on reads coincident with the target intervals using consenses from indels present in the original alignment.

java -Xmx8G -Djava.io.tmpdir=/tmp -jar GenomeAnalysisTK.jar \
    -T IndelRealigner \
    -R human_g1k_v37_decoy.fasta \
    -targetIntervals 7156_realignertargetcreator.intervals \
    -known INDEL_chr10_1Mb_b37_1000G_phase3_v4_20130502.vcf \ 
    -I 7156_snippet.bam \
    -o 7156_snippet_indelrealigner.bam

The resulting coordinate-sorted and indexed BAM contains the same records as the original BAM but with changes to realigned records and their mates. Our tutorial's two IGV screenshots show realigned reads in two different loci. For simplicity, the screenshots show the subset of reads that realigned. For screenshots of full alignments for the same loci, see here and here.

Comments on specific parameters

  • The -targetIntervals file from RealignerTargetCreator, with extension .intervals or .list, is required. See section 1 for a description.
  • Specify each BAM alignment file with -I. IndelRealigner operates on all reads simultaneously in files you provide it jointly.
  • Specify each optional known indels VCF file with -known.
  • For joint processing, e.g. for tumor-normal pairs, generate one output file for each input by specifying -nWayOut instead of -o.
  • By default, and in this command, IndelRealigner applies the USE_READS consensus model. This is the consensus model we recommend because it balances accuracy and performance. To specify a different model, use the -model argument. The KNOWNS_ONLY consensus model constructs alternative alignments from the reference sequence by incorporating any known indels at the site, the USE_READS model from indels in reads spanning the site and the USE_SW model additionally from Smith-Waterman alignment of reads that do not perfectly match the reference sequence.

    The KNOWNS_ONLY model can be sufficient for preparing data for base quality score recalibration. It can maximize performance at the expense of some accuracy. This is the case only given the known indels file represents common variants for your data. If you specify -model KNOWNS_ONLY but forget to provide a VCF, the command runs but the tool does not realign any reads.

  • If you encounter out of memory errors, try these options. First, increase max java heap size from -Xmx8G. To find a system's default maximum heap size, type java -XX:+PrintFlagsFinal -version, and look for MaxHeapSize. If this does not help, and you are jointly processing data, then try running indel realignment iteratively on smaller subsets of data before processing them jointly.
  • IndelRealigner performs local realignment without downsampling. If the number of reads in an interval exceeds the 20,000 default threshold set by the -maxReads parameter, then the tool skips the region.
  • The tool has two read filters, BadCigarFilter and MalformedReadFilter. The tool processes reads flagged as duplicate.

Changes to alignment records

For our example data,194 alignment records realign for ~89 sites. These records now have the OC tag to mark the original CIGAR string. We can use the OC tag to pull out realigned reads and instructions for this are in section 4. The following screenshot shows an example pair of records before and after indel realignment. We note seven changes with asterisks, blue for before and red for after, for both the realigned read and for its mate.

Changes to the example realigned record:

  • MAPQ increases from 60 to 70. The tool increases each realigned record's MAPQ by ten.
  • The CIGAR string, now 72M20I55M4S, reflects the realignment containing a 20bp insertion.
  • The OC tag retains the original CIGAR string (OC:Z:110M2I22M1D13M4S) and replaces the MD tag that stored the string for mismatching positions.
  • The NM tag counts the realigned record's mismatches, and changes from 8 to 24.

Changes to the realigned read's mate record:

  • The MC tag updates the mate CIGAR string (to MC:Z:72M20I55M4S).
  • The MQ tag updates to the new mapping quality of the mate (to MQ:i:70).
  • The UQ tag updates to reflect the new Phred likelihood of the segment, from UQ:i:100 to UQ:i:68.

back to top


3. Some additional considerations

RealignerTargetCreator documentation has a -maxInterval cutoff to drop intervals from the list if they are too large. This is because increases in number of reads per interval quadratically increase the compute required to realign a region, and larger intervals tend to include more reads. By the same reasoning, increasing read depth, e.g. with additional alignment files, increases required compute.

Our tutorial's INDEL_chr10_1Mb_b37_1000G_phase3_v4_20130502.vcf contains 1168 indel-only records. The following are metrics on intervals created using the three available options.

               #intervals    avg length     basepair coverage     
VCF only       1161           3.33           3,864         
BAM only        487          15.22           7,412          
VCF+BAM        1151          23.07          26,558         

You can project the genomic coverage of the intervals as a function of the interval density (number of intervals per basepair) derived from varying the known indel density (number of indel records in the VCF). This in turn allows you to anticipate compute for indel realignment. The density of indel sites increases the interval length following a power law (y=ax^b). The constant (a) and the power (b) are different for intervals created with VCF only and with VCF+BAM. For our example data, these average interval lengths are well within the length of a read and minimally vary the reads per interval and thus the memory needed for indel realignment.

back to top


4. Related resources

  • See the Best Practice Workflow and click on the flowchart's Realign Indels icon for best practice recommendations and links including to a 14-minute video overview.
  • See Article#1247 for guidance on using VCF(s) of known variant sites.
  • To subset realigned reads only into a valid BAM, as shown in the screenshots, use samtools view 7088_snippet_indelrealigner.bam | grep 'OC' | cut -f1 | sort > 7088_OC.txt to create a list of readnames. Then, follow direction in blogpost SAM flags down a boat on how to create a valid BAM using FilterSamReads.
  • See discussion on multithreading for options on speeding up these processes. The document titled How can I use parallelism to make GATK tools run faster? gives two charts: (i) the first table relates the three parallelism options to the major GATK tools and (ii) the second table provides recommended configurations for the tools. Briefly, RealignerTargetCreator runs faster with increasing -nt threads, while IndelRealigner shows diminishing returns for increases in scatter-gather threads provided by Queue. See blog How long does it take to run the GATK Best Practices? for a breakdown of the impact of threading and CPU utilization for Best Practice Workflow tools.
  • See DePristo et al's 2011 Nature Genetics technical report for benchmarked effects of indel realignment as well as for the mathematics behind the algorithms.
  • See Tutorial#6517 for instructions on creating a snippet of reads corresponding to a genomic interval. For your research aims, you may find testing a small interval of your alignment and your choice VCF, while adjusting parameters, before committing to processing your full dataset, is time well-invested.
  • The tutorial's PCR-free 2x150 bp reads give enough depth of coverage (34.67 mean and 99.6% above 15) and library complexity to allow us the confidence to use aligner-generated indels in realignment. Check alignment coverage with DepthofCoverage for WGS or DiagnoseTargets for WES.
  • See SelectVariants to subset out indel calls using the -selectType INDEL option. Note this excludes indels that are part of mixed variant sites (see FAQ). Current solutions to including indels from mixed sites involves the use of JEXL expressions, as discussed here. Current solutions to selecting variants based on population allelic frequency (AF), as we may desire to limit our known indels to those that are more common than rare for more efficient processing, are discussed in two forum posts (1,2).
  • See Tutorial#6491 for basic instructions on using the Integrative Genomics Viewer (IGV).


Created 2012-07-23 16:48:55 | Updated 2012-09-30 23:35:55 | Tags: indelrealigner realignertargetcreator

Comments (4)

Realigner Target Creator

For a complete, detailed argument reference, refer to the GATK document page here.


Indel Realigner

For a complete, detailed argument reference, refer to the GATK document page here.


Running the Indel Realigner only at known sites

While we advocate for using the Indel Realigner over an aggregated bam using the full Smith-Waterman alignment algorithm, it will work for just a single lane of sequencing data when run in -knownsOnly mode. Novel sites obviously won't be cleaned up, but the majority of a single individual's short indels will already have been seen in dbSNP and/or 1000 Genomes. One would employ the known-only/lane-level realignment strategy in a large-scale project (e.g. 1000 Genomes) where computation time is severely constrained and limited. We modify the example arguments from above to reflect the command-lines necessary for known-only/lane-level cleaning.

The RealignerTargetCreator step would need to be done just once for a single set of indels; so as long as the set of known indels doesn't change, the output.intervals file from below would never need to be recalculated.

 java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar \
  -T RealignerTargetCreator \
  -R /path/to/reference.fasta \
  -o /path/to/output.intervals \
  -known /path/to/indel_calls.vcf

The IndelRealigner step needs to be run on every bam file.

java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir \
  -jar /path/to/GenomeAnalysisTK.jar \
  -I <lane-level.bam> \
  -R <ref.fasta> \
  -T IndelRealigner \
  -targetIntervals <intervalListFromStep1Above.intervals> \
  -o <realignedBam.bam> \
  -known /path/to/indel_calls.vcf
  --consensusDeterminationModel KNOWNS_ONLY \
  -LOD 0.4

Created 2013-08-21 21:15:21 | Updated 2014-02-08 20:09:15 | Tags: indelrealigner unifiedgenotyper variantrecalibrator haplotypecaller reducereads release-notes

Comments (2)

GATK 2.7 was released on August 21, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history


Reduce Reads

  • Changed the underlying convention of having unstranded reduced reads; instead there are now at least 2 compressed reads at every position, one for each strand (forward and reverse). This allows us to maintain strand information that is useful for downstream filtering.
  • Fixed bug where representative depths were arbitrarily being capped at 127 (instead of the expected 255).
  • Fixed bug where insertions downstream of a variant region weren't triggering a stop to the compression.
  • Fixed bug when using --cancer_mode where alignments were being emitted out of order (and causing the tool to fail).

Unified Genotyper

  • Added --onlyEmitSamples argument that, when provided, instructs that caller to emit only the selected samples into the VCF (even though the calling is performed over all samples present in the provided bam files).
  • FPGA support was added to the underlying HMM that is automatically used when the appropriate hardware is available on the machine.
  • Added a (very) experimental argument (allSitePLs) that will have the caller emit PLs for all sites (including reference sites). Note that this does not give a fully accurate reference model because it models only SNPs. Full a proper handling of the reference model, please use the Haplotype Caller.

Haplotype Caller

  • Added a still somewhat experimental PCR indel error model to the Haplotype Caller. By default this modeling is turned on and is very useful for removing false positive indel calls associated with PCR slippage around short tandem repeats (esp. homopolymers). Users have the option (with the --pcr_indel_model argument) of turning it off or making it even more aggressive (at the expense of losing some true positives too).
  • Added the ability to emit accurate likelihoods for non-variant positions (i.e. what we call a "reference model" that incorporates indels as well as SNP confidences at every position). The output format can be either a record for every position or use the gVCF style recording of blocks. See the --emitRefConfidence argument for more details; note that this replaces the use of "--output_mode EMIT_ALL_SITES" in the HaplotypeCaller.
  • Improvements to the internal likelihoods that are generated by the Haplotype Caller. Specifically, this tool now uses a tri-state correction like the Unified Genotyper, corrects for overlapping read pairs (from the same underlying fragment), and does not run contamination removal (allele-biased downsampling) by default.
  • Several small runtime performance improvements were added (although we are still hard at work on larger improvements that will allow calling to scale to many samples; we're just not there yet).
  • Fixed bug in how adapter clipping was performed (we now clip only after reverting soft-clipped bases).
  • FPGA support was added to the underlying HMM that is automatically used when the appropriate hardware is available on the machine.
  • Improved the "dangling tail" recovery in the assembly algorithm, which allows for higher sensitivity in calling variants at the edges of coverage (e.g. near the ends of targets in an exome).
  • Added the ability to run allele-biased downsampling with different per-sample values like the Unified Genotyper (contributed by Yossi Farjoun).

Variant Annotator

  • Fixed bug where only the last -comp was being annotated at a site.

Indel Realigner

  • Fixed bug that arises because of secondary alignments and that was causing the tool not to update the alignment start of the mate when a read was realigned.

Phase By Transmission

  • Fixed bug where multi-allelic records were being completely dropped by this tool. Now they are emitted unphased.

Variant Recalibrator

  • General improvements to the Gaussian modeling, mostly centered around separating the parameters for the positive and negative training models.
  • The percentBadVariants argument has been replaced with the numBad argument.
  • Added mode to not emit (at all) variant records that are filtered out.
  • This tool now automatically orders the annotation dimensions by their standard deviation instead of the order they were specified on the command-line in order to stabilize the training and have it produce optimal results.
  • Fixed bug where the tool occasionally produced bad log10 values internally.

Miscellaneous

  • General performance improvements to the VCF reading code contributed by Michael McCowan.
  • Error messages are much less verbose and "scary."
  • Added a LibraryReadFilter contributed by Louis Bergelson.
  • Fixed the ReadBackedPileup class to represent mapping qualities as ints, not (signed) bytes.
  • Added the engine-wide ability to do on-the-fly BAM file sample renaming at runtime (see the documentation for the --sample_rename_mapping_file argument for more details).
  • Fixed bug in how the GATK counts filtered reads in the traversal output.
  • Added a new tool called Qualify Intervals.
  • Fixed major bug in the BCF encoding (the previous version was producing problematic files that were failing when trying to be read back into the GATK).
  • Picard/sam/tribble/variant jars updated to version 1.96.1534.

Created 2016-05-20 22:32:20 | Updated | Tags: indelrealigner realignertargetcreator queue performance gatk lsf memory

Comments (2)

Hi,

I'm using a QScript to run the GATK best practices on LSF. I'm trying to run the data processing steps now. RealignerTargetCreator and IndelRealigner are using so much memory that my jobs are being killed by LSF. Usually they don't even make it through RealignerTargetCreator, but the ones that do make it have a memory explosion in IndelRealigner and get killed at that step.

I'm testing the pipeline with a very small dataset: 3 bam files of about 10 million reads each. Even when I downsample them to 1 million reads each, the memory still explodes. I've tried requesting up to 32g of memory. Higher memory requests allow the jobs to run longer, but they eventually outgrow even 32g.

A few notes:

  • I'm using GATK/Queue 3.5-0.
  • The "-memLimit" option is working correctly so that my jobs are submitted with the correct memory request. LSF is reporting that the actual memory usage is indeed extremely high.
  • I've validated my bam files with the Picard validation tool.
  • I'm using 1000G_phase1.indels.hg19.vcf and dbsnp_137.hg19.vcf from the GATK resource bundle downloaded in early 2013.

Do you have any idea what could be happening?

Here is an example of the output from LSF:

Sender: LSF System <hpcadmin@c27>
Subject: Job 57251: <RealignerTargetCreator: B11.sorted.rg.downsample.interval_list> Exited

Job <RealignerTargetCreator: B11.sorted.rg.downsample.interval_list> was submitted from host <c15> by user <russellp> in cluster <compbio_cluster1>.
Job was executed on host(s) <c27>, in queue <shared>, as user <russellp> in cluster <compbio_cluster1>.
</home/russellp> was used as the home directory.
</home/russellp/Projects/IPF_resequencing/GATK_testing> was used as the working directory.
Started at Fri May 20 16:10:00 2016
Results reported at Fri May 20 16:11:03 2016

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
sh /home/russellp/Projects/IPF_resequencing/GATK_testing/.queue/tmp/.exec5219091543152698331
------------------------------------------------------------

TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Exited with exit code 130.

Resource usage summary:

    CPU time   :     79.19 sec.
    Max Memory :      7951 MB
    Max Swap   :     34733 MB

    Max Processes  :         4
    Max Threads    :        29

Here is my QScript:

package qscripts

import org.broadinstitute.gatk.queue.QScript
//import org.broadinstitute.gatk.queue.extensions.picard.MarkDuplicates
import org.broadinstitute.gatk.queue.extensions.gatk._

/**
  * Created by prussell on 3/3/16.
  * Implements GATK best practices
  */
class GatkBestPractices extends QScript {

  /**
    * *********************************************************************
    *                               INPUTS
    * *********************************************************************
    */

  /**
    * Reference genome fasta
    */
  @Input(doc="Reference genome for the bam files", shortName="R", fullName="REF_FASTA", required=true)
  var referenceFile: File = null

  /**
    * Bam files
    */
  @Input(doc="One or more bam files", shortName="I", fullName="INPUT_BAM", required=true)
  var bamFiles: List[File] = Nil

  /**
    * VCF files
    */
  @Input(doc="VCF file(s) with known indels", fullName="KNOWN_INDELS", required=true)
  var knownIndels: List[File] = Nil
  @Input(doc="Database of known variants e.g. dbSNP", fullName="KNOWN_VARIANTS", required=true)
  var knownPolymorphicSites: List[File] = Nil

  /**
    * Output directory
    */
  @Input(doc="Output directory", shortName="od", fullName="OUT_DIR", required=true)
  var outputDirectory : File = null

  /**
    * Output file prefix not including directory
    */
  @Input(doc="Output prefix not including directory", shortName="op", fullName="OUT_PREFIX", required=true)
  var outputPrefix : File = null

  /**
    * Common arguments
    */
  trait CommonArguments extends CommandLineGATK {
    this.reference_sequence = referenceFile
    // TODO other common arguments?
  }

  def script() = {    

    /**
      * Container for the processed bam files
      */
    var processedFiles = Seq.empty[File]

    /**
      * Data processing
      */
    for(bam <- bamFiles) {

      /**
        * *********************************************************************
        * LOCAL REALIGNMENT AROUND INDELS
        * https://www.broadinstitute.org/gatk/guide/article?id=38
        * https://www.broadinstitute.org/gatk/guide/article?id=2800
        * *********************************************************************
        */

      /**
        * Local realignment around indels
        * Step 1 of 2: RealignerTargetCreator: Define intervals to target for local realignment
        * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php
        */
      val realignerTargetCreator = new RealignerTargetCreator with CommonArguments
      realignerTargetCreator.input_file +:= bam
      realignerTargetCreator.known = knownIndels
      realignerTargetCreator.out = swapExt(bam, "bam", "interval_list")
      realignerTargetCreator.maxIntervalSize = int2intOption(500) // Default 500
      realignerTargetCreator.minReadsAtLocus = int2intOption(4) // Default 4
      realignerTargetCreator.mismatchFraction = double2doubleOption(0.0) // Default 0.0
      realignerTargetCreator.windowSize = int2intOption(10) // Default 10
      add(realignerTargetCreator)

      /**
        * Local realignment around indels
        * Step 2 of 2: IndelRealigner: Perform local realignment of reads around indels
        * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php
        */
      val indelRealigner = new IndelRealigner with CommonArguments
      indelRealigner.targetIntervals = realignerTargetCreator.out
      indelRealigner.input_file +:= bam
      indelRealigner.knownAlleles = knownIndels
      indelRealigner.out = swapExt(bam, "bam", "realign.bam")
      indelRealigner.consensusDeterminationModel = null
      indelRealigner.LODThresholdForCleaning = double2doubleOption(5.0) // Default 5.0
      indelRealigner.nWayOut = null
      indelRealigner.entropyThreshold = double2doubleOption(0.15) // Default 0.15
      indelRealigner.maxConsensuses = int2intOption(30) // Default 30
      indelRealigner.maxIsizeForMovement = int2intOption(3000) // Default 3000
      indelRealigner.maxPositionalMoveAllowed = int2intOption(200) // Default 200
      indelRealigner.maxReadsForConsensuses = int2intOption(120) // Default 120
      indelRealigner.maxReadsForRealignment = int2intOption(20000) // Default 20000
      indelRealigner.maxReadsInMemory = int2intOption(150000) // Default 150000
      indelRealigner.noOriginalAlignmentTags = false
      add(indelRealigner)

      /**
        * *********************************************************************
        * BASE QUALITY SCORE RECALIBRATION
        * https://www.broadinstitute.org/gatk/guide/article?id=44
        * https://www.broadinstitute.org/gatk/guide/article?id=2801
        * *********************************************************************
        */

      /**
        * Base quality score recalibration
        * Step 1 of 3: BaseRecalibrator: Generate base recalibration table to compensate for systematic errors in basecalling confidences
        * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php
        */

      // Generate the first pass recalibration table file
      val baseRecalibratorBefore = new BaseRecalibrator with CommonArguments
      baseRecalibratorBefore.input_file +:= indelRealigner.out
      baseRecalibratorBefore.out = swapExt(indelRealigner.out, "realign.bam", "base_recalibrator_first_pass.out")
      baseRecalibratorBefore.knownSites = knownPolymorphicSites
      baseRecalibratorBefore.indels_context_size = int2intOption(3) // Default 3
      baseRecalibratorBefore.maximum_cycle_value = int2intOption(500) // Default 500
      baseRecalibratorBefore.mismatches_context_size = int2intOption(2) // Default 2
      baseRecalibratorBefore.solid_nocall_strategy = null
      baseRecalibratorBefore.solid_recal_mode = null
      baseRecalibratorBefore.list = false
      baseRecalibratorBefore.lowMemoryMode = false
      baseRecalibratorBefore.no_standard_covs = false
      baseRecalibratorBefore.sort_by_all_columns = false
      baseRecalibratorBefore.binary_tag_name = null
      baseRecalibratorBefore.bqsrBAQGapOpenPenalty = double2doubleOption(40.0) // Default 40.0
      baseRecalibratorBefore.deletions_default_quality = int2byteOption(45) // Default 45
      baseRecalibratorBefore.insertions_default_quality = int2byteOption(45) // Default 45
      baseRecalibratorBefore.low_quality_tail = int2byteOption(2) // Default 2
      baseRecalibratorBefore.mismatches_default_quality = int2byteOption(-1) // Default -1
      baseRecalibratorBefore.quantizing_levels = int2intOption(16) // Default 16
      baseRecalibratorBefore.run_without_dbsnp_potentially_ruining_quality = false
      add(baseRecalibratorBefore)

      // Generate the second pass recalibration table file
      val baseRecalibratorAfter = new BaseRecalibrator with CommonArguments
      baseRecalibratorAfter.BQSR = baseRecalibratorBefore.out
      baseRecalibratorAfter.input_file +:= indelRealigner.out
      baseRecalibratorAfter.out = swapExt(indelRealigner.out, "realign.bam", "base_recalibrator_second_pass.out")
      baseRecalibratorAfter.knownSites = knownPolymorphicSites
      baseRecalibratorAfter.indels_context_size = int2intOption(3) // Default 3
      baseRecalibratorAfter.maximum_cycle_value = int2intOption(500) // Default 500
      baseRecalibratorAfter.mismatches_context_size = int2intOption(2) // Default 2
      baseRecalibratorAfter.solid_nocall_strategy = null
      baseRecalibratorAfter.solid_recal_mode = null
      baseRecalibratorAfter.list = false
      baseRecalibratorAfter.lowMemoryMode = false
      baseRecalibratorAfter.no_standard_covs = false
      baseRecalibratorAfter.sort_by_all_columns = false
      baseRecalibratorAfter.binary_tag_name = null
      baseRecalibratorAfter.bqsrBAQGapOpenPenalty = double2doubleOption(40.0) // Default 40.0
      baseRecalibratorAfter.deletions_default_quality = int2byteOption(45) // Default 45
      baseRecalibratorAfter.insertions_default_quality = int2byteOption(45) // Default 45
      baseRecalibratorAfter.low_quality_tail = int2byteOption(2) // Default 2
      baseRecalibratorAfter.mismatches_default_quality = int2byteOption(-1) // Default -1
      baseRecalibratorAfter.quantizing_levels = int2intOption(16) // Default 16
      baseRecalibratorAfter.run_without_dbsnp_potentially_ruining_quality = false
      add(baseRecalibratorAfter)

      /**
        * Base quality score recalibration
        * Step 2 of 3: AnalyzeCovariates: Create plots to visualize base recalibration results
        * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_bqsr_AnalyzeCovariates.php
        */
      val analyzeCovariates = new AnalyzeCovariates with CommonArguments
      analyzeCovariates.beforeReportFile = baseRecalibratorBefore.out
      analyzeCovariates.afterReportFile = baseRecalibratorAfter.out
      analyzeCovariates.plotsReportFile = new File(outputDirectory.getAbsolutePath + "/" + outputPrefix + "_BQSR.pdf")
      analyzeCovariates.intermediateCsvFile = new File(outputDirectory.getAbsolutePath + "/" + outputPrefix + "_BQSR.csv")
      analyzeCovariates.ignoreLMT = false
      add(analyzeCovariates)

      /**
        * Base quality score recalibration
        * Step 3 of 3: PrintReads: Write out sequence read data (for filtering, merging, subsetting etc)
        * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_readutils_PrintReads.php
        */
      val printReads = new PrintReads with CommonArguments
      printReads.input_file +:= bam
      printReads.BQSR = baseRecalibratorAfter.out
      printReads.out = swapExt(bam, "bam", "recalibrated.bam")

      processedFiles +:= printReads.out

    }

    /**
      * *********************************************************************
      *                        VARIANT DISCOVERY
      * https://www.broadinstitute.org/gatk/guide/bp_step.php?p=2
      * *********************************************************************
      */

    /**
      * Variant discovery
      * Step 1 of 6: HaplotypeCaller: Call germline SNPs and indels via local re-assembly of haplotypes
      * https://www.broadinstitute.org/gatk/guide/article?id=2803
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php
      * https://www.broadinstitute.org/gatk/guide/article?id=3893
      */
    // TODO

    /**
      * Variant discovery
      * Step 2 of 6: CombineGVCFs: Combine per-sample gVCF files produced by HaplotypeCaller into a multi-sample gVCF file
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_CombineGVCFs.php
      */
    // TODO

    /**
      * Variant discovery
      * Step 3 of 6: GenotypeGVCFs: Perform joint genotyping on gVCF files produced by HaplotypeCaller
      * https://www.broadinstitute.org/gatk/guide/article?id=3893
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.php
      */
    // TODO

    /**
      * Variant discovery
      * Step 4 of 6: VariantFiltration: Filter variant calls based on INFO and FORMAT annotations
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_filters_VariantFiltration.php
      */
    // TODO

    /**
      * Variant discovery
      * Step 5 of 6: VariantRecalibrator: Build a recalibration model to score variant quality for filtering purposes
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantrecalibration_VariantRecalibrator.php
      * https://www.broadinstitute.org/gatk/guide/article?id=39
      * https://www.broadinstitute.org/gatk/guide/article?id=2805
      *
      */
    // TODO

    /**
      * Variant discovery
      * Step 6 of 6: ApplyRecalibration: Apply a score cutoff to filter variants based on a recalibration table
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantrecalibration_ApplyRecalibration.php
      * https://www.broadinstitute.org/gatk/guide/article?id=2806
      */
    // TODO

    /**
      * *********************************************************************
      *                        CALLSET REFINEMENT
      * https://www.broadinstitute.org/gatk/guide/bp_step.php?p=3
      * *********************************************************************
      */

    /**
      * Callset refinement
      * Step 1 of 8: CalculateGenotypePosteriors: Calculate genotype posterior likelihoods given panel data
      * https://www.broadinstitute.org/gatk/guide/article?id=4727
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_CalculateGenotypePosteriors.php
      */
    // TODO

    /**
      * Callset refinement
      * Step 2 of 8: VariantFiltration: Filter variant calls based on INFO and FORMAT annotations
      * https://www.broadinstitute.org/gatk/guide/article?id=4727
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_filters_VariantFiltration.php
      */
    // TODO

    /**
      * Callset refinement
      * Step 3 of 8: VariantAnnotator: Annotate variant calls with context information
      * https://www.broadinstitute.org/gatk/guide/article?id=4727
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_VariantAnnotator.php
      */
    // TODO

    /**
      * Callset refinement
      * Step 4 of 8: SelectVariants: Select a subset of variants from a larger callset
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php
      */
    // TODO

    /**
      * Callset refinement
      * Step 5 of 8: CombineVariants: Combine variant records from different sources
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_CombineVariants.php
      */
    // TODO

    /**
      * Callset refinement
      * Step 6 of 8: VariantEval: General-purpose tool for variant evaluation (% in dbSNP, genotype concordance, Ti/Tv ratios, and a lot more)
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_varianteval_VariantEval.php
      */
    // TODO

    /**
      * Callset refinement
      * Step 7 of 8: VariantsToTable: Extract specific fields from a VCF file to a tab-delimited table
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_VariantsToTable.php
      */
    // TODO

    /**
      * Callset refinement
      * Step 8 of 8: GenotypeConcordance (Picard version): Genotype concordance
      * https://broadinstitute.github.io/picard/command-line-overview.html#GenotypeConcordance
      */
    // TODO

  }

}

Created 2016-04-14 09:22:34 | Updated | Tags: indelrealigner

Comments (1)

First i have created the target creator for my indel realignment the command used for trarget creation is

java -jar Tools/GenomeAnalysisTK.jar -T RealignerTargetCreator -R Reference/BWA/Homo_sapiens_assembly38.fasta -I output/sample1/BWA_alignment_RG.bam -o output/sample1/BWA_Realignement.list

this command works fine

but while running the indel Realigner i stuck with some errors

Comannd used for realignment

java -jar Tools/GenomeAnalysisTK.jar -T IndelRealigner -R Reference/BWA/Homo_sapiens_assembly38.fasta -targetIntervals output/sample1/BWA_Realignement.list -I output/sample1/BWA_alignment_RG.bam -o output/sample1/BWA_Realignment.bam

Error while running Realignment

INFO 13:01:24,584 HelpFormatter - --------------------------------------------------------------------------------- INFO 13:01:24,587 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12 INFO 13:01:24,587 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 13:01:24,588 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 13:01:24,593 HelpFormatter - Program Args: -T IndelRealigner -R Reference/BWA/Homo_sapiens_assembly38.fasta -targetIntervals output/sample1/BWA_Realignement.list -I output/sample1/BWA_alignment_RG.bam -o output/sample1/BWA_Realignment.bam INFO 13:01:24,596 HelpFormatter - Executing as bioinformatics@BIF-Server on Linux 3.19.0-56-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_77-b03. INFO 13:01:24,597 HelpFormatter - Date/Time: 2016/04/14 13:01:24 INFO 13:01:24,597 HelpFormatter - --------------------------------------------------------------------------------- INFO 13:01:24,598 HelpFormatter - --------------------------------------------------------------------------------- INFO 13:01:24,999 GenomeAnalysisEngine - Strictness is SILENT INFO 13:01:25,153 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 13:01:25,159 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 13:01:25,303 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.14 INFO 13:01:25,440 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 13:01:25,443 GenomeAnalysisEngine - Done preparing for traversal INFO 13:01:25,443 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 13:01:25,444 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 13:01:25,444 ProgressMeter - Location | reads | elapsed | reads | completed | runtime | runtime INFO 13:01:26,051 ReadShardBalancer$1 - Loading BAM index data INFO 13:01:26,052 ReadShardBalancer$1 - Done loading BAM index data INFO 13:01:55,449 ProgressMeter - chr1:150513415 1200014.0 30.0 s 25.0 s 4.7% 10.7 m 10.2 m INFO 13:02:25,452 ProgressMeter - chr2:84816103 2715517.0 60.0 s 22.0 s 10.4% 9.6 m 8.6 m INFO 13:02:55,455 ProgressMeter - chr3:31182497 3932473.0 90.0 s 22.0 s 16.2% 9.2 m 7.7 m INFO 13:03:25,457 ProgressMeter - chr4:75787095 5388526.0 120.0 s 22.0 s 23.8% 8.4 m 6.4 m INFO 13:03:55,459 ProgressMeter - chr5:179729704 6861376.0 2.5 m 21.0 s 32.9% 7.6 m 5.1 m INFO 13:04:25,460 ProgressMeter - chr7:36686822 8129203.0 3.0 m 22.0 s 39.4% 7.6 m 4.6 m INFO 13:04:55,462 ProgressMeter - chr8:109486775 9556787.0 3.5 m 21.0 s 46.6% 7.5 m 4.0 m INFO 13:05:25,464 ProgressMeter - chr10:69240749 1.0932107E7 4.0 m 21.0 s 54.2% 7.4 m 3.4 m INFO 13:05:55,465 ProgressMeter - chr11:125485282 1.2536117E7 4.5 m 21.0 s 60.1% 7.5 m 3.0 m INFO 13:06:25,467 ProgressMeter - chr13:95020893 1.4115976E7 5.0 m 21.0 s 67.5% 7.4 m 2.4 m INFO 13:06:55,469 ProgressMeter - chr15:89155504 1.5506828E7 5.5 m 21.0 s 74.2% 7.4 m 114.0 s INFO 13:07:25,471 ProgressMeter - chr17:50976077 1.7117276E7 6.0 m 21.0 s 79.0% 7.6 m 95.0 s INFO 13:07:55,472 ProgressMeter - chr20:87839 1.8696835E7 6.5 m 20.0 s 84.3% 7.7 m 72.0 s INFO 13:08:25,475 ProgressMeter - chrX:56565071 2.0179893E7 7.0 m 20.0 s 91.1% 7.7 m 40.0 s INFO 13:08:55,486 ProgressMeter - HLA-DRB1*15:01:01:01:9686 2.1461918E7 7.5 m 20.0 s 100.0% 7.5 m 0.0 s

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.4-46-gbc02625):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: SAM/BAM/CRAM file htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter@f53070c is malformed: the BAM file has a read with no stored bases (i.e. it uses '*') which is not supported in the GATK; see the --filter_bases_not_stored argument. Offender: MI71I:00039:10107

Created 2016-03-23 14:48:23 | Updated | Tags: indelrealigner contamination next-generation-sequencing

Comments (5)

Hello,

It is usually suggested to run individually Tumor normal pair files with indel realignment before we continur for checking for contamination. I have two questions.

  1. in my case for each patient I have multiple samples, at least 3 tumor and 1 normal. Shall I run indel realignment on all of them at the same time? What do you suggest?

  2. the output of the above command will be a single bam file unless we state the -nWayout option, but still for the contamination we have to use the tumor and the normal vcf files separately. Therefore, we have to use the nWayout option in the indel realignment. Is that correct?

Thank you.


Created 2016-02-16 23:00:55 | Updated | Tags: indelrealigner

Comments (8)

Hello, I am trying to realign indels, but the first step on indel realignment seems to skip over creating the interval file.

My bam files are created using:

bwa mem -M -t 16 -c 3 -R "@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1" $RefSeq M1_pair1.fastq.gz M1_pair2.fastq.gz | samtools view -Sbh - > M1test7.bam;

samtools sort M1test7.bam M1test7.sorted; samtools index M1test7.sorted.bam;

Then, I try to run the realignment: java -Xmx4g -jar $GATK -T RealignerTargetCreator -R $RefSeq -o realignment_targetsM1test7.interval -I M1test7.sorted.bam; java -Xmx4g -Djava.io.tmpdir=./tmp -jar $GATK -T IndelRealigner -R $RefSeq -I M1test7.sorted.bam -targetIntervals realignment_targetsM1test7.interval -o M1test7realigned.bam;

However, I get several errors on step one (creating the interval files) INFO 12:49:08,323 HelpFormatter - Program Args: -T RealignerTargetCreator -R /bigdata/judelsonlab/mmatson/projects/refSeqs/ensemblEditedCNV/PinfestansEnsemblSorted.fa -o realignment_targetsM1test7.interval -I M1test7.sorted.bam INFO 12:49:08,328 HelpFormatter - Executing as mmatson@c22 on Linux 3.10.0-229.1.2.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_17-b02. INFO 12:49:08,329 HelpFormatter - Date/Time: 2016/02/16 12:49:08 INFO 12:49:08,330 HelpFormatter - --------------------------------------------------------------------------------- INFO 12:49:08,331 HelpFormatter - --------------------------------------------------------------------------------- INFO 12:49:09,103 GenomeAnalysisEngine - Strictness is SILENT INFO 12:49:10,130 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 12:49:10,145 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:49:10,782 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.64 INFO 12:49:11,125 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 12:49:11,489 GenomeAnalysisEngine - Done preparing for traversal INFO 12:49:11,492 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 12:49:11,494 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 12:49:11,496 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 12:49:13,199 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:330) at htsjdk.samtools.reference.IndexedFastaSequenceFile.getSubsequenceAt(IndexedFastaSequenceFile.java:195) at org.broadinstitute.gatk.utils.fasta.CachingIndexedFastaSequenceFile.getSubsequenceAt(CachingIndexedFastaSequenceFile.java:329) at org.broadinstitute.gatk.engine.datasources.providers.LocusReferenceView.initializeReferenceSequence(LocusReferenceView.java:150) at org.broadinstitute.gatk.engine.datasources.providers.LocusReferenceView.(LocusReferenceView.java:126) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:90) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.4-46-gbc02625):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)

The last error before the file exits is

ERROR MESSAGE: Couldn't read file /bigdata/judelsonlab/mmatson/projects/fastqFiles/bsaProgeny/realignment_targetsM1test7.interval because The interval file does not exist.
ERROR ------------------------------------------------------------------------------------------

Error: Unable to access jarfile FixMateInformation

So, I can't seem to determine why my interval file is not being created. I have tried deleting my reference sequence and re indexing and creating a new dict file as well. Any help would be appreciated.

Mike


Created 2016-02-10 12:06:58 | Updated 2016-02-11 00:13:27 | Tags: indelrealigner

Comments (11)

Hi

I'm using IndelRealigner, and it seems to take long time, from 5 to 8h, I have a strange sensation that I may not put the right commands but I'm not sure. Previously of what I show below I have Marked duplicates with MarkDuplicates.

/home/Programas/samtools-0.1.19/samtools index /local/ktroule/SNPiR/Panc047/accepted_hits_RG_Reordered_deduplicated.bam

java -Xmx20g -jar /home/Programas/GATK-3.1-1-g07a4bf8/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /local/Referencias/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome.fa -I /local/SNPiR/Panc042/accepted_hits_RG_Reordered_deduplicated.bam --unsafe ALLOW_N_CIGAR_READS --known /local/Referencias/HG19/Mills_and_1000G_gold_standard.indels.hg19.vcf -o /local/SNPiR/Panc042/Panc042_indel.intervals

java -Xmx20g -jar /home/Programas/GATK-3.1-1-g07a4bf8/GenomeAnalysisTK.jar -T IndelRealigner --targetIntervals /local/SNPiR/Panc042/Panc042_indel.intervals  -R /local/Referencias/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome.fa --unsafe ALLOW_N_CIGAR_READS -I /local/SNPiR/Panc042/accepted_hits_RG_Reordered_deduplicated.bam -o /local/SNPiR/Panc042/accepted_hits_RG_Reordered_deduplicated_Realigned.bam

This is not really about an error, but about If the program takes that long and about if the parameters I have used are the rights ones, as I have seen that few ones use the downsampling.

Finally, if I have not read wrongly InderRealigner does not support multithread, at least not directly from the program.

thanks for yor time.


Created 2016-01-28 15:08:47 | Updated | Tags: indelrealigner variantrecalibrator vqsr haplotypecaller best-practices

Comments (3)

The release notes for 3.5 state "Added new MQ jittering functionality to improve how VQSR handles MQ". My understanding is that in order to use this, we will need to use the --MQCapForLogitJitterTransform argument in VariantRecalibrator. I have a few questions on this: 1) Is the above correct, i.e. the new MQ jittering functionality is only used if --MQCapForLogitJitterTransform is set to something other than the default value of zero? 2) Is the use of MQCapForLogitJitterTransform recommended? 3) If we do use MQCapForLogitJitterTransform, the tool documentation states "We recommend to either use --read-filter ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller or use a MQCap=max+10 to take that into account". Is one of these to be preferred over the other? Given that it seems that sites that have been realigned can have values up to 70, but sites that have not can have values no higher than 60, it seems to me that the ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller option might be preferred, but I would like to check before running.


Created 2016-01-15 16:27:38 | Updated | Tags: indelrealigner bqsr commandlinegatk knownsites

Comments (2)

Dear GATK team,

I'd like to learn what files I should use for indel realignment and BQSR from hg38 bundle? (I read the manual on this topic -- https://broadinstitute.org/gatk/guide/article?id=1247 -- but just would like to be sure):

1) Am I right that for indel realignment I should use Mills_and_1000G_gold_standard.indels.hg38.vcf and 1000G_phase1.snps.high_confidence.hg38.vcf.gz ?

2) Am I right that for BQSR I should use Mills_and_1000G_gold_standard.indels.hg38.vcf , 1000G_phase1.snps.high_confidence.hg38.vcf.gz , and dbsnp_144.hg38.vcf ?

3) Are there any other files with known sites I should use for indel realignment and BQSR?


Created 2015-12-22 15:20:19 | Updated | Tags: indelrealigner dcov downsample-to-coverage

Comments (6)

I ran the IndelRealigner tool with this option: --downsample_to_coverage 100000. i see this output line, which leads me to believe the downsample option was recognized: INFO 09:42:26,159 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 100000

yet, i also see multiple lines like this in the output: INFO 09:43:59,101 IndelRealigner - Not attempting realignment in interval gi|224589819:117115017-117313719:44370-44521 because there are too many reads.

i thought the downsampling would alleviate excluding regions where too many reads are present? i do believe the regions are being excluded because the IndelRealigner tool completes quickly, as compare to runs i've made when i've set the max to align to 100000.

to be explicit, i want all regions included in processing, and i'm willing to down sample to make this happen.

thanks for the help.


Created 2015-12-21 19:25:36 | Updated | Tags: indelrealigner depthofcoverage

Comments (1)

indelrealigner rejects a region because there are too many reads. --maxReadsForRealignment was set to 100000 for this run. when i run the depthofcoverage tool, it show that the max coverage in the region that was rejected is ~95000. many positions are far less, like around 1000. i don't understand why this region would be rejected since no position has a depth of coverage over the max i set with --maxReadsForRealignment.

thanks.


Created 2015-12-17 19:38:46 | Updated | Tags: indelrealigner unifiedgenotyper haplotypecaller gatk

Comments (3)

Does anyone know of an effective way to determine haplotypes or phasing data for SNPs and STRs? I understand that STRs are inherently difficult for aligners; however, I'm trying to determine haplotypes for a large number of STRs (including the flanking region information...SNPs) on a large number of samples. So, manual verification is not really an option. We've developed an in-house perl script that calls STRs accurately; however, it currently does not include flanking region information.

Any help is greatly appreciated.


Created 2015-12-05 15:18:04 | Updated | Tags: indelrealigner

Comments (8)

Dear GATK team, I'm re-running an analysis pipeline with the new GATK version 3.5 on human samples. I have already analyzed my data with the older 3.1 version without any problems. I am having troubles now with the IndelRealigner step.

ERROR
ERROR MESSAGE: Input files reads and reference have incompatible contigs. Please see http://gatkforums.broadinstitute.org/discussion/63/input-files-have-incompatible-contigsfor more informa

tion. Error details: Found contigs with the same name but different lengths or MD5s:

ERROR contig reads is named chr20 with length 63025520 and MD5 0dec9660ec1efaaf33281c0d5ea2560f
ERROR contig reference is named chr20 with length 63025520 and MD5 1ef908e47ac040f0e94ede396c59f074.
ERROR reads contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY]
ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY]

This happens for all my samples.

The reason I am re-running all these steps is because I want to use the HC in GVCF mode and this was not possible with my older GATK version. I didn't want to mix versions so I started the whole pipeline again. Is this really necesary? Or can I just do the HC in GVCF mode on the 3.1 generated bam files? Thanks a lot!!


Created 2015-12-03 06:53:58 | Updated | Tags: indelrealigner nwayout

Comments (1)

I'm using the GenomeAnalysisTK-3.4-0.jar version to co-realign paired tumor-normal BAMs. The -Input is set to a list file contains the paired tumor-normal BAMs directory. For example: /pipiline-test/Normal/rmdup-realn/Normal.rmdup.bam /pipline-test/Tumor/rmdup-realn/Tumor.rmdup.bam And the -nWayOut is set to .realn.bam. The problem is that the output files are generated in the same folder where the shell script is in. Is there a way I can get the output realn.bam to be in the same folder with the preview rmdup.bam?


Created 2015-11-01 02:05:09 | Updated | Tags: indelrealigner

Comments (2)

Hi,

I am new to the GATK forum and I am not sure It's the right place I am posting to so all my apologies if I am in the wrong place. However, I wanted just to report a little typo that occurs here:

https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php

Actually, within the usage example box, the "known" argument is written as follow:

--known indels.vcf

Well, if you execute it as it is, it produces the ERROR MESSAGE: Argument with name 'known' isn't defined.

In fact, and according to the same link, for IndelRealigner you should use either --knownAlleles or -known instead of the --known argument which work well for RealignerTargetCreator.

All the best.

A.


Created 2015-10-24 16:10:33 | Updated | Tags: indelrealigner mrnaseq memory

Comments (3)

I have 2 samples of RNAseq with 10 million PE reads for each and read length of 81 bp. indelRealigner gave me this error

ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java

I increased the heap size to 78g. One sample was done successful but the 2nd sample are still giving the same error. I tried to use down sampling but still the same error

Here is my code:

java -Xmx78g -jar $GATK/GenomeAnalysisTK.jar \ -T IndelRealigner \ -R $gatk_ref \ -I $sample \ -targetIntervals gatk.intervals \ -nWayOut '.realigned.bam' \ -known $indels \ -model USE_SW \ -LOD 0.4 \ -dcov 1000

Any advice will be truly appreciated. Thank you


Created 2015-09-03 08:51:02 | Updated | Tags: indelrealigner malformedreadfilter best-practices

Comments (10)

I recently came across(blog post) a scenario in which a subset of my reads had triggered the MalformedReadFilter during indel realignment. I'm curious to know about the definition of "malformed", as whilst invalid, I wouldn't have described incorrect mate names and alignment positions for unmapped reads as "malformed" and capable of causing job crashes...

Not a bug - I'm just curious to know of any implementation details or reasoning about the MalformedReadFilter.


Created 2015-08-13 15:47:04 | Updated | Tags: indelrealigner haplotypecaller

Comments (10)

If HaplotypeCaller re-assembles all reads in a region Why is it recommend to run IndelRealigner first?


Created 2015-08-11 15:38:09 | Updated | Tags: indelrealigner

Comments (10)

Greetings,

Recently, I was following the best practices for DNA variant calling pipeline on a few WXS bam files. (The size is about 8GB for average.) However, I have a memory usage problem for indelrealigner. I set 4g, 16g, 32g, 64g as maximum heap size for Java, but I had the same error message below:

INFO 19:31:43,543 ProgressMeter - chr10:132448064 6.4744467E7 4.9 h 4.5 m 58.2% 8.4 h 3.5 h

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.4-0-g7e26428):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java
ERROR ------------------------------------------------------------------------------------------

The syntaxes I used is : java -Xmx4, 16, 32, 64G -jar $GATK \ -T IndelRealigner \ -R GRCh38.fa \ -I C50.bam \ -known 1000G_phase3.indels.vcf \ -targetIntervals C500.intervals \ -model KNOWNS_ONLY \ --maxReadsInMemory 5000, 10000, 20000, 50000, 100000, 150000 \ -o C500.indelrealign.bam

No matter how I changed the Xmx and maxReadsInMemory, I always got the same error message. Only if I changed the Xmx to 160G, it would finally work.

My question is whether it is possible for me to reduce the memory usage, and which parameter I should focus on in order to do that.


Created 2015-07-22 11:04:35 | Updated | Tags: indelrealigner

Comments (23)

Hi, I'm having trouble with the second part of the indel realignment protocol. I'm trying to run it for 100bp Illumina paired-end genome resequencing data. This is for a non-model plant so I don't have a set of known indels. RealignerTargetCreator seems to run okay for all my 12 datasets with the following command line:

java -Xmx20G -jar /usr/local/bin/JAVA_JARS/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T RealignerTargetCreator -I mapping.bam -R reference.fasta -o mapping.intervals

But the second part fails with the following command line:

java -Xmx50G -jar /usr/local/bin/JAVA_JARS/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T IndelRealigner -I mapping.bam -R reference.fasta -targetIntervals mapping.intervals -o mapping_realigned.bam

with the following error:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:330) at htsjdk.samtools.reference.IndexedFastaSequenceFile.getSubsequenceAt(IndexedFastaSequenceFile.java:195) at org.broadinstitute.gatk.utils.fasta.CachingIndexedFastaSequenceFile.getSubsequenceAt(CachingIndexedFastaSequenceFile.java:329) at org.broadinstitute.gatk.tools.walkers.indels.ReadBin.getReference(ReadBin.java:108) at org.broadinstitute.gatk.tools.walkers.indels.IndelRealigner.clean(IndelRealigner.java:696) at org.broadinstitute.gatk.tools.walkers.indels.IndelRealigner.cleanAndCallMap(IndelRealigner.java:580) at org.broadinstitute.gatk.tools.walkers.indels.IndelRealigner.map(IndelRealigner.java:552) at org.broadinstitute.gatk.tools.walkers.indels.IndelRealigner.map(IndelRealigner.java:148) at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:228) at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:216) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:102) at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:108) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.4-46-gbc02625):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

I have validated the bam files with picard tools ValidateSamFile and found no errors. I have created .dict and .fai files for the reference with the methods recommended in the GATK documentation and a .bai index for the bams. My processing protocol up to this step has been:

  1. Quality trimming with trimmomatic, discard broken pairs.
  2. mapping with bwa mem using the following command line: bwa mem -M -t 25 reads_1P.fq.gz reads_2P.fq.gz | samtools view -Sb - | samtools sort - mapping.bam
  3. merging bams for each individual with picard tools MergeSamFiles (there were two readsets with different insert sizes for each of the twelve individuals) with the following command line: java -Xmx40G -jar MergeSamFiles.jar INPUT=mapping1.bam INPUT=mapping2.bam OUTPUT=mapping_merged.bam SORT_ORDER=coordinate ASSUME_SORTED=true MERGE_SEQUENCE_DICTIONARIES=TRUE USE_THREADING=true (I also tried the next steps without merging and got the same error from IndelRealigner so I don't think this is causing a problem)
  4. Add read groups with picard tools AddOrReplaceReadGroups with the following command line java -Xmx40G -jar AddOrReplaceReadGroups.jar I=mapping_merged.bam O=mapping_merged_RG.bam RGSM=Ind1 RGPL=illumina RGPU=1 RGLB=1 MAX_RECORDS_IN_RAM=3000000 VALIDATION_STRINGENCY=SILENT

I've tried remaking the indexes and even tried a few versions of java. I'm now out of ideas. Any help would be much appreciated.

All the best, Owen


Created 2015-07-21 20:57:38 | Updated | Tags: indelrealigner

Comments (11)

Mate information seems to be assigned incorrectly when the supplementary alignment is on the same chromosome as one of the non-supplementary alignments, and the primary alignments are on different chromosomes. For example:

(The first 9 columns of a SAM line for read HS36_15753:4:1106:13106:57119#25)

HS36_15753:4:1106:13106:57119#25 2209 12 2870282 60 37H38M = 2870346 0 # supplementary, second read; mate info, columns 7-9) matches the other part of the second read (with flag 161) HS36_15753:4:1106:13106:57119#25 81 12 2870346 60 75M hs37d5 32110385 0 # first read in a pair HS36_15753:4:1106:13106:57119#25 161 hs37d5 32110385 60 45M30S 12 2870346 0 # first part of the second read

after realignment becomes:

HS36_15753:4:1106:13106:57119#25 2209 12 2870282 60 37H38M = 2870346 139 # mate coord is correct but insert size is changed to 139 HS36_15753:4:1106:13106:57119#25 81 12 2870346 60 75M = 2870282 -139 # first read; mate info says the supplementary alignment is the mate HS36_15753:4:1106:13106:57119#25 161 hs37d5 32110385 60 45M30S 12 2870346 0 # no change

When I run Picard FixMateInformation I get back the original alignments:

HS36_15753:4:1106:13106:57119#25 2209 12 2870282 60 37H38M = 2870346 0 HS36_15753:4:1106:13106:57119#25 81 12 2870346 60 75M hs37d5 32110385 0 HS36_15753:4:1106:13106:57119#25 161 hs37d5 32110385 60 45M30S 12 2870346 0

I noticed this problem because it causes downstream problems with duplicate marking (duplicates with the exact 3 alignments are not getting marked as duplicate)

Thanks


Created 2015-05-30 13:45:41 | Updated | Tags: indelrealigner picard reads

Comments (5)

Hello,

I'm quite new to SNP calling. I am trying to setup a pipeline which includes GATK IndelRealigner as a final step. My bam file (before realignment) is a little over 1GB. After running the indel realigner however, it's reduced to 18MB! I'm assuming its throwing out way too many reads or something has gone wrong.

I'm calling the indel realigner with the default options as follows:

java -Xmx16g -jar $GATK_DIR/GenomeAnalysisTK.jar \
   -T IndelRealigner \
   -R /path/to/my/ref \
   -I input.bam.intervals \
   -targetIntervals input.bam.intervals \
   -o realn.bam \

I am generating the read groups using AddOrReplaceReadGroups.jar (from picard tools) and interval file using GATK RealignerTargetCreator with default options.

My bam file was generated off the raw reads of experiment SRA181417 fetched from SRA (after cleaning adapters using cutadapt, mapping to reference using bwa-mem, and removing duplicate reads using picard tools)

I have tried this on other reads and do not have the same issue. Can anyone comment on why indel realigner could be throwing out so many reads.

Thank you


Created 2015-05-13 12:41:10 | Updated | Tags: indelrealigner bam error

Comments (2)

Hello, I run into a problem after the pre-processing, it seems that extra contigs where added to my bam file compared to the reference I used, which make the indel realigner step impossible to do. I have checked the headers of my file and the reference is the same but my bam file as a hundreds of additional contigs. Not sure what happen. The steps to get the bam where:

  • Aligned with bwa mem
  • Transform to bam and sort (Samtools)
  • Dedup (picard)
  • Add read group (picard)
  • Index bam (samtools)
  • Run Realigner target creator When I check the header of my bam file it still show the right contigs but when running it complains of difference (additional) compare to my reference. I am currently re-testing the whole pipeline on a single sample but if you have any pointer to what could cause this, maybe a problem with the bam formating? I am running GATK 3.3.0-g37228af Java 1.7 I have attached the ouput log from the command. Thanks,

Julien

PS: I attended your workshop in Cambridge!


Created 2015-05-05 09:47:41 | Updated | Tags: indelrealigner realignertargetcreator

Comments (3)

I was just wondering what you guys thought of my realignment intervals length distribution. This is 30Mb from a single diploid sample without prior indel position information. Approximately 60,000 events , i.e. one every fifty bases seems like a lot. How indicative of true indels is the data from TargetCreator and IndelRealigner? Guess I'll have to check with the ug-vcf calls... Across the genome, distribution of 'all' events is uniform. Does multi-sample realignment improve the accuracy or efficiency of the realignment process ?


Created 2015-04-27 06:58:38 | Updated | Tags: indelrealigner bqsr

Comments (1)

Hi,

I have gone through the Realignment step and found Re aligner will change the CIGAR of alignment in bam file. Most of the structaral variant detection tool dependent upon CIGAR field. So my question is it right to consider re calibrated bam file, does it has any advantage for SV Detection over Raw Sorted Bam file..?


Created 2015-03-29 01:40:41 | Updated | Tags: indelrealigner fixmisencodedquals fix-misencoded-quality-scores

Comments (15)

Dear GATK team,

I have two input fasta files from exome-seq. One is coded with Q64 and the other is coded with Q33 quality scores. I want to combine the two input fasta files and run bwa+GATK.

How do I combine them for IndelRealigner? I suppose that IndelRealigner needs all reads from both Q64 and Q33. Can I do IndelRealigner separately and then join them? Will this cause problems?

I have searched for many posts but can't find my answers. Please help me.

Thanks, Woody


Created 2015-03-26 14:47:13 | Updated | Tags: indelrealigner

Comments (1)

I have been trying to run IndelRealigner with the following commands ($tumorPfx etc are file names)

java -d64 -jar $gatkJar -R $hgReference -T IndelRealigner -rf BadCigar -I $tumorPfx.bam -known $G1000_Mills -known $G100\ 0_Phase1_Indels -targetIntervals $tumorSample.intervals -o $tumorPfx.realn.bam

and have been getting the following output and error:

ERROR stack trace

java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:635)

On this website, I found a similar error with somebody trying to run HaplotypeCaller, albeit with Index and Size = 3, where it was remarked that it may be due to a bug or a java version issue. Is the same thing going on here?

Thank you, Max


Created 2015-03-15 21:46:07 | Updated | Tags: indelrealigner malformedbam

Comments (1)

Hi Team,

I get an error with gatk in variant calling steps, using BAM file from realignment step. The error indicated something wrong with the bai file. So I tried to create it new. But then this comes up, saying there is something wrong with the bam (see below) This bam was created with IndelRealigner (no errors)

Thanks! Alexander

$ picard 1 BuildBamIndex INPUT=B57.3.bam
[Sun Mar 15 22:37:46 CET 2015] picard.sam.BuildBamIndex INPUT=B57.3.bam    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Sun Mar 15 22:37:46 CET 2015] Executing as kaktus42@soroban on Linux 2.6.32-431.29.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_31-b13; Picard version: 1.129(b508b2885562a4e932d3a3a60b8ea283b7ec78e2_1424706677) IntelDeflater
[Sun Mar 15 22:41:19 CET 2015] picard.sam.BuildBamIndex done. Elapsed time: 3,55 minutes.
Runtime.totalMemory()=855638016
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.FileTruncatedException: Premature end of file
    at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:382)
    at htsjdk.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:127)
    at htsjdk.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:252)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at htsjdk.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:404)
    at htsjdk.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:380)
    at htsjdk.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:366)
    at htsjdk.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:199)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:660)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:634)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:628)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:598)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:527)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:501)
    at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:287)
    at htsjdk.samtools.BAMIndexer.createIndex(BAMIndexer.java:271)
    at picard.sam.BuildBamIndex.doWork(BuildBamIndex.java:138)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

Created 2015-02-16 08:45:57 | Updated | Tags: indelrealigner bqsr

Comments (1)

If I am not having information about known variants. Is it fine if I skip the BQSR step after indelRealignment step??


Created 2015-01-26 20:35:50 | Updated | Tags: indelrealigner bed

Comments (1)

'just asking for confirmation: if I run IndelRealigner with option -L my.capture.bed does indelrealigner:

  • keep all the reads but only realign them in region of the bed ?
  • or only keep the reads in the given region (smaller BAM)?

Thanks


Created 2015-01-14 00:52:00 | Updated 2015-01-14 00:56:38 | Tags: indelrealigner

Comments (1)

Hi GATK team,

Though I have read the seminar slides for GATK indelrealignment, I still have no idear about how GATK does that for us, is there anyone can suggest? Say, give me a reference, or a brief ìntroduction.

Actually, what I care most is whether indelrealignment takes base quality into consideration.Thank you very much.

bless~


Created 2014-12-26 21:09:34 | Updated | Tags: indelrealigner compression

Comments (1)

Hi GATK team, my jobs are currently running and I'm a little bit lazy to try this later: I saw that the .interval files produced by RealignerTargetCreator can be quite large. Can I use a ".interval.gz" extension on the command line of RealignerTargetCreator ? Can I use this *.gz file with IndelRealigner ?


Created 2014-12-08 15:37:48 | Updated | Tags: indelrealigner

Comments (4)

Hi. I am using IndelRealigner for local indel realignment. The bam used as input is 6.6GB, while the realigned bam is 22GB.

Did I miss anything there?

The pipeline I used is as below:

echo "Patient ${sample}: @create intervals for local realignment"
sudo java -Djava.io.tmpdir=${out_dir}/tmpdir \
    -Xmx${maxMem} -Xms${minMem} \
    -jar ${gatk} \
    -T RealignerTargetCreator \
    -I ${out_dir}/${input_next} \
    -o ${out_dir}/${input_next}.forRealigner.intervals \
    -R ${reference} \
    -L ${intervals} \
    --interval_padding 200 \
    -rf ${reads_filter} \
    -known ${kg_mills} \
    -known ${kg_indels} \
    -nt ${maxDataThread} \
    --allow_potentially_misencoded_quality_scores \
    2>${out_dir}/logs/${sample_prefix}_createIntervals.err

echo "Patient ${sample}: @local realignment"
sudo java -Djava.io.tmpdir=${out_dir}/tmpdir \
    -Xmx${maxMem} -Xms${minMem} \
    -jar $gatk \
    -T IndelRealigner \
    -I ${out_dir}/${input_next} \
    -o ${out_dir}/${sample_prefix}.dedup.realigned.bam \
    -R ${reference} \
    -targetIntervals ${out_dir}/${input_next}.forRealigner.intervals \
    -rf ${reads_filter} \
    -known ${kg_mills} \
    -known ${kg_indels} \
    -compress 0 \
    -LOD 0.4 \
    --consensusDeterminationModel USE_READS \
    --allow_potentially_misencoded_quality_scores \
    2> ${out_dir}/logs/${sample_prefix}_realignment.err

Thanks.


Created 2014-11-18 15:48:00 | Updated | Tags: indelrealigner

Comments (1)

Hello, I want to use gatk indelrealigner in parallel mode. I'm looking for a qscript, because -nt and -nct flags will not work. I use the actual queue version 3.3-0.

Joern


Created 2014-11-11 10:24:07 | Updated 2014-11-11 10:44:58 | Tags: indelrealigner

Comments (5)

Hi,

Recently I experienced a slightly annoying problem with IndelRealigner loosing some reads. It is usually just few reads missing from the output, but when I compare the output and input and extract the reads taht are missing after the IndelRealigner job, I cannot see what is wrong with them. An example of one such read is below:

M01823:187:000000000-AB050:1:1109:16397:19623 69 8 64405501 0 * = 64405501 0 TTTGCTTTCAAAAATACCTGTGCAGGTGGAGGTGTGCGTCTGCGTCTAACGGTGTGCGGTGCGAATTTCGACGATCGTTGCATTAACTTGCGAAACCCCTCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAAATAAAACAAACAAAACGAACTACTACAGACAACGACAAAAACCAAAAAACAACATATAAACAAATAAACGAGCAACACAACACAAATAAAAGAGCAAGCACTACAC CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG885+3355<,8,,=,,,,,,3,,=4::?,,7,,7,*2+14<**/*/21/0+++2++2/+++++21222;/++2+++1:68**20++++ RG:Z:140919_M01823_0188_000000000-AB050_AACCCCTC-TGTTCTCT_L001 AS:i:0 XS:i:0

It's pair has been kept, but this read was removed.

It is a bit of nuisance, as in our workflow we check the number of reads in the files after various steps for sanity, so varying number of reads introduces problems. I would be grateful if you could adviice why some reads get ommitted by IndelRealigner so I could modufy our workflow accordingly. Or could it be a bug?

Thank you, Dalia


Created 2014-09-18 11:58:28 | Updated | Tags: indelrealigner nwayout

Comments (5)

Hello, I'm trying to realign approximately 115 bam files. I am able to do this with the -o command, but this results in an impressively large bam file that I cannot fix in Picard (FixMateInformation and SortSam). Unfortunately these are corrections that need to happen before the downstream GATK snp discovery. So I tried the -nWayOut command, to get an individual realigned bam file for each input, but this returns a stack trace ERROR that includes something about an unavailable reader id. I've pasted it below.

INFO 14:06:32,838 ProgressMeter - scaffold_0:4430818 1.17606954E8 59.5 m 30.0 s 0.4% 9.8 d 9.7 d INFO 14:07:32,840 ProgressMeter - scaffold_0:4474066 1.18707144E8 60.5 m 30.0 s 0.4% 9.8 d 9.8 d INFO 14:08:32,841 ProgressMeter - scaffold_0:4505563 1.1980727E8 61.5 m 30.0 s 0.4% 9.9 d 9.9 d INFO 14:09:32,843 ProgressMeter - scaffold_0:4506325 1.20407434E8 62.5 m 31.0 s 0.4% 10.1 d 10.0 d INFO 14:09:55,236 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: No such reader id is available at org.broadinstitute.gatk.engine.datasources.reads.SAMDataSource$SAMResourcePool.getReaderID(SAMDataSource.java:809) at org.broadinstitute.gatk.engine.datasources.reads.SAMDataSource.getReaderID(SAMDataSource.java:430) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.getReaderIDForRead(GenomeAnalysisEngine.java:803) at org.broadinstitute.gatk.utils.sam.NWaySAMFileWriter.addAlignment(NWaySAMFileWriter.java:158) at org.broadinstitute.gatk.tools.walkers.indels.ConstrainedMateFixingManager.writeRead(ConstrainedMateFixingManager.java:356) at org.broadinstitute.gatk.tools.walkers.indels.ConstrainedMateFixingManager.addRead(ConstrainedMateFixingManager.java:261) at org.broadinstitute.gatk.tools.walkers.indels.ConstrainedMateFixingManager.addRead(ConstrainedMateFixingManager.java:237) at org.broadinstitute.gatk.tools.walkers.indels.IndelRealigner.emit(IndelRealigner.java:492) at org.broadinstitute.gatk.tools.walkers.indels.IndelRealigner.map(IndelRealigner.java:529) at org.broadinstitute.gatk.tools.walkers.indels.IndelRealigner.map(IndelRealigner.java:146) at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:228) at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:216) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:102) at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:108) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: No such reader id is available
ERROR ------------------------------------------------------------------------------------------

I'm not sure if it's the number of bam files submitted or if it is something with the specific position when the error occurs. Any help in this matter would be greatly appreciated!

Christen


Created 2014-09-15 21:28:59 | Updated | Tags: indelrealigner unifiedgenotyper indels gatk variant-calling

Comments (3)

I used GATK 3.2-2 to do indel realignment and multisample calling with UnifiedGenotyper (I cannot use HaplotypeCaller because it is incompatible with the type of data I am analyzing).

Among the ~96 samples sequenced, I have one sample with two nearby variant calls, an indel and a missense, that we checked by Sanger. The missense is real but we found no trace of the indel in the Sanger traces. When I look at indel call in the multisample vcf, it has good allele depth and GQ, but suspiciously has the same AD as the missense call. Additionally, when I look at the bam in IGV, I see no evidence for the indel whatsoever and the variant is not called in any other samples in this project.

indel: chr13:101755523 CA>C GT:AD:DP:GQ:PL 0/1:55,56:111:99:1388,0,1538

missense: chr13:101755530 A>G GT:AD:DP:GQ:PL 0/1:55,56:111:99:2170,0,2125

I went back and recalled just this one sample (single sample calling)… which resulted in the correct variants, i.e. the indel was not called at all, but the SNP, which does validate, is called.

I understand that this is not an easy region to call because of the 7xA repeat, but it’s not obvious to me why this happens only in multisample mode and I'd like to have a better understanding of what is going on.


Created 2014-08-09 01:34:03 | Updated | Tags: indelrealigner commandlinegatk

Comments (3)

Dear GATK help team,

I have a cut chromosome file (cur 17) in which I have processed through sorting, alignment, adding headers, and even through the realignertargetcreator. Yet, when I would like to call my indels from the Indel realigned. I have received an error. ERROR MESSAGE: Badly formed genome loc: Parameters to GenomeLocParser are incorrect:The contig index 0 is bad, doesn't equal the contig index 17 of the contig from a string chr17

I have cut these chromosomes and processed the chr17 first since that is my region of interest, and did it since I thought it might save memory issues.

I am currently a newbie, have check the forum for help, yet only found one similar post with no solution. Please help-- stuck at this stage. My code for the index realigned is the following: java -jar /Users/yotsukurasohiya/build/softwares/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T IndelRealigner -R /Volumes/Pegasus/broadref/ucsc.hg19.fasta -I /Volumes/Pegasus/tmp/mardup.pregatk.bam -targetIntervals 2_target_intervals.list -known /Volumes/Pegasus/broadref/Mills_and_1000G_gold_standard.indels.hg19.vcf -known /Volumes/Pegasus/broadref/dbsnp_138.hg19.vcf -known /Volumes/Pegasus/broadref/1000G_phase1.snps.high_confidence.hg19.vcf -o 2_realigned_reads.bam

The heads that I have added are through picard softwares addorreplacereadgroups. SO=coordinate CREATE_INDEX=true SM=temp PL=Illumina PU=barcode LB=bar ID=id


Created 2014-07-17 18:23:59 | Updated | Tags: indelrealigner baserecalibrator

Comments (2)

Hi

For input file of known indels in IndelRealigner, I would like to ask what is the pros and cons for using Mills_and_1000G_gold_standard.indels.b37.vcf vs 1000G_phase1.indels.b37.vcf ? Both are available from the resource bundle.

A similar question is which of the above 2 files would you suggest for input for known Indel sites in the BaseRecalibrator? Is it the Mills_and_1000G_gold_standard.indels.b37.vcf as the workflow use gold.standard.indel.vcf? Which should I choose between dbsnp_138.b37.vcf and 1000G_phase1.snps.high_confidence.b37.vcf for the other known site inputs? Is it dbsnp_138.b37.vcf that contains more variant?

Thanks

Kin


Created 2014-07-04 19:05:39 | Updated | Tags: indelrealigner nwayout

Comments (7)

I'm putting together an in-house pipeline for processing of paired tumor-normal BAMs. I've set it up to make a single intervals file for each pair and then want to have separate BAMs for BQSR, so I thought I'd use IndelRealigner with -nWayOut. However, the realigned BAMs are output to whatever directory I've run the script from, rather than the same directory as the BAM files I provided to -I.

I tried to create a .map file with full pathnames to both the input and output files, but received the error:

##### ERROR MESSAGE: Bad input: Input-output bam filename map does not contain an entry for the input file 3528.dedup.bam

The corresponding entry in the .map file had the format:

/users/home/project/cocleaned/3528.dedup.bam /users/home/project/cocleaned/3528.realigned.bam

Is there a way to specify an output directory in conjunction with -nWayOut? If not, I can probably hackishly search for the realigned files, but I thought I'd check.


Created 2014-06-11 15:10:28 | Updated | Tags: indelrealigner realignertargetcreator indels resources

Comments (3)

Hello there,

Would you please let us know how "IndelRealigner" makes use of "known" resources? I assume it already has the intervals of interest for realignment from the "RealignerTargetCreator". So it's not clear why it needs the resources again. Would it use them for making any sort of decision to reject or accept realigned indels?

Also, please let us know what happens (algorithmically) if we don't provide the same resources used in "RealignerTargetCreator" to the program.

Thank you Amin Zia


Created 2014-06-10 16:31:35 | Updated | Tags: indelrealigner

Comments (2)

Hi,

I am using GATK 3.1-1 on Ion torrent data. I am using human reference as provided by Ion Torrent people. I used picard to remove the duplicates. RealignerTargetCreator / IndelRealigner tools (with indel.vcf files from 1000G_phase1 and Mills) runs without any error on TMAP produced .bam files. But I am not sure whether the tools are doing what they are suppose to do? The stdout with informative lines are below.

Can anyone help me interpreting - 219901 reads (97.07% of total) failing DuplicateReadFilter (Does it mean - none of the reads were duplicate?) 1539 reads (0.68% of total) failing MappingQualityZeroFilter (Does it mean 1539 reads are at multiple location or low quality score?)

INFO 17:57:10,596 ProgressMeter - chr2:108035321 3.29e+08 30.0 s 0.0 s 11.5% 4.3 m 3.8 m INFO 17:57:40,597 ProgressMeter - chr4:55195129 7.26e+08 60.0 s 0.0 s 24.1% 4.2 m 3.2 m INFO 17:58:10,599 ProgressMeter - chr6:75886121 1.12e+09 90.0 s 0.0 s 36.8% 4.1 m 2.6 m INFO 17:58:40,600 ProgressMeter - chr8:136959289 1.52e+09 120.0 s 0.0 s 49.4% 4.0 m 2.0 m INFO 17:59:10,601 ProgressMeter - chr11:127358689 1.92e+09 2.5 m 0.0 s 62.8% 4.0 m 88.0 s INFO 17:59:40,602 ProgressMeter - chr15:60257869 2.31e+09 3.0 m 0.0 s 76.5% 3.9 m 55.0 s INFO 18:00:10,604 ProgressMeter - chr20:41377801 2.74e+09 3.5 m 0.0 s 89.2% 3.9 m 25.0 s INFO 18:00:40,605 ProgressMeter - chrM:16485 3.06e+09 4.0 m 0.0 s 100.0% 4.0 m 0.0 s INFO 18:01:00,391 ProgressMeter - done 3.10e+09 4.3 m 0.0 s 100.0% 4.3 m 0.0 s INFO 18:01:00,392 ProgressMeter - Total runtime 259.80 secs, 4.33 min, 0.07 hours INFO 18:01:00,451 MicroScheduler - 221440 reads were filtered out during the traversal out of approximately 226527 total reads (97.75%) INFO 18:01:00,451 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter INFO 18:01:00,452 MicroScheduler - -> 0 reads (0.00% of total) failing BadMateFilter INFO 18:01:00,452 MicroScheduler - -> 219901 reads (97.07% of total) failing DuplicateReadFilter INFO 18:01:00,452 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 18:01:00,452 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 18:01:00,452 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 18:01:00,453 MicroScheduler - -> 1539 reads (0.68% of total) failing MappingQualityZeroFilter INFO 18:01:00,453 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 18:01:00,453 MicroScheduler - -> 0 reads (0.00% of total) failing Platform454Filter INFO 18:01:00,466 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

Thank you for the help, SK


Created 2014-04-30 08:27:24 | Updated | Tags: indelrealigner realignertargetcreator

Comments (5)

Hi there,

I'm running GATK (version 3.1-1-g07a4bf8). I already ran RealignerTargerCreator and IndelRealigner without any problem on my computer. However, today, RealignerTargetcreator no longer works and I got this error message:

ERROR stack trace

java.lang.ExceptionInInitializerError at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.(GenomeAnalysisEngine.java:167) at org.broadinstitute.sting.gatk.CommandLineExecutable.(CommandLineExecutable.java:57) at org.broadinstitute.sting.gatk.CommandLineGATK.(CommandLineGATK.java:66) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:106) Caused by: java.lang.NullPointerException at org.reflections.Reflections.scan(Reflections.java:220) at org.reflections.Reflections.scan(Reflections.java:166) at org.reflections.Reflections.(Reflections.java:94) at org.broadinstitute.sting.utils.classloader.PluginManager.(PluginManager.java:79) ... 4 more

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

Yesterday, I updated my setting from Ubuntu 13.10 to 14.04 and java from 1.7 to "1.8.0_05". Reading the error stack trace (notably "Caused by: java.lang.NullPointerException"), could it be that this last java update is causing this error?

But after I upgraded java, I have been successfully using 2 programs running with java: picard (MarkDuplicates.jar or BuildBamIndex.jar tools) or Qualimap_v0.8.

Would you have any idea to help me solve this issue? Many thanks !

Fabrice


Created 2014-04-23 15:38:54 | Updated | Tags: indelrealigner realignertargetcreator mouse

Comments (2)

Gidday,

I've run the IndelRealigner on my mouse WGS *bam files with known site data from the Sanger MGP, and now I'm trying to figure out how "well" it worked.

The list created by RealignerTargetCreator contains 6547185 intervals

Parsing the output realigned.bam file for reads that had an "OC" tag added (as suggested in http://www.broadinstitute.org/gatk/events/3391/GATKw1310-BP-2-Realignment.pdf) shows that 1648299 reads were actually realigned.

I used the default settings, which means that

1) -model was USE_READS - and from what I've read, this is the correct option to use, given that Smith-Waterman modelling doesn't give greatly improved results;

2) -LOD was 5.0 - but for my data, which is mouse whole-genome sequence at average 10x coverage, this may be too high and I might be losing true positives.

I've tried randomly picking out candidate intervals from the intervals and OC-tagged reads from the realigned.bam file to check, but I was wondering if there's a more empirical way of checking how good the realignment was (I realise there's "no formal measure" as per the presentation but I'm finding it hard to make a judgement call!).

My feeling from looking at the intervals or realigned reads is that the low coverage is a major issue in terms of identifying "true" indels, so preferably we'd go for specificity over sensitivity.

Thanks for any advice/suggestions in advance!


Created 2014-04-17 19:41:42 | Updated 2014-04-17 19:42:38 | Tags: indelrealigner indel-vcf-gatk reordersam

Comments (6)

Hi. I'm unable to use the IndelRealigner java jar. My previous steps were;

1) Convert Bowtie2 paired-end Illumina Reads .sam to .bam

2) Use bedtools to extract pairs that fall within the Hg19 exome.

3) Convert the new .bam to .sam

4) Sort the new .sam via SortSam.jar

5) Mark duplicates via MarkDuplicates.jar

6) Use AddOrReplaceReadGroups.jar

7 ) Use ReorderSam.jar

8) Use RealignerTargetCreator

Untill this far everything went well. Now I'm trying the following command; java -jar GenomeAnalysisTK.jar -T IndelRealigner -R [.fasta] -l [ReorderedSam.bam] -targetIntervals [aligner.intervals] -o output.bam (Also when applying -known and an .vcf file Im producing the same error):

ERROR MESSAGE: Unable to match: GATK_7-PicardReorderSam.bam to a logging level, make sure it's a valid level (DEBUG, INFO, WARN, ERROR, FATAL, OFF)

I hope you can help me, because I can't find anything related on google.


Created 2014-04-17 11:34:39 | Updated 2014-04-17 16:09:45 | Tags: indelrealigner bqsr knownsites mouse

Comments (5)

Hello again,

More fun with mouse known site data! I'm using the Sanger MGP v3 known indel/known SNP sites for the IndelRealigner and BQSR steps.

I'm working with whole-genome sequence; however, the known sites have been filtered for the following contigs (example from the SNP vcf):

##fileformat=VCFv4.1
##samtoolsVersion=0.1.18-r572
##reference=ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa
##source_20130026.2=vcf-annotate(r813) -f +/D=200/d=5/q=20/w=2/a=5 (AJ,AKR,CASTEiJ,CBAJ,DBA2J,FVBNJ,LPJ,PWKPhJ,WSBEiJ)
##source_20130026.2=vcf-annotate(r813) -f +/D=250/d=5/q=20/w=2/a=5 (129S1,BALBcJ,C3HHeJ,C57BL6NJ,NODShiLtJ,NZO,Spretus)
##source_20130305.2=vcf-annotate(r818) -f +/D=155/d=5/q=20/w=2/a=5 (129P2)
##source_20130304.2=vcf-annotate(r818) -f +/D=100/d=5/q=20/w=2/a=5 (129S5)
##contig=<ID=1,length=195471971>
##contig=<ID=10,length=130694993>
##contig=<ID=11,length=122082543>
##contig=<ID=12,length=120129022>
##contig=<ID=13,length=120421639>
##contig=<ID=14,length=124902244>
##contig=<ID=15,length=104043685>
##contig=<ID=16,length=98207768>
##contig=<ID=17,length=94987271>
##contig=<ID=18,length=90702639>
##contig=<ID=19,length=61431566>
##contig=<ID=2,length=182113224>
##contig=<ID=3,length=160039680>
##contig=<ID=4,length=156508116>
##contig=<ID=5,length=151834684>
##contig=<ID=6,length=149736546>
##contig=<ID=7,length=145441459>
##contig=<ID=8,length=129401213>
##contig=<ID=9,length=124595110>
##contig=<ID=X,length=171031299>
##FILTER=<ID=BaseQualBias,Description="Min P-value for baseQ bias (INFO/PV4) [0]">
##FILTER=<ID=EndDistBias,Description="Min P-value for end distance bias (INFO/PV4) [0.0001]">
##FILTER=<ID=GapWin,Description="Window size for filtering adjacent gaps [3]">
##FILTER=<ID=Het,Description="Genotype call is heterozygous (low quality) []">
##FILTER=<ID=MapQualBias,Description="Min P-value for mapQ bias (INFO/PV4) [0]">
##FILTER=<ID=MaxDP,Description="Maximum read depth (INFO/DP or INFO/DP4) [200]">
##FILTER=<ID=MinAB,Description="Minimum number of alternate bases (INFO/DP4) [5]">
##FILTER=<ID=MinDP,Description="Minimum read depth (INFO/DP or INFO/DP4) [5]">
##FILTER=<ID=MinMQ,Description="Minimum RMS mapping quality for SNPs (INFO/MQ) [20]">
##FILTER=<ID=Qual,Description="Minimum value of the QUAL field [10]">
##FILTER=<ID=RefN,Description="Reference base is N []">
##FILTER=<ID=SnpGap,Description="SNP within INT bp around a gap to be filtered [2]">
##FILTER=<ID=StrandBias,Description="Min P-value for strand bias (INFO/PV4) [0.0001]">
##FILTER=<ID=VDB,Description="Minimum Variant Distance Bias (INFO/VDB) [0]">

When I was trying to use these known sites at the VariantRecalibration step, I got a lot of walker errors saying that (I paraphrase) "it's dangerous to use this known site data on your VCF because the contigs of your references do not match".

However, if you look at the GRCm38_68.fai it DOES include the smaller scaffolds which are present in my data.

So, my question is: how should I filter my bam files for the IndelRealigner and downstream steps? I feel like the best option is to filter on the contigs present in the known site vcfs, but obviously that would throw out a proportion of my data.

Thanks very much!


Created 2014-04-10 16:24:48 | Updated | Tags: indelrealigner realignertargetcreator bqsr knownsites mouse indel-realignment

Comments (5)

Hello,

I was wondering about the format of the known site vcfs used by the RealignerTargetCreator and BaseRecalibrator walkers.

I'm working with mouse whole genome sequence data, so I've been using the Sanger Mouse Genome project known sites from the Keane et al. 2011 Nature paper. From the output, it seems that the RealignerTargetCreator walker is able to recognise and use the gzipped vcf fine:

INFO 15:12:09,747 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,751 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02 INFO 15:12:09,751 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 15:12:09,752 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 15:12:09,758 HelpFormatter - Program Args: -T RealignerTargetCreator -R mm10.fa -I DUK01M.sorted.dedup.bam -known /tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz -o DUK01M.indel.intervals.list INFO 15:12:09,758 HelpFormatter - Date/Time: 2014/03/25 15:12:09 INFO 15:12:09,758 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,759 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,918 ArgumentTypeDescriptor - Dynamically determined type of /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz to be VCF INFO 15:12:10,010 GenomeAnalysisEngine - Strictness is SILENT INFO 15:12:10,367 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 15:12:10,377 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 15:12:10,439 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06 INFO 15:12:10,468 RMDTrackBuilder - Attempting to blindly load /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz as a tabix indexed file INFO 15:12:11,066 IndexDictionaryUtils - Track known doesn't have a sequence dictionary built in, skipping dictionary validation INFO 15:12:11,201 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files INFO 15:12:12,333 GenomeAnalysisEngine - Done creating shard strategy INFO 15:12:12,334 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] I've checked the indel interval lists for my samples and they do all appear to contain different intervals.

However, when I use the equivalent SNP vcf in the following BQSR step, GATK errors as follows:

`##### ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 2.5-2-gf57256b):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation.
ERROR ------------------------------------------------------------------------------------------`

Which means that the SNP vcf (which has the same format as the indel vcf) is not used by BQSR.

My question is: given that the BQSR step failed, should I be worried that there are no errors from the Indel Realignment step? As the known SNP/indel vcfs are in the same format, I don't know whether I can trust the realigned .bams.

Thanks very much!


Created 2014-03-28 15:59:45 | Updated 2014-03-28 16:15:25 | Tags: indelrealigner

Comments (10)

Hello,

i have a ION PGM sequencing, i follow the best practices to do the variant calling

my command line:

bwa mem -M -R '@RG\tID:group1\tSM:sample1\tPL:IONTORRENT\tLB:lib1\tPU:unit1' /home/horus/Escritorio/PGM/primirna/references/hg19usar.fa 1.fq.gz > 1_aligned_reads.sam

java -jar /home/horus/Instaladores/picard-tools-1.110/picard-tools-1.110/SortSam.jar INPUT=1_aligned_reads.sam OUTPUT=1_sorted_reads.bam SORT_ORDER=coordinate

java -jar /home/horus/Instaladores/picard-tools-1.110/picard-tools-1.110/MarkDuplicates.jar INPUT=1_sorted_reads.bam OUTPUT=1_dedup_reads.bam METRICS_FILE=1_metrics.txt

java -jar /home/horus/Instaladores/picard-tools-1.110/picard-tools-1.110/BuildBamIndex.jar INPUT=1_dedup_reads.bam

java -jar /home/horus/Instaladores/GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /home/horus/Escritorio/PGM/primirna/references/hg19usar.fa -I 1_dedup_reads.bam -known /home/horus/Escritorio/PGM/primirna/references/Mills_and_1000G_gold_standard.indels.b37.vcf_nuevo -o 1_target_intervals.list

java -jar /home/horus/Instaladores/GenomeAnalysisTK-3.1-1/GenomeAnalysisTK.jar -T IndelRealigner -R /home/horus/Escritorio/PGM/primirna/references/hg19usar.fa -I 1_dedup_reads.bam -targetIntervals 1_target_intervals.list -known /home/horus/Escritorio/PGM/primirna/references/Mills_and_1000G_gold_standard.indels.b37.vcf_nuevo -o 1_realigned_reads.bam

after the IndelRealigner step, i check the sorted_reads.bam and the bam i get with the IndelRealigner on IGV...

in the position that show the image, after the realignment only 5 reads are keep, the question is why all the reads that have the variant in the reverse strand are gone?

i don't understend, these reads are placed somewhere else in the alignment?


Created 2014-03-17 11:56:12 | Updated | Tags: indelrealigner realignertargetcreator targetintervals

Comments (4)

Hello, I ran the GATK RealignerTargetCreator command and got an output file with some lines in non-interval format (about 6%). For example: the line chrM:125-346 versus the line chrM:7684 When I ran the next IndelRealigner command it failed with the following error message, indicating the line without interval: "##### ERROR MESSAGE: Invalid argument value 'targetIntervals' at position 10.

ERROR Invalid argument value 'GRC13283077_var_list' at position 11."

What is the reason for this output? What can I do about it? Please advice.

Thanks, Lily


Created 2014-02-05 11:33:30 | Updated | Tags: indelrealigner unifiedgenotyper java error

Comments (1)

Hi All We are running into some random weirdness when running jobs using SGE, GATK version 2.7-2-g6bda569, pretty much all GATK tools - but mostly IndelRealigner abd UnifiedGenotyper, we often get the following error:-

ERROR MESSAGE: Couldn't read file /scratch/project/pipelines/novorecal.bam because java.io.FileNotFoundException: /scratch/project/pipelines/novorecal.bam (No such file or directory)

This also happens for supplied reference genomes and vcf files. The GATK tool cant find them.

These "missing" files do exist, and have often even been created by the previous tool/step in the pipeline.

When we re-run the pipeline on a failed sample, it works. So we end up having to re-run our pipeline on the same set of samples multiple times and are beginning to find this very frustrating. These errors seem to be random, I cant find any pattern, and as I mentioned, when we re-run the pipeline on a failed run, it work without a hitch.

Has anyone experienced this? And if so, any recommendations?

Please help

Steve


Created 2014-01-21 16:05:25 | Updated | Tags: indelrealigner

Comments (5)

Dear GATK team,

I have noticed that sometimes, when I run IndelRealigner independently on a tumor BAM file and its matching normal BAM file, I get different haplotypes, leading to false somatic variant or indel calls. Is there a way to ensure that both BAM files are realigned the same way ? For instance, merging both BAM files together, running IndelRealigner and splitting resulting BAM file by read group ?

Best regards

David


Created 2014-01-13 17:19:24 | Updated | Tags: indelrealigner

Comments (1)

Hi I'm realigning around indels with GATK v2.7.4 and in one case It's causing in one instance to align my 100 bases read long as only insertions (100I)

This might be related to another thread here but the error is slightly different.

I've uploaded the test bam files before and after realignment to your ftp

Here are the command I used to realign:

java -Xmx8G -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -I test_sorted.bam -o test_sorted.intervals -R ucsc.hg19.fasta -known 1000G_phase1.indels.hg19.vcf -known Mills_and_1000G_gold_standard.indels.hg19.vcf

java -Xmx8G -jar GenomeAnalysisTK.jar -T IndelRealigner -I test_sorted.bam --out test_sorted_realigned.bam -targetIntervals test_sorted.intervals -R ucsc.hg19.fasta -known 1000G_phase1.indels.hg19.vcf -known Mills_and_1000G_gold_standard.indels.hg19.vcf

Thanks


Created 2013-12-19 08:32:02 | Updated | Tags: indelrealigner

Comments (1)

Hello,

Could anybody explain briefly the main steps in IndelRealigner? My understand is sort of as follow, but I am not sure if it is true.

Step1: for each interval, find all the associated reads which have overlap with the interval;

Step2: get the "leftmost" and "rightmost" points of those reads. all the known Indels in the [leftmost, rightmost] interval will be deemed as confirmed indels.

Step3: traverse those reads, find ones which have and only have one indel and left-align the indels. All these indels will be deemed as candidate indels.

Step4: filter all the candidate indels and find ones which should really be good enough to add to confirmed indel list. (How?)

Step5: realign around the indels found from Step2 (KnownIndels) and Step4 (Filtered out Indels). (How?)

Did I understand the main steps correctly? How to do the Step 4 and 5?

Thanks very much.


Created 2013-12-17 19:29:51 | Updated | Tags: indelrealigner realignertargetcreator realignment

Comments (4)

Hi

I'm indel realigning with version 2.4-9 using generic commands such as:

java -Xmx4g -jar /path/to/GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R /path/to/reference.fasta \ -I /path/to/input.bam \ -o /path/to/realigner.intervals

java -Xmx4g -jar /path/to/GenomeAnalysisTK.jar \ -T IndelRealigner \ -R /path/to/reference.fasta \ -I /path/to/sample-level.bam \ -targetIntervals /path/to/realigner.intervals.from.rtc \ -o /path/to/realigned.bam \ -model USE_SW \ -LOD 0.4

In most cases this is working fine, but in a few cases it is introducing artefacts that subsequently cause the bam file to fail Picard's ValidateSamFile, and in a couple of cases it introduces errors that can't be fixed by CleanSam and/or FixMateInformation.

Here's an example of an error that can be fixed by CleanSam:

Before indel realignment:

HS24_08564:7:2311:4630:19372#87 69 AAKM01002546 471 0 * = 471 0 HS24_08564:7:2311:4630:19372#87 137 AAKM01002546 471 25 100M = 471 0

After indel realignment:

HS24_08564:7:2311:4630:19372#87 69 AAKM01002546 471 0 * = 471 0 HS24_08564:7:2311:4630:19372#87 137 AAKM01002546 471 35 91M1D9M = 471 0

Here's an example of an error that can't be fixed:

Before indel realignment:

HS24_10061:6:1312:10172:98346#54 69 AAKM01002280 649 0 * = 649 0 HS24_10061:6:1312:10172:98346#54 137 AAKM01002280 649 37 100M = 649 0

After indel realignment:

HS24_10061:6:1312:10172:98346#54 69 AAKM01002280 649 0 * = 649 0 HS24_10061:6:1312:10172:98346#54 137 AAKM01002280 649 47 91M2D9M0S = 649 0

Is this a known bug? Any chance of a fix?

Thanks!

Richard


Created 2013-10-03 21:56:17 | Updated | Tags: indelrealigner vcf ensembl

Comments (12)

Hello

I am working with canine genomes and donwloaded the reference file (ftp://ftp.ensembl.org/pub/release-73/fasta/canis_familiaris/dna/Canis_familiaris.CanFam3.1.73.dna.toplevel.fa.gz) and variation VCF file (ftp://ftp.ensembl.org/pub/release-73/variation/vcf/canis_familiaris/Canis_familiaris.vcf.gz) from Ensembl. I was able to index the reference files and perform the alignment and Mark duplicate steps of the pipeline for the canine genomes.

I extracted SNPs that were marked either deleteions or duplications from the variation VCF file using grep -E "^#|deletion|insertion" Canis_familiaris.vcf > Canis_familiaris.indels.vcf to use in the Indel Realigner steps. The header of the file is as follows: `##fileformat=VCFv4.1

fileDate=20130826

##source=ensembl;version=73;url=http://e73.ensembl.org/canis_familiaris
##reference=ftp://ftp.ensembl.org/pub/release-73/fasta/canis_familiaris/dna/
##INFO=<ID=TSA,Number=0,Type=String,Description="Type of sequence alteration. Child of term sequence_alteration as defined by the sequence ontology project.">
##INFO=<ID=E_MO,Number=0,Type=Flag,Description="Multiple_observations.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status">
##INFO=<ID=E_ESP,Number=0,Type=Flag,Description="Exome_Sequencing_Project.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status">
##INFO=<ID=E_1000G,Number=0,Type=Flag,Description="1000Genomes.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status">
##INFO=<ID=E_HM,Number=0,Type=Flag,Description="HapMap.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status">
##INFO=<ID=E_Freq,Number=0,Type=Flag,Description="Frequency.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status">
##INFO=<ID=E_C,Number=0,Type=Flag,Description="Cited.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status">
##INFO=<ID=dbSNP_131,Number=0,Type=Flag,Description="Variants (including SNPs and indels) imported from dbSNP [remapped from build CanFam2.0]">
1   8252    rs8962840   C   CT  .   .   dbSNP_131;TSA=insertion
1   9702    rs8457244   CA  C   .   .   dbSNP_131;TSA=deletion
1   18289   rs8444620   GT  G   .   .   dbSNP_131;TSA=deletion
1   36381   rs8471229   GT  G   .   .   dbSNP_131;TSA=deletion
1   52452   rs8469855   AT  A   .   .   dbSNP_131;TSA=deletion
1   55767   rs8285977   G   GA  .   .   dbSNP_131;TSA=insertion
1   60065   rs8723538   TC  T   .   .   dbSNP_131;TSA=deletion
1   62067   rs8395051   TA  T   .   .   dbSNP_131;TSA=deletion`

However, when I run the filrst step of Indel Realignment -T RealignerTargetCreator -nt 16 -R canFam3.1_bwa075/Canis_familiaris.CanFam3.1.73.dna.toplevel.fa -I AF23.rg.prmdup.bam -l INFO -S SILENT -o AF23.indel.intervals --known canFam3.1_bwa075/Canis_familiaris.indels.vcf

I get the error message: `##### ERROR A USER ERROR has occurred (version 2.7-2-g6bda569):

ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file`

I guess it is because the VCF file is missing the standard header (#CHROM POS ID ...). I was wondering if there was any way I could add the header line to the VCF file to be able to use in GATK? Would it also be possible to use the original VCF with all variations in the Indel Realignment steps or would you recommend I subset it for insertions/deletions as I did in this case?

Thank you very much for your help and suggestions!!!

Yours sincerely Shanker Swaminathan


Created 2013-09-30 14:20:57 | Updated | Tags: indelrealigner

Comments (15)

Hi,

first of all, sorry, if I'm not correct in this forum, but I didn't find a place where it realy seemed correct.

On the last data I've got I ren into information like: "INFO 22:14:12,352 IndelRealigner - Not attempting realignment in interval chrY:27782626-27782640 because there are too many reads."

I've read about the option: "--maxReadsForRealignment" and set it to 100000. I had a look at the bam file and saw, that at some points I've got more then 150000 reads. The average reading depth is about 50. I'm quite new to the filed of NGS and thus am not sure what that means.

Exactly how does the program handle this? Sincer there only small parts, which are actually to high covered. Are only those few bases not realigend? Or is everything in the step skipped? As an example: chrY:27782626-27782640 is only an interval of 14bp. The whole y-chromsom is covered by only two steps shown in the log file. What is not aligned? One part of the y-chromosom or just those 14bp?

Besides this technical questions: is it save to just increase the "--maxReadsForRealignment" until all intervals are realigned, or is there normaly a general failure (eg. PCR) behind this, so that it seems more save to ignore those intervals?

If some other users have encountered the same issue it would be nice to hear how the have handled this.

Thanks alot


Created 2013-09-12 14:38:40 | Updated 2013-09-12 14:39:06 | Tags: indelrealigner indels

Comments (2)

I have ~2,000 whole-genome-sequencing bams that I need to move through an alignment/deduping/recalibration pipeline. I'll be using bwa-mem for the initial alignment, and I'm wondering if you would still recommend carrying out the local realignment step (downstream) for these bams, given bwa-mem does some local/gapped alignment on its own. This is an important question for our group because the local realignment (using RealignerTargetCreator & IndelRealigner) for these bams will take a significant amount of time / computational resources. If I want to call short indels using UnifiedGenotyper (rather than HC, due to number of samples), does bwa-mem for primary alignment reduce the need for local realignment downstream? Any feedback/thoughts would be much appreciated - thanks!


Created 2013-09-09 13:28:50 | Updated | Tags: indelrealigner printreads

Comments (6)

Is it possible to adjust the compression level of the BAM-files (like i picard etc). To optimize speed/disk long term disk usage, this would be advantageous when producing temporary/final BAM files, e.g. using the indelrealigner and printreads tools, respectively.


Created 2013-08-05 19:36:35 | Updated | Tags: indelrealigner haplotypecaller

Comments (3)

Hello again,

we are having trouble trying to understand a insertion called by HaplotypeCaller (HC).

First we can't see any read confirming the insertion in IGV.

Second is that HC is telling us that it has a coverage of 126 reads and only 12 confirming the insertion but it calls as a homozygous insertion. Could you help us to find a direction to understand what is happening please?

Please note that the insertion seq is on the reference genome (b37) and it ends at the start position called by HC.

19 42489516 . A ATGTTCCGAGTCTCCAAGGGGTTGTCGTGC 167.77 . AC=2;AF=1.00;AN=2;DP=126;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;QD=0.05 GT:AD:GQ:PL 1/1:0,12:99:3848,330,0


Created 2013-07-09 02:16:17 | Updated | Tags: indelrealigner realignertargetcreator realignment

Comments (5)

I am doing exome sequencing in 700 individuals from a species with a large genome and I would like to use GATK to realign around indels. I am using a reduced reference, which is still about 3Gb. I tested out the target creator, but it is taking 5 days for 12 individuals when each is done individually and this time frame is not feasible for all 700 individuals. I tried to run more in parallel (~30 individuals), but there are RAM limitations on our 250G server. I am currently testing out the program by running all 12 test samples as input for the same run and the time estimate is very long (on the order of several hundred weeks). Based on a preliminary run I have also included a vcf file with likely indels to try and speed the process. Can you suggest another way in which I can make the time frame for all 700 individuals more reasonable? Otherwise we will not be able to use this tool.


Created 2013-07-08 21:08:28 | Updated | Tags: indelrealigner unifiedgenotyper baserecalibrator

Comments (3)

Hi !

I've got some question about some of the GATK tools and practices (and hope its okay to post them into a single thread)

  1. Is there any additional background information about the UnifiedGenotyper available , especially in case of multi sample calling ? How this works in a bit more detailed way ? So far I could only find the slides from the last GATK Workshop. But if I remember this correctly, during the presentation it was mentioned one could ask if more information is required (unfortunately I wasn't at the workshop ;))

  2. What's the definition of the Base Score Accuracy (Base Quality Score Recalibration Plots) ? Am I correct that this specifies how well the observed quality scores match the expected (empirical) quality scores ? I think I read it somewhere but couldn't find it any more.

  3. I've read that the way to validate(check what and how much was done) the Realign around Indels step, is to count/search for the OC Tags in the alignment file. Is there any fast way to do so ? Or do I have to convert the BAM into a SAM and count by running through lines of the alignment ?

  4. Fortunately there exist the "Recommended sets of known sites per tool" which I used so far. But is there any explanation why those sets are recommended ?

Tanks a lot !


Created 2013-06-27 04:13:31 | Updated | Tags: indelrealigner realignertargetcreator unifiedgenotyper

Comments (7)

Dear Developers,

I Run Unified genotyper for identify the indels and deletion. I used Realigner target creater got forIndelRealigner.intervals. used this output for Indel Realigner and get the output bam file which i gave as input toUnified Genotyper. I got result as

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT BRCA1

gi|262359905|ref|NG_005905.2| 48155 . G A 387.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.466;DP=47;Dels=0.00;FS=2.721;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=29.31;MQ0=0;MQRankSum=-4.812;QD=8.25;ReadPosRankSum=0.535 GT:AD:DP:GQ:PL 0/1:25,22:45:99:416,0,676

But if i run without these steps (Realigner target creater, Indel Realigner) i am getting the same output. Also i used the same dataset and run it in samtools and got more SNP.

I am looking for a deletion in the dataset. I would like to get the result from GATK Could you please suggest on the same.

Thanks Sridhar


Created 2013-05-30 11:57:14 | Updated | Tags: indelrealigner baserecalibrator printreadswalker

Comments (2)

Hi, I am doing some whole genome sequence on 2 samples in which each sample was run on 12 lanes of a SOLiD 5500 machine. These are at fairly high coverage of ~40x each. My plan was align each lane independently then merge all 12 lane for each sample into 1 large bam file, then do the post-processing. I did this and was able to do the indel realign on both samples but have been having trouble with the base recalibration step, in which when apply the base recalibration PrintReads crashes with an error saying that there is not enough memory available. I have tried changing the tmp directory that is used and any other trick that I have been able to find on the forum.

I was wondering if an alternate and suitable approach would be to perform all of the post-processing on each individual lane first, then merge all of the lanes together after that. Would doing that have any adverse affect on the downstream analysis, i.e snps, cnvs, translocations, etc.

Thanks


Created 2013-05-22 15:47:54 | Updated | Tags: indelrealigner realignertargetcreator

Comments (4)

Hi

I've followed the suggested protocol for local realignment - first using RealignerTargetCreator and then IndelRealigner, but have unexpected results.

Let's call the two BAMs I'm realigning "normal" and "tumour" or N and T for short. Once realigned, I've split the resulting NT BAM file (using readgroup tags, although I see from the docs that it can create separate files natively) back into the original N and T BAM files and discovered something odd. I was expecting the pre-realignment N and T files to contain the same number of reads as the post-realignment files, only the coordinates that reads are mapped to would be different.

However, I notice that post-realignment files contain significantly fewer reads because unaligned reads and reads not aligned to the autosomes or sex chromosomes have been removed. However, these reads alone do not account for the difference; large numbers of reads aligned to the 24 chromosomes are now missing.

Can you tell me more about the reads that are removed? I suspect it to be an alignment quality issue, but cannot find direct reference to this behaviour in the documentation. I'm currently keeping both my pre and post-realignment bam files, but ultimately there will be space constraints and I'll have to choose and would like to make the most informed decision possible.

Regards Chris


Created 2013-05-12 22:55:38 | Updated | Tags: indelrealigner intervals

Comments (2)

Hello,

I am using one of the 1000 Genomes exome data (.bam format) which is called NA12892, aligned to the GRCh37 build. And I just started to use IndelRealigner tool which requires proper .interval_list file to work.

However, I am unable to find/generate *.interval_list file compatible with my data. Where can I download/generate .interval_list (or .bed) file that are compatible with exome data?

Normally, these files are provided by producers for Library Prep. kits (e.g illumina). But, I couldn't find which interval file should be used in 1000 Genomes data.

Thank you.


Created 2013-05-02 16:33:13 | Updated | Tags: indelrealigner

Comments (18)

Hi, I've 32 exomes in a merged BAM file, I have read groups to identify each exome, with id, sample, library and platform all set (I hope) correctly, what I can't understand why I have a discrepancy in the logs generated about the number of reads traversed by RealignmentTargetcreator Vs IndelRealigner:

From my RealignmentTargetcreator run I got:

INFO 18:24:17,286 ProgressMeter - Total runtime 5394.91 secs, 89.92 min, 1.50 hours INFO 18:24:17,286 MicroScheduler - 389,565,880 reads were filtered out during traversal out of 2,752,553,629 total (14.15%) INFO 18:24:17,287 MicroScheduler - -> 24573133 reads (0.89% of total) failing BadMateFilter INFO 18:24:17,287 MicroScheduler - -> 229871090 reads (8.35% of total) failing DuplicateReadFilter INFO 18:24:17,287 MicroScheduler - -> 135121273 reads (4.91% of total) failing MappingQualityZeroFilter INFO 18:24:17,298 MicroScheduler - -> 384 reads (0.00% of total) failing UnmappedReadFilter

Vs this from IndelRealigner

INFO 11:47:21,707 MicroScheduler - 0 reads were filtered out during traversal out of 200,100 total (0.00%)

Why have so few of the reads RealignmentTarget creator reported been traversed by IndelRealigner?

My command for IndelRealigner was:

GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T IndelRealigner --maxReadsInMemory 1000000 --maxReadsForRealignment 1000000 -known /data/GATK_bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf -known /data/GATK_bundle/hg19/1000G_phase1.indels.hg19.vcf -I Merged_dedup.bam -R /data/GATK_bundle/hg19/ucsc.hg19.fasta -targetIntervals forIndelRealigner.intervals -o Merged_dedup_realigned.bam


Created 2013-04-21 22:26:42 | Updated | Tags: indelrealigner unifiedgenotyper

Comments (5)

Hi, I am calling indel in pooled samples using this command: java -jar -Xmx2g /PATH/2.1.13/GenomeAnalysisTK.jar -l INFO -T UnifiedGenotyper -I pool1.bam -I pool2.bam --out INDEL.vcf -R /reference.fa -glm INDEL

Currently i donot have any information of already known indels. 1.Do i need to first realign (RealignerTargetCreator and IndelRealigner) and then call indels even for pooled data?

  1. How different will this be for calling indel on individual sample?

Looking forward for your suggesions. with thanks sasha


Created 2013-04-19 15:53:12 | Updated | Tags: indelrealigner snp bisulfite

Comments (6)

I have Bisulfite- treated sequence mapped using Bismark and Bowtie2 and I'd like to call SNPs and INDELs from it. I have used Bis-SNP to call SNPs but it doesn't call indels , can I use GATK to call indels from the mapped data? Do u have any support to Bisulfite data? Another question please, the data is a mix from 6 different people do u have any support fro pooled data? Thanks for your help.

Comments (5)

Hello,

I am having trouble calling variants using Haplotype Caller on simulated exome reads. I have been able to call reasonable-looking variants on the exome (simulated with dwgsim) with HaplotypeCaller before running it through the Best Practices Pre-Processing pipeline. The pre-processed data worked fine with UnifiedGenotyper but with HaplotypeCaller, though it runs without errors and seems to walk across the genome, only outputs a VCF header. I have tried calling variants with and without using -L to provide the exome regions (as recommended in this forum post: http://gatkforums.broadinstitute.org/discussion/1681/expected-file-size-haplotype-caller) but this hasn't made a difference - when we run the command with the pre-processed BAMs, we only get a VCF header. Everything has been tested with both 2.4-7 and 2.4-9.

Any help or guidance would be greatly appreciated!

Command Used for HaplotypeCaller:

java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -I exome.realigned.dedup.recal.bam -o exome.raw.vcf -D dbsnp_137.hg19.vcf -stand_emit_conf 10 -rf BadCigar -L Illumin_TruSeq.bed --logging_level DEBUG

Commands Used for pre-processing (run in sequence using a Perl script):

java -Xmx16g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -nt 8 -R ucsc.hg19.fasta -I exome.bam -o exome.intervals -known dbsnp_137.hg19.vcf

java -Xmx4g -jar GenomeAnalysisTK.jar -T IndelRealigner -R ucsc.hg19.fasta -I exome.bam -o exome.realigned.bam -targetIntervals intervals.bam -known dbsnp_137.hg19.vcf

java -Xmx16g -jar MarkDuplicates.jar I=exome.realigned.bam METRICS_FILE=exome.dups O=exome.realigned.dedup.bam

samtools index exome.realigned.dedup

java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -o exome.recal_data.grp -knownSites dbsnp_137.hg19.vcf -cov ReadGroupCovariate -cov ContextCovariate -cov CycleCovariate -cov QualityScoreCovariate

java -Xmx4g -jar GenomeAnalysisTK.jar -T PrintReads -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -BQSR exome.recal_data.grp -baq CALCULATE_AS_NECESSARY -o exome.realigned.dedup.recal.bam


Created 2013-03-21 23:28:13 | Updated | Tags: indelrealigner realignertargetcreator

Comments (5)

Hi, I got errors when ran GATK RealignerTargetCreator and IndelRealigner in v2.4.9. I've checked many related discussions and comments. First, I got an error like "we encountered an extremely high quality score of 69" with option -S LENIENT and the GATK program stalled. So I added "--fix_misencoded_quality_scores", and then I got different error message "ERROR MESSAGE: Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '0'" now. I tried older versions of GATK and both java 1.6 and 1.7. I'm hoping that you can help this. Please let me know if I'm missing something. Thanks!

--Giltae


Created 2013-03-18 20:59:06 | Updated | Tags: indelrealigner realignertargetcreator

Comments (7)

Hi,

I am trying to decide between two approaches for performing realignment around indels. I have ~600 samples that have been aligned to a very fragmented draft genome assembly. What is best:

  1. take each sample and create a list of targets, followed by realignment on each sample.
  2. combine all samples into one large bam file and create a list of targets, followed by realignment on the same large bam file.

Also, would there be any advantages in terms of speed with either approach?

Cheers,

Steve


Created 2013-03-03 19:16:19 | Updated | Tags: indelrealigner realignertargetcreator

Comments (23)

Hi,

I have downloaded newest version of GATK (version 2.4-3) this week and tried to perform local realignment for my targeted sequencing data. Reference genome, SNP and indel data files were downloaded from resource bundle. However, I encountered two issues when I was doing the realignment.

First, in the step of RealignerTargetCreator. With the same command line, if I run it under version 2.4-3, I got an error message "MESSAGE: -49" (no other detail information provided); if I run it under an older version 2.3-9, it ran very well with no errors.

Second, in the step of IndelRealigner. I got error message "MESSAGE: Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '13'". However, reference genome was downloaded from the bundle. I am not sure how to fix this issue.

I hope someone can help me with these issues. Let me know if more info is needed.

Thanks!


Created 2013-02-15 10:07:49 | Updated | Tags: indelrealigner

Comments (1)

HI I have used indelrealinger sucessfully in the past but I am now getting the following error. Please let me know if you have any suggestions.

[Fri Feb 15 09:09:37 GMT 2013] net.sf.picard.sam.CreateSequenceDictionary REFERENCE=/exports/home/fturner/vet_roslin_ark_genomics/reference_gen omes/chicken/Gallus_gallus.WASHUC2.69.dna.toplevel.fa OUTPUT=/exports/home/fturner/vet_roslin_ark_genomics/reference_genomes/chicken/dict129865 4176074562895.tmp TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 TMP_DIR=/tmp/fturner VERBOSITY=INFO QUIET=false VALIDATION_STRI NGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Fri Feb 15 09:09:37 GMT 2013] Executing as fturner@eddie327 on Linux 2.6.32-220.23.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0 _07-b10 [Fri Feb 15 09:09:52 GMT 2013] net.sf.picard.sam.CreateSequenceDictionary done. Elapsed time: 0.24 minutes. Runtime.totalMemory()=442957824

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.UnsupportedOperationException: Cannot enable index memory mapping for a SAM text reader at net.sf.samtools.SAMTextReader.enableIndexMemoryMapping(SAMTextReader.java:105) at net.sf.samtools.SAMFileReader.enableIndexMemoryMapping(SAMFileReader.java:228) at org.broadinstitute.sting.gatk.datasources.reads.SAMDataSource$SAMReaders.(SAMDataSource.java:754) at org.broadinstitute.sting.gatk.datasources.reads.SAMDataSource$SAMResourcePool.createNewResource(SAMDataSource.java:729) at org.broadinstitute.sting.gatk.datasources.reads.SAMDataSource$SAMResourcePool.getAvailableReaders(SAMDataSource.java:700) at org.broadinstitute.sting.gatk.datasources.reads.SAMDataSource.(SAMDataSource.java:232) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.createReadsDataSource(GenomeAnalysisEngine.java:884) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:717) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:207) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:104) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:227) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:89)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 1.1-23-g8072bd9):
ERROR
ERROR Please visit the wiki to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
ERROR
ERROR MESSAGE: Cannot enable index memory mapping for a SAM text reader

Created 2013-02-14 19:41:01 | Updated 2013-02-15 06:36:36 | Tags: indelrealigner input bam

Comments (4)

Hello, I am a first-time user of GATK and have spent some time now on trying to get the input bam files in the appropriate format. To run IndelRealigner, I have added ReadGroups, Reordered and Index my bam file with the respective Picard-Tools.

My command-line is the following:

java -Djava.io.tmpdir='pwd'/tmp -jar GenomeAnalysisTK.jar -I ./add_read_groups_reorder_index.bam -R ./genome.fa -T IndelRealigner -targetIntervals ./gatk.intervals -o ./*.bam -known ./Mills-1000G-indels.vcf --consensusDeterminationModel KNOWNS_ONLY -LOD 0.4

I get the following message:

SAM/BAM file /home/gp53/tophat2-merge-ctl-1st-2nd-readgroups-reorder-index.bam is malformed: SAM file doesn't have any read groups defined in the header.

My reads are paired-end aligned with TopHat2 I will appreciate your help on this. Thanks, G.


Created 2013-02-14 12:37:20 | Updated | Tags: indelrealigner

Comments (1)

Hi,

When doing an indel realignment with GATK, the 'MD' field in the SAM/BAM record gets dropped for realigned reads. Is it possible to recompute them directly with GATK? I know that 'samtools calmd' does this, but are there alternative options?


Created 2013-02-04 17:31:24 | Updated | Tags: indelrealigner

Comments (3)

Does anyone know of any known issues with the indelrealigner? The GATK is calling 1000s of SNPs on one genome I have due to bad realignments. It appears the target finder identifies regions of the genome that are essentially perfectly aligned, but when the realigner gets to these areas it remits the reads as a new alignment that is a train-wreck compared to what the re-aligner started with. Am going to investigate further but thought I would check if this rings any bells. I am using the flag -model USE_READS and have no known indels to work with.


Created 2012-12-20 01:21:50 | Updated | Tags: indelrealigner softclip

Comments (1)

Hi, Some aligners produce Smith-Waterman alignments and may soft clip bases from a read when there are indels or mismatches near the ends of the reads. I was wondering if you include these bases in the realignment process? And if not whether you might consider making it an option? Thanks, Colin


Created 2012-11-28 11:47:29 | Updated | Tags: indelrealigner picard microscheduler baserecalibration createtarget

Comments (2)

Hi all,

I am doing an exome analysis with BWA 0.6.1-r104, Picard 1.79 and GATK v2.2-8-gec077cd. I have paired end reads, my protocol until now is (in brief, omitting options etc.)

bwa aln R1.fastq bwa aln R2.fastq bwa sampe R1.sai R2.sai picard/CleanSam.jar picard/SortSam.jar picard/MarkDuplicates.jar picard/AddOrReplaceReadGroups.jar picard/BuildBamIndex.jar GATK -T RealignerTargetCreator -known dbsnp.vcf GATK -T IndelRealigner -known dbsnp.vcf GATK -T BaseRecalibrator -knownSites dbsnp.vcf GATK -T PrintReads

A closer look on the output of the above toolchain revealed changes in read counts I did not quite understand.

I have 85767226 paired end = 171534452 sequences in fastQ file

BWA reports this number, the cleaned SAM file has 171534452 alignments as expected.

MarkDuplicates reports:

Read 165619516 records. 2 pairs never matched. Marking 20272927 records as duplicates. Found 2919670 optical duplicate clusters.

so nearly 6 million reads seem to miss.

CreateTargets MicroScheduler reports

35915555 reads were filtered out during traversal out of 166579875 total (21.56%) -> 428072 reads (0.26% of total) failing BadMateFilter -> 16077607 reads (9.65% of total) failing DuplicateReadFilter -> 19409876 reads (11.65% of total) failing MappingQualityZeroFilter

so nearly 5 million reads seem to miss

The Realigner MicroScheduler reports

0 reads were filtered out during traversal out of 171551640 total (0.00%)

which appears a miracle to me since 1) there are even more reads now than input sequences, 2) all those crappy reads reported by CreateTargets do not appear.

From Base recalibration MicroScheduler, I get

41397379 reads were filtered out during traversal out of 171703265 total (24.11%) -> 16010068 reads (9.32% of total) failing DuplicateReadFilter -> 25387311 reads (14.79% of total) failing MappingQualityZeroFilter

..... so my reads got even more offspring, but, e.g., the duplicate reads reappear with "roughly" the same number.

I found these varying counts a little irritating -- can someone please give me a hint on the logics of these numbers? And, does the protocol look meaningful?

Thanks for any comments!


Created 2012-11-27 15:38:23 | Updated | Tags: indelrealigner realignertargetcreator

Comments (17)

I can't seem to run the IndelRealigner on reads that contain colons, ":" in the reference scaffold names. The RealignerTargetCreator step works correctly and generates the interval table, but the second, IndelRealigner, step fails. When I look at the generated interval table, I see the interval delimiter is a colon, which I imagine is the problem.

Unfortunately, I have a set of human references that have a colon in every scaffold name, so changing this would be a massive undertaking.

I believe this problem could be solved if you searched for the colon delimiter from the end of the interval string instead of from the beginning, so I'm hoping this a real simple fix.

Thanks!


Created 2012-11-26 14:38:52 | Updated 2012-12-02 05:20:34 | Tags: indelrealigner igv

Comments (16)

Hi. I am getting VERY odd results with some Streptococcus equi sequence. The BAM files from BWA align well in IGV, but when I run them through your pipeline there are many local errors where it seems that a single indel has been incorrectly multiplied up - somehow. You need to see the IGV screenshot.!

The bottom is a BAM file from BWA and the top is the final one from the GATK pipeline.


Created 2012-11-15 16:08:05 | Updated 2012-11-15 22:59:14 | Tags: indelrealigner known

Comments (8)

Hi, For both IndelRealigner/RealignerTargetCreator, there is an option for known indel sites as below:

-known /path/to/indels.vcf

However, from the bundle files collection such as from hg19, there are several vcf files:

1000G_indels_for_realignment.hg19.vcf
1000G_omni2.5.hg19.sites.vcf
1000G_omni2.5.hg19.vcf
dbsnp_132.hg19.excluding_sites_after_129.vcf
dbsnp_132.hg19.vcf
hapmap_3.3.hg19.sites.vcf
hapmap_3.3.hg19.vcf
indels_mills_devine.hg19.sites.vcf
indels_mills_devine.hg19.vcf
NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.sites.vcf
NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.vcf

amongst them, just based on the names, 1000G_indels_for_realignment.hg19.vcf and indels_mills_devine.hg19.sites.vcf look like the files supposed to use for IndelRealigner/RealignerTargetCreator, Could you clarify the exact files for this purpose?

Since for old version, I have used 1000G_phase1.indels.hg19.vcf and Mills_and_1000G_gold_standard.indels.hg19.sites.vcf. and I compared the new and old files, quite different now.

Thanks

Mike


Created 2012-11-12 22:55:36 | Updated 2013-01-07 20:06:43 | Tags: indelrealigner bqsr best-practices baserecalibration

Comments (1)

Hello,

I asked this question in a comment under BestPractices but never got a response. Hopefully I will here. Here goes:

I have been running GATK v1.6.2 on several samples. It seems the way I had initially had run GATK for indel-realignment and quality re-calibration steps are reversed. For example, in order of processing, I ran:

  • MarkDuplicates
  • Count Covariates
  • Table Recalibration
  • Realigner Target Creator
  • Indel Realigner

What are the consequences and effect to SNP and INDEL calling if the GATK steps are ran as above?. I'm aware that this is not according to the best-practices document (thank you for the thorough GATK documentation), but I wanted to know if it is essential for me to re-analyze the samples. The variant caller I'm using looks at BaseQuality scores when making calls.

Any help on this would be appreciated.

Mike


Created 2012-11-08 12:31:03 | Updated 2012-11-08 18:05:56 | Tags: indelrealigner realignertargetcreator unifiedgenotyper variants file-size

Comments (1)

HI

I am using the following set of commands on GATK2.1.13 to generate a VCF file

echo `java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -I B2_with_ReadGroup.ddup.sorted.bam -R human_g1k_v37.fasta -T RealignerTargetCreator  -o my.intervals -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key`
echo "Realignment Done at `date`"
echo "Starting IndelRealigner at `date`"

echo `java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -I B2_with_ReadGroup.ddup.sorted.bam -R human_g1k_v37.fasta -T IndelRealigner -targetIntervals my.intervals -o myrealignedBam.bam  -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key`
echo "Realignment done at `date`"
echo "Starting UnifiedGenotyper at `date`"
echo `java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -l INFO -R human_g1k_v37.fasta -T UnifiedGenotyper    -I myrealignedBam.bam    -o mygatk_vcf.vcf    --output_mode EMIT_ALL_SITES -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key`
echo "Gentoypxing complete at `date`"

When i do a 'mpileup' for B2_with_ReadGroup.ddup.sorted.bam , I get a devcent 10 MB VCF file. But on the last ste of the above pipeline, my " mygatk_vcf.vcf " is goinging into 81GBs !!

Do you know what is wrong ?


Created 2012-10-31 07:52:07 | Updated 2012-10-31 17:35:25 | Tags: indelrealigner nwayout

Comments (1)

Hi, I've run into what appears to be a bug in handling output in IndelRealigner. When specifying --nWayOut everything works, but when I add --disable_bam_indexing, it appears to be expecting --out instead?

##### ERROR A USER ERROR has occurred (version 2.1-13-g1706365): 
....
##### ERROR MESSAGE: Value for argument with name '--out' (-o) is missing.

Created 2012-10-19 17:50:54 | Updated 2012-10-19 18:17:30 | Tags: indelrealigner arrayindexoutofboundsexception

Comments (7)

Hi everyone,

I'm using IndelRealigner to do a local realignment in the standard GATK workflow. I have used this pipeline before with success, but am now met with this error. I could not find any other examples of this, so I am posting as per the instructions in the error.

Cheers,

A.B.

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace
java.lang.ArrayIndexOutOfBoundsException 
at org.broadinstitute.sting.utils.sam.AlignmentUtils.createIndelString(AlignmentUtils.java:710) 
at org.broadinstitute.sting.utils.sam.AlignmentUtils.leftAlignIndel(AlignmentUtils.java:603) 
at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.determineReadsThatNeedCleaning(IndelRealigner.java:912) 
at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.clean(IndelRealigner.java:681) 
at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.cleanAndCallMap(IndelRealigner.java:547) 
at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.map(IndelRealigner.java:519) 
at org.broadinstitute.sting.gatk.walkers.indels.IndelRealigner.map(IndelRealigner.java:114) 
at org.broadinstitute.sting.gatk.traversals.TraverseReads.traverse(TraverseReads.java:104) 
at org.broadinstitute.sting.gatk.traversals.TraverseReads.traverse(TraverseReads.java:52) 
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:71) 
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:265) 
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) 
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) 
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)
ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.1-11-g13c0244):
ERROR
ERROR Please visit the wiki to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

~


Created 2012-09-25 19:30:15 | Updated 2012-09-25 19:32:13 | Tags: indelrealigner

Comments (10)

I got this when I ran the IndelRealigner. The output bam is empty.

INFO  15:19:35,568 TraversalEngine - Total runtime 0.00 secs, 0.00 min, 0.00 hours
INFO  15:19:36,910 GATKRunReport - Uploaded run statistics report to AWS S3

It didn't initialize. The sample is aligned to specific region of the genome and I did use -L option. For whole genome alignment of the same sample, I don't have any problems. Do you know why?


Created 2012-09-25 08:04:12 | Updated 2012-09-26 14:31:52 | Tags: indelrealigner

Comments (3)

Hi,

Apologies if this has been reported but I can't find it in the forum.

We're in the process of upgrading to GATK v2 but have been using v1.5 and have just noticed a few cases where IndelRealigner suddenly ended without warning or any report of an error. See example below where it ended with only ~50% of the BAM file processed. I'm wondering if it's a memory issue if multiple samples were being run concurrently. But more importantly with no alert it makes it tricky for us to identify when this happens. Is this something that's been fixed in later versions e.g. GATK 2.1 i.e. will Indelrealigner report an error when it finishes but the sample has not been processed to completion?

INFO  16:21:03,120 TraversalEngine -      8:90782004        3.17e+07    3.3 h        6.3 m     47.8%         6.9 h     3.6 h
INFO  16:21:33,939 TraversalEngine -      8:99615949        3.18e+07    3.3 h        6.3 m     48.1%         6.9 h     3.6 h
INFO  16:22:04,047 TraversalEngine -     8:110498944        3.19e+07    3.3 h        6.2 m     48.5%         6.9 h     3.5 h
INFO  16:22:24,484 TraversalEngine - Total runtime 11978.49 secs, 199.64 min, 3.33 hours
INFO  16:22:24,509 TraversalEngine - 0 reads were filtered out during traversal out of 32137673 total (0.00%)

Thank you in advance.

Best regards, Maria


Created 2012-09-17 17:42:03 | Updated 2012-09-17 18:02:25 | Tags: indelrealigner

Comments (3)

After I ran "IndelRealigner" tool, I saw the following message in the end of the run log, is it normal that 0 reads were filtered out during this step?

-------
INFO  07:01:50,692 TraversalEngine - 0 reads were filtered out during traversal out of 1529770054 total (0.00%) 
-------

JH


Created 2012-09-09 02:52:32 | Updated 2012-09-09 02:53:52 | Tags: indelrealigner bqsr selectvariants

Comments (2)

My current workflow for analysing mouse exome-sequencing (based on v4 of Best Practices) can require me to use slightly different VCFs as --knownSites or --known parameters in BQSR, indel realignment etc. Basically, I have a "master" VCF that I subset using SelectVariants. The choice of subset largely depends on the strain of the mice being sequenced but also on other things such as AF'. It'd be great to be able to do this on-the-fly in conjunction with--known' in tools that required knownSites rather than having to create project-specific (or even tool-specific) VCFs.

Is there a way to do this that I've overlooked? Is this a feature that might be added to GATK?


Created 2012-08-03 05:30:36 | Updated 2012-08-03 05:32:30 | Tags: indelrealigner realignertargetcreator mergeintervallists

Comments (1)

Hi all - I'm using GATK realigner which can take several hours on my samples. I'm trying to optimize my pipeline by dividing this up by chromosome for each node in my cluster. I can call RealignerTargetCreator using the -L parameter for each chromosome which results in a bunch of interval files. Now, I either want to call IndelRealigner using the -L parameter for each chromsome then merge the resulting BAM files, or merge the interval files into one then call IndelRealigner.

1) I don't see a way to merge interval files using GATK. Is this possible?

or

2) Can I call IndelRealigner and process each chromosome separately then merge the resulting BAM files together?


Created 2012-07-31 10:02:07 | Updated 2012-07-31 10:02:07 | Tags: indelrealigner baserecalibrator

Comments (2)

Dear GATK Team,

I have recently downloaded the GATK Bundle to get the human reference genome and its associated annotations.

After the mapping step on my lane BAM files, I am planning on using IndelRealigner and BaseRecalibrator as it is explained in the "Best Practices v4".

I am always confused about which annotation file I should use for my analysis.

For the Indel realignment, in the command line arguments of RealignerTargetCreator, one have to set the '--known' switch to indicate known indel sites.

--known:indels,vcf Mills_and_1000G_gold_standard.indels.b37.sites.vcf --known:dbsnp,vcf dbsnp_135.b37.vcf

But in the annotations folder, you can also find 'dbsnp_135.b37.excluding_sites_after_129.vcf' for dbsnp (version before 1000K genomes). Depending on which one I use the target intervals files are pretty different. So I am really wondering which one should be used in my case ? Or is there any other factor that could drive me to the better choice ?

I have a similar dilemna with base recalibration, "dbsnp_135.b37.vcf" or "dbsnp_135.b37.excluding_sites_after_129.vcf" in the '-knownSites' switch ?

Thanks a lot, Best,

Anthony


Created 2012-07-26 05:07:14 | Updated 2012-10-19 16:51:54 | Tags: indelrealigner baserecalibrator knownsites

Comments (20)

Dear GATK team,

Thanks a lot for the new GATK version and GATK forum!

I am trying to use GATK for yeast strains. I do not have files of known sites of SNPs/indels. I understand that the BaseRecalibrator must get such a file. Do you suggest to skip calibration and realignment, or is there another way to go here?


Created 2012-07-24 14:17:40 | Updated 2012-07-24 14:19:56 | Tags: indelrealigner leftalignindels leftalignvariants

Comments (5)

Dear all,

I was browsing through some of the less used functions in the GATK documentation, hence the following question: Does the LeftAlignIndels function do something additional that is not happening with IndelRealigner? In other words, do you recommend to run LeftAlignIndels on top of the indel realignment?

Best regards, Sophia