Tagged with #realignertargetcreator
3 documentation articles | 0 announcements | 34 forum discussions


Comments (0)

This article is part of the workflow documentation describing the Best Practices for Variant Discovery in DNAseq data. See http://www.broadinstitute.org/gatk/guide/best-practices for the full workflow.

The algorithms that are used in the initial mapping step tend to produce various types of artifacts. For example, reads that align on the edges of indels often get mapped with mismatching bases that might look like evidence for SNPs, but are actually mapping artifacts. The realignment process identifies the most consistent placement of the reads relative to the indel in order to clean up these artifacts. It occurs in two steps: first the program identifies intervals that need to be realigned, then in the second step it determines the optimal consensus sequence and performs the actual realignment of reads.

Comments (109)

Realigner Target Creator

For a complete, detailed argument reference, refer to the GATK document page here.


Indel Realigner

For a complete, detailed argument reference, refer to the GATK document page here.


Running the Indel Realigner only at known sites

While we advocate for using the Indel Realigner over an aggregated bam using the full Smith-Waterman alignment algorithm, it will work for just a single lane of sequencing data when run in -knownsOnly mode. Novel sites obviously won't be cleaned up, but the majority of a single individual's short indels will already have been seen in dbSNP and/or 1000 Genomes. One would employ the known-only/lane-level realignment strategy in a large-scale project (e.g. 1000 Genomes) where computation time is severely constrained and limited. We modify the example arguments from above to reflect the command-lines necessary for known-only/lane-level cleaning.

The RealignerTargetCreator step would need to be done just once for a single set of indels; so as long as the set of known indels doesn't change, the output.intervals file from below would never need to be recalculated.

 java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar \
  -T RealignerTargetCreator \
  -R /path/to/reference.fasta \
  -o /path/to/output.intervals \
  -known /path/to/indel_calls.vcf

The IndelRealigner step needs to be run on every bam file.

java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir \
  -jar /path/to/GenomeAnalysisTK.jar \
  -I <lane-level.bam> \
  -R <ref.fasta> \
  -T IndelRealigner \
  -targetIntervals <intervalListFromStep1Above.intervals> \
  -o <realignedBam.bam> \
  -known /path/to/indel_calls.vcf
  --consensusDeterminationModel KNOWNS_ONLY \
  -LOD 0.4
No posts found with the requested search criteria.
Comments (1)

Gidday,

I've run the IndelRealigner on my mouse WGS *bam files with known site data from the Sanger MGP, and now I'm trying to figure out how "well" it worked.

The list created by RealignerTargetCreator contains 6547185 intervals

Parsing the output realigned.bam file for reads that had an "OC" tag added (as suggested in http://www.broadinstitute.org/gatk/events/3391/GATKw1310-BP-2-Realignment.pdf) shows that 1648299 reads were actually realigned.

I used the default settings, which means that

1) -model was USE_READS - and from what I've read, this is the correct option to use, given that Smith-Waterman modelling doesn't give greatly improved results;

2) -LOD was 5.0 - but for my data, which is mouse whole-genome sequence at average 10x coverage, this may be too high and I might be losing true positives.

I've tried randomly picking out candidate intervals from the intervals and OC-tagged reads from the realigned.bam file to check, but I was wondering if there's a more empirical way of checking how good the realignment was (I realise there's "no formal measure" as per the presentation but I'm finding it hard to make a judgement call!).

My feeling from looking at the intervals or realigned reads is that the low coverage is a major issue in terms of identifying "true" indels, so preferably we'd go for specificity over sensitivity.

Thanks for any advice/suggestions in advance!

Comments (3)

Hi,

I'm trying to use the RealignerTargetCreator as a test with 1 know file; 1000G_phase1.indels.b37.vcf. At first the contigs didn't match with my .BAM file (chr1/chr2 vs 1/2), so I adjusted that. Now when running it for the first time; java - jar [path to genomeAnalysisTK.jar] -T RealignerTargetCreator -R [.fasta] -I [.bam] -o [.intervals] -known [path to 1000g_phase1_adjusted.indels.b37.vcf] it gives the following error: "I/O error loading or writing tribble index file for [path to 1000g]". When running it the second time, I get the following error; "Problem detecting index type", because the .idx file is not correctly created.

What am I doing wrong?

Comments (5)

Hello,

I was wondering about the format of the known site vcfs used by the RealignerTargetCreator and BaseRecalibrator walkers.

I'm working with mouse whole genome sequence data, so I've been using the Sanger Mouse Genome project known sites from the Keane et al. 2011 Nature paper. From the output, it seems that the RealignerTargetCreator walker is able to recognise and use the gzipped vcf fine:

INFO 15:12:09,747 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,751 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02 INFO 15:12:09,751 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 15:12:09,752 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 15:12:09,758 HelpFormatter - Program Args: -T RealignerTargetCreator -R mm10.fa -I DUK01M.sorted.dedup.bam -known /tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz -o DUK01M.indel.intervals.list INFO 15:12:09,758 HelpFormatter - Date/Time: 2014/03/25 15:12:09 INFO 15:12:09,758 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,759 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,918 ArgumentTypeDescriptor - Dynamically determined type of /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz to be VCF INFO 15:12:10,010 GenomeAnalysisEngine - Strictness is SILENT INFO 15:12:10,367 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 15:12:10,377 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 15:12:10,439 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06 INFO 15:12:10,468 RMDTrackBuilder - Attempting to blindly load /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz as a tabix indexed file INFO 15:12:11,066 IndexDictionaryUtils - Track known doesn't have a sequence dictionary built in, skipping dictionary validation INFO 15:12:11,201 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files INFO 15:12:12,333 GenomeAnalysisEngine - Done creating shard strategy INFO 15:12:12,334 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] I've checked the indel interval lists for my samples and they do all appear to contain different intervals.

However, when I use the equivalent SNP vcf in the following BQSR step, GATK errors as follows:

`##### ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 2.5-2-gf57256b):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation.
ERROR ------------------------------------------------------------------------------------------`

Which means that the SNP vcf (which has the same format as the indel vcf) is not used by BQSR.

My question is: given that the BQSR step failed, should I be worried that there are no errors from the Indel Realignment step? As the known SNP/indel vcfs are in the same format, I don't know whether I can trust the realigned .bams.

Thanks very much!

Comments (6)

Hello All,

I am running RealignerTargetCreator using GATK version GenomeAnalysisTK-1.2-4-gd9ea764 and I am getting the following error: `

ERROR MESSAGE: Input files reads and reference have incompatible contigs: Found contigs with the same name but different lengths:
ERROR contig reads = scaffold69676_size1796 / 3149
ERROR contig reference = scaffold69676_size1796 / 1758.
ERROR reads contigs = [scaffold1_size320545, scaffold2_size291774, scaffold3_size284740..........`

I already checked that I am using the right Reference FASTA file and the correct .bam file, that I have used for alignment before. Therefore, I am clueless why I am getting this error? I would appreciate your help regarding this problem. Any suggestion is welcome?

Thanks, Namrata

Comments (4)

Hello, I ran the GATK RealignerTargetCreator command and got an output file with some lines in non-interval format (about 6%). For example: the line chrM:125-346 versus the line chrM:7684 When I ran the next IndelRealigner command it failed with the following error message, indicating the line without interval: "##### ERROR MESSAGE: Invalid argument value 'targetIntervals' at position 10.

ERROR Invalid argument value 'GRC13283077_var_list' at position 11."

What is the reason for this output? What can I do about it? Please advice.

Thanks, Lily

Comments (1)

Hi there,

I've encountered a NullPointerException when running RealignerTargetCreator. I wasn't able to find any hint that that's a known problem, so I'm posting it here (sorry, should I have overlooked something).

INFO  21:41:18,635 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  21:41:18,638 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.8-1-g932cd3a, Compiled 2013/12/06 16:47:15 
INFO  21:41:18,638 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  21:41:18,638 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  21:41:18,645 HelpFormatter - Program Args: -l INFO -R hg18/hg18.fasta -I aln/hiseq.wholegenome.cov30.stampy.sorted.markdup.bam -T RealignerTargetCreator -nt 12 -o aln/hiseq.wholegenome.cov30.stampy.indels.intervals 
INFO  21:41:18,645 HelpFormatter - Date/Time: 2014/02/24 21:41:18 
INFO  21:41:18,645 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  21:41:18,646 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  21:41:18,733 GenomeAnalysisEngine - Strictness is SILENT 
INFO  21:41:18,842 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  21:41:18,853 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  21:41:18,923 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.07 
INFO  21:41:18,956 MicroScheduler - Running the GATK in parallel mode with 12 total threads, 1 CPU thread(s) for each of 12 data thread(s), of 24 processors available on this machine 
INFO  21:41:19,038 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
INFO  21:41:19,823 GenomeAnalysisEngine - Done preparing for traversal 
INFO  21:41:19,823 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  21:41:19,825 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
INFO  21:41:19,910 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  21:41:19,916 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  21:41:19,919 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  21:41:19,928 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  21:41:19,937 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  21:41:19,944 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  21:41:19,944 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  21:41:19,951 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  21:41:19,952 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  21:41:19,958 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  21:41:19,958 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  21:41:19,964 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  21:41:19,964 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  21:41:19,975 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  21:41:19,976 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  21:41:19,980 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00 
INFO  21:41:19,980 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  21:41:19,984 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00 
INFO  21:41:19,985 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  21:41:19,989 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00 
INFO  21:41:19,990 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  21:41:20,010 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02 
INFO  21:41:49,933 ProgressMeter -   chr1:26116497        2.60e+07   30.0 s        1.0 s      0.8%        59.0 m    58.5 m 
INFO  21:42:19,937 ProgressMeter -   chr1:57885373        5.77e+07   60.0 s        1.0 s      1.9%        53.2 m    52.2 m 
INFO  21:42:49,941 ProgressMeter -   chr1:88146121        8.80e+07   90.0 s        1.0 s      2.9%        52.4 m    50.9 m 
INFO  21:43:03,098 GATKRunReport - Uploaded run statistics report to AWS S3 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
java.lang.NullPointerException
        at org.broadinstitute.sting.utils.locusiterator.LocusIteratorByState.getLocation(LocusIteratorByState.java:218)
        at org.broadinstitute.sting.utils.locusiterator.LocusIteratorByState.lazyLoadNextAlignmentContext(LocusIteratorByState.java:294)
        at org.broadinstitute.sting.utils.locusiterator.LocusIteratorByState.hasNext(LocusIteratorByState.java:233)
        at net.sf.picard.util.PeekableIterator.advance(PeekableIterator.java:70)
        at net.sf.picard.util.PeekableIterator.next(PeekableIterator.java:57)
        at org.broadinstitute.sting.gatk.executive.WindowMaker$WindowMakerIterator.advance(WindowMaker.java:200)
        at org.broadinstitute.sting.gatk.executive.WindowMaker$WindowMakerIterator.hasNext(WindowMaker.java:169)
        at org.broadinstitute.sting.gatk.datasources.providers.LocusView.advance(LocusView.java:180)
        at org.broadinstitute.sting.gatk.datasources.providers.LocusView.hasNextLocus(LocusView.java:148)
        at org.broadinstitute.sting.gatk.datasources.providers.AllLocusView.advance(AllLocusView.java:127)
        at org.broadinstitute.sting.gatk.datasources.providers.AllLocusView.hasNext(AllLocusView.java:85)
        at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$MapDataIterator.hasNext(TraverseLociNano.java:167)
        at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:268)
        at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
        at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
        at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
        at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
        at org.broadinstitute.sting.gatk.executive.ShardTraverser.call(ShardTraverser.java:98)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.8-1-g932cd3a):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Code exception (see stack trace for error itself)
##### ERROR ------------------------------------------------------------------------------------------

Please let me know if there is more information I can provide.

Best, Tobi

Comments (4)

Hi

I'm indel realigning with version 2.4-9 using generic commands such as:

java -Xmx4g -jar /path/to/GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R /path/to/reference.fasta \ -I /path/to/input.bam \ -o /path/to/realigner.intervals

java -Xmx4g -jar /path/to/GenomeAnalysisTK.jar \ -T IndelRealigner \ -R /path/to/reference.fasta \ -I /path/to/sample-level.bam \ -targetIntervals /path/to/realigner.intervals.from.rtc \ -o /path/to/realigned.bam \ -model USE_SW \ -LOD 0.4

In most cases this is working fine, but in a few cases it is introducing artefacts that subsequently cause the bam file to fail Picard's ValidateSamFile, and in a couple of cases it introduces errors that can't be fixed by CleanSam and/or FixMateInformation.

Here's an example of an error that can be fixed by CleanSam:

Before indel realignment:

HS24_08564:7:2311:4630:19372#87 69 AAKM01002546 471 0 * = 471 0 HS24_08564:7:2311:4630:19372#87 137 AAKM01002546 471 25 100M = 471 0

After indel realignment:

HS24_08564:7:2311:4630:19372#87 69 AAKM01002546 471 0 * = 471 0 HS24_08564:7:2311:4630:19372#87 137 AAKM01002546 471 35 91M1D9M = 471 0

Here's an example of an error that can't be fixed:

Before indel realignment:

HS24_10061:6:1312:10172:98346#54 69 AAKM01002280 649 0 * = 649 0 HS24_10061:6:1312:10172:98346#54 137 AAKM01002280 649 37 100M = 649 0

After indel realignment:

HS24_10061:6:1312:10172:98346#54 69 AAKM01002280 649 0 * = 649 0 HS24_10061:6:1312:10172:98346#54 137 AAKM01002280 649 47 91M2D9M0S = 649 0

Is this a known bug? Any chance of a fix?

Thanks!

Richard

Comments (9)

Dear all can anybody help me with this error while running Realigntargetcreator the run failed to pass through many filters any suggestion why

Run summary

INFO 12:07:39,628 HelpFormatter - -------------------------------------------------------------------------------- 
INFO 12:07:39,631 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-2-g6bda569, Compiled 2013/08/28 16:30:29 
INFO 12:07:39,631 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO 12:07:39,631 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO 12:07:39,635 HelpFormatter - Program Args: -T RealignerTargetCreator -R /home/sab/ref/human_hg19.fa -I /home/sab/pipeline/A_sorted.bam -o /home/sab/pipeline/A_sorted.IndelRealigner.intervals 
INFO 12:07:39,636 HelpFormatter - Date/Time: 2013/12/03 12:07:39 
INFO 12:07:39,636 HelpFormatter - -------------------------------------------------------------------------------- 
INFO 12:07:39,636 HelpFormatter - -------------------------------------------------------------------------------- 
INFO 12:07:39,697 GenomeAnalysisEngine - Strictness is SILENT 
INFO 12:07:39,789 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO 12:07:39,798 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO 12:07:39,820 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02 
INFO 12:07:39,906 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
INFO 12:07:40,400 GenomeAnalysisEngine - Done preparing for traversal 
INFO 12:07:40,400 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO 12:07:40,401 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining 
INFO 12:08:10,406 ProgressMeter - chr1:85262337 8.53e+07 30.0 s 0.0 s 2.8% 18.2 m 17.7 m 
INFO 12:08:40,407 ProgressMeter - chr1:180404225 1.80e+08 60.0 s 0.0 s 5.8% 17.2 m 16.2 m 
INFO 12:09:10,409 ProgressMeter - chr2:26489345 2.76e+08 90.0 s 0.0 s 8.9% 16.8 m 15.3 m 
INFO 12:09:40,413 ProgressMeter - chr2:126159101 3.75e+08 120.0 s 0.0 s 12.1% 16.5 m 14.5 m 
INFO 12:10:10,415 ProgressMeter - chr2:226503317 4.76e+08 2.5 m 0.0 s 15.4% 16.3 m 13.8 m 
INFO 12:10:40,417 ProgressMeter - chr3:79985305 5.72e+08 3.0 m 0.0 s 18.5% 16.2 m 13.2 m 
INFO 12:11:10,418 ProgressMeter - chr3:178616501 6.71e+08 3.5 m 0.0 s 21.7% 16.1 m 12.6 m 
INFO 12:11:40,424 ProgressMeter - chr4:82438805 7.73e+08 4.0 m 0.0 s 25.0% 16.0 m 12.0 m 
INFO 12:12:10,426 ProgressMeter - chr4:190799281 8.81e+08 4.5 m 0.0 s 28.5% 15.8 m 11.3 m 
INFO 12:12:40,427 ProgressMeter - chr5:95948005 9.78e+08 5.0 m 0.0 s 31.6% 15.8 m 10.8 m 
INFO 12:13:10,429 ProgressMeter - chr6:6479181 1.07e+09 5.5 m 0.0 s 34.5% 15.9 m 10.4 m 
INFO 12:13:40,430 ProgressMeter - chr6:92259205 1.15e+09 6.0 m 0.0 s 37.3% 16.1 m 10.1 m 
INFO 12:14:10,432 ProgressMeter - chr7:12978229 1.25e+09 6.5 m 0.0 s 40.3% 16.1 m 9.6 m 
INFO 12:14:40,433 ProgressMeter - chr7:102060237 1.34e+09 7.0 m 0.0 s 43.1% 16.2 m 9.2 m 
INFO 12:15:10,435 ProgressMeter - chr8:37320269 1.43e+09 7.5 m 0.0 s 46.2% 16.2 m 8.7 m 
INFO 12:15:40,436 ProgressMeter - chr8:134665297 1.53e+09 8.0 m 0.0 s 49.3% 16.2 m 8.2 m 
INFO 12:16:10,437 ProgressMeter - chr9:94340989 1.63e+09 8.5 m 0.0 s 52.8% 16.1 m 7.6 m 
INFO 12:16:40,439 ProgressMeter - chr10:51925797 1.73e+09 9.0 m 0.0 s 56.0% 16.1 m 7.1 m 
INFO 12:17:10,440 ProgressMeter - chr11:10504845 1.83e+09 9.5 m 0.0 s 59.0% 16.1 m 6.6 m 
INFO 12:17:40,442 ProgressMeter - chr11:102575141 1.92e+09 10.0 m 0.0 s 62.0% 16.1 m 6.1 m 
INFO 12:18:10,443 ProgressMeter - chr12:60381241 2.01e+09 10.5 m 0.0 s 65.0% 16.2 m 5.7 m 
INFO 12:18:40,445 ProgressMeter - chr13:27611641 2.11e+09 11.0 m 0.0 s 68.2% 16.1 m 5.1 m 
INFO 12:19:10,446 ProgressMeter - chr14:15346101 2.20e+09 11.5 m 0.0 s 71.6% 16.1 m 4.6 m 
INFO 12:19:40,456 ProgressMeter - chr15:9565601 2.31e+09 12.0 m 0.0 s 74.8% 16.0 m 4.0 m 
INFO 12:20:10,457 ProgressMeter - chr16:5380553 2.42e+09 12.5 m 0.0 s 78.0% 16.0 m 3.5 m 
INFO 12:20:40,459 ProgressMeter - chr17:7484705 2.51e+09 13.0 m 0.0 s 81.0% 16.0 m 3.0 m 
INFO 12:21:10,460 ProgressMeter - chr18:12533677 2.59e+09 13.5 m 0.0 s 83.8% 16.1 m 2.6 m 
INFO 12:21:40,462 ProgressMeter - chr19:34306113 2.69e+09 14.0 m 0.0 s 87.0% 16.1 m 2.1 m 
INFO 12:22:10,463 ProgressMeter - chr20:55319685 2.77e+09 14.5 m 0.0 s 89.6% 16.2 m 100.0 s 
INFO 12:22:40,465 ProgressMeter - chr22:41605677 2.87e+09 15.0 m 0.0 s 92.8% 16.2 m 70.0 s 
INFO 12:23:10,466 ProgressMeter - chrX:84912589 2.97e+09 15.5 m 0.0 s 95.8% 16.2 m 40.0 s 
INFO 12:23:40,468 ProgressMeter - chrY:30850957 3.07e+09 16.0 m 0.0 s 99.1% 16.1 m 8.0 s 
INFO 12:23:47,504 ProgressMeter - done 3.10e+09 16.1 m 0.0 s 100.0% 16.1 m 0.0 s 
INFO 12:23:47,505 ProgressMeter - Total runtime 967.10 secs, 16.12 min, 0.27 hours 
INFO 12:23:47,591 MicroScheduler - 1390162 reads were filtered out during the traversal out of approximately 5682566 total reads (24.46%) 
INFO 12:23:47,591 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter 
INFO 12:23:47,592 MicroScheduler - -> 51548 reads (0.91% of total) failing BadMateFilter 
INFO 12:23:47,592 MicroScheduler - -> 352814 reads (6.21% of total) failing DuplicateReadFilter 
INFO 12:23:47,592 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter 
INFO 12:23:47,592 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter 
INFO 12:23:47,592 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter 
INFO 12:23:47,592 MicroScheduler - -> 985800 reads (17.35% of total) failing MappingQualityZeroFilter 
INFO 12:23:47,593 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter 
INFO 12:23:47,593 MicroScheduler - -> 0 reads (0.00% of total) failing Platform454Filter 
INFO 12:23:47,593 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter 
INFO 12:23:49,381 GATKRunReport - Uploaded run statistics report to AWS S3

Regards

Comments (11)

Dear GATK team,

I'm trying to get genotype calls for whole genome data sequenced on Illumina HiSeq. I have ~1.2B reads from one individual and would like to test the effect of varying the number of reads used as input on GATK calls. I first divide the unsorted reads into parcels of ~200M. Then I use either 1 parcel or 2 parcels or 3 parcels and so on as input to the GATK pipeline to simulate having different numbers of reads. When I ran this process on hg18 aligned data, the vcf files increased in size as number of input parcels increased, as expected. However, when I ran the same process on hg19 aligned data, the vcf files sizes were all similar and the loci and read numbers in all the files were similar. The input files to UnifiedGenotyper varied in size as expected so the problem is likely at the last step.

The only difference between the hg18 and hg19 runs was that for hg18, I used the -L target.intervals option at the RealignerTargetCreator step (and later on whenever required) which solved the filter N cigar problem. Somehow that didn't work for the hg19 run and so I had to use the filter N cigar option. Could this affect the genotyping step?

A diagram of the workflow is attached.

Thank you! Stephanie

Comments (3)

root@GR0001:~# java -Xmx2g -Djava.io.tmpdir=/nG/Data/1265/vcf1/node1/9/AnalysisTemp/ -jar /nG/Process/Tools/GATK/GenomeAnalysisTK-2.7-1-g42d771f/GenomeAnalysisTK.jar -T RealignerTargetCreator -nt 3 -L 9 -R '/nG/Reference/CommonName/dog/FASTA/chrAll.fa' -I '/nG/Data/1265/vcf1/node1/9/Databind/9.bam' -o '/nG/Data/1265/vcf1/node1/9/Databind/9.bam.intervals' INFO 11:10:17,730 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:10:17,732 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-1-g42d771f, Compiled 2013/08/21 23:02:55 INFO 11:10:17,733 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 11:10:17,733 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 11:10:17,737 HelpFormatter - Program Args: -T RealignerTargetCreator -nt 3 -L 9 -R /nG/Reference/CommonName/dog/FASTA/chrAll.fa -I /nG/Data/1265/vcf1/node1/9/Databind/9.bam -o /nG/Data/1265/vcf1/node1/9/Databind/9.bam.intervals INFO 11:10:17,738 HelpFormatter - Date/Time: 2013/08/28 11:10:17 INFO 11:10:17,738 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:10:17,738 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:10:17,879 GenomeAnalysisEngine - Strictness is SILENT INFO 11:10:18,189 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 11:10:18,198 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 11:10:18,325 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.13 INFO 11:10:24,257 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.7-1-g42d771f):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Couldn't read file /root/9 because The interval file 9 does not have one of the supported extensions (.bed, .list, .picard, .interval_list, or .intervals). Please rename your file with the appropriate extension. If 9 is NOT supposed to be a file, please move or rename the file at location /root/9
ERROR ------------------------------------------------------------------------------------------

I can't understand why call this error. please check it. Is it a bug?

update When I try to with chromosome 9 (numerical), I have the problem. There is not problem for other chromosome number (like 1,2,3,4,5,6,7,8,10,11...)

Comments (1)

Hello

I am currently trying to run the RealignerTargetCreator on some bam files which were aligned to hg19 howver am getting this error `ERROR MESSAGE: Input files known and reference have incompatible contigs: Found contigs with the same name but different lengths:

ERROR contig known = chrM / 16571
ERROR contig reference = chrM / 16569.`

After some initial investigation I found that the supplied hg19 reference genome which was being used for mapping was using rCRS mtDNA. other then realigning to a different build of hg19 is there any way to easily fix this problem through GATK?

Comments (1)

Hi, I am working on exome capture data for barley (1.3Gbp). I am interested in variant calling to find out SNPs in my sample. I have used SAMTools SNP calling and things get done in ~1 hr whereas GATK (inspite of its several steps to prepare the BAM for variant caller) takes forever. I understand my reference is large and since its an exome capture the targeted region is only 60 Mbp of 1.3Gbp. RealignerTargetCreator is the step it takes forever to locate for sites where indel realignment is required. Do someone have any suggestions to speed it up? Or try any other variant caller? I have tried downsampling my BAM with samtools view -s and that too takes as the log says 8 more days to finish :(

Here is my sample command:

java -Xmx20g -XX:MaxPermSize=40G -jar /software/production/gatk/2.3.9/x86_64/GenomeAnalysisTK.jar -T RealignerTargetCreator -I in.bam -R ref.fa -o out.bam

Thanks, D

Comments (1)

Hi, I ran the exact same command line twice with the exact same files and parameters and the output file is different: gatk -T RealignerTargetCreator -nt 8 -R /home/ngs/data/tools/gatk/hg/broad_bundle_hg19_v2.2/ucsc.hg19.fasta -I $HOME/jobout/nogroupid_$JOB_ID//JFP0435_02_R2.JFP.lane5.120817FCA_sorted_remdup.bam --known /home/ngs/data/tools/gatk/hg/broad_bundle_hg19_v2.2//1000G_phase1.indels.hg19.vcf --known /home/ngs/data/tools/gatk/hg/broad_bundle_hg19_v2.2//Mills_and_1000G_gold_standard.indels.hg19.vcf --filter_mismatching_base_and_quals -o $HOME/jobout/nogroupid_$JOB_ID//JFP0435_02_R2.JFP.lane5.120817FCA_sorted_remdup.intervals

rem :Gatk vers 2.3.

Size of the intervals files are different and when I run a 'diff' it show some differences, not huge but I wonder if it's due to the algorithm:

here is the diff result:

158138c158138 < chr2:905714-905956 --- > chr2:905685-905956 452144c452144 < chr3:195511953-195512064 --- > chr3:195511916-195512064 461418c461418 < chr4:9241955-9242059 --- > chr4:9241966-9242059 605566,605567c605566,605569 < chr5:21481723-21482150 < chr5:21482294-21482726 --- > chr5:21481723-21481740 > chr5:21481909-21482150 > chr5:21482294-21482563 > chr5:21482690-21482726 605569c605571,605573 < chr5:21484233-21484258 --- > chr5:21483481-21483649 > chr5:21483821-21484114 > chr5:21484233-21484265 615246c615250 < chr5:34189680-34190048 --- > chr5:34189714-34190048 615248a615253 > chr5:34191846-34192088 909440,909441c909445 < chr7:100643452-100643460 < chr7:100643595-100643794 --- > chr7:100643452-100643794 1008760c1008764 < chr8:86572070-86572117 --- > chr8:86572070-86572085 1008763c1008767 < chr8:86573453-86573764 --- > chr8:86573453-86573707 1478683c1478687 < chr14:19553519-19553562 --- > chr14:19553519-19553559 1478828c1478832 < chr14:20019712-20019990 --- > chr14:20019712-20019951 1994633c1994637 < chrUn_gl000212:6736-6842 --- > chrUn_gl000212:6736-6843

Kind regards

Didier

Comments (4)

I started with BWA-MEM to do alignment, used Picard to process the .SAM files (converted to bam, reorder, addorreplacegroup, etc). The GATK version I'm using is version 2.5-2-gf57256b, I cannot run 2.6 because the server only has Java 6 and I cannot upgrade it to Java 7.

I got a huge stack of error message when I run this command line (RealignerTargetCrator):

java -Xmx2g -jar $CLASSPATH/GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R /Volumes/files/Users/user1/GATK_ref/hg19.fasta \ -I sorted_Deduped_reorder_grp.bam \ -o ./GATK/forIndelRealigner.intervals>

The error messages are these (sorry, a lot): I don't know why GATK needs to connect to window server? what permission problem? I am using a Mac OS X built server (remote). Thank you

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.InternalError: Can't connect to window server - not enough permissions. at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1827) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1724) at java.lang.Runtime.loadLibrary0(Runtime.java:823) at java.lang.System.loadLibrary(System.java:1045) at sun.security.action.LoadLibraryAction.run(LoadLibraryAction.java:50) at java.security.AccessController.doPrivileged(Native Method) at java.awt.Toolkit.loadLibraries(Toolkit.java:1605) at java.awt.Toolkit.(Toolkit.java:1627) at sun.awt.AppContext$2.run(AppContext.java:240) at sun.awt.AppContext$2.run(AppContext.java:226) at java.security.AccessController.doPrivileged(Native Method) at sun.awt.AppContext.initMainAppContext(AppContext.java:226) at sun.awt.AppContext.access$200(AppContext.java:112) at sun.awt.AppContext$3.run(AppContext.java:306) at java.security.AccessController.doPrivileged(Native Method) at sun.awt.AppContext.getAppContext(AppContext.java:287) at com.sun.jmx.trace.Trace.out(Trace.java:180) at com.sun.jmx.trace.Trace.isSelected(Trace.java:88) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.isTraceOn(DefaultMBeanServerInterceptor.java:1830) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:929) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:916) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:312) at com.sun.jmx.mbeanserver.JmxMBeanServer$2.run(JmxMBeanServer.java:1195) at java.security.AccessController.doPrivileged(Native Method) at com.sun.jmx.mbeanserver.JmxMBeanServer.initialize(JmxMBeanServer.java:1193) at com.sun.jmx.mbeanserver.JmxMBeanServer.(JmxMBeanServer.java:225) at com.sun.jmx.mbeanserver.JmxMBeanServer.(JmxMBeanServer.java:170) at com.sun.jmx.mbeanserver.JmxMBeanServer.newMBeanServer(JmxMBeanServer.java:1401) at javax.management.MBeanServerBuilder.newMBeanServer(MBeanServerBuilder.java:93) at javax.management.MBeanServerFactory.newMBeanServer(MBeanServerFactory.java:311) at javax.management.MBeanServerFactory.createMBeanServer(MBeanServerFactory.java:214) at javax.management.MBeanServerFactory.createMBeanServer(MBeanServerFactory.java:175) at sun.management.ManagementFactory.createPlatformMBeanServer(ManagementFactory.java:302) at java.lang.management.ManagementFactory.getPlatformMBeanServer(ManagementFactory.java:504) at org.broadinstitute.sting.gatk.executive.MicroScheduler.(MicroScheduler.java:222) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.(LinearMicroScheduler.java:70) at org.broadinstitute.sting.gatk.executive.MicroScheduler.create(MicroScheduler.java:169) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.createMicroscheduler(GenomeAnalysisEngine.java:443) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:272) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.5-2-gf57256b):
Comments (1)

Hi,

I am currently running GATK-2.6.4 with Java-1.7. I am trying to run RTC. It seems to get almost to the end and then I get the following error which I believe is referring to the -o argument:

INFO 06:12:01,578 ProgressMeter - done 3.10e+09 92.2 m 1.0 s 100.0% 92.2 m 0.0 s INFO 06:12:01,578 ProgressMeter - Total runtime 5530.95 secs, 92.18 min, 1.54 hours INFO 06:12:01,580 MicroScheduler - 40 reads were filtered out during the traversal out of approximately 112022776 total reads (0.00%) INFO 06:12:01,581 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter INFO 06:12:01,582 MicroScheduler - -> 0 reads (0.00% of total) failing BadMateFilter INFO 06:12:01,583 MicroScheduler - -> 8 reads (0.00% of total) failing DuplicateReadFilter INFO 06:12:01,583 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 06:12:01,584 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 06:12:01,584 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 06:12:01,585 MicroScheduler - -> 32 reads (0.00% of total) failing MappingQualityZeroFilter INFO 06:12:01,585 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 06:12:01,586 MicroScheduler - -> 0 reads (0.00% of total) failing Platform454Filter INFO 06:12:01,586 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter INFO 06:12:02,411 GATKRunReport - Uploaded run statistics report to AWS S3 /local/scratch/1373342599.1240274.shell: line 7: -o: command not found

My RTC command is as follows:

java -Xmx2g -jar GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R human_g1k_v37.fasta \ -I aln.sorted.rg.markdup.bam -o forIndelRealigner.intervals \ -known 1000G_phase1.indels.b37.vcf \ -known Mills_and_1000G_gold_standard.indels.b37.sites.vcf \

Is there a new way to specify an outfile in this version of GATK?

Thanks very much!

Comments (5)

I am doing exome sequencing in 700 individuals from a species with a large genome and I would like to use GATK to realign around indels. I am using a reduced reference, which is still about 3Gb. I tested out the target creator, but it is taking 5 days for 12 individuals when each is done individually and this time frame is not feasible for all 700 individuals. I tried to run more in parallel (~30 individuals), but there are RAM limitations on our 250G server. I am currently testing out the program by running all 12 test samples as input for the same run and the time estimate is very long (on the order of several hundred weeks). Based on a preliminary run I have also included a vcf file with likely indels to try and speed the process. Can you suggest another way in which I can make the time frame for all 700 individuals more reasonable? Otherwise we will not be able to use this tool.

Comments (7)

Dear Developers,

I Run Unified genotyper for identify the indels and deletion. I used Realigner target creater got forIndelRealigner.intervals. used this output for Indel Realigner and get the output bam file which i gave as input toUnified Genotyper. I got result as

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT BRCA1

gi|262359905|ref|NG_005905.2| 48155 . G A 387.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.466;DP=47;Dels=0.00;FS=2.721;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=29.31;MQ0=0;MQRankSum=-4.812;QD=8.25;ReadPosRankSum=0.535 GT:AD:DP:GQ:PL 0/1:25,22:45:99:416,0,676

But if i run without these steps (Realigner target creater, Indel Realigner) i am getting the same output. Also i used the same dataset and run it in samtools and got more SNP.

I am looking for a deletion in the dataset. I would like to get the result from GATK Could you please suggest on the same.

Thanks Sridhar

Comments (4)

Hi

I've followed the suggested protocol for local realignment - first using RealignerTargetCreator and then IndelRealigner, but have unexpected results.

Let's call the two BAMs I'm realigning "normal" and "tumour" or N and T for short. Once realigned, I've split the resulting NT BAM file (using readgroup tags, although I see from the docs that it can create separate files natively) back into the original N and T BAM files and discovered something odd. I was expecting the pre-realignment N and T files to contain the same number of reads as the post-realignment files, only the coordinates that reads are mapped to would be different.

However, I notice that post-realignment files contain significantly fewer reads because unaligned reads and reads not aligned to the autosomes or sex chromosomes have been removed. However, these reads alone do not account for the difference; large numbers of reads aligned to the 24 chromosomes are now missing.

Can you tell me more about the reads that are removed? I suspect it to be an alignment quality issue, but cannot find direct reference to this behaviour in the documentation. I'm currently keeping both my pre and post-realignment bam files, but ultimately there will be space constraints and I'll have to choose and would like to make the most informed decision possible.

Regards Chris

Comments (13)

Hi,

I want to run RealignerTargetCreator with this command line :

qsub -b Y -N RTC -q bigmem.q "/usr/local/java/latest/bin/java -Xmx36g -jar /home/sabotf/sources/GenomeAnalysisTK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /data/projects/assembling-glab/PacBio_test/XL/filtered_subreads_XL.fasta -o /data/projects/assembling-glab/mappingResults/Tog5681Clean_vs_CG14_XL/output.intervals -I /data/projects/assembling-glab/mappingResults/Tog5681Clean_vs_CG14_XL/rmdup.bam" but this return this error :

`##### ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 2.3-9-ge5ebf34):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java
ERROR ------------------------------------------------------------------------------------------`

I tried with -Xmx4g, then,-Xmx12g, then -Xmx48g, and always the same error. I don't know what to do ... any idea ? thanks

Comments (1)

Hello,

I am using Realigner target creator of GATK toolbox but it keeps on giving me this error:

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.3-6-gebbba25):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
** ERROR MESSAGE: SAM/BAM file Sample_27.sorted.bam is malformed: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups**
ERROR ------------------------------------------------------------------------------------------

How could I solve this problem.

Thanks a lot in advance.

Comments (5)

Hello,

I am having trouble calling variants using Haplotype Caller on simulated exome reads. I have been able to call reasonable-looking variants on the exome (simulated with dwgsim) with HaplotypeCaller before running it through the Best Practices Pre-Processing pipeline. The pre-processed data worked fine with UnifiedGenotyper but with HaplotypeCaller, though it runs without errors and seems to walk across the genome, only outputs a VCF header. I have tried calling variants with and without using -L to provide the exome regions (as recommended in this forum post: http://gatkforums.broadinstitute.org/discussion/1681/expected-file-size-haplotype-caller) but this hasn't made a difference - when we run the command with the pre-processed BAMs, we only get a VCF header. Everything has been tested with both 2.4-7 and 2.4-9.

Any help or guidance would be greatly appreciated!

Command Used for HaplotypeCaller:

java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -I exome.realigned.dedup.recal.bam -o exome.raw.vcf -D dbsnp_137.hg19.vcf -stand_emit_conf 10 -rf BadCigar -L Illumin_TruSeq.bed --logging_level DEBUG

Commands Used for pre-processing (run in sequence using a Perl script):

java -Xmx16g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -nt 8 -R ucsc.hg19.fasta -I exome.bam -o exome.intervals -known dbsnp_137.hg19.vcf

java -Xmx4g -jar GenomeAnalysisTK.jar -T IndelRealigner -R ucsc.hg19.fasta -I exome.bam -o exome.realigned.bam -targetIntervals intervals.bam -known dbsnp_137.hg19.vcf

java -Xmx16g -jar MarkDuplicates.jar I=exome.realigned.bam METRICS_FILE=exome.dups O=exome.realigned.dedup.bam

samtools index exome.realigned.dedup

java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -o exome.recal_data.grp -knownSites dbsnp_137.hg19.vcf -cov ReadGroupCovariate -cov ContextCovariate -cov CycleCovariate -cov QualityScoreCovariate

java -Xmx4g -jar GenomeAnalysisTK.jar -T PrintReads -nct 8 -R ucsc.hg19.fasta -I exome.realigned.dedup.bam -BQSR exome.recal_data.grp -baq CALCULATE_AS_NECESSARY -o exome.realigned.dedup.recal.bam

Comments (5)

Hi, I got errors when ran GATK RealignerTargetCreator and IndelRealigner in v2.4.9. I've checked many related discussions and comments. First, I got an error like "we encountered an extremely high quality score of 69" with option -S LENIENT and the GATK program stalled. So I added "--fix_misencoded_quality_scores", and then I got different error message "ERROR MESSAGE: Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '0'" now. I tried older versions of GATK and both java 1.6 and 1.7. I'm hoping that you can help this. Please let me know if I'm missing something. Thanks!

--Giltae

Comments (3)

Hi,

I am trying to decide between two approaches for performing realignment around indels. I have ~600 samples that have been aligned to a very fragmented draft genome assembly. What is best:
1. take each sample and create a list of targets, followed by realignment on each sample.
2. combine all samples into one large bam file and create a list of targets, followed by realignment on the same large bam file.

Also, would there be any advantages in terms of speed with either approach?

Cheers,

Steve

Comments (11)

Hi,

I am currently working with a project where we have sequenced a library of approximately 70 bps insert sizes using 2x100 paired-end seq. While this can seem unnecessary, it can improve base qualities a lot.

I have used SeqPrep (https://github.com/jstjohn/SeqPrep) which strips adaptors and merges reads that overlap, in our case the entire read most of the times. This also boosts the base qualities, if a base was sequenced twice, the quality improves quite a bit. This way, base qualities can stretch up to 70 and over (probability of error 0.0001 x 0.0001 if both reads had Q40 at that base, it merged qual = 80). No funny business there. :)

However, this does not seem to play nicely with GATK. The realignment crashes (see below) saying the the base quals must be erroneous. In my case, but they are correct. Can I force GATK to work with these BQs? (--validation_strictness LENIENT didn't help as you can see below :)

cheers Daniel Klevebring

   INFO  13:04:07,408 HelpFormatter - -------------------------------------------------------------------------------- 
   INFO  13:04:07,411 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-7-g5e89f01, Compiled 2013/03/06 01:01:28 
   INFO  13:04:07,411 HelpFormatter - Copyright (c) 2010 The Broad Institute 
   INFO  13:04:07,411 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
   INFO  13:04:07,416 HelpFormatter - Program Args: -T RealignerTargetCreator -I /scratch/3041404/P394_102.prmdup.bam -R /bubo/proj/b2010040/private/GoldenPath/hg19/GATK_resource_bundle/human_g1k_v37_clean.fasta -o /scratch/3041404/P394_102.realn.intervals --intervals /bubo/proj/b2010040/private/GoldenPath/NG_design/1000G_REF_picard_custom_design_target_regions_HG19.bed.interval_list --validation_strictness LENIENT 
   INFO  13:04:07,416 HelpFormatter - Date/Time: 2013/03/13 13:04:07 
   INFO  13:04:07,416 HelpFormatter - -------------------------------------------------------------------------------- 
   INFO  13:04:07,416 HelpFormatter - -------------------------------------------------------------------------------- 
   INFO  13:04:08,461 GenomeAnalysisEngine - Strictness is LENIENT 
   INFO  13:04:08,632 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
   INFO  13:04:08,640 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
   INFO  13:04:08,655 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
   INFO  13:04:09,782 IntervalUtils - Processing 39772003 bp from intervals 
   INFO  13:04:10,001 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files 
   INFO  13:04:10,262 GenomeAnalysisEngine - Done creating shard strategy 
   INFO  13:04:10,262 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
   INFO  13:04:10,263 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
   INFO  13:04:18,482 GATKRunReport - Uploaded run statistics report to AWS S3 
   ##### ERROR ------------------------------------------------------------------------------------------
   ##### ERROR A USER ERROR has occurred (version 2.4-7-g5e89f01): 
   ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
   ##### ERROR Please do not post this error to the GATK forum
   ##### ERROR
   ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
   ##### ERROR Visit our website and forum for extensive documentation and answers to 
   ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
   ##### ERROR
   ##### ERROR MESSAGE: SAM/BAM file SAMFileReader{/scratch/3041404/P394_102.prmdup.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 70; please see the GATK --help documentation for options related to this error
   ##### ERROR ------------------------------------------------------------------------------------------
Comments (5)

Hello,

I try to run the realigner target creator with the data divided on chromosomes. It seems to run smooth but the runtime for chromosome 1 seems to never end. I have checked my bam files and as expected the one with chr 1 is a little bit bigger than chr 2 but nothing proportional to the differences in predicted runtime. The version i use is GATK 2.3.0 and this is how i run it:

java -Xmx12g -jar programs/GenomeAnalysisTK-2.3-0/GenomeAnalysisTK.jar -l INFO -T VariantRecalibrator -R Homo_sapiens.GRCh37.57_dna_concat.fa -recalFile allchr_varrecal_BOTH_comb_ref.intervals -rscriptFile 15_allchr_varrecal_BOTH_comb_ref.intervals.plots.R -tranchesFile allchr_varrecal_BOTH_comb_ref.intervals.tranches -resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf -resource:dbsnp,VCF,known=true,training=false,truth=false,prior=8.0 dbsnp_135.b37.vcf -resource:mills,VCF,known=true,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ --mode BOTH -nt 8 -input allchr_real_recal_resrt_raw_BOTH_comb_ref.vcf -L Agilent_SureSelect.V4.GRCh37.70_targets_nochr.bed.pad100.interval_list --pedigreeValidationType SILENT --pedigree my_fam.fam

predicted runtime looks like:

INFO 15:11:33,778 ProgressMeter - 1:33587200 3.36e+07 2.3 h 4.1 m 1.1% 8.8 d 8.7 d INFO 15:11:33,778 ProgressMeter - 11:73495886 1.82e+09 2.3 h 4.5 s 61.0% 3.7 h 87.4 m INFO 15:11:33,778 ProgressMeter - 10:6623707 1.68e+09 2.3 h 4.9 s 54.5% 4.2 h 114.3 m INFO 15:11:33,778 ProgressMeter - 11:572298 1.82e+09 2.3 h 4.5 s 58.7% 3.9 h 96.4 m INFO 15:11:33,779 ProgressMeter - 11:122078003 1.82e+09 2.3 h 4.5 s 62.6% 3.6 h 81.7 m INFO 15:11:53,780 ProgressMeter - 11:10589211 1.82e+09 2.3 h 4.5 s 59.0% 3.9 h 95.3 m INFO 15:11:53,780 ProgressMeter - 11:105031869 1.82e+09 2.3 h 4.5 s 62.1% 3.7 h 83.9 m INFO 15:11:53,781 ProgressMeter - 11:46994477 1.82e+09 2.3 h 4.5 s 60.2% 3.8 h 90.8 m INFO 15:12:03,779 ProgressMeter - 11:8677862 1.82e+09 2.3 h 4.5 s 58.9% 3.9 h 95.7 m INFO 15:12:03,779 ProgressMeter - 11:130690257 1.82e+09 2.3 h 4.5 s 62.9% 3.6 h 81.1 m INFO 15:12:13,779 ProgressMeter - 11:84816026 1.82e+09 2.3 h 4.5 s 61.4% 3.7 h 86.4 m INFO 15:12:13,779 ProgressMeter - 10:17039143 1.68e+09 2.3 h 4.9 s 54.8% 4.2 h 113.3 m INFO 15:12:23,780 ProgressMeter - 11:18556991 1.82e+09 2.3 h 4.5 s 59.3% 3.9 h 94.6 m INFO 15:12:23,780 ProgressMeter - 11:113365440 1.82e+09 2.3 h 4.5 s 62.3% 3.7 h 83.2 m INFO 15:12:23,782 ProgressMeter - 11:55284794 1.82e+09 2.3 h 4.5 s 60.4% 3.8 h 90.1 m INFO 15:12:33,780 ProgressMeter - 1:33783808 3.38e+07 2.3 h 4.1 m 1.1% 8.8 d 8.7 d INFO 15:12:33,780 ProgressMeter - 12:3311608 1.95e+09 2.3 h 4.2 s 63.1% 3.6 h 80.5 m INFO 15:12:43,779 ProgressMeter - 11:93292307 1.82e+09 2.3 h 4.6 s 61.7% 3.7 h 85.8 m INFO 15:12:43,780 ProgressMeter - 10:24841443 1.68e+09 2.3 h 4.9 s 55.1% 4.2 h 112.5 m INFO 15:12:43,780 ProgressMeter - 11:19408689 1.82e+09 2.3 h 4.6 s 59.3% 3.9 h 94.8 m INFO 15:12:53,965 ProgressMeter - 11:26512308 1.82e+09 2.3 h 4.6 s 59.5% 3.9 h 94.0 m INFO 15:12:53,965 ProgressMeter - 11:121738354 1.82e+09 2.3 h 4.6 s 62.6% 3.7 h 82.6 m INFO 15:12:53,965 ProgressMeter - 11:63468191 1.82e+09 2.3 h 4.6 s 60.7% 3.8 h 89.4 m INFO 15:13:03,781 ProgressMeter - 12:11960820 1.95e+09 2.3 h 4.3 s 63.4% 3.6 h 79.8 m INFO 15:13:13,780 ProgressMeter - 11:101551879 1.82e+09 2.3 h 4.6 s 61.9% 3.7 h 85.1 m INFO 15:13:13,780 ProgressMeter - 10:32238974 1.68e+09 2.3 h 4.9 s 55.3% 4.2 h 111.8 m INFO 15:13:13,780 ProgressMeter - 11:27204534 1.82e+09 2.3 h 4.6 s 59.5% 3.9 h 94.1 m INFO 15:13:23,966 ProgressMeter - 11:71529440 1.82e+09 2.3 h 4.6 s 61.0% 3.8 h 88.8 m INFO 15:13:23,968 ProgressMeter - 11:34170083 1.82e+09 2.3 h 4.6 s 59.8% 3.9 h 93.4 m INFO 15:13:23,977 ProgressMeter - 11:129889715 1.82e+09 2.3 h 4.6 s 62.9% 3.7 h 81.9 m INFO 15:13:33,781 ProgressMeter - 1:33980416 3.40e+07 2.3 h 4.1 m 1.1% 8.8 d 8.7 d

Any suggestions?

Regards,

Måns

Comments (23)

Hi,

I have downloaded newest version of GATK (version 2.4-3) this week and tried to perform local realignment for my targeted sequencing data. Reference genome, SNP and indel data files were downloaded from resource bundle. However, I encountered two issues when I was doing the realignment.

First, in the step of RealignerTargetCreator. With the same command line, if I run it under version 2.4-3, I got an error message "MESSAGE: -49" (no other detail information provided); if I run it under an older version 2.3-9, it ran very well with no errors.

Second, in the step of IndelRealigner. I got error message "MESSAGE: Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '13'". However, reference genome was downloaded from the bundle. I am not sure how to fix this issue.

I hope someone can help me with these issues. Let me know if more info is needed.

Thanks!

Comments (7)

Hi there,

I get an error when I try to run GATK with the following command:

java -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fa -I  merged_bam_files_indexed_markduplicate.bam -o reads.intervals

However I get this error:

SAM/BAM file SAMFileReader{/merged_bam_files_indexed_markduplicate.bam} is malformed: Read HWI-ST303_0093:5:5:13416:34802#0 is either missing the read group or its read group is not defined in the BAM header, both of which are required by the GATK.  Please use http://gatkforums.broadinstitute.org/discussion/59/companion-utilities-replacereadgroups to fix this problem

It suggest that it a header issue however my bam file has a header:

samtools view -h merged_bam_files_indexed_markduplicate.bam | grep ^@RG
@RG     ID:test1      PL:Illumina     PU:HWI-ST303    LB:test     PI:75   SM:test   CN:japan
@RG     ID:test2      PL:Illumina     PU:HWI-ST303    LB:test     PI:75   SM:test    CN:japan

when I grep the read within the error:

HWI-ST303_0093:5:5:13416:34802#0        99      1       1090    29      23S60M17S       =       1150    160     TGTTTGGGTTGAAGATTGATACTGGAAGAAGATTAGAATTGTAGAAAGGGGAAAACGATGTTAGAAAGTTAATACGGCTTACTCCAGATCCTTGGATCTC        GGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGFGGGGGGGGGDGFGFGGGGGFEDFGEGGGDGEG?FGGDDGFFDGGEDDFFFFEDG?E        MD:Z:60 PG:Z:MarkDuplicates     RG:Z:test1      XG:i:0  AM:i:29 NM:i:0  SM:i:29 XM:i:0  XO:i:0  XT:A:M

Following Picard solution:

java -XX:MaxDirectMemorySize=4G -jar picard-tools-1.85/AddOrReplaceReadGroups.jar I= test.bam O= test.header.bam SORT_ORDER=coordinate RGID=test RGLB=test  RGPL=Illumina RGSM=test/ RGPU=HWI-ST303  RGCN=japan CREATE_INDEX=True 

I get this error after 2 min.:

Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Record 12247781, Read name HWI-ST303_0093:5:26:10129:50409#0, MAPQ should be 0 for unmapped read.`

Any recommendation on how to solve this issue ?

My plan is to do the following to resolve the issue:

picard/MarkDuplicates.jar I=test.bam O=test_markduplicate.bam M=test.matrix AS=true VALIDATION_STRINGENCY=LENIANT
samtools  index test_markduplicate.bam

I see a lot of messages like below but the command still running:

Ignoring SAM validation error: ERROR: Record (number), Read name HWI-ST303_0093:5:5:13416:34802#0, RG ID on SAMRecord not found in header: test1

while running the command

then try the GATK RealignerTargetCreator

I already tried to do the following

java -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fa -I  merged_bam_files_indexed_markduplicate.bam -o reads.intervals --validation_strictness LENIENT

But I still got the same error

N.B: the same command run with no issue with GATK version (1.2)

My pipeline in short: mapping the paired end reads with

bwa aln -q 20 ref.fa read > files.sai
bwa sampe ref.fa file1.sai file2.sai read1 read2 > test1.sam
samtools view -bS test1.sam | samtools sort - test
samtools  index test1.bam
samtools merge -rh RG.txt test test1.bam test2.bam

RG.txt

@RG     ID:test1      PL:Illumina     PU:HWI-ST303    LB:test     PI:75   SM:test   CN:japan
@RG     ID:test2      PL:Illumina     PU:HWI-ST303    LB:test     PI:75   SM:test    CN:japan

samtools  index test.bam
picard/MarkDuplicates.jar I=test.bam O=test_markduplicate.bam M=test.matrix AS=true VALIDATION_STRINGENCY=SILENT
samtools  index test_markduplicate.bam
Comments (1)

I've the following queries on running RealignerTargetCreator module in GATK1.4.

1) Is it recommended to provide the target capture BED file to RealignerTargetCreator in case of targeted/exome experiments? Without the bed file, the tool is taking long time (~6-7 hrs). What's the optimal way here?

2) Does running mark duplicates before or after 'RealignerTargetCreator' have any effect on the # of snps/indels? What is recommended?

Look forward to your comments. Raj

Comments (17)

I can't seem to run the IndelRealigner on reads that contain colons, ":" in the reference scaffold names. The RealignerTargetCreator step works correctly and generates the interval table, but the second, IndelRealigner, step fails. When I look at the generated interval table, I see the interval delimiter is a colon, which I imagine is the problem.

Unfortunately, I have a set of human references that have a colon in every scaffold name, so changing this would be a massive undertaking.

I believe this problem could be solved if you searched for the colon delimiter from the end of the interval string instead of from the beginning, so I'm hoping this a real simple fix.

Thanks!

Comments (1)

HI

I am using the following set of commands on GATK2.1.13 to generate a VCF file

echo `java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -I B2_with_ReadGroup.ddup.sorted.bam -R human_g1k_v37.fasta -T RealignerTargetCreator  -o my.intervals -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key`
echo "Realignment Done at `date`"
echo "Starting IndelRealigner at `date`"

echo `java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -I B2_with_ReadGroup.ddup.sorted.bam -R human_g1k_v37.fasta -T IndelRealigner -targetIntervals my.intervals -o myrealignedBam.bam  -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key`
echo "Realignment done at `date`"
echo "Starting UnifiedGenotyper at `date`"
echo `java -Xmx20g -jar /usr/bin/GenomeAnalysisTK.jar -l INFO -R human_g1k_v37.fasta -T UnifiedGenotyper    -I myrealignedBam.bam    -o mygatk_vcf.vcf    --output_mode EMIT_ALL_SITES -et NO_ET -K /root/sandbox/saket.kumar_iitb.ac.in.key`
echo "Gentoypxing complete at `date`"

When i do a 'mpileup' for B2_with_ReadGroup.ddup.sorted.bam , I get a devcent 10 MB VCF file. But on the last ste of the above pipeline, my " mygatk_vcf.vcf " is goinging into 81GBs !!

Do you know what is wrong ?

Comments (20)

java -jar GenomeAnalysisTK.jar -R exampleFASTA.fasta -I exampleBAM.bam -T RealignerTargetCreator -o exampleTarget.intervals

I used the example files, and I got no error report, but in the output, there is no data, that is to say. The exampleTarget.intervals is empty.

Why is this happening?

And when I add --known dbsnp_135.hg19.vcf( that is in the bundle), it will get an error

ERROR MESSAGE: Input files known and reference have incompatible contigs: Found contigs with the same name but different lengths:
ERROR contig known = chr1 / 249250621
ERROR contig reference = chr1 / 100000.
Comments (1)

Hello,

before I only used BWA and as you described in the best pratice I performed the realign step. Now I want to integrate in my pipeline Stampy associated with BWA.

Do you think, I should make the realign step ?

Thanks !

Comments (4)

Hi all,

We're doing some analysis on quite big data and time is an issue, so I did a bit of scaling testing on a subset of the data before beginning. The results were unexpected.

When I run GATK RealignerTargetCreator with -nt 8 and give it 8 cores to work with, it actually takes about 2.5 times LONGER than if I just run it single-threaded. I don't mean that the user or CPU time goes up - the real, walltime goes up. In the -nt 8 case, the 8 cores would have been on a single node of our cluster with shared memory.

I tried testing on two different kinds of subsets of the data and both performed worse when multithreaded. I first tried restricting the input data by genomic region, ie just analysing chr22. When multithreading didn't seem to be working as expected in this test, I thought that maybe GATK was trying to parallelise over genomic regions, so I instead tried testing on a single lane of input data (a 9.6G bam file spread over the whole genome). This also ran more slowly when multithreaded.

So my question is: should I use -nt 8 in my real analysis even though it was a bad option in testing? Is it possible that multithreading will be bad for small amounts of data, but good in the large-data case? Or, does this indicate that I'm doing something wrong when trying to run RealignerTargetCreator multithreaded?

I really would like to use the fastest option for the real data as it will be very big. Any help much appreciated.

Thanks, Clare

Comments (1)

Hi all - I'm using GATK realigner which can take several hours on my samples. I'm trying to optimize my pipeline by dividing this up by chromosome for each node in my cluster. I can call RealignerTargetCreator using the -L parameter for each chromosome which results in a bunch of interval files. Now, I either want to call IndelRealigner using the -L parameter for each chromsome then merge the resulting BAM files, or merge the interval files into one then call IndelRealigner.

1) I don't see a way to merge interval files using GATK. Is this possible?

or

2) Can I call IndelRealigner and process each chromosome separately then merge the resulting BAM files together?