Tagged with #splitncigarreads
0 documentation articles | 1 announcement | 8 forum discussions

No articles to display.

Created 2014-03-04 06:37:01 | Updated 2014-03-17 22:48:07 | Tags: readbackedphasing reducereads release-notes genotypegvcfs combinegvcfs splitncigarreads

Comments (2)

GATK 3.0 was released on March 5, 2014. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

One important change for those who prefer to build from source is that we now use maven instead of ant. See the relevant documentation for building the GATK with our new build system.


  • This is a new GATK tool to be used for variant calling in RNA-seq data. Its purpose is to split reads that contain N Cigar operators (due to a limitation in the GATK that we will eventually handle internally) and to trim (and generally clean up) imperfect alignments.

Haplotype Caller

  • Fixed bug where dangling tail merging in the assembly graph occasionally created a cycle.
  • Added experimental code to retrieve dangling heads in the assembly graph, which is needed for calling variants in RNA-seq data.
  • Generally improved gVCF output by making it more accurate. This includes many updates so that the single sample gVCFs can be accurately genotyped together by GenotypeGVCFs.
  • Fixed a bug in the PairHMM class where the transition probability was miscalculated resulting in probabilities larger than 1.
  • Fixed bug in the function to find the best paths from an alignment graph which was causing bad genotypes to be emitted when running with multiple samples together.


  • This is a new GATK tool to be used in the Haplotype Caller pipeline with large cohorts. Its purpose is to combine any number of gVCF files into a single merged gVCF. One would use this tool for hierarchical merges of the data when there are too many samples in the project to throw at all at once to GenotypeGVCFs.


  • This is a new GATK tool to be used in the Haplotype Caller pipeline. Its purpose is to take any number of gVCF files and to genotype them in order to produce a VCF with raw SNP and indel calls.


  • This is a new GATK tool that might be useful to some. Given a VCF file, this tool will generate simulated reads that support the variants present in the file.

Unified Genotyper

  • Fixed bug when clipping long reads in the HMM; some reads were incorrectly getting clipped.

Variant Recalibrator

  • Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

Reduce Reads

  • Removed from the GATK. It was a valiant attempt, but ultimately we found a better way to process large cohorts. Reduced BAMs are no longer supported in the GATK.

Variant Annotator

  • Improved the FisherStrand (FS) calculation when used in large cohorts. When the table gets too large, we normalize it down to values that are more reasonable. Also, we don't include a particular sample's contribution unless we observe both ref and alt counts for it. We expect to improve on this even further in a future release.
  • Improved the QualByDepth (QD) calculation when used in large cohorts. Now, when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1. Note that this only works in the gVCF pipeline for now.
  • In addition, fixed the normalization for indels in QD (which was over-penalizing larger events).

Combine Variants

  • Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

Select Variants

  • Fixed a huge bug where selecting out a subset of samples while using multi-threading (-nt) caused genotype-level fields (e.g. AD) to get swapped among samples. This was a bad one.
  • Fixed a bug where selecting out a subset of samples at multi-allelic sites occasionally caused the alternate alleles to be re-ordered but the AD values were not updated accordingly.


  • Fixed bug where it wasn't checking for underflow and occasionally produced bad likelihoods.
  • It no longer strips out the AD annotation from genotypes.
  • AC/AF/AN counts are updated after fixing genotypes.
  • Updated to handle cases where the AC (and MLEAC) annotations are not good (e.g. they are greater than AN somehow).

Indel Realigner

  • Fixed bug where a realigned read can sometimes get partially aligned off the end of the contig.

Read Backed Phasing

  • Updated the tool to use the VCF 4.1 framework for phasing; it now uses HP tags instead of '|' to convey phase information.


  • Thanks to Phillip Dexheimer for several Queue related fixes and patches.
  • Thanks to Nicholas Clarke for patches to the timer which occasionally had negative elapsed times.
  • Providing an empty BAM list no results in a user error.
  • Fixed a bug in the gVCF writer where it was dropping the first few reference blocks at the beginnings of all but the first chromosome. Also, several unnecessary INFO field annotations were dropped from the output.
  • Logger output now goes to STDERR instead of STDOUT.
  • Picard, Tribble, and Variant jars updated to version 1.107.1683.

Created 2016-04-06 21:06:40 | Updated | Tags: unifiedgenotyper haplotypecaller rnaseq splitncigarreads

Comments (1)

Hi, I work on plant species. I am using GATK on variant discovery in RNAseq data. I am not able to decide whether I should use option --filter_reads_with_N_cigar or
-U ALLOW_N_CIGAR_READS. What do you suggest?

I Would like to count the number of reads affected by N's in the CIGAR? Could you please suggest any tool?

Secondly, I am using Haplotype Caller for variant discovery. It is running very slow (on 12 CPU).
Is it okay to use UnifiedGenotyper instead of Haplotype Caller on RNAseq data?


Created 2016-02-23 22:55:06 | Updated 2016-02-23 22:57:24 | Tags: splitncigarreads mq mutect2 gatk3-5 star

Comments (1)

I am calling variants in RNA-seq data. From the guide posted here, I know that Split'N'Trim is part of the workflow which has ALLOW_N_CIGAR_READS.

I am wondering if I simply use Mutect2 --filter_reads_with_N, would I be losing out on anything during downstream analysis?

I do not have to reassign MQ to unique mappers as I have set it to 60 (instead of 255) in STAR. By chance, does TopHat2 preserve MQ of each read bases? Ideally I'd like to filter variants using MQ down the road.

Created 2015-09-18 06:11:03 | Updated | Tags: splitncigarreads

Comments (3)

Hi, GATK team: I got in trouble while using SplitNCigarReads in GATK.

INFO 12:57:51,153 ProgressMeter - chrM:16557 5.8048662E7 88.2 m 91.0 s 98.7% 89.3 m 70.0 s INFO 12:58:21,164 ProgressMeter - chrM:16557 5.8352949E7 88.7 m 91.0 s 98.7% 89.8 m 71.0 s INFO 12:58:51,305 ProgressMeter - chrM:16557 5.9379541E7 89.2 m 90.0 s 98.7% 90.4 m 71.0 s INFO 12:59:21,317 ProgressMeter - chrM:16557 6.0129818E7 89.7 m 89.0 s 98.7% 90.9 m 72.0 s INFO 12:59:51,328 ProgressMeter - chrM:16557 6.0608243E7 90.2 m 89.0 s 98.7% 91.4 m 72.0 s INFO 13:00:21,338 ProgressMeter - chrM:16557 6.0883019E7 90.7 m 89.0 s 98.7% 91.9 m 72.0 s INFO 13:05:07,709 ProgressMeter - chrM:16557 6.0883019E7 95.4 m 94.0 s 98.7% 96.7 m 76.0 s INFO 13:05:46,225 ProgressMeter - chrM:16557 6.1509252E7 95.9 m 93.0 s 98.7% 97.2 m 77.0 s INFO 13:06:16,236 ProgressMeter - chrM:16557 6.200168E7 96.6 m 93.0 s 98.7% 97.9 m 77.0 s INFO 13:06:46,246 ProgressMeter - chrM:16557 6.200168E7 97.1 m 93.0 s 98.7% 98.4 m 78.0 s INFO 13:07:54,627 ProgressMeter - chrM:16557 6.200168E7 98.2 m 95.0 s 98.7% 99.5 m 78.0 s INFO 13:10:39,906 ProgressMeter - chrM:16557 6.200168E7 101.0 m 97.0 s 98.7% 102.3 m 81.0 s INFO 13:11:10,129 ProgressMeter - chrM:16557 6.200168E7 101.5 m 98.0 s 98.7% 102.8 m 81.0 s INFO 13:11:40,288 ProgressMeter - chrM:16557 6.200168E7 102.0 m 98.0 s 98.7% 103.3 m 81.0 s INFO 13:13:13,893 ProgressMeter - chrM:16557 6.200168E7 103.5 m 100.0 s 98.7% 104.9 m 83.0 s

the logs show that it seems stopped at 13:13:13 ,after a long time (7days~~~), still no update. I' ve search for this , it seems that there are a lot of reads map to this area(chrM:16557 nearby). how can i solve this?

sincerely, Xin

Created 2015-07-21 13:25:52 | Updated | Tags: error splitncigarreads code-exception

Comments (11)

I am following the documentation of "Best Practices for Variant Discovery in RNAseq data" and I am at the SplitNCigarReads -step. All of the previous steps has worked out fine. (NOTE:I did the mapping of my reads with TopHat instead of STAR).

When I run the following: java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R reference.fa -I reordered_acc_reference.bam -o split.bam -fixMisencodedQuals -U ALLOW_N_CIGAR_READS

I get the following error message:

ERROR stack trace

java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:330) at htsjdk.samtools.reference.IndexedFastaSequenceFile.getSubsequenceAt(IndexedFastaSequenceFile.java:195) at org.broadinstitute.gatk.utils.fasta.CachingIndexedFastaSequenceFile.getSubsequenceAt(CachingIndexedFastaSequenceFile.java:329) at org.broadinstitute.gatk.tools.walkers.rnaseq.OverhangFixingManager$Splice.initialize(OverhangFixingManager.java:365) at org.broadinstitute.gatk.tools.walkers.rnaseq.OverhangFixingManager.addSplicePosition(OverhangFixingManager.java:171) at org.broadinstitute.gatk.tools.walkers.rnaseq.SplitNCigarReads.splitReadBasedOnCigar(SplitNCigarReads.java:280) at org.broadinstitute.gatk.tools.walkers.rnaseq.SplitNCigarReads.splitNCigarRead(SplitNCigarReads.java:233) at org.broadinstitute.gatk.tools.walkers.rnaseq.SplitNCigarReads.reduce(SplitNCigarReads.java:210) at org.broadinstitute.gatk.tools.walkers.rnaseq.SplitNCigarReads.reduce(SplitNCigarReads.java:118) at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano$TraverseReadsReduce.apply(TraverseReadsNano.java:251) at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano$TraverseReadsReduce.apply(TraverseReadsNano.java:240) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:279) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:102) at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:108) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.4-46-gbc02625):
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR MESSAGE: Code exception (see stack trace for error itself)

Can anyone help me understand what is wrong?

Created 2014-08-28 12:01:56 | Updated 2014-08-28 12:15:37 | Tags: splitncigarreads

Comments (5)


I am using RNA-seq data for variant calling. I am stuck at the step of "SplitNCigarReads".

The command that, I am using is:

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R hg19.fa -I Mydata.sort.MarkDup.RG.bam -o Mydata.sort.MarkDup.RG.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

Now, the error that I get is:

INFO  13:52:00,688 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  13:52:00,693 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.2-2-gec30cee, Compiled 2014/07/17 15:22:03 
INFO  13:52:00,694 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  13:52:00,694 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  13:52:00,705 HelpFormatter - Program Args: -T SplitNCigarReads -R hg19.fa -I Mydata.sort.MarkDup.RG.bam -o Mydata.sort.MarkDup.RG.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS 
INFO  13:52:00,716 HelpFormatter - Executing as abhishek@mb06 on Linux 3.12.21-gentoo-r1 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_55-b13. 
INFO  13:52:00,719 HelpFormatter - Date/Time: 2014/08/28 13:52:00 
INFO  13:52:00,723 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  13:52:00,726 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  13:52:03,508 GenomeAnalysisEngine - Strictness is SILENT 
INFO  13:52:03,794 GenomeAnalysisEngine - Downsampling Settings: No downsampling 
INFO  13:52:03,817 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  13:52:03,988 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.16 
INFO  13:52:04,627 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
INFO  13:52:04,666 GenomeAnalysisEngine - Done preparing for traversal 
INFO  13:52:04,678 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  13:52:04,683 ProgressMeter -        Location |     reads | elapsed |     reads | completed | runtime |   runtime 
INFO  13:52:05,021 ReadShardBalancer$1 - Loading BAM index data 
INFO  13:52:05,032 ReadShardBalancer$1 - Done loading BAM index data 
INFO  13:52:16,601 GATKRunReport - Uploaded run statistics report to AWS S3 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.2-2-gec30cee): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Badly formed genome loc: Parameters to GenomeLocParser are incorrect:The stop position 0 is less than start 4 in contig chrM
##### ERROR ------------------------------------------------------------------------------------------

Kindly help in fixing this error.

Thank you

Created 2014-08-22 21:55:59 | Updated | Tags: splitncigarreads

Comments (3)

Dear GATK developers,

Thanks for this nice and practical tool with great supports! I tried to find relevant information in forum for this error reported by splitNCigarReads walker but couldn't find any information. So I hope I am not bulking the forum. Here is the error

MESSAGE: Badly formed genome loc: Parameters to GenomeLocParser are incorrect:The stop position 69 is less than start 70 in contig JH375787.1

The same reference has been used as in previous steps for RG and MarkDup. I get this error for some of my samples not all.

It's the command if it can help: java -jar /sw/apps/bioinfo/GATK/3.1.1//GenomeAnalysisTK.jar -T SplitNCigarReads -R /home/nimar/b2013097/private/nobackup/data/Reference/Reference.fa -I Sample_50T.merged.Reordered.sort.RG.MarkDup-filtered-unmapped.bam -U ALLOW_N_CIGAR_READS -o Sample_50T.merged.Reordered.sort.RG.MarkDup-filtered-unmapped.SplitN.bam Looking forward to your comments

Created 2014-07-03 20:24:55 | Updated | Tags: splitncigarreads

Comments (9)

I am working on RNAseq data. I expect SplitNCigarReads can handel spliced reads, but it complains about N in CIGAR. I just wonder whether SplitNCigarReads is ready to handle spliced reads or still under construction. I am using version 3.1-1-g07a4bf8.

The command line is "java -jar $GATKjar -T SplitNCigarReads -I a.bam -R $REFfa -o a.splitN.bam".

The error message is as follows: MESSAGE: Unsupported CIGAR operator N in read DHT4KXP1:231:C4DW4ACXX:3:1301:3574:77410 at chrM:673. Perhaps you are trying to use RNA-Seq data? While we are currently actively working to support this data type unfortunately the GATK cannot be used with this data in its current form. You have the option of either filtering out all reads with operator N in their CIGAR string (please add --filter_reads_with_N_cigar to your command line) or assume the risk of processing those reads as they are including the pertinent unsafe flag (please add -U ALLOW_N_CIGAR_READS to your command line).

According to the error message, it seems SplitNCigarReads still cannot handle spliced reads. Or, I missed something? Thanks!

Created 2014-06-27 13:41:25 | Updated 2014-06-27 13:47:40 | Tags: fixmisencodedquals splitncigarreads splitncigarreads-error allowpotentiallymisencodedquals

Comments (2)

Just as a check I'm posting this.


Got the following error with gatk3.1-1 SplitNCigarReads:

ERROR MESSAGE: SAM/BAM file SAMFileReader{samplePE.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 66; please see the GATK --help documentation for options related to this error


I first looked around the GATK --help/forum and found two options:

-fixMisencodedQuals, --fix_misencoded_quality_scores Fix mis-encoded base quality scores
-allowPotentiallyMisencodedQuals, --allow_potentially_misencoded_quality_scores Ignore warnings about base quality score encoding

in short: The fixMisencodedQuals one corrects old Phred+64 scores from before Illumina 1.8+ also see fastq wiki entry.

The allowPotentiallyMisencodedQuals ignores these misencoded quals, letting your program run.

So i checked the Quality score distribution:

java -jar picard-tools-1.102/QualityScoreDistribution.jar I=samplePE.bam CHART=qsd.pdf O=qsd.log

and got:

33 80506
36 814
37 1146
38 7253
39 6131
40 3456
... ...
70 181897
71 198819
72 264951

This is valid Illumina 1.8+ so i used allowPotentiallyMisencodedQuals.


Check the quality score distribution with picard-tools-1.102/QualityScoreDistribution.jar and compare with fastq wiki entry, then decide what to use.

Remaining Questions

Is this correct? And why give the errors on qual (>64) >=66 <=74, this looks like a bit too much validation?