No official posts found with tag ClipReads

Popular Discussions

ClipReads

This tool provides simple, powerful read clipping capabilities to remove low quality strings of bases, sections of reads, and reads containing user-provided sequences.

Category BAM Processing and Analysis Tools


Introduction

It allows the user to clip bases in reads with poor quality scores, that match particular sequences, or that were generated by particular machine cycles.

Quality score based clipping
Clip bases from the read in clipper from
argmax_x{ \sum{i = x + 1}^l (qTrimmingThreshold - qual)
to the end of the read. This is blatantly stolen from BWA. Walk through the read from the end (in machine cycle order) to the beginning, calculating the running sum of qTrimmingThreshold - qual. While we do this, we track the maximum value of this sum where the delta > 0. After the loop, clipPoint is either -1 (don't do anything) or the clipping index in the read (from the end).
Cycle based clipping
Clips machine cycles from the read. Accepts a string of ranges of the form start1-end1,start2-end2, etc. For each start/end pair, removes bases in machine cycles from start to end, inclusive. These are 1-based values (positions). For example, 1-5,10-12 clips the first 5 bases, and then three bases at cycles 10, 11, and 12.
Sequence matching
Clips bases from that exactly match one of a number of base sequences. This employs an exact match algorithm, filtering only bases whose sequence exactly matches SEQ.

Input

Any number of BAM files.

Output

A new BAM file containing all of the reads from the input BAMs with the user-specified clipping operation applied to each read.

Summary output

     Number of examined reads              13
     Number of clipped reads               13
     Percent of clipped reads              100.00
     Number of examined bases              988
     Number of clipped bases               126
     Percent of clipped bases              12.75
     Number of quality-score clipped bases 126
     Number of range clipped bases         0
     Number of sequence clipped bases      0
     

Example clipping

Suppose we are given this read:
     314KGAAXX090507:1:19:1420:1123#0        16      chrM    3116    29      76M     *       *       *
          TAGGACCCGGGCCCCCCTCCCCAATCCTCCAACGCATATAGCGGCCGCGCCTTCCCCCGTAAATGATATCATCTCA
          #################4?6/?2135;;;'1/=/<'B9;12;68?A79@,@==@9?=AAA3;A@B;A?B54;?ABA
     
If we are clipping reads with -QT 10 and -CR WRITE_NS, we get:
     314KGAAXX090507:1:19:1420:1123#0        16      chrM    3116    29      76M     *       *       *
          NNNNNNNNNNNNNNNNNTCCCCAATCCTCCAACGCATATAGCGGCCGCGCCTTCCCCCGTAAATGATATCATCTCA
          #################4?6/?2135;;;'1/=/<'B9;12;68?A79@,@==@9?=AAA3;A@B;A?B54;?ABA
     
Whereas with -CR WRITE_Q0S:
     314KGAAXX090507:1:19:1420:1123#0        16      chrM    3116    29      76M     *       *       *
          TAGGACCCGGGCCCCCCTCCCCAATCCTCCAACGCATATAGCGGCCGCGCCTTCCCCCGTAAATGATATCATCTCA
          !!!!!!!!!!!!!!!!!4?6/?2135;;;'1/=/<'B9;12;68?A79@,@==@9?=AAA3;A@B;A?B54;?ABA
     
Or -CR SOFTCLIP_BASES:
     314KGAAXX090507:1:19:1420:1123#0        16      chrM    3133    29      17S59M  *       *       *
          TAGGACCCGGGCCCCCCTCCCCAATCCTCCAACGCATATAGCGGCCGCGCCTTCCCCCGTAAATGATATCATCTCA
          #################4?6/?2135;;;'1/=/<'B9;12;68?A79@,@==@9?=AAA3;A@B;A?B54;?ABA
     

Examples

     -T ClipReads -I my.bam -I your.bam -o my_and_your.clipped.bam -R Homo_sapiens_assembly18.fasta \
     -XF seqsToClip.fasta -X CCCCC -CT "1-5,11-15" -QT 10
 

ClipReads specific arguments

Name Type Default value Summary
Required
--out StingSAMFileWriter stdout Write BAM output here
Optional
--clipRepresentation ClippingRepresentation WRITE_NS How should we actually clip the bases?
--clipSequence String[] NA Remove sequences within reads matching this sequence
--clipSequencesFile String NA Remove sequences within reads matching the sequences in this FASTA file
--cyclesToTrim String NA String indicating machine cycles to clip from the reads
--outputStatistics PrintStream NA Write output statistics to this file
--qTrimmingThreshold int -1 If provided, the Q-score clipper will be applied

Additional capabilities

The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals. This capability is available to all GATK walkers.

Argument details

--clipRepresentation / -CR ( ClippingRepresentation with default value WRITE_NS )

How should we actually clip the bases?. The different values for this argument determines how ClipReads applies clips to the reads. This can range from writing Ns over the clipped bases to hard clipping away the bases from the BAM.
The --clipRepresentation argument is an enumerated type (ClippingRepresentation), which can have one of the following values:

WRITE_NS
Clipped bases are changed to Ns
WRITE_Q0S
Clipped bases are changed to have Q0 quality score
WRITE_NS_Q0S
Clipped bases are change to have both an N base and a Q0 quality score
SOFTCLIP_BASES
Change the read's cigar string to soft clip (S, see sam-spec) away the bases. Note that this can only be applied to cases where the clipped bases occur at the start or end of a read.
HARDCLIP_BASES
WARNING: THIS OPTION IS STILL UNDER DEVELOPMENT AND IS NOT SUPPORTED. Change the read's cigar string to hard clip (H, see sam-spec) away the bases. Hard clipping, unlike soft clipping, actually removes bases from the read, reducing the resulting file's size but introducing an irrevesible (i.e., lossy) operation. Note that this can only be applied to cases where the clipped bases occur at the start or end of a read.
REVERT_SOFTCLIPPED_BASES
Turn all soft-clipped bases into matches

--clipSequence / -X ( String[] )

Remove sequences within reads matching this sequence. Clips bases from the reads matching the provided SEQ. Can be provided any number of times on the command line

--clipSequencesFile / -XF ( String )

Remove sequences within reads matching the sequences in this FASTA file. Reads the sequences in the provided FASTA file, and clip any bases that exactly match any of the sequences in the file.

--cyclesToTrim / -CT ( String )

String indicating machine cycles to clip from the reads. Clips machine cycles from the read. Accepts a string of ranges of the form start1-end1,start2-end2, etc. For each start/end pair, removes bases in machine cycles from start to end, inclusive. These are 1-based values (positions). For example, 1-5,10-12 clips the first 5 bases, and then three bases at cycles 10, 11, and 12.

--out / -o ( StingSAMFileWriter with default value stdout )

Write BAM output here. The output SAM/BAM file will be written here

--outputStatistics / -os ( PrintStream )

Write output statistics to this file. If provided, ClipReads will write summary statistics about the clipping operations applied to the reads to this file.

--qTrimmingThreshold / -QT ( int with default value -1 )

If provided, the Q-score clipper will be applied. If a value > 0 is provided, then the quality score based read clipper will be applied to the reads using this quality score threshold.


See also Documentation index | GATK Site | GATK support forum

GATK version 2.3-9-ge5ebf34 built at 2013/01/11 22:47:55.