ReduceReads

Reduces the BAM file using read based compression that keeps only essential information for variant calling

Category Sequence Data Processing Tools

Traversal ReadWalker

PartitionBy CONTIG


Overview

This walker will generated reduced versions of the BAM files that still follow the BAM spec and contain all the information necessary for the GSA variant calling pipeline. Some options allow you to tune in how much compression you want to achieve. The default values have been shown to reduce a typical whole exome BAM file 100x. The higher the coverage, the bigger the savings in file size and performance of the downstream tools.

Input

The BAM file to be compressed

Output

The compressed (reduced) BAM file.

Examples

 java -Xmx4g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T ReduceReads \
   -I myData.bam \
   -o myData.reduced.bam
 

Additional Information

Read filters

These Read Filters are automatically applied to the data by the Engine before processing by ReduceReads.

Downsampling settings

This tool overrides the engine's default downsampling settings.

  • Mode: BY_SAMPLE
  • To coverage: 40

Command-line Arguments

Inherited arguments

The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).

ReduceReads specific arguments

This table summarizes the command-line arguments that are specific to this tool. For details, see the list further down below the table.

Name Type Default value Summary
Optional
--context_size int 10
--dont_compress_read_names boolean false
--dont_hardclip_low_qual_tails boolean false
--dont_simplify_reads boolean false
--dont_use_softclipped_bases boolean false
--downsample_coverage int 250
--hard_clip_to_interval boolean false
-known List[RodBinding[VariantContext]] [] Input VCF file(s) with known SNPs
-mindel double 0.05
--minimum_mapping_quality int 20
--minimum_tail_qualities byte 2
-minqual byte 15
-noclip_ad boolean false
--out StingSAMFileWriter NA An output file created by the walker. Will overwrite contents if file exists
Advanced
-min_pvalue double 0.01
-minvar double 0.05

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.

--context_size / -cs ( int with default value 10 )

. The number of bases to keep around mismatches (potential variation)

--dont_compress_read_names / -nocmp_names ( boolean with default value false )

. Do not compress read names. By default, ReduceReads will compress read names to numbers and guarantee uniqueness and reads with similar name will still have similar compressed names. Note: If you scatter/gather there is no guarantee that read name uniqueness will be maintained -- in this case we recommend not compressing.

--dont_hardclip_low_qual_tails / -noclip_tail ( boolean with default value false )

. Do not hard clip the low quality tails of the reads. This option overrides the argument of minimum tail quality.

--dont_simplify_reads / -nosimplify ( boolean with default value false )

. Do not simplify read (strip away all extra information of the read -- anything other than bases, quals and read group).

--dont_use_softclipped_bases / -no_soft ( boolean with default value false )

. Do not use high quality soft-clipped bases. By default, ReduceReads will hard clip away any low quality soft clipped base left by the aligner and use the high quality soft clipped bases in it's traversal algorithm to identify variant regions. The minimum quality for soft clipped bases is the same as the minimum base quality to consider (minqual)

--downsample_coverage / -ds ( int with default value 250 )

. The number of reads emitted per sample in a variant region can be downsampled for better compression. This level of downsampling only happens after the region has been evaluated, therefore it can be combined with the engine level downsampling. A value of 0 turns downsampling off.

--hard_clip_to_interval / -clip_int ( boolean with default value false )

. Optionally hard clip all incoming reads to the desired intervals. The hard clips will happen exactly at the interval border.

-known / --known_sites_for_polyploid_reduction ( List[RodBinding[VariantContext]] with default value [] )

Input VCF file(s) with known SNPs. Any number of VCF files representing known SNPs to be used for the polyploid-based reduction. Could be e.g. dbSNP and/or official 1000 Genomes SNP calls. Non-SNP variants in these files will be ignored. If provided, the polyploid ("het") compression will work only when a single SNP from the known set is present in a consensus window (otherwise there will be no reduction); if not provided then polyploid compression will be triggered anywhere there is a single SNP present in a consensus window. -known binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

-min_pvalue / --minimum_alt_pvalue_to_trigger_variant ( double with default value 0.01 )

. Minimum p-value from binomial distribution of mismatches in a site to trigger a variant region. Any site with a value falling below this will be considered consensus and reduced (otherwise we will try to trigger polyploid compression). Note that this value is used only regions with low coverage.

-mindel / --minimum_del_proportion_to_trigger_variant ( double with default value 0.05 )

. Minimum proportion of indels in a site to trigger a variant region. Anything below this will be considered consensus.

--minimum_mapping_quality / -minmap ( int with default value 20 )

. The minimum mapping quality to be considered for the consensus synthetic read. Reads that have mapping quality below this threshold will not be counted towards consensus, but are still counted towards variable regions.

--minimum_tail_qualities / -mintail ( byte with default value 2 )

. Reads have notoriously low quality bases on the tails (left and right). Consecutive bases at the tails with quality at or lower than this threshold will be hard clipped off before entering the reduce reads algorithm.

-minqual / --minimum_base_quality_to_consider ( byte with default value 15 )

. The minimum base quality to be considered for the consensus synthetic read. Reads that have base quality below this threshold will not be counted towards consensus, but are still counted towards variable regions.

-minvar / --minimum_alt_proportion_to_trigger_variant ( double with default value 0.05 )

. Minimum proportion of mismatches in a site to trigger a variant region. Anything below this will be considered consensus and reduced (otherwise we will try to trigger polyploid compression). Note that this value is used only regions with high coverage.

-noclip_ad / --dont_hardclip_adaptor_sequences ( boolean with default value false )

. Do not hard clip adaptor sequences. Note: You don't have to turn this on for reads that are not mate paired. The program will behave correctly in those cases.

--out / -o ( StingSAMFileWriter )

An output file created by the walker. Will overwrite contents if file exists.


See also Guide Index | Technical Documentation Index | Support Forum

GATK version 2.5-2-gdb4546e built at 2013/05/01 09:32:36.