Reduces the BAM file using read based compression that keeps only essential information for variant calling
This tool will generate reduced versions of the BAM files that still follow the BAM specification and contain all the information necessary to call variants according to the GATK Best Practices recommendations. Some options allow you to tune how much compression you want to achieve. The default values have been shown to reduce a typical whole exome BAM file by 100x. The higher the coverage, the bigger the savings in file size and performance of the downstream tools.
The BAM file to be compressed
The compressed (reduced) BAM file.
java -Xmx4g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T ReduceReads \ -I myData.bam \ -o myData.reduced.bam
These Read Filters are automatically applied to the data by the Engine before processing by ReduceReads.
This tool applies the following downsampling settings by default.
The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).
This table summarizes the command-line arguments that are specific to this tool. For details, see the list further down below the table.
|-known||List[RodBinding[VariantContext]]||||Input VCF file(s) with known SNPs|
|--out||StingSAMFileWriter||NA||An output file created by the walker. Will overwrite contents if file exists|
|--context_size||int||10||The number of bases to keep around mismatches (potential variation)|
|--downsample_coverage||int||250||Downsample the number of reads emitted per sample in a variant region for better compression|
|-mindel||double||0.05||Minimum proportion of indels in a site to trigger a variant region|
|--minimum_mapping_quality||int||20||The minimum mapping quality to be considered for the consensus synthetic read|
|-minqual||byte||15||The minimum base quality to be considered for the consensus synthetic read|
|--cancer_mode||boolean||false||Enable multi-sample reduction for cancer analysis|
|--dont_compress_read_names||boolean||false||Do not compress read names|
|--dont_hardclip_low_qual_tails||boolean||false||Do not hard clip the low quality tails of the reads|
|--dont_simplify_reads||boolean||false||Do not simplify read|
|--dont_use_softclipped_bases||boolean||false||Do not use high quality soft-clipped bases|
|--hard_clip_to_interval||boolean||false||Hard clip all incoming reads to the desired intervals|
|-noclip_ad||boolean||false||Do not hard clip adaptor sequences|
|-min_pvalue||double||0.01||Minimum p-value from binomial distribution of mismatches in a site to trigger a variant region|
|-minvar||double||0.05||Minimum proportion of mismatches in a site to trigger a variant region|
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
Enable multi-sample reduction for cancer analysis. Generally, this tool is not meant to be run for more than 1 sample at a time. The one valid exception brought to our attention by colleagues is the specific case of tumor/normal pairs in cancer analysis. To prevent users from unintentionally running the tool in a less than ideal manner, we require them to explicitly enable multi-sample analysis with this argument.
The number of bases to keep around mismatches (potential variation).
Do not compress read names. By default, ReduceReads will compress read names to numbers and guarantee uniqueness and reads with similar name will still have similar compressed names. Note: If you scatter/gather there is no guarantee that read name uniqueness will be maintained -- in this case we recommend not compressing.
Do not hard clip the low quality tails of the reads. This option overrides the argument of minimum tail quality.
Do not simplify read. This strips away all extra information of the read -- anything other than bases, quals and read group.
Do not use high quality soft-clipped bases. By default, ReduceReads will hard clip away any low quality soft clipped base left by the aligner and use the high quality soft clipped bases in it's traversal algorithm to identify variant regions. The minimum quality for soft clipped bases is the same as the minimum base quality to consider (minqual)
Downsample the number of reads emitted per sample in a variant region for better compression. This level of downsampling only happens after the region has been evaluated, therefore it can be combined with the engine level downsampling. A value of 0 turns downsampling off.
Hard clip all incoming reads to the desired intervals. The hard clips will happen exactly at the interval border.
Input VCF file(s) with known SNPs. Any number of VCF files representing known SNPs to be used for the polyploid-based reduction. Could be e.g. dbSNP and/or official 1000 Genomes SNP calls. Non-SNP variants in these files will be ignored. If provided, the polyploid ("het") compression will work only when a single SNP from the known set is present in a consensus window (otherwise there will be no reduction); if not provided then polyploid compression will be triggered anywhere there is a single SNP present in a consensus window. -known binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3
Minimum p-value from binomial distribution of mismatches in a site to trigger a variant region. Any site with a value falling below this will be considered consensus and reduced (otherwise we will try to trigger polyploid compression). Note that this value is used only regions with low coverage.
Minimum proportion of indels in a site to trigger a variant region. Anything below this will be considered consensus.
The minimum mapping quality to be considered for the consensus synthetic read. Reads that have mapping quality below this threshold will not be counted towards consensus, but are still counted towards variable regions.
. Reads have notoriously low quality bases on the tails (left and right). Consecutive bases at the tails with quality at or lower than this threshold will be hard clipped off before entering the reduce reads algorithm.
The minimum base quality to be considered for the consensus synthetic read. Reads that have base quality below this threshold will not be counted towards consensus, but are still counted towards variable regions.
Minimum proportion of mismatches in a site to trigger a variant region. Anything below this will be considered consensus and reduced (otherwise we will try to trigger polyploid compression). Note that this value is used only regions with high coverage.
Do not hard clip adaptor sequences. Note that it is not necessary to turn this on for reads that are not mate paired. The program will behave correctly by default in those cases.
An output file created by the walker. Will overwrite contents if file exists.
GATK version 2.8-1-g2a26ec9 built at 2013/12/06 16:54:02.