Reduces the BAM file using read based compression that keeps only essential information for variant calling
This walker will generated reduced versions of the BAM files that still follow the BAM spec and contain all the information necessary for the GSA variant calling pipeline. Some options allow you to tune in how much compression you want to achieve. The default values have been shown to reduce a typical whole exome BAM file 100x. The higher the coverage, the bigger the savings in file size and performance of the downstream tools.
The BAM file to be compressed
The compressed (reduced) BAM file.
java -Xmx4g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T ReduceReads \ -I myData.bam \ -o myData.reduced.bam
These Read Filters are automatically applied to the data by the Engine before processing by ReduceReads.
This tool overrides the engine's default downsampling settings.
The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).
This table summarizes the command-line arguments that are specific to this tool. For details, see the list further down below the table.
|-known||List[RodBinding[VariantContext]]||||Input VCF file(s) with known SNPs|
|--out||StingSAMFileWriter||NA||An output file created by the walker. Will overwrite contents if file exists|
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
. The number of bases to keep around mismatches (potential variation)
. Do not compress read names. By default, ReduceReads will compress read names to numbers and guarantee uniqueness and reads with similar name will still have similar compressed names. Note: If you scatter/gather there is no guarantee that read name uniqueness will be maintained -- in this case we recommend not compressing.
. Do not hard clip the low quality tails of the reads. This option overrides the argument of minimum tail quality.
. Do not simplify read (strip away all extra information of the read -- anything other than bases, quals and read group).
. Do not use high quality soft-clipped bases. By default, ReduceReads will hard clip away any low quality soft clipped base left by the aligner and use the high quality soft clipped bases in it's traversal algorithm to identify variant regions. The minimum quality for soft clipped bases is the same as the minimum base quality to consider (minqual)
. The number of reads emitted per sample in a variant region can be downsampled for better compression. This level of downsampling only happens after the region has been evaluated, therefore it can be combined with the engine level downsampling. A value of 0 turns downsampling off.
. Optionally hard clip all incoming reads to the desired intervals. The hard clips will happen exactly at the interval border.
Input VCF file(s) with known SNPs. Any number of VCF files representing known SNPs to be used for the polyploid-based reduction. Could be e.g. dbSNP and/or official 1000 Genomes SNP calls. Non-SNP variants in these files will be ignored. If provided, the polyploid ("het") compression will work only when a single SNP from the known set is present in a consensus window (otherwise there will be no reduction); if not provided then polyploid compression will be triggered anywhere there is a single SNP present in a consensus window. -known binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3
. Minimum p-value from binomial distribution of mismatches in a site to trigger a variant region. Any site with a value falling below this will be considered consensus and reduced (otherwise we will try to trigger polyploid compression). Note that this value is used only regions with low coverage.
. Minimum proportion of indels in a site to trigger a variant region. Anything below this will be considered consensus.
. The minimum mapping quality to be considered for the consensus synthetic read. Reads that have mapping quality below this threshold will not be counted towards consensus, but are still counted towards variable regions.
. Reads have notoriously low quality bases on the tails (left and right). Consecutive bases at the tails with quality at or lower than this threshold will be hard clipped off before entering the reduce reads algorithm.
. The minimum base quality to be considered for the consensus synthetic read. Reads that have base quality below this threshold will not be counted towards consensus, but are still counted towards variable regions.
. Minimum proportion of mismatches in a site to trigger a variant region. Anything below this will be considered consensus and reduced (otherwise we will try to trigger polyploid compression). Note that this value is used only regions with high coverage.
. Do not hard clip adaptor sequences. Note: You don't have to turn this on for reads that are not mate paired. The program will behave correctly in those cases.
An output file created by the walker. Will overwrite contents if file exists.
GATK version 2.5-2-gdb4546e built at 2013/05/01 09:32:36.