Tagged with #read filters
27 documentation articles | 0 announcements | 5 forum discussions



Created 2012-08-11 05:16:06 | Updated 2012-10-18 15:16:57 | Tags: official developer downsampling intermediate
Comments (15)

1. Introduction

Reads can be filtered out of traversals by either pileup size through one of our downsampling methods or by read property through our read filtering mechanism. Both techniques and described below.

2. Downsampling

Normal sequencing and alignment protocols can often yield pileups with vast numbers of reads aligned to a single section of the genome in otherwise well-behaved datasets. Because of the frequency of these 'speed bumps', the GATK now downsamples pileup data unless explicitly overridden.

Defaults

The GATK's default downsampler exhibits the following properties:

  • The downsampler treats data from each sample independently, so that high coverage in one sample won't negatively impact calling in other samples.

  • The downsampler attempts to downsample uniformly across the range spanned by the reads in the pileup.

  • The downsampler's memory consumption is proportional to the sampled coverage depth rather than the full coverage depth.

By default, the downsampler is limited to 1000 reads per sample. This value can be adjusted either per-walker or per-run.

Customizing

From the command line:

  • To disable the downsampler, specify -dt NONE.

  • To change the default coverage per-sample, specify the desired coverage to the -dcov option.

To modify the walker's default behavior:

  • Add the @Downsample interface to the top of your walker. Override the downsampling type by changing the by=<value>. Override the downsampling depth by changing the toCoverage=<value>.

Algorithm details

The downsampler algorithm is designed to maintain uniform coverage while preserving a low memory footprint in regions of especially deep data. Given an already established pileup, a single-base locus, and a pile of reads with an alignment start of single-base locus + 1, the outline of the algorithm is as follows:

For each sample:

  • Select reads with the next alignment start.

  • While the number of existing reads + the number of incoming reads is greater than the target sample size:

    Walk backward through each set of reads having the same alignment start. If the count of reads having the same alignment start is > 1, throw out one randomly selected read.

  • If we have n slots avaiable where n is >= 1, randomly select n of the incoming reads and add them to the pileup.

  • Otherwise, we have zero slots available. Choose the read from the existing pileup with the least alignment start. Throw it out and add one randomly selected read from the new pileup.

3. Read filtering

To selectively filter out reads before they reach your walker, implement one or multiple net.sf.picard.filter.SamRecordFilter, and attach it to your walker as follows:

@ReadFilters({Platform454Filter.class, ZeroMappingQualityReadFilter.class})

4. Command-line arguments for read filters

You can add command-line arguments for filters with the @Argument tag, just as with walkers. Here's an example of our new max read length filter:

public class MaxReadLengthFilter implements SamRecordFilter {
    @Argument(fullName = "maxReadLength", shortName = "maxRead", doc="Discard reads with length greater than the specified value", required=false)
    private int maxReadLength;

    public boolean filterOut(SAMRecord read) { return read.getReadLength() > maxReadLength; }
}

Adding this filter to the top of your walker using the @ReadFilters attribute will add a new required command-line argument, maxReadLength, which will filter reads > maxReadLength before your walker is called.

Note that when you specify a read filter, you need to strip the Filter part of its name off! E.g. in the example above, if you want to use MaxReadLengthFilter, you need to call it like this:

--read_filter MaxReadLength

5. Adding filters dynamically using command-line arguments

The --read-filter argument will allow you to apply whatever read filters you'd like to your dataset, before the reads reach your walker. To add the MaxReadLength filter above to PrintReads, you'd add the command line parameters:

--read_filter MaxReadLength --maxReadLength 76

You can add as many filters as you like by using multiple copies of the --read_filter parameter:

--read_filter MaxReadLength --maxReadLength 76 --read_filter ZeroMappingQualityRead

Created 2012-07-23 23:57:04 | Updated 2012-07-23 23:57:04 | Tags: unmappedreadfilter gatkdocs
Comments (4)

A new tool has been released!

Check out the documentation at UnmappedReadFilter.


Created 2012-07-23 23:56:52 | Updated 2012-07-23 23:56:52 | Tags: singlereadgroupfilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at SingleReadGroupFilter.


Created 2012-07-23 23:56:46 | Updated 2012-07-23 23:56:46 | Tags: samplefilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at SampleFilter.


Created 2012-07-23 23:56:42 | Updated 2012-07-23 23:56:42 | Tags: reassignmappingqualityfilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at ReassignMappingQualityFilter.


Created 2012-07-23 23:56:39 | Updated 2012-07-23 23:56:39 | Tags: readstrandfilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at ReadStrandFilter.


Created 2012-07-23 23:56:37 | Updated 2012-07-23 23:56:37 | Tags: readnamefilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at ReadNameFilter.


Created 2012-07-23 23:56:35 | Updated 2012-07-23 23:56:35 | Tags: readgroupblacklistfilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at ReadGroupBlackListFilter.


Created 2012-07-23 23:56:25 | Updated 2012-07-23 23:56:25 | Tags: platformfilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at PlatformFilter.


Created 2012-07-23 23:56:24 | Updated 2012-07-23 23:56:24 | Tags: platform454filter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at Platform454Filter.


Created 2012-07-23 23:56:24 | Updated 2012-07-23 23:56:24 | Tags: platformunitfilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at PlatformUnitFilter.


Created 2012-07-23 23:56:19 | Updated 2012-07-23 23:56:19 | Tags: notprimaryalignmentfilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at NotPrimaryAlignmentFilter.


Created 2012-07-23 23:56:18 | Updated 2012-07-23 23:56:18 | Tags: nooriginalqualityscoresfilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at NoOriginalQualityScoresFilter.


Created 2012-07-23 23:56:17 | Updated 2012-07-23 23:56:17 | Tags: missingreadgroupfilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at MissingReadGroupFilter.


Created 2012-07-23 23:56:13 | Updated 2012-07-23 23:56:13 | Tags: maxreadlengthfilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at MaxReadLengthFilter.


Created 2012-07-23 23:56:13 | Updated 2012-07-23 23:56:13 | Tags: maxinsertsizefilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at MaxInsertSizeFilter.


Created 2012-07-23 23:56:12 | Updated 2012-07-23 23:56:12 | Tags: mateunmappedfilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at MateUnmappedFilter.


Created 2012-07-23 23:56:09 | Updated 2012-07-23 23:56:09 | Tags: matesamestrandfilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at MateSameStrandFilter.


Created 2012-07-23 23:56:09 | Updated 2012-07-23 23:56:09 | Tags: mappingqualityzerofilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at MappingQualityZeroFilter.


Created 2012-07-23 23:56:07 | Updated 2012-07-23 23:56:07 | Tags: mappingqualityunavailablefilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at MappingQualityUnavailableFilter.


Created 2012-07-23 23:56:05 | Updated 2012-07-23 23:56:05 | Tags: malformedreadfilter gatkdocs
Comments (9)

A new tool has been released!

Check out the documentation at MalformedReadFilter.


Created 2012-07-23 23:56:05 | Updated 2012-07-23 23:56:05 | Tags: mappingqualityfilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at MappingQualityFilter.


Created 2012-07-23 23:55:38 | Updated 2012-07-23 23:55:38 | Tags: failsvendorqualitycheckfilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at FailsVendorQualityCheckFilter.


Created 2012-07-23 23:55:33 | Updated 2012-07-23 23:55:33 | Tags: duplicatereadfilter gatkdocs
Comments (12)

A new tool has been released!

Check out the documentation at DuplicateReadFilter.


Created 2012-07-23 23:54:58 | Updated 2012-07-23 23:54:58 | Tags: badmatefilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at BadMateFilter.


Created 2012-07-23 23:54:57 | Updated 2012-07-23 23:54:57 | Tags: badcigarfilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at BadCigarFilter.


Created 2012-07-23 23:54:45 | Updated 2012-07-23 23:54:45 | Tags: addaberrantinserttagfilter gatkdocs
Comments (0)

A new tool has been released!

Check out the documentation at AddAberrantInsertTagFilter.

No posts found with the requested search criteria.

Created 2015-08-18 12:42:10 | Updated | Tags: haplotypecaller variantfiltration
Comments (5)

Hi I am new to using HC. I am just using the HapMap sample NA12878 to validate my pipeline. I ran the following command on a BAM file that I generated. I noticed that certain SNPs were filtered out most of which I can understand the rationale.

java -Xms454m -Xmx3181m -XX:+UseSerialGC -Djava.io.tmpdir=/home/tx/tmpYdAKAC -jar /usr/local/share/gatk/3.4-46-gbc02625/GenomeAnalysisTK.jar \ -R /usr/local/share/genomes/Hsapiens/GRCh37/seq/GRCh37.fa \ -I /home/INDELanalysis/validationrun/NA12878-sort-11_0_15994663-prep.bam \ --dbsnp /usr/local/share/genomes/Hsapiens/GRCh37/variation/dbsnp_138.vcf.gz \ -L 11:1628982-1651500 \ -T HaplotypeCaller \ -o /home/INDELanalysis/validationrun/NA12878-11_0_15994663-raw.vcf.gz

I ran the same analysis using Samtools and the following SNP was in the final filtered output

Samtools output 11 1651228 rs71454096 C G 117 PASS AC=1;AN=2;BQB=0.61207;BaseQRankSum=-0.458;DB;DP=21;DP4=11,0,5,3;FS=7.225;GC=71.29;HOB=0.5;HRun=3;ICB=1;MQ=56.97;MQ0=0;MQ0F=0;MQB=0.395294;MQRankSum=-1.091;MQSB=0.798516;QD=5.57;RPB=0.902014;ReadPosRankSum=-1.162;SGB=-0.651104;VDB=0.0103095;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|gCg/gGg|A53G|237|KRTAP5-5|protein_coding|CODING|ENST00000399676|1|G) GT:DP:PL 0/1:19:150,0,158

However when I run it through HC with the above settings the above variant is filtered out of the analysis. When I ran HC with--emitRefConfidence GVCF I got the following output for this variant:

HC output 11 1651228 rs71454096 C G,<NON_REF> 0 . DB;DP=20;MLEAC=0,0;MLEAF=0.00,0.00;MQ=58.22 GT:AD:DP:GQ:PGT:PID:PL:SB 0/0:10,0,0:10:34:0|1:1651120_G_T:0,34,442,34,442,442:10,0,0,0

I am a little confused why it got a QUAL score of 0. So looking nearby I also noticed that there were some sizeable INDELS

11 1651198 . GAGGCTGTGGGGGCTGTGGCTCCGGCTGTGC G,<NON_REF> 0 . DP=25;MLEAC=0,0;MLEAF=0.00,0.00;MQ=58.38 GT:AD:DP:GQ:PL:SB 0/0:14,0,0:14:37:0,37,585,42,589,594:12,2,0,0 11 1651199 rs71025763 A AGGCTGTGGCTCC,<NON_REF> 0 . DB;DP=18;MLEAC=0,0;MLEAF=0.00,0.00;MQ=57.74 GT:AD:DP:GQ:PL:SB 0/0:6,0,0:6:28:0,28,332,28,332,332:5,1,0,0

However neither of these INDELS were called either. Can anyone shed some light as to why nothing is called in this region? If I look at the site 11 1651228 in the bam file I can see that most of the calls were made on the forward strand so there is likely to be some strand bias here and in actual fact we can see that HC called the genotype 0/0 with only 10 reads contributing to this genotype call. I presume the remaining reads were filtered out. However when I look at the pileup for this position I don't understand why over half the reads were filtered

11 1651228 C CCCCGGGGCCCGGCCCGGCGG BBFIFB<FFIIFFFFFFFBBB

Any advice help would be much appreciated.


Created 2015-07-30 17:47:43 | Updated | Tags: overclippedreadfilter
Comments (2)

Hello,

I have 51bp reads and am trying to filter out those where about 20-30 bases have been soft-clipped. The command I have been using is:

java -Xmx8g -jar $GATK_JAR -R $REFERENCE -T PrintReads -rf OverclippedRead --filter_is_too_short_value 40 -o /dev/stdout --disable_bam_indexing -I snippet.bam

Using GATK version v3.4-46, human reference hg19.

The problem I'm encountering is that this doesn't actually filter out any reads. I'm looking at many reads with 27 soft-clipped bases and 24 matches, but those are included in the output. Also the log is unambiguous:

-> 0 reads (0.00% of total) failing OverclippedReadFilter MicroScheduler - 0 reads were filtered out during the traversal out of approximately 2474 total reads (0.00%)

Thanks.


Created 2014-07-31 14:05:07 | Updated | Tags: haplotypecaller best-practices
Comments (4)

Hi,

I am wondering how HaplotypeCaller in GATK 3.1 handles multimappers? I thought I read that they are passed over for variant calling but stay in the realigned, recalibrated BAM for 'completions sake', like marking duplicates not removing them but cannot find supporting info on the website or from farther afield,

Presumably it is the NotPrimaryAlignmentFilter but there is no info on that posted yet. I know I can output a BAM from HC with haplotype info in there but can I just get reads used in variant calls? Or should I trim the BAM myself to retain reads I want used? I do this for mark duplicates (removed) but for multimappers I would like to know how you define so I can do the same. The reason is for coverage estimates, using bamtobed or such means I take all realigned, recalibrated which is many more lines including multimappers which skews my results.

Thanks,

Bruce.


Created 2014-03-03 06:56:01 | Updated | Tags:
Comments (3)

We've called variants from our own exome data sets. When compaired with the protocol of 1000 genomes, we found that they just marked duplicates and replicates after local realignment and bqsr rather than removing them before, which is the case of ours. Since we wanna use the CHB and CHS as controls, will the distinct calling strategy affect the final output a lot?


Created 2013-06-27 13:07:18 | Updated | Tags:
Comments (1)

Are read filters applied before any down sampling?