Tagged with #filtering
0 documentation articles | 0 events or announcements | 4 forum discussions


Sorry, there are no publicly available documents of this type with the tag #filtering. Try one of the other types.
Sorry, there are no publicly available documents of this type with the tag #filtering. Try one of the other types.

Hello,

I am calling SNPs and Indels on a non-mammalian genome and do not have an empirical truth set for either. What would you recommend are the annotations I can use for filtering out low-confidence calls?

Thanks for your input! ~Mika

I have used the UnifiedGenotyper to call variants on a set of ~2400 genes (TruSeq Illumina data) from 28 different samples mapped against a preliminary draft genome. I do not have a defined set of SNPs or INDELs to use in recalibration via VQSR.

While the raw VCF has plenty of QUAL scores that are very high, not a single call has a PASS associated with it in the Filter field- all are "." If I use SelectVaraints to filter the VCF based on high QUAL or DP values, or combination, the Filter field remains "." for the returned variants.

Am I doing something wrong, or is the raw file telling me that none of the variant calls are meaningful, in spite of their high QUAL values?

Is there a "best practices" way to go about filtering such a dataset when VQSR can't be employed? If so, I haven't found it.

Dear, GATK team, I have done raw snp and indel calling with UnifiedGenotyper following the command line below.

java -Xmx16g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -glm BOTH -R ucsc.hg19.fasta -T UnifiedGenotyper -I ERR031029.marked.realigned.fixed.recal.bam -I ERR031030.marked.realigned.fixed.recal.bam -D dbsnp_135.hg19.vcf -o ERR031030.raw.snps.indels.vcf -metrics snps.metrics -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000

After that, I did snp filteration using the following command lines.

java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T SelectVariants --variant ERR031030.raw.snps.indels.vcf -o ERR031030.snpsonly.vcf -selectType SNP

java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T SelectVariants --variant ERR031030.raw.snps.indels.vcf -o ERR031030.indelsonly.vcf -selectType INDEL

java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T VariantRecalibrator -R ucsc.hg19.fasta -input ERR031030.snpsonly.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_phase1.indels.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_135.hg19.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an MQ -mode SNP -recalFile ERR031030.snp.recal.vcf -tranchesFile ERR031030.snp.tranches.vcf -rscriptFile ERR031030.plots.R

java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T ApplyRecalibration -input ERR031030.snpsonly.vcf -tranchesFile ERR031030.snp.tranches.vcf -recalFile ERR031030.snp.recal.vcf -o ERR031030.snps.filtered.vcf

java -Xmx16g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T VariantFiltration --variant ERR031030.snps.filtered.vcf -o ERR031030.final.filtered.vcf --filterName "Nov28filters && QD < 2.0 && ReadPosRankSum < -8.0 && MQ < 40.0 && FS > 60.0 && MQRandkSum < -12.5"  --filterExpression "HaplotypeScore > 13.0"

The filtered snp.vcf file came up, however, it seems it contains some problem.

chrM    311     .       T       C       429.19  Nov28filters **_&& QD < 2.0 && ReadPosRankSum < -8.0 && MQ < 40.0 && FS > 60.0 && MQRandkSum < -12.5;VQSRTrancheSNP99.90to100.00   AC=1;AF=0.250;AN=4;BaseQRankSum=-13.010;DP=2000;Dels=0.00;FS=50.500;HaplotypeScore=382.2016;MLEAC=1;MLEAF=0.250;MQ=50.86;MQ0=0;MQRankSum=1.458;QD=0.43;ReadPosRankSum=-10.687;VQSLOD=-6.143e+02;culprit=HaplotypeScore  GT:AD:DP:GQ:PL  0/0:634,353:949:99:0,232,7697   0/1:463,521:945:99:459,0,4190
chrM    410     .       A       T       64750.20        PASS    AC=4;AF=1.00;AN=4;DP=2000;Dels=0.00;FS=0.000;HaplotypeScore=7.3762;MLEAC=4;MLEAF=1.00;MQ=56.04;MQ0=0;QD=32.38;VQSLOD=2.27;culprit=HaplotypeScore        GT:AD:DP:GQ:PL  1/1:0,998:998:99:32010,2926,0   1/1:0,999:999:99:32767,2912,0
chrM    711     .       G       A       62989.20        PASS    AC=4;AF=1.00;AN=4;BaseQRankSum=2.500;DP=2000;Dels=0.00;FS=3.751;HaplotypeScore=8.7084;MLEAC=4;MLEAF=1.00;MQ=56.74;MQ0=1;MQRankSum=-0.107;QD=31.49;ReadPosRankSum=-2.169;VQSLOD=2.46;culprit=HaplotypeScore
      GT:AD:DP:GQ:PL  1/1:0,998:972:99:30899,2808,0   1/1:3,997:972:99:32117,2830,0
chrM    1121    .       T       C       16719.20        Nov28filters && QD < 2.0 && ReadPosRankSum < -8.0 && MQ < 40.0 && FS > 60.0 && MQRandkSum < -12.5;VQSRTrancheSNP99.90to100.00   AC=4;AF=1.00;AN=4;BaseQRankSum=-0.239;DP=2000;Dels=0.00;FS=2.141;HaplotypeScore=22.9003;MLEAC=4;MLEAF=1.00;MQ=21.32;MQ0=703;MQRankSum=-1.627;QD=8.36;ReadPosRankSum=-0.027;VQSLOD=-4.195e+00;culprit=HaplotypeScore     GT:AD:DP:GQ:PL  1/1:3,985:986:99:9547,976,0     1/1:4,983:983:99:7199,739,0
chrM    2489    .       A       C       34.19   LowQual;Nov28filters && QD < 2.0 && ReadPosRankSum < -8.0 && MQ < 40.0 && FS > 60.0 && MQRandkSum < -12.5       AC=1;AF=0.250;AN=4;BaseQRankSum=-17.321;DP=2000;Dels=0.00;FS=180.208;HaplotypeScore=18.7245;MLEAC=1;MLEAF=0.250;MQ=46.52;MQ0=31;MQRankSum=3.365;QD=0.03;ReadPosRankSum=-4.198   GT:AD:DP:GQ:PL  0/1:278,719:950:64:64,0,4623    0/0:309,688:950:99:0,263,6065

For the filter option, most of the filtered snps show Nov28filters rather than PASS or LowQual, what's wrong with that, Are there some problems with my command lines? Thank you so much for your reply.

Dear all, I have a set of 48 exomes which were analysed according to the best practices (using GATK-2.2-3 and HaplotypeCaller). According to the VQRS I have this first level of "uncertainty":

##FILTER=<ID=VQSRTrancheBOTH90.00to99.00,Description="Truth sensitivity tranche level for BOTH model at VQS Lod: -1.3455 <= x < 2.62">

that sets filter=PASS for variants with VQSLOD >= 2.62. I also have an external validation of some SNPs, 3 out of 20 have a VQSLOD lower than 2.62 (1.24, .1.37 and 1.69). Now the question: should I trust the validation and set the filter to, say, VQSLOD >= 1.2 or keep the GATK filter? What is your experience about this?

Thanks

d