I am new to NGS analysis and have been following this pipeline recommended in many of the posts in the online forums. I have 3 samples(1 child, 2 parents) and completed analysis till generation of VCF files using HaplotypeCaller. I would like to find de novo mutations in the child, Is it a good idea to proceed for de novo mutations identification after annotation or before annotation? Please advise me on how to proceed with this ?
Note: This question has been asked in another forum but I didn't get suggestions on the GATK functions.
I have a question regarding the SNP call by GATK3.2 vs the eye observation in IGV; both use hg19: We have three samples, in the IGV, I see the following genotypes from BAM file (after realign and recalibration; before HaplotypeCaller): chr10:17659149 Sample 1: 9Gs and 11Ts Sample 2: 6Gs and 6Ts Sample 3: 18Gs
But when I check the vcf produced by GATK, it shows: chr10 17659149 rs7895850 C G,T 1509.20 PASS AC=4,2;AF=0.667,0.333;AN=6;DB;DP=41;FS=0.000;MLEAC=4,2;MLEAF=0.667,0.333;MQ=60.00;MQ0=0;POSITIVE_TRAIN_SITE;QD=25.48;VQSLOD=6.71;culprit=MQ GT:AD:DP:GQ:PL 1/1:0,14,0:14:41:493,41,0,493,41,493 2/2:0,11,0:11:33:429,429,429,33,33,0 1/1:0,16,0:16:48:704,48,0,704,48,704
If you look at the GT field, the corresponding genotypes are sample1 as G, sample2 as T, sample3 as G. They are quite different from the IGV for sample 1 and 2. I am wondering if you have any idea about this?
I have a small UG question to ask and a minor bug to report, but I will give the whole context. I have cohorts at low and high coverage. I will probably call my low coverage data with UG and my high coverage data with HC. I will probably run VQSR per cohort. I will then create a union set of variants and recall all samples with GENOTYPE_GIVEN_ALLELES. I will probably use UG and HC for the low and high coverage recalling, respectively. Prior to generating the union set I will need to merge the UG outputted SNPs and InDels at the same position. I can merge SNPs and InDels with bcftools:
bcftools norm -m +any
I remember from another thread that UG treats SNPs and InDels separately:
But I just checked that UG can do SNP and InDels calls at the same position, when an alleles file with merged SNPs and InDels is provided, when I use GENOTYPE_GIVEN_ALLELES. I expected it to fail, but it seems to work fine. SNPs and InDels are emitted at the same site/record/line. For one record and 3 samples (HOMREF, HET, HOMALT) the genotypes are correct. Can I trust the PLs? Can I also trust QUAL and other calculated values? Or is UG not meant to be used for calling SNPs and InDels at the same site with GGA?
I only had a minor issue despite tabix indexing my bgzipped known alleles file:
##### ERROR MESSAGE: An index is required, but none found., for input source: alleles.vcf.gz
Feel free to delete the original post of this question.
I am experimenting with
JEXL expressions in order to do some hard filtering on variants. I wonder if there is a method to tell the filter expression to operate on the current sample (assuming here a single sample VCF file) without knowing the sample name a priori e.g.
Works just fine to sample heterozygous positions from a VCF where I know the sample will be called "Sample1", but can I refer to a sample name programmatically, e.g. something like:
vc.getGenotype( sample() ).isHet()
Sorry if this is a really dumb question. (BTW I realise you could use a
--genotypeFilterExpression "isHet == 1" to do the same thing, but I want to annotate the VCF in the
FORMAT/FILTER field, rather than the INFO field for downstream variant selection operations.
Hi all, I'm in a bit of a daze going through all the documentation and I wanted to do a sanity check on my workflow with the experts. I have ~120 WGS of a ~24Mb fungal pathogen. The end-product of my GATK workflow would be a high quality call set of SNPs, restricted to the sites for which we have confidence in the call across all samples (so sites which are not covered by sufficient high quality reads in one or more samples will be eliminated).
Therefore my workflow (starting from a sorted indexed BAM file of reads from a single sample, mapped to reference with
bwa mem) is this:
"QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"and we also filter out heterozygous positions using "isHet == 1".
--output_mode EMIT_ALL_SITES \and
--emitRefConfidence GVCF \
Does this sound like a reasonable thing to do? What options should I use in step 8 in order for
HC to tell me how confident it is, site-by-site about it's calls, including those that are homozygous reference? I notice that when using
--output_mode EMIT_ALL_CONFIDENT_SITES \ and
--emitRefConfidence GVCF \ I am missing a lot of the annotation I get when just outputting variant sites (e.g. QD).
currently, I am working on 454 sequencing data. I am using GATK to call SNPs and indels, which works relatively well for the SNPs with the recommended filtering options. However, I was interested to raise sensitivity, which is why I started experimenting with the QD value. I restarted the whole GATK pipeline (including the genotyping) with the filtering option "QD<0.0" instead of "QD<2.0". As I did not change any other parameters, I was just expecting to see more SNPs being called. However, I came across a list of SNPs, which did contain some new SNPs, but which was also lacking one of the previously called SNPs (it was not even present in the raw HaplotypeCaller output). How could this have happened?
This is an extract from the vcf file containing the SNP that was previously called:
4 106156187 rs17253672 C T 490.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.341;ClippingRankSum=1.022;DB;DP=161;FS=4.193;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.231;QD=3.05;ReadPosRankSum=1.347 GT:AD:DP:GQ:PL 0/1:10,12:22:99:519,0,262
I am very curious to hear whether anyone might have an explanation for my findings. Many thanks in advance!
I don't see any information from this post http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk,
I was wondering if it's possible to get the number of forward/reverse reads in the final VCF outputted by HaplotypeCaller?
Hi! I have worked some time on a mRNAseq set, single-end. Its a high quality set and lots of biological replicates (200+).
My question is, how could I best contribute to the methodology used for SNPs call in mRNAseq? What do we need tested to improve this method?
For exome-sequencing, I was wondering how I should decide the "-mode" parameter? should I use "SNP" or "Indel" or both? I want to get both information in the final results, should I use "BOTH" -mode?
I was wondering, in the vcf output from UnifiedGenotyper what metrics go into the variant quality score, QUAL ?
I'm assuming that depth, DP is one of them but I can't find further information. Apologies if I've missed it.
Attached are some plots of the depth and quality distribution for variant calls, and also of the relationship between depth and quality. I'm slightly worried that the bimodal quality distribution indicates an error, otherwise I've just attached the graphs for general interest.
I have used
UnifiedGenotyper to call SNPS. I found some SNPs that has been reported from low quality reads in chromosome X and chromosome Y. Is it possible not to take low quality reads into account while calling SNPs using
UnifiedGenotyper? Or, do I need to do quality filtering of BAM files before hand ?