I am currently working on two different projects and interested in finding common and rare variants.
Project1 - Organism : Influenza virus Number of samples : 18 Sequencing type: Exome sequencing Alignment tool : BWA Analysis ready BAM files :
Organism : Human Number of samples : 24 Sequencing type: DNAseq (paired-end) Analysis ready BAM files :
What is the criteria to select the variant caller?
I have performed variant calling analysis for 24 samples using GATK pipeline. I need some clarifications on following things
1) If I generate single VCF file for each of the 24 samples individually and then generate a single VCF file containing all 24 samples,
Are there any differences between them in the output VCF?
The reason why I am asking this is, I have family level information and also symptom level information for those 24 samples.
Family level information for those 24 samples
FamilyA : Sample1, Sample2, Sample3
FamilyB : Sample4, Sample5, Sample6
FamilyH : Sample22, Sample23, Sample24
Symptom level information for those 24 samples
Joint pain : Sample1, Sample 4, Sample 14, Sample 15, Sample,16, Sample17
Bleeding : Sample2, Sample5, Sample6
Symptom X : …..
I would like to know whether the samples that are grouped together in the above scenario have any common genetic variants among them. In other words, are there 'secondary' variants elsewhere in the exome (other than the X gene) that are common amongst patients that suffer from the same symptoms.
case1: I am comparing individual VCF file (sample2.vcf, sample5.vcf and sample6.vcf) and filtering the common variants
case2: I am extracting just the sample2, sample5, and sample6 from the single VCF file with all 25 samples in it
As the above example, I would like to find common variants at the family level as well.
Hi everyone, I have a set of high-quality SNPs that were jointly called off a merged, realigned BAM. I then created a VCF for each sample using SelectVariants, re-ran UnifiedGenotyper with EMIT_ALL_SITES, and thinned the resulting VCF to no-call sites so I could mask them. Next, I produced a FASTA file for each sample with Ns at no-call sites and SNPs in their appropriate positions with FastaAlternateReferenceMaker. How can I use GATK to specify contigs where no reads were supported for a given sample and use this information to avoid outputting these regions via the -L flag with FastaAlternateReferenceMaker? Apologies if this is trivial, but I haven't found a clear solution.
I typed the following command for finding snps and indels: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R chr21/chr21.fa -I chr21/alignments/human38chr21.sorted.bam -o chr21/humanoutput.raw.snps.indels.vcf
I even got the vcf file but it contains only the header and does not contain the data lines.What maybe wrong? I hae attached the file containing the stack trace.
Is removing SNPs close to indel boundaries still recommended - having performed GATK local realignment and HaplotypeCaller - or do these processes mean that SNPs near indels are now called more accurately? If boundary SNPs should be excluded, what distance from the indel edge is recommended? (I don't expect there to be an established distance but an approximation would be good.)
Kindly help me in fixing following error while trying to build SNP recalibration model,
$ java -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R Reference/hg19.fa -input raw_variants.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /Reference/hapmap_3.3.hg19.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 /Reference/1000G_omni2.5.hg19.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 /Reference/1000G_phase1.snps.high_confidence.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /Reference/dbsnp_138.hg19.vcf -an QD -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R
Thanks and Best Regards, Adnan
I am new to NGS analysis and have been following this pipeline recommended in many of the posts in the online forums. I have 3 samples(1 child, 2 parents) and completed analysis till generation of VCF files using HaplotypeCaller. I would like to find de novo mutations in the child, Is it a good idea to proceed for de novo mutations identification after annotation or before annotation? Please advise me on how to proceed with this ?
Note: This question has been asked in another forum but I didn't get suggestions on the GATK functions.
I have a question regarding the SNP call by GATK3.2 vs the eye observation in IGV; both use hg19: We have three samples, in the IGV, I see the following genotypes from BAM file (after realign and recalibration; before HaplotypeCaller): chr10:17659149 Sample 1: 9Gs and 11Ts Sample 2: 6Gs and 6Ts Sample 3: 18Gs
But when I check the vcf produced by GATK, it shows: chr10 17659149 rs7895850 C G,T 1509.20 PASS AC=4,2;AF=0.667,0.333;AN=6;DB;DP=41;FS=0.000;MLEAC=4,2;MLEAF=0.667,0.333;MQ=60.00;MQ0=0;POSITIVE_TRAIN_SITE;QD=25.48;VQSLOD=6.71;culprit=MQ GT:AD:DP:GQ:PL 1/1:0,14,0:14:41:493,41,0,493,41,493 2/2:0,11,0:11:33:429,429,429,33,33,0 1/1:0,16,0:16:48:704,48,0,704,48,704
If you look at the GT field, the corresponding genotypes are sample1 as G, sample2 as T, sample3 as G. They are quite different from the IGV for sample 1 and 2. I am wondering if you have any idea about this?
I have a small UG question to ask and a minor bug to report, but I will give the whole context. I have cohorts at low and high coverage. I will probably call my low coverage data with UG and my high coverage data with HC. I will probably run VQSR per cohort. I will then create a union set of variants and recall all samples with GENOTYPE_GIVEN_ALLELES. I will probably use UG and HC for the low and high coverage recalling, respectively. Prior to generating the union set I will need to merge the UG outputted SNPs and InDels at the same position. I can merge SNPs and InDels with bcftools:
bcftools norm -m +any
I remember from another thread that UG treats SNPs and InDels separately:
But I just checked that UG can do SNP and InDels calls at the same position, when an alleles file with merged SNPs and InDels is provided, when I use GENOTYPE_GIVEN_ALLELES. I expected it to fail, but it seems to work fine. SNPs and InDels are emitted at the same site/record/line. For one record and 3 samples (HOMREF, HET, HOMALT) the genotypes are correct. Can I trust the PLs? Can I also trust QUAL and other calculated values? Or is UG not meant to be used for calling SNPs and InDels at the same site with GGA?
I only had a minor issue despite tabix indexing my bgzipped known alleles file:
##### ERROR MESSAGE: An index is required, but none found., for input source: alleles.vcf.gz
Feel free to delete the original post of this question.
I am experimenting with
JEXL expressions in order to do some hard filtering on variants. I wonder if there is a method to tell the filter expression to operate on the current sample (assuming here a single sample VCF file) without knowing the sample name a priori e.g.
Works just fine to sample heterozygous positions from a VCF where I know the sample will be called "Sample1", but can I refer to a sample name programmatically, e.g. something like:
vc.getGenotype( sample() ).isHet()
Sorry if this is a really dumb question. (BTW I realise you could use a
--genotypeFilterExpression "isHet == 1" to do the same thing, but I want to annotate the VCF in the
FORMAT/FILTER field, rather than the INFO field for downstream variant selection operations.
Hi all, I'm in a bit of a daze going through all the documentation and I wanted to do a sanity check on my workflow with the experts. I have ~120 WGS of a ~24Mb fungal pathogen. The end-product of my GATK workflow would be a high quality call set of SNPs, restricted to the sites for which we have confidence in the call across all samples (so sites which are not covered by sufficient high quality reads in one or more samples will be eliminated).
Therefore my workflow (starting from a sorted indexed BAM file of reads from a single sample, mapped to reference with
bwa mem) is this:
"QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"and we also filter out heterozygous positions using "isHet == 1".
--output_mode EMIT_ALL_SITES \and
--emitRefConfidence GVCF \
Does this sound like a reasonable thing to do? What options should I use in step 8 in order for
HC to tell me how confident it is, site-by-site about it's calls, including those that are homozygous reference? I notice that when using
--output_mode EMIT_ALL_CONFIDENT_SITES \ and
--emitRefConfidence GVCF \ I am missing a lot of the annotation I get when just outputting variant sites (e.g. QD).
currently, I am working on 454 sequencing data. I am using GATK to call SNPs and indels, which works relatively well for the SNPs with the recommended filtering options. However, I was interested to raise sensitivity, which is why I started experimenting with the QD value. I restarted the whole GATK pipeline (including the genotyping) with the filtering option "QD<0.0" instead of "QD<2.0". As I did not change any other parameters, I was just expecting to see more SNPs being called. However, I came across a list of SNPs, which did contain some new SNPs, but which was also lacking one of the previously called SNPs (it was not even present in the raw HaplotypeCaller output). How could this have happened?
This is an extract from the vcf file containing the SNP that was previously called:
4 106156187 rs17253672 C T 490.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.341;ClippingRankSum=1.022;DB;DP=161;FS=4.193;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.231;QD=3.05;ReadPosRankSum=1.347 GT:AD:DP:GQ:PL 0/1:10,12:22:99:519,0,262
I am very curious to hear whether anyone might have an explanation for my findings. Many thanks in advance!
I don't see any information from this post http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk,
I was wondering if it's possible to get the number of forward/reverse reads in the final VCF outputted by HaplotypeCaller?
Hi! I have worked some time on a mRNAseq set, single-end. Its a high quality set and lots of biological replicates (200+).
My question is, how could I best contribute to the methodology used for SNPs call in mRNAseq? What do we need tested to improve this method?
For exome-sequencing, I was wondering how I should decide the "-mode" parameter? should I use "SNP" or "Indel" or both? I want to get both information in the final results, should I use "BOTH" -mode?
I was wondering, in the vcf output from UnifiedGenotyper what metrics go into the variant quality score, QUAL ?
I'm assuming that depth, DP is one of them but I can't find further information. Apologies if I've missed it.
Attached are some plots of the depth and quality distribution for variant calls, and also of the relationship between depth and quality. I'm slightly worried that the bimodal quality distribution indicates an error, otherwise I've just attached the graphs for general interest.
I have used
UnifiedGenotyper to call SNPS. I found some SNPs that has been reported from low quality reads in chromosome X and chromosome Y. Is it possible not to take low quality reads into account while calling SNPs using
UnifiedGenotyper? Or, do I need to do quality filtering of BAM files before hand ?