Tagged with #snps
0 documentation articles | 0 announcements | 17 forum discussions


No articles to display.

No articles to display.


Created 2016-05-24 21:40:42 | Updated | Tags: unifiedgenotyper haplotypecaller gatk snps

Comments (1)

Hi,

I am currently working on two different projects and interested in finding common and rare variants.

Project1 - Organism : Influenza virus Number of samples : 18 Sequencing type: Exome sequencing Alignment tool : BWA Analysis ready BAM files :

  • BAM files are generated using BWA,
  • then sorted the bam
  • deduplicated bam file using picard (markduplicates) GATK variant caller : Unifiedgenotyper or HaplotypeCaller (Which one to be used?)

Project2 :
Organism : Human Number of samples : 24 Sequencing type: DNAseq (paired-end) Analysis ready BAM files :

  • BAM files are generated using BWA,
  • then sorted the bam
  • and finally deduplicated and recalibrated_reads.bam GATK variant caller : Unifiedgenotyper or HaplotypeCaller (Which one to be used?)

What is the criteria to select the variant caller?


Created 2016-05-24 21:03:25 | Updated | Tags: vcf snps multisample

Comments (1)

Dear All,

I have performed variant calling analysis for 24 samples using GATK pipeline. I need some clarifications on following things

1) If I generate single VCF file for each of the 24 samples individually and then generate a single VCF file containing all 24 samples,

  • Are there any differences between them in the output VCF?

  • if yes, what are the differences?

The reason why I am asking this is, I have family level information and also symptom level information for those 24 samples.

Family level information for those 24 samples

FamilyA : Sample1, Sample2, Sample3

FamilyB : Sample4, Sample5, Sample6

….

FamilyH : Sample22, Sample23, Sample24

Symptom level information for those 24 samples

Joint pain : Sample1, Sample 4, Sample 14, Sample 15, Sample,16, Sample17

Bleeding : Sample2, Sample5, Sample6

Symptom X : …..

For instance,

I would like to know whether the samples that are grouped together in the above scenario have any common genetic variants among them. In other words, are there 'secondary' variants elsewhere in the exome (other than the X gene) that are common amongst patients that suffer from the same symptoms.

  • I want to find common variants for the bleeding symptom, does the common variants differ between the case1 and case2 or not?

case1: I am comparing individual VCF file (sample2.vcf, sample5.vcf and sample6.vcf) and filtering the common variants

case2: I am extracting just the sample2, sample5, and sample6 from the single VCF file with all 25 samples in it

As the above example, I would like to find common variants at the family level as well.


Created 2016-05-10 22:25:04 | Updated 2016-05-10 22:29:05 | Tags: fasta snps fastaalternatereferencemaker

Comments (0)

Hi everyone, I have a set of high-quality SNPs that were jointly called off a merged, realigned BAM. I then created a VCF for each sample using SelectVariants, re-ran UnifiedGenotyper with EMIT_ALL_SITES, and thinned the resulting VCF to no-call sites so I could mask them. Next, I produced a FASTA file for each sample with Ns at no-call sites and SNPs in their appropriate positions with FastaAlternateReferenceMaker. How can I use GATK to specify contigs where no reads were supported for a given sample and use this information to avoid outputting these regions via the -L flag with FastaAlternateReferenceMaker? Apologies if this is trivial, but I haven't found a clear solution.

Thanks


Created 2016-04-06 19:37:38 | Updated | Tags: haplotypecaller indels snps

Comments (4)

Hi,

I typed the following command for finding snps and indels: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R chr21/chr21.fa -I chr21/alignments/human38chr21.sorted.bam -o chr21/humanoutput.raw.snps.indels.vcf

I even got the vcf file but it contains only the header and does not contain the data lines.What maybe wrong? I hae attached the file containing the stack trace.


Created 2016-03-23 15:49:08 | Updated | Tags: indels snps

Comments (1)

Is removing SNPs close to indel boundaries still recommended - having performed GATK local realignment and HaplotypeCaller - or do these processes mean that SNPs near indels are now called more accurately? If boundary SNPs should be excluded, what distance from the indel edge is recommended? (I don't expect there to be an established distance but an approximation would be good.)


Created 2016-01-17 12:15:32 | Updated 2016-01-17 12:16:21 | Tags: vqsr snps calibration-model

Comments (1)

Hello everyone,

Kindly help me in fixing following error while trying to build SNP recalibration model,

$ java -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R Reference/hg19.fa -input raw_variants.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /Reference/hapmap_3.3.hg19.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 /Reference/1000G_omni2.5.hg19.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 /Reference/1000G_phase1.snps.high_confidence.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /Reference/dbsnp_138.hg19.vcf -an QD -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Invalid argument value '-resource:hapmap known=false training=true truth=true prior=15.0' at positi

on 6.

ERROR Invalid argument value '/Reference/hapmap_3.3.hg19.vcf' at position 7.
ERROR Invalid argument value '-resource:omni known=false training=true truth=true prior=12.0' at position 8.
ERROR Invalid argument value '/Reference/1000G_omni2.5.hg19.vcf' at position 9.
ERROR Invalid argument value '-resource:1000G known=false training=true truth=false prior=10.0' at position 10.
ERROR Invalid argument value '/Reference/1000G_phase1.snps.high_confidence.hg19.vcf' at position 11.
ERROR Invalid argument value '-resource:dbsnp known=true training=false truth=false prior=2.0' at position 12.
ERROR Invalid argument value '/Reference/dbsnp_138.hg19.vcf' at position 13.
ERROR ------------------------------------------------------------------------------------------

Thanks and Best Regards, Adnan


Created 2015-05-08 07:52:31 | Updated 2015-05-08 07:54:39 | Tags: snps denovo

Comments (5)

Hi,

I am new to NGS analysis and have been following this pipeline recommended in many of the posts in the online forums. I have 3 samples(1 child, 2 parents) and completed analysis till generation of VCF files using HaplotypeCaller. I would like to find de novo mutations in the child, Is it a good idea to proceed for de novo mutations identification after annotation or before annotation? Please advise me on how to proceed with this ?

Note: This question has been asked in another forum but I didn't get suggestions on the GATK functions.

Thanks


Created 2015-03-31 20:22:27 | Updated | Tags: igv snps

Comments (2)

Hi All,

I have a question regarding the SNP call by GATK3.2 vs the eye observation in IGV; both use hg19: We have three samples, in the IGV, I see the following genotypes from BAM file (after realign and recalibration; before HaplotypeCaller): chr10:17659149 Sample 1: 9Gs and 11Ts Sample 2: 6Gs and 6Ts Sample 3: 18Gs

But when I check the vcf produced by GATK, it shows: chr10 17659149 rs7895850 C G,T 1509.20 PASS AC=4,2;AF=0.667,0.333;AN=6;DB;DP=41;FS=0.000;MLEAC=4,2;MLEAF=0.667,0.333;MQ=60.00;MQ0=0;POSITIVE_TRAIN_SITE;QD=25.48;VQSLOD=6.71;culprit=MQ GT:AD:DP:GQ:PL 1/1:0,14,0:14:41:493,41,0,493,41,493 2/2:0,11,0:11:33:429,429,429,33,33,0 1/1:0,16,0:16:48:704,48,0,704,48,704

If you look at the GT field, the corresponding genotypes are sample1 as G, sample2 as T, sample3 as G. They are quite different from the IGV for sample 1 and 2. I am wondering if you have any idea about this?


Created 2015-01-13 01:21:15 | Updated | Tags: unifiedgenotyper indels snps genotype-given-alleles

Comments (1)

I have a small UG question to ask and a minor bug to report, but I will give the whole context. I have cohorts at low and high coverage. I will probably call my low coverage data with UG and my high coverage data with HC. I will probably run VQSR per cohort. I will then create a union set of variants and recall all samples with GENOTYPE_GIVEN_ALLELES. I will probably use UG and HC for the low and high coverage recalling, respectively. Prior to generating the union set I will need to merge the UG outputted SNPs and InDels at the same position. I can merge SNPs and InDels with bcftools:

bcftools norm -m +any

I remember from another thread that UG treats SNPs and InDels separately:

http://gatkforums.broadinstitute.org/discussion/3375/combine-snp-and-indel-vcf

But I just checked that UG can do SNP and InDels calls at the same position, when an alleles file with merged SNPs and InDels is provided, when I use GENOTYPE_GIVEN_ALLELES. I expected it to fail, but it seems to work fine. SNPs and InDels are emitted at the same site/record/line. For one record and 3 samples (HOMREF, HET, HOMALT) the genotypes are correct. Can I trust the PLs? Can I also trust QUAL and other calculated values? Or is UG not meant to be used for calling SNPs and InDels at the same site with GGA?

I only had a minor issue despite tabix indexing my bgzipped known alleles file:

##### ERROR MESSAGE: An index is required, but none found., for input source: alleles.vcf.gz

Feel free to delete the original post of this question.


Created 2015-01-11 23:30:38 | Updated | Tags: selectvariants variantfiltration jexl snps

Comments (2)

I am experimenting with JEXL expressions in order to do some hard filtering on variants. I wonder if there is a method to tell the filter expression to operate on the current sample (assuming here a single sample VCF file) without knowing the sample name a priori e.g.

vc.getGenotype("Sample1").isHet()

Works just fine to sample heterozygous positions from a VCF where I know the sample will be called "Sample1", but can I refer to a sample name programmatically, e.g. something like: vc.getGenotype( sample() ).isHet()

Sorry if this is a really dumb question. (BTW I realise you could use a genotype_filter e.g. --genotypeFilterExpression "isHet == 1" to do the same thing, but I want to annotate the VCF in the FORMAT/FILTER field, rather than the INFO field for downstream variant selection operations.


Created 2015-01-09 16:47:35 | Updated | Tags: bqsr haplotypecaller variantfiltration vcf bam workflows snps gatk3

Comments (3)

Hi all, I'm in a bit of a daze going through all the documentation and I wanted to do a sanity check on my workflow with the experts. I have ~120 WGS of a ~24Mb fungal pathogen. The end-product of my GATK workflow would be a high quality call set of SNPs, restricted to the sites for which we have confidence in the call across all samples (so sites which are not covered by sufficient high quality reads in one or more samples will be eliminated).

Therefore my workflow (starting from a sorted indexed BAM file of reads from a single sample, mapped to reference with bwa mem) is this:

  • 01- BAM: Local INDEL realignment (RealignerTargetCreator/IndelRealigner)
  • 02- BAM: MarkDuplicates
  • 03- BAM: Local INDEL realignment second pass (RealignerTargetCreator/IndelRealigner)
  • 04- BAM: Calling variants using HaplotypeCaller
  • 05- VCF: Hard filter variants for truth set for BQSR (there is no known variant site databases so we can use our best variants from each VCF file for this). The filter settings are: "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" and we also filter out heterozygous positions using "isHet == 1".
  • 06- VCF: Calculate BQSR table using the high quality hard-filtered variants from step 05.
  • 07- BAM: Apply BQSR recalibration from previous step to BAM file from step 04.
  • 08- BAM: Calling variants on recalibrated BAM file from previous step using HaplotypeCaller, also emitting reference sites using --output_mode EMIT_ALL_SITES \ and --emitRefConfidence GVCF \

Does this sound like a reasonable thing to do? What options should I use in step 8 in order for HC to tell me how confident it is, site-by-site about it's calls, including those that are homozygous reference? I notice that when using --output_mode EMIT_ALL_CONFIDENT_SITES \ and --emitRefConfidence GVCF \ I am missing a lot of the annotation I get when just outputting variant sites (e.g. QD).


Created 2014-12-12 10:42:37 | Updated | Tags: haplotypecaller snps genotyping qd

Comments (9)

Hey,

currently, I am working on 454 sequencing data. I am using GATK to call SNPs and indels, which works relatively well for the SNPs with the recommended filtering options. However, I was interested to raise sensitivity, which is why I started experimenting with the QD value. I restarted the whole GATK pipeline (including the genotyping) with the filtering option "QD<0.0" instead of "QD<2.0". As I did not change any other parameters, I was just expecting to see more SNPs being called. However, I came across a list of SNPs, which did contain some new SNPs, but which was also lacking one of the previously called SNPs (it was not even present in the raw HaplotypeCaller output). How could this have happened?

This is an extract from the vcf file containing the SNP that was previously called:

4 106156187 rs17253672 C T 490.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.341;ClippingRankSum=1.022;DB;DP=161;FS=4.193;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.231;QD=3.05;ReadPosRankSum=1.347 GT:AD:DP:GQ:PL 0/1:10,12:22:99:519,0,262

I am very curious to hear whether anyone might have an explanation for my findings. Many thanks in advance!


Created 2014-07-10 18:31:00 | Updated | Tags: haplotypecaller reads snps forward reverse

Comments (3)

I don't see any information from this post http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk,

I was wondering if it's possible to get the number of forward/reverse reads in the final VCF outputted by HaplotypeCaller?


Created 2013-10-24 11:31:46 | Updated 2013-10-24 11:38:15 | Tags: snp rnaseq snps mrnaseq

Comments (2)

Hi! I have worked some time on a mRNAseq set, single-end. Its a high quality set and lots of biological replicates (200+).

My question is, how could I best contribute to the methodology used for SNPs call in mRNAseq? What do we need tested to improve this method?


Created 2013-08-13 04:31:16 | Updated | Tags: variantrecalibrator indels snps mode

Comments (3)

For exome-sequencing, I was wondering how I should decide the "-mode" parameter? should I use "SNP" or "Indel" or both? I want to get both information in the final results, should I use "BOTH" -mode?


Created 2013-03-26 11:34:41 | Updated | Tags: unifiedgenotyper snp snps

Comments (3)

I was wondering, in the vcf output from UnifiedGenotyper what metrics go into the variant quality score, QUAL ?

I'm assuming that depth, DP is one of them but I can't find further information. Apologies if I've missed it.

Attached are some plots of the depth and quality distribution for variant calls, and also of the relationship between depth and quality. I'm slightly worried that the bimodal quality distribution indicates an error, otherwise I've just attached the graphs for general interest.


Created 2013-02-13 07:39:15 | Updated | Tags: unifiedgenotyper snp snps

Comments (1)

I have used UnifiedGenotyper to call SNPS. I found some SNPs that has been reported from low quality reads in chromosome X and chromosome Y. Is it possible not to take low quality reads into account while calling SNPs using UnifiedGenotyper? Or, do I need to do quality filtering of BAM files before hand ?