A biallelic site is a specific locus in a genome that contains two observed alleles, counting the reference as one, and therefore allowing for one variant allele. In practical terms, this is what you would call a site where, across multiple samples in a cohort, you have evidence for a single non-reference allele. Shown below is a toy example in which the consensus sequence for samples 1-3 have a deletion at position 7. Sample 4 matches the reference. This is considered a biallelic site because there are only two possible alleles-- a deletion, or the reference allele
1 2 3 4 5 6 7 8 9 Reference: A T A T A T G C G Sample 1 : A T A T A T - C G Sample 2 : A T A T A T - C G Sample 3 : A T A T A T - C G Sample 4 : A T A T A T G C G
A multiallelic site is a specific locus in a genome that contains three or more observed alleles, again counting the reference as one, and therefore allowing for two or more variant alleles. This is what you would call a site where, across multiple samples in a cohort, you see evidence for two or more non-reference alleles. Show below is a toy example in which the consensus sequences for samples 1-3 have a deletion or a SNP at the 7th position. Sample 4 matches the reference. This is considered a multiallelic site because there are four possible alleles-- a deletion, the reference allele
C (SNP), or a
T (SNP). True multiallelic sites are not observed very frequently unless you look at very large cohorts, so they are often taken as a sign of a noisy region where artifacts are likely.
1 2 3 4 5 6 7 8 9 Reference: A T A T A T G C G Sample 1 : A T A T A T - C G Sample 2 : A T A T A T C C G Sample 3 : A T A T A T T C G Sample 4 : A T A T A T G C G
Hi, I ran the GenotypeGVCFs (on RNA-seq data) on several samples and I'm getting several positions I'm having a hard time interpreting. They look like this example: chr6 135674438 . C A,* 10623.61 PASS AC=1,1;AF=0.167,0.167;AN=6;DP=860;FS=2.692;MLEAC=1,1;MLEAF=0.167,0.167;MQ=60.00;QD=26.76;SOR=0.390 GT:AD:DP:GQ:PL 1/2:2,0,223:401:99:10634,6990,5967,1315,0,4631 0/0:23,0,0:23:69:0,69,672,69,672,672 0/0:5,0,0:5:0:0,0,98,0,98,98 ./.:0,0,0:0 ./.:0,0,0:0 ./.:0,0,0:0 ./.:0,0,0:0 ./.:0,0,0:0
What I don't understand is where does the "A" ALT allele come from?
All the samples leave the first (the one with GT = 1/2) are either homozygous for the REF allele or do not cover this site. The first sample is indicated to have 2 reads with the REF allele, 0 reads with the "A" allele, and 223 reads with the "*" allele - which I guess is all other possible alleles?
So my question is why is the "A" allele even indicated if it is not supported by any of the samples?
Thanks a lot
I'm running haplotype caller (latest nightly build) with -A StrandAlleleCountsBySample parameter to get strand specific read counts (SAC). For variants with more than the default 6 maximal alt alleles there is a problem with the SAC field:
2 47641559 . TAAAAAAAAAAA T,TA,TAA,TAAA,TAAAA,TAAAAAA,<NON_REF> 1308.73 . BaseQRankSum=0.434;ClippingRankSum=0.768;DP=105;ExcessHet=3.0103;MLEAC=0,0,0,0,0,1,1;MLEAF=0.00,0.00,0.00,0.00,0.00,0.500,0.500;MQRankSum=-1.704;RAW_MQ=378000.00;ReadPosRankSum=1.971 GT:AD:DP:GQ:PL:SAC:SB 6/7:3,0,0,3,4,5,16,9
So there are 9 reads originating from another than one of the given alt alleles (=NON_REF), but the SAC field is missing these reads. This gets especially annoying if one of the NON_REF alleles is selected as most likely when combining the sample with others in GenotypeGVCFs.
11 108141955 . CTTTT C,CT,CTT,CTTT,ATTTT,TTTTT,<NON_REF> 1552.73 . BaseQRankSum=-0.227;DP=704;ExcessHet=3.0103;MLEAC=0,0,0,1,0,0,0;MLEAF=0.00,0.00,0.00,0.500,0.00,0.00,0.00;MQ=60.02;MQRankSum=-0.254;ReadPosRankSum=1.249 GT:AD:DP:GQ:PL:SAC:SB 0/4:431,5,4,27,127,4,3,3
Is there some way to make the VCF QD/FS filed support multiallelic ? I Want to filter VCF by QD/FS info for RNA data.
1)The QD/FS filed is NOT the VCF Type
A'; Type 'A': If the Field has one value per alternate allele then this value should beA';
2)There is no way to let GATK to output VCF multiallelic separately. Now multiallelic of VCF share the same QD/FS value.
Best Regards. Wang Yugui
Is GATK Unified genotyper able to call multi-allelic positions in a single pooled sample? Case is a pool of 13 samples, we use UG with ploidy set to 26. If I understand the supplementaries of the original publications correct, UG will never be able to call three alleles at a single position. in single sample calling. Or does this not hold for high ploidy analysis?
If needed, we can call multiple pools together, but this becomes computationally intensive.
In summary, we would like to call a 14xG,6xA,6xT call for example.
Also, how does UG take noise into account when genotyping (sequencing errors), when for example 3% of reads is aberrant at a position, this could correspond to ~ 1/26.
Thanks for any guidelines,
I would like to know if GATK can call tri-allelic variants in one single sample? I am asking this because I am interested in clonal mosaicism and then looking at tri-allelic variants might be a way to look into that...
Thanks in advance, João Fadista