When I tried to call SNP/Indel through the V3.3 GATK, I found a problem, how can I get the following datasets?
True sites training resource: HapMap True sites training resource: Omni Non-true sites training resource: 1000G Known sites resource, not used in training: dbSNP Known and true sites training resource: Mills
Does GATK provide the corresponding vcf files such as "hapmap.vcf","omni.vcf","1000G.vcf""dbsnp.vcf""mills.vcf" ?
I have 11 exome-sequenced samples which I'd like to analyze using vqsr and therefore need to borrow exomes from the 1000 genomes. In order to find samples that are similar and have been sequenced in a similar way with mine, I've narrowed the target selection according to gender (the 11 are females), super population (EUR), instrument platform (Illumina), library layout (paired end), analysis group (exome) and quality control (not withdrawn). I'm also trying to focus on read depths and read lengths similar to my samples, but I have some difficulties with that. I'd like to ask your advice on the following, regarding the importance for the vqsr:
1) I have limited possibilities to find 1000g exomes that fit my aim in both read depths and lengths at the same time. Which of these two do you think would be more important if I loosen the tie a bit for either field? Just to explain further: Eight of my samples had 90bp length and ~ 30 million read counts. Another three had rl 100bp and rc ~45 M. I've considered running these two subsets separately, with different 1000g-addition targets, in case that would be advisable. But almost all the 1000g-exomes with rl 90 have rc in the range 45-80 M. On the other hand rather few 1000g-exomes with rl 100 have rc ~45 M. If ok to loosen either the rl or rc restriction, there will be no problem finding enough exomes from 1000g but then I'd be grateful for your advice on whether to run as subsets or not.
2) Instrument model: Should I perhaps try to confine to eg Illumina HiSeq 2000 rather than just any Illumina platform?
With many thanks in advance.
I have a human exome experiment on which I am using hg19 resources (reference, targets, dbSNP, ... the whole shebang). I want to add some 1000Genomes exomes to this experiment, but the available BAMs are from GRCh37.
Is there a tool to port the BAMs from GRCh37 to hg19, and to continue with that? Maybe LiftOver?
Do you rather recommend re-processing the 1000Genomes BAMs on hg19? Would that mean regenerate FASTQs and re-do the whole map/MarkDup/IndelReal/BQSR steps?
For now, I have worked on the original BAMs but have renamed all the classical chromosomes from "1" to "chr1" and I got rid of the mitochondrial chromosome and all other contigs (got rid of these contigs also in the resources to avoid GATKs complaints on missing contigs). How bad would you think that is based on the differences you know between GRCh37 and hg19?
Thanks a lot for your help!
I'm trying to use the RealignerTargetCreator as a test with 1 know file; 1000G_phase1.indels.b37.vcf. At first the contigs didn't match with my .BAM file (chr1/chr2 vs 1/2), so I adjusted that. Now when running it for the first time; java - jar [path to genomeAnalysisTK.jar] -T RealignerTargetCreator -R [.fasta] -I [.bam] -o [.intervals] -known [path to 1000g_phase1_adjusted.indels.b37.vcf] it gives the following error: "I/O error loading or writing tribble index file for [path to 1000g]". When running it the second time, I get the following error; "Problem detecting index type", because the .idx file is not correctly created.
What am I doing wrong?