GATK 3.1 was released on March 18, 2014. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history
--pair_hmm_implementation VECTOR_LOGLESS_CACHING. Please see the 3.1 Version Highlights for more details about expected speed ups and some background on the collaboration that made these possible.
Hi everyone, I have a set of high-quality SNPs that were jointly called off a merged, realigned BAM. I then created a VCF for each sample using SelectVariants, re-ran UnifiedGenotyper with EMIT_ALL_SITES, and thinned the resulting VCF to no-call sites so I could mask them. Next, I produced a FASTA file for each sample with Ns at no-call sites and SNPs in their appropriate positions with FastaAlternateReferenceMaker. How can I use GATK to specify contigs where no reads were supported for a given sample and use this information to avoid outputting these regions via the -L flag with FastaAlternateReferenceMaker? Apologies if this is trivial, but I haven't found a clear solution.
I am trying to generate a neutral genome between two chicken breeds to reduce the alignment bias when I count the SNP count ratio between breeds.
I want to replace every SNP on the reference genome to N or IUPAC ambiguity codes if the SNP is different between two breeds. (for example, use R if Breed1 has A and Breed2 has G on certain position).
I tried two different ways but both seems not working. First, Use both bam files from each breed as input for HaplotypeCaller (or merged bam files then HaplotypeCaller), filtered SNPs and run FastaAlternateReferenceMaker giving --use_IUPAC_sample argument -> --use_IUPAC_sample takes only one of the samples (Breed1 or Breed2)
java -jar /software/gatk/3.5/static/GenomeAnalysisTK.jar \ -T FastaAlternateReferenceMaker \ -R /share/zhoulab/Referencegenome/Wholegenomefasta/genome.fa \ --use_IUPAC_sample breed1 \ -o breed1_iupac.fa \ -V filtered.snp.parent.vcf
Second, Run HaplotypeCaller for each bamfile seprately, filtered SNPs and run FastaAlternateReferenceMaker two rounds (both with --use_IUPAC_sample argument) which first converts reference.fa with Breed1.vcf then add use output.fa to add Breed2 vcf info. -> gave error for "Input files variant and reference have incompatible contigs." so wouldn't let me run the second round
So is there any way to do it? or is this even possible?
Thank you in advance!!
I'm trying to recreate alternate RNA transcripts/alleles for each of my individuals for which I have RNA-seq data. Using FastaAlternateReferenceMaker I can recreate the whole genome from my vcf files but is there a simple way of using the reference RNA transcripts as a reference rather than the reference genome? I know I can enter a list of intervals of interest but for RNA transcripts I need to concatenate exons and produce a reverse complement if on the minus strand.
Thanks for any advice, Mark
I am using FastaAlterenateReferenceMaker to generate new reference genome using SNPs in .vcf file. I used following command:
java -jar GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R genome.fa -o newgenome.fa -V snps.vcf
I got this error:
##### ERROR MESSAGE: Input files variant and reference have incompatible contigs: The contig order in variant and referenceis not the same; to fix this please see (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files.
##### ERROR variant contigs = [10, 1, 2, 3, 4, 5, 6, 7, 8, 9, MT, Pt]
##### ERROR reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, MT, Pt]
vfc file contains SNPs for only one genome (ch6) and does not have any contig description line, and using the script in the link didn't work. I tried changing the order of the genome.fa to be same as variant contigs but this time, the chromosomes of the newgenome.fa were wrong (e.g. ch7 had the sequence of ch6, ch1 had the sequence of ch10, etc.)
How can I fix this? Thank you in advance.
I tried to use FastaAlternateReferenceMaker to create a fasta file using a variant file (with SNPs and INDELs) and GRCh37 reference genome for Chromosome 22 earlier and now Chromosome 19. I noticed that some variants are not incorporated in the output sequence though the QUAL value is 100 and FILTER value is PASS for the variants. There are no overlapping SNPs or INDELs in these locations. For example, when trying to create the output sequence for CHR 19, it did not include the variant in position 540805 in the attached VCF file. Am I missing something? Appreciate your input. I am not able to upload the VCF file. if it would help, here are a few lines from the VCF file:
19 539279 rs2288956 C T 100 PASS AA=C|||;AC=1;AF=0.500;AFR_AF=0.0915;AMR_AF=0.2493;AN=2;DP=11750;EAS_AF=0.1776;EUR_AF=0.2883;NS=2504;SAS_AF=0.1534;VT=SNP GT 1|0 19 539871 rs17672563 C T 100 PASS AA=C|||;AC=1;AF=0.500;AFR_AF=0.0915;AMR_AF=0.2493;AN=2;DP=15431;EAS_AF=0.1746;EUR_AF=0.2883;NS=2504;SAS_AF=0.1554;VT=SNP GT 1|0 19 540648 rs376850117 GC G 100 PASS AA=ccccc|CCCCCC|CCCCC|insertion;AC=1;AF=0.500;AFR_AF=0.7375;AMR_AF=0.9452;AN=2;DP=10125;EAS_AF=0.9018;EUR_AF=0.9384;NS=2504;SAS_AF=0.8814;VT=INDEL GT 1|0 19 540805 rs578255597 G AC 100 PASS AA=?|-|C|unsure;AC=1;AF=0.500;AFR_AF=0.6657;AMR_AF=0.9438;AN=2;DP=9576;EAS_AF=0.999;EUR_AF=0.9294;NS=2504;SAS_AF=0.9162;VT=INDEL GT 1|0
This is the command that I used:
java -Xmx2g -jar ../GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -R human_g1k_v37.fasta -T FastaAlternateReferenceMaker --variant outputChr19NA07056Edited.vcf -o 1000Genomes_chr19outputNA07056.fasta -L 19
I have a question about using the GATK AlternateReferenceMaker and how it may affect down stream analysis. I apologize for the long post but I tried to include only relevant info. Any thoughts on analysis are helpful!
Here is basically what I did:
Used GATK AlternateReferenceMaker to insert differences and create a "pseudo"-reference for my species
How can I tell if these were produced from the process I used to create a "pseudo"-reference for my species? Could they be misalignments from sequence contigs to reference or between species? How would this affect the SNP calling proceess?
What is the best way to ground truth the sequencing region? Should I try to pull the alignments from the region around each SNP for each individual and look at them manually? How can I do this?
Thanks for your thoughts! Tammy
Hi, I am using FastaAlternateReferenceMaker from GenomeAnalysisTK-3.4-46
my vcf-file looks like this:
#header... ##source=SelectVariants #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample11 scaffold-50 1055905 . G C,T,A 14632.70 PASS AC=0,0,2;AF=0.00,0.00,1.00;AN=2;BaseQRankSum=-1.612;ClippingRankSum=-0.732;DP=56;FS=0.660;InbreedingCoeff=1.0000;MQ=60.00;MQ0=0;MQRankSum=-0.369;QD=30.79;ReadPosRankSum=0.672;SOR=0.742 GT:AD:DP:GQ:PL 3/3:0,0,0,56:56:99:2635,2647,2745,2635,2647,2635,178,184,178,0
this is how I start the program:
java -jar GenomeAnalysisTK.jar \ -T FastaAlternateReferenceMaker \ -R C17_ref.fna \ -o test.fasta \ -L scaffold-50:1055905-1055905 \ -V results/tmp/J65_clean.vcf
and the resulting fasta file will always have a C instead of an A which would be correct here (if i use A,T,C in the ALT column I will get an A). so it looks like the first ALT nucleotide is taken and it totally ignores the 3/3 in the last column (if I change it to 1/1 or 0/0 nothing changes)
What am I doing wrong?
I am trying to use a VCF containing snps variants to change the mouse reference (GRCm38- c57BL/6J) with BALB/cJ snps.
After running this command:
java -jar ~/programs/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R ~/genome/mouse_GRCm38.p4/GRCm38.primary_assembly/GRCm38.primary_assembly.fa -o ~/BALBcJ.snp.primary.fa -V ~/BALB_cJ.snps.vcf
The following ERROR shows up:
So Trying to fix, I used the perl script in the link to sort properly within the reference.
I did this:
./sortByRef.pl ~/BALB_cJ.snps.vcf /home/tiagocastro/genome/mouse_GRCm38.p4/GRCm38.primary_assembly/GRCm38.primary_assembly.fa.fai > ~/BALB_cJ.snps_sorted.vcf
using the new vcf file, a new error is shown:
looking the head of each, sorted and basic vcf, I can see that is different.
Can someone help me?
I have some VCF files, each of which I have merged to contain >300 genotypes. Furthermore, to make them more manageable I have subsetted them to just contain the chromosome regions I am interested in.
Now I wish to generate some genotype specific FASTA sequences using these files and a reference sequence; i.e. a sequence for each genotype which is the same as the reference sequence but with the SNPs specific to each genotype in place of their counterparts in the reference sequence.
Now I know that there is variation in the genotypes. Here is a picture visualizing three exemplar genotypes that I generated by loading the VCF file into Geneious.
I then try to create individual VCF files for each genotype using this:
java -jar GenomeAnalysisTK.jar -R ~/Path/to/reference\sequence/ref.fasta -T SelectVariants --variant ~/Path/to/complete\vcf/example.vcf -o ~/Path/to/individual/genotype.vcf -sn genotype
While I can't be sure this had the desired effect as it is difficult to assess a whole VCF file I can say that the header now only contains the relevant genotype so I assume this is the case.
I then try and use this individual VCF file for each genotype like this:
java -jar GenomeAnalysisTK.jar -R ~/Path/to/reference\sequence/ref.fasta -T FastaAlternateReferenceMaker --variant~/Path/to/individual/genotype.vcf -L chrX:XX,XXX,XXX-XX,XXX,XXX -o ~/Path/to/individual/genotype.fasta
Here the Xs represent the location on the reference sequence of the regions of interest.
I did this in a loop and got identical sequences for every genotypes. I then implemented it individually for the 3 exemplar genotypes in the picture above and in both cases I get identical sequences for every genotype. Interestingly they are not the reference sequence.
What am I doing wrong?
I have also posted this on the Biostars forum..
I'm trying to use the FastaAlternateReferenceMaker tool to make a new reference genome for read-mapping.
However, it merges all the chromosomes into one, simply called "1". Is there a command that allows chromosome/scaffold names to be maintained.
Also I guess that any frameshifts caused by insertions or deletions affect the base co-ordinates relative to the original assembly, making comparison with other data problematic.
Hello, I really like yor FastaAlternateReferenceMaker to build a new consensus from FASTA+VCF. However, it deletes all headers from the FASTA file and replaces them by numbers. Is there a way to change this behavior, e.g. simply take the old headers?
I think supplying the error/out file from the bsub job I ran for GATK FastaAlternateReferenceMaker will be more helpful than any explanation I could write. The reference fasta is ~85Mb and the output Alternate.consensus.fna ended up ~24Mb when error occurred. Let me know if you need anymore details to diagnose the problem. Thx
I was working with FastaAlternateReferenceMaker today. I wanted to use it on two samples. The first sample worked fine. The second sample came up with an error about a few REF values not conforming (e.g. :A) in the vcf. I wrote a script to correct the errors and remove the : from the vcf file. I tried to run it again, but received memory allocation errors. So, I dealt with that and am now allocating 6g for a 15G VCF and a 188M fasta reference. I got it running (I think) and then started getting the NegativeArraySizeException java error. I saw this had come up as a bug in earlier versions of GATK, but it seemed like it had been resolved...perhaps not? Anyhow, this 'quick little run' has taken up the better part of my day and I would be super greatful for some assistance troubleshooting this problem.
Command: $ java -Xmx6g -jar ../../shays/GATK/GenomeAnalysisTK.jar -l ERROR -R ../up/chrom6.2.fa -T FastaAlternateReferenceMaker -o D-chr6.cons.fa --variant D-c6.vcf
first of all thanks for the support offered here. But (sadly) I have an other two questions.
I used the FastaAlternateReferenceMaker to create a consesus sequence and it worked great. Here I have a question regarding a warning: "IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation". I do not get this warning. Which dictionary is needed here. Is this warning comming up, because I used .gz and .tbi compressed vcf-files?
And a furhter question regarding the creation of a consesus sequence. I`m unsure what to do in cases of heterozygous INDELs. What would you suggest to do? Skip them as they are only represented by one allele or include them in the consensus. Both seems to be incorrect. Do you know the general handling here?
I`m trying to use the option --useIUPAC with the FastaAlternateReferenceMaker. I tried several ways to set this option but all lead to ERRORs (see below). Without the option no ERROR message occurrs.
Here my code and...
java -Xms1g -jar "/home/GenomeAnalysisTK.jar" \ -T "FastaAlternateReferenceMaker" \ -R "/home/refGen.fa" \ -L "chromosome3" \ -lw "100" \ --variant "/home/chromosome3_raw_snp_filter3.vcf.gz" \ -o "/home/chromosome3.fasta" \ --useIUPAC
the ways I used useIUPAC:
--useIUPAC ##### ERROR MESSAGE: Argument with name 'useIUPAC' isn't defined. --useIUPAC "true" ##### ERROR MESSAGE: Argument with name 'useIUPAC' isn't defined. -useIUPAC "true" ##### ERROR MESSAGE: Argument with name 'useIUPAC' isn't defined. -useIUPAC ##### ERROR MESSAGE: Argument with name 'useIUPAC' isn't defined.
I would like to change the reference genome with the sequences of the variants I've found in my deep sequencing data. So here is my command line: java -jar GenomeAnalysisTK.jar -R ../human_g1k_v37.fasta -T FastaAlternateReferenceMaker -o ../output_new_genome.fasta -L ../cancer.intervals -- ../output.snp.indel.LOH.vcf
I got this error:
Don't really understand...
Thank you for your time and help
Dear all I am writing to you to understand why FastaAlternateReferenceMaker is missing a deletion in the final consensus. I am using gatk v. 2.8-1 and this commandline:
gatk -R ref.fa -T FastaAlternateReferenceMaker -o reads_vs_ref_gatk_consensus.fa --variant reads_vs_ref_var.vcf
output from gatk was:
INFO 08:57:46,123 HelpFormatter - --------------------------------------------------------------------------------
INFO 08:57:46,126 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.8-1-g932cd3a, Compiled 2013/12/06 16:47:15
INFO 08:57:46,126 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 08:57:46,126 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 08:57:46,130 HelpFormatter - Program Args: -R ref.fa -T FastaAlternateReferenceMaker -o reads_vs_ref_gatk_consensus.fa --variant reads_vs_ref_var.vcf
INFO 08:57:46,130 HelpFormatter - Date/Time: 2014/02/20 08:57:46
INFO 08:57:46,131 HelpFormatter - --------------------------------------------------------------------------------
INFO 08:57:46,131 HelpFormatter - --------------------------------------------------------------------------------
INFO 08:57:46,137 ArgumentTypeDescriptor - Dynamically determined type of reads_vs_ref_var.vcf to be VCF
INFO 08:57:46,856 GenomeAnalysisEngine - Strictness is SILENT
INFO 08:57:46,978 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 08:57:47,010 RMDTrackBuilder - Creating Tribble index in memory for file reads_vs_ref_var.vcf
INFO 08:57:47,079 RMDTrackBuilder - Writing Tribble index to disk for file /home/aprea/test/consensus/reads_vs_ref_var.vcf.idx
INFO 08:57:47,223 GenomeAnalysisEngine - Preparing for traversal
INFO 08:57:47,224 GenomeAnalysisEngine - Done preparing for traversal
INFO 08:57:47,225 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 08:57:47,226 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
INFO 08:57:47,546 ProgressMeter - done 1.05e+04 0.0 s 30.0 s 99.0% 0.0 s 0.0 s
INFO 08:57:47,547 ProgressMeter - Total runtime 0.32 secs, 0.01 min, 0.00 hours
INFO 08:57:51,452 GATKRunReport - Uploaded run statistics report to AWS S3
There are 3 variants in the vcf file: the first is a SNP and is reported, the second is an insertion and is reported but the third, a deletion, is missing. Could you please tell me where's the problem? I attached both reference and vcf file.
I am using FastaAlternateReferenceMaker to create consensus sequences as follows:
java -Xmx12g -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R /Reference/chromosome.1.fa -o /Output/consensus.1.fa --variant /VCF/chromosome1.vcf -L /Interval/chromosome.1.list
This command line produces a single fasta file including consensus sequences for the intervals provided in the list file.
(1) Presumably, sequence header for each sequence in the consensus fasta file corresponds to the contig name in the master reference file. Is there any way to modify the command line to print, say, the interval as the sequence header instead of the contig name?
Note: I can feed the program one interval at a time and name the output file accordingly, however, I'd like to stick to submitting my query per one chromosome at a time for the sake of saving up time.
(2) Number of sequences in the consensus fasta do not match the number of non-overlapping intervals in the input list. I would think that this is because some intervals are variant-free and therefore no alternate reference is reported for them. Do you confirm this is the case? I cannot easily check because the issue mentioned in (1).
I tried to run FastaAlternateReferenceMaker and I get the following error:
WARNING 2013-09-18 16:28:28 IntervalList Ignoring interval for unknown reference: Chr1:3580210-3580286
For all the intervals I submitted. I already looked around on the web, and I did not find any answer, knowing that my chromosome names are all with the 'Chr' format in all the files and that my interval files are tab delimited.
My interval file look like:
@HD VN:1.4 SO:unsorted @SQ SN:Chr1 LN:158337067 UR:file:chromosome_3.1.fasta M5:0631b350aa263a0f714de8ba9d609eb0 @SQ SN:Chr2 LN:137060424 UR:file:Chromosome_3.1.fasta M5:15898469d6142f8bb74f769bfe9b155f @SQ SN:Chr3 LN:121430405 UR:file:Chromosome_3.1.fasta M5:c515c4da7c2cd2d24c9487db8f733cfd ... Chr1 3580210 3580286 + ID=MI0011294_1;accession_number=MI0011294 Chr1 3580220 3580240 + ID=MIMAT0011792_1;accession_number=MIMAT0011792 Chr1 3607747 3607842 - ID=MI0014499_1;accession_number=MI0014499 Chr1 3607802 3607822 - ID=MIMAT0017395_1;accession_number=MIMAT0017395 Chr1 10227277 10227339 - ID=MI0009752_1;accession_number=MI0009752 Chr1 10227315 10227337 - ID=MIMAT0009241_1;accession_number=MIMAT0009241 Chr1 19881347 19881431 - ID=MI0005457_1;accession_number=MI0005457 Chr1 19881398 19881419 - ID=MIMAT0003539_1;accession_number=MIMAT0003539 Chr1 19930459 19930542 - ID=MI0005454_1;accession_number=MI0005454 Chr1 19930511 19930532 - ID=MIMAT0004332_1;accession_number=MIMAT0004332 ...
The header of my interval file is a copy of the Chromosome_3.1.dict I do not know what is misformated and why I get this error
java -jar GenomeAnalysisTK.jar -R S288C_refseq.fasta -T FastaAlternateReferenceMaker -o WT_refseq.fasta --variant WT_common.vcf
Output (last line):
##### ERROR MESSAGE: Line 54: there aren't enough columns for line chr1 1 . CCACACCACACCCACAC CCACCCACACCACACCCACAC,CCACACCACACCCACACCACACCCACAC 5.79 . INDEL;IS=1,1.000000;DP=3;VDB=2.063840e-02;AF1=1;AC1=8;DP4=0,0,3,0;MQ=28;FQ=-33.4 GT:PL:DP:SP:GQ (we expected 9 tokens, and saw 1 )
As far as I can tell, the vcf file conforms to spec and contains 9 tab separated columns. Therefore I don't understand the error. I've tried re-parsing the vcf file to ensure that there aren't missing or hidden characters, without success.
Hi, I want to create out of my bam file a consensus file by using FastaAlternatereferneceMaker and my .vcf file. How should I care about positions where no reads mapped to. By default FastaAlternateReferenceMaker uses the REF base, but isnt it better to use here N insted? THANKS :=)
Hi, I am using FastaAlternateReferenceMaker and have a set of intervals ordered first by chromosome and then by their start positions. I have tried ordering chromosomes alphabetically(chr1, chr10, chr11,..) as well as numerically (chr1, chr2, chr3...) but the output fasta sequence returned is not in the same order as listed in interval file. I find that even the names target_1, target_2 etc are also not used as fasta headers in the output file. I am stuck with mapping the input intervals with the output fasta sequences. Thanks in advance for all the help, Ramya