GATK 3.1 was released on March 18, 2014. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history
--pair_hmm_implementation VECTOR_LOGLESS_CACHING. Please see the 3.1 Version Highlights for more details about expected speed ups and some background on the collaboration that made these possible.
I have some VCF files, each of which I have merged to contain >300 genotypes. Furthermore, to make them more manageable I have subsetted them to just contain the chromosome regions I am interested in.
Now I wish to generate some genotype specific FASTA sequences using these files and a reference sequence; i.e. a sequence for each genotype which is the same as the reference sequence but with the SNPs specific to each genotype in place of their counterparts in the reference sequence.
Now I know that there is variation in the genotypes. Here is a picture visualizing three exemplar genotypes that I generated by loading the VCF file into Geneious.
I then try to create individual VCF files for each genotype using this:
java -jar GenomeAnalysisTK.jar -R ~/Path/to/reference\sequence/ref.fasta -T SelectVariants --variant ~/Path/to/complete\vcf/example.vcf -o ~/Path/to/individual/genotype.vcf -sn genotype
While I can't be sure this had the desired effect as it is difficult to assess a whole VCF file I can say that the header now only contains the relevant genotype so I assume this is the case.
I then try and use this individual VCF file for each genotype like this:
java -jar GenomeAnalysisTK.jar -R ~/Path/to/reference\sequence/ref.fasta -T FastaAlternateReferenceMaker --variant~/Path/to/individual/genotype.vcf -L chrX:XX,XXX,XXX-XX,XXX,XXX -o ~/Path/to/individual/genotype.fasta
Here the Xs represent the location on the reference sequence of the regions of interest.
I did this in a loop and got identical sequences for every genotypes. I then implemented it individually for the 3 exemplar genotypes in the picture above and in both cases I get identical sequences for every genotype. Interestingly they are not the reference sequence.
What am I doing wrong?
I have also posted this on the Biostars forum..
I'm trying to use the FastaAlternateReferenceMaker tool to make a new reference genome for read-mapping.
However, it merges all the chromosomes into one, simply called "1". Is there a command that allows chromosome/scaffold names to be maintained.
Also I guess that any frameshifts caused by insertions or deletions affect the base co-ordinates relative to the original assembly, making comparison with other data problematic.
Hello, I really like yor FastaAlternateReferenceMaker to build a new consensus from FASTA+VCF. However, it deletes all headers from the FASTA file and replaces them by numbers. Is there a way to change this behavior, e.g. simply take the old headers?
I think supplying the error/out file from the bsub job I ran for GATK FastaAlternateReferenceMaker will be more helpful than any explanation I could write. The reference fasta is ~85Mb and the output Alternate.consensus.fna ended up ~24Mb when error occurred. Let me know if you need anymore details to diagnose the problem. Thx
I was working with FastaAlternateReferenceMaker today. I wanted to use it on two samples. The first sample worked fine. The second sample came up with an error about a few REF values not conforming (e.g. :A) in the vcf. I wrote a script to correct the errors and remove the : from the vcf file. I tried to run it again, but received memory allocation errors. So, I dealt with that and am now allocating 6g for a 15G VCF and a 188M fasta reference. I got it running (I think) and then started getting the NegativeArraySizeException java error. I saw this had come up as a bug in earlier versions of GATK, but it seemed like it had been resolved...perhaps not? Anyhow, this 'quick little run' has taken up the better part of my day and I would be super greatful for some assistance troubleshooting this problem.
Command: $ java -Xmx6g -jar ../../shays/GATK/GenomeAnalysisTK.jar -l ERROR -R ../up/chrom6.2.fa -T FastaAlternateReferenceMaker -o D-chr6.cons.fa --variant D-c6.vcf
first of all thanks for the support offered here. But (sadly) I have an other two questions.
I used the FastaAlternateReferenceMaker to create a consesus sequence and it worked great. Here I have a question regarding a warning: "IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation". I do not get this warning. Which dictionary is needed here. Is this warning comming up, because I used .gz and .tbi compressed vcf-files?
And a furhter question regarding the creation of a consesus sequence. I`m unsure what to do in cases of heterozygous INDELs. What would you suggest to do? Skip them as they are only represented by one allele or include them in the consensus. Both seems to be incorrect. Do you know the general handling here?
I`m trying to use the option --useIUPAC with the FastaAlternateReferenceMaker. I tried several ways to set this option but all lead to ERRORs (see below). Without the option no ERROR message occurrs.
Here my code and...
java -Xms1g -jar "/home/GenomeAnalysisTK.jar" \ -T "FastaAlternateReferenceMaker" \ -R "/home/refGen.fa" \ -L "chromosome3" \ -lw "100" \ --variant "/home/chromosome3_raw_snp_filter3.vcf.gz" \ -o "/home/chromosome3.fasta" \ --useIUPAC
the ways I used useIUPAC:
--useIUPAC ##### ERROR MESSAGE: Argument with name 'useIUPAC' isn't defined. --useIUPAC "true" ##### ERROR MESSAGE: Argument with name 'useIUPAC' isn't defined. -useIUPAC "true" ##### ERROR MESSAGE: Argument with name 'useIUPAC' isn't defined. -useIUPAC ##### ERROR MESSAGE: Argument with name 'useIUPAC' isn't defined.
I would like to change the reference genome with the sequences of the variants I've found in my deep sequencing data. So here is my command line: java -jar GenomeAnalysisTK.jar -R ../human_g1k_v37.fasta -T FastaAlternateReferenceMaker -o ../output_new_genome.fasta -L ../cancer.intervals -- ../output.snp.indel.LOH.vcf
I got this error:
Don't really understand...
Thank you for your time and help
Dear all I am writing to you to understand why FastaAlternateReferenceMaker is missing a deletion in the final consensus. I am using gatk v. 2.8-1 and this commandline:
gatk -R ref.fa -T FastaAlternateReferenceMaker -o reads_vs_ref_gatk_consensus.fa --variant reads_vs_ref_var.vcf
output from gatk was:
INFO 08:57:46,123 HelpFormatter - --------------------------------------------------------------------------------
INFO 08:57:46,126 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.8-1-g932cd3a, Compiled 2013/12/06 16:47:15
INFO 08:57:46,126 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 08:57:46,126 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 08:57:46,130 HelpFormatter - Program Args: -R ref.fa -T FastaAlternateReferenceMaker -o reads_vs_ref_gatk_consensus.fa --variant reads_vs_ref_var.vcf
INFO 08:57:46,130 HelpFormatter - Date/Time: 2014/02/20 08:57:46
INFO 08:57:46,131 HelpFormatter - --------------------------------------------------------------------------------
INFO 08:57:46,131 HelpFormatter - --------------------------------------------------------------------------------
INFO 08:57:46,137 ArgumentTypeDescriptor - Dynamically determined type of reads_vs_ref_var.vcf to be VCF
INFO 08:57:46,856 GenomeAnalysisEngine - Strictness is SILENT
INFO 08:57:46,978 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 08:57:47,010 RMDTrackBuilder - Creating Tribble index in memory for file reads_vs_ref_var.vcf
INFO 08:57:47,079 RMDTrackBuilder - Writing Tribble index to disk for file /home/aprea/test/consensus/reads_vs_ref_var.vcf.idx
INFO 08:57:47,223 GenomeAnalysisEngine - Preparing for traversal
INFO 08:57:47,224 GenomeAnalysisEngine - Done preparing for traversal
INFO 08:57:47,225 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 08:57:47,226 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
INFO 08:57:47,546 ProgressMeter - done 1.05e+04 0.0 s 30.0 s 99.0% 0.0 s 0.0 s
INFO 08:57:47,547 ProgressMeter - Total runtime 0.32 secs, 0.01 min, 0.00 hours
INFO 08:57:51,452 GATKRunReport - Uploaded run statistics report to AWS S3
There are 3 variants in the vcf file: the first is a SNP and is reported, the second is an insertion and is reported but the third, a deletion, is missing. Could you please tell me where's the problem? I attached both reference and vcf file.
I am using FastaAlternateReferenceMaker to create consensus sequences as follows:
java -Xmx12g -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R /Reference/chromosome.1.fa -o /Output/consensus.1.fa --variant /VCF/chromosome1.vcf -L /Interval/chromosome.1.list
This command line produces a single fasta file including consensus sequences for the intervals provided in the list file.
(1) Presumably, sequence header for each sequence in the consensus fasta file corresponds to the contig name in the master reference file. Is there any way to modify the command line to print, say, the interval as the sequence header instead of the contig name?
Note: I can feed the program one interval at a time and name the output file accordingly, however, I'd like to stick to submitting my query per one chromosome at a time for the sake of saving up time.
(2) Number of sequences in the consensus fasta do not match the number of non-overlapping intervals in the input list. I would think that this is because some intervals are variant-free and therefore no alternate reference is reported for them. Do you confirm this is the case? I cannot easily check because the issue mentioned in (1).
I tried to run FastaAlternateReferenceMaker and I get the following error:
WARNING 2013-09-18 16:28:28 IntervalList Ignoring interval for unknown reference: Chr1:3580210-3580286
For all the intervals I submitted. I already looked around on the web, and I did not find any answer, knowing that my chromosome names are all with the 'Chr' format in all the files and that my interval files are tab delimited.
My interval file look like:
@HD VN:1.4 SO:unsorted @SQ SN:Chr1 LN:158337067 UR:file:chromosome_3.1.fasta M5:0631b350aa263a0f714de8ba9d609eb0 @SQ SN:Chr2 LN:137060424 UR:file:Chromosome_3.1.fasta M5:15898469d6142f8bb74f769bfe9b155f @SQ SN:Chr3 LN:121430405 UR:file:Chromosome_3.1.fasta M5:c515c4da7c2cd2d24c9487db8f733cfd ... Chr1 3580210 3580286 + ID=MI0011294_1;accession_number=MI0011294 Chr1 3580220 3580240 + ID=MIMAT0011792_1;accession_number=MIMAT0011792 Chr1 3607747 3607842 - ID=MI0014499_1;accession_number=MI0014499 Chr1 3607802 3607822 - ID=MIMAT0017395_1;accession_number=MIMAT0017395 Chr1 10227277 10227339 - ID=MI0009752_1;accession_number=MI0009752 Chr1 10227315 10227337 - ID=MIMAT0009241_1;accession_number=MIMAT0009241 Chr1 19881347 19881431 - ID=MI0005457_1;accession_number=MI0005457 Chr1 19881398 19881419 - ID=MIMAT0003539_1;accession_number=MIMAT0003539 Chr1 19930459 19930542 - ID=MI0005454_1;accession_number=MI0005454 Chr1 19930511 19930532 - ID=MIMAT0004332_1;accession_number=MIMAT0004332 ...
The header of my interval file is a copy of the Chromosome_3.1.dict I do not know what is misformated and why I get this error
java -jar GenomeAnalysisTK.jar -R S288C_refseq.fasta -T FastaAlternateReferenceMaker -o WT_refseq.fasta --variant WT_common.vcf
Output (last line):
##### ERROR MESSAGE: Line 54: there aren't enough columns for line chr1 1 . CCACACCACACCCACAC CCACCCACACCACACCCACAC,CCACACCACACCCACACCACACCCACAC 5.79 . INDEL;IS=1,1.000000;DP=3;VDB=2.063840e-02;AF1=1;AC1=8;DP4=0,0,3,0;MQ=28;FQ=-33.4 GT:PL:DP:SP:GQ (we expected 9 tokens, and saw 1 )
As far as I can tell, the vcf file conforms to spec and contains 9 tab separated columns. Therefore I don't understand the error. I've tried re-parsing the vcf file to ensure that there aren't missing or hidden characters, without success.
Hi, I want to create out of my bam file a consensus file by using FastaAlternatereferneceMaker and my .vcf file. How should I care about positions where no reads mapped to. By default FastaAlternateReferenceMaker uses the REF base, but isnt it better to use here N insted? THANKS :=)
Hi, I am using FastaAlternateReferenceMaker and have a set of intervals ordered first by chromosome and then by their start positions. I have tried ordering chromosomes alphabetically(chr1, chr10, chr11,..) as well as numerically (chr1, chr2, chr3...) but the output fasta sequence returned is not in the same order as listed in interval file. I find that even the names target_1, target_2 etc are also not used as fasta headers in the output file. I am stuck with mapping the input intervals with the output fasta sequences. Thanks in advance for all the help, Ramya