# Tagged with #fastaalternatereferencemaker 0 documentation articles | 1 announcement | 20 forum discussions

No posts found with the requested search criteria.

Created 2014-03-17 16:52:43 | Updated 2014-03-19 15:13:51 | Tags: variantrecalibrator haplotypecaller randomlysplitvariants release-notes fastaalternatereferencemaker

GATK 3.1 was released on March 18, 2014. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

## Haplotype Caller

• Added new capabilities to the Haplotype Caller to use hardware-based optimizations. Can be enabled with --pair_hmm_implementation VECTOR_LOGLESS_CACHING. Please see the 3.1 Version Highlights for more details about expected speed ups and some background on the collaboration that made these possible.
• Fixed bugs in computing the weights of edges in the assembly graph. This was causing bad genotypes to be output when running the Haplotype Caller over multiple samples simultaneously (as opposed to creating gVCFs in the new recommended pipeline, which was working as expected).

## Variant Recalibrator

• Fixed issue where output could be non-deterministic with very large data sets.

## CalculateGenotypePosteriors

• Fixed several bugs where bad input were causing the tool to crash instead of gracefully exiting with an error message.

## Miscellaneous

• RandomlySplitVariants can now output splits comprised of more than 2 output files.
• FastaAlternateReferenceMaker can now output heterozygous sites using IUPAC ambiguity encoding.
• Picard, Tribble, and Variant jars updated to version 1.109.1722.

Created 2015-11-11 12:43:45 | Updated | Tags: fastaalternatereferencemaker

Hi,

I'm trying to recreate alternate RNA transcripts/alleles for each of my individuals for which I have RNA-seq data. Using FastaAlternateReferenceMaker I can recreate the whole genome from my vcf files but is there a simple way of using the reference RNA transcripts as a reference rather than the reference genome? I know I can enter a list of intervals of interest but for RNA transcripts I need to concatenate exons and produce a reverse complement if on the minus strand.

Created 2015-11-05 02:36:31 | Updated | Tags: vcf fastaalternatereferencemaker incompatible-contigs

Hello, I am using FastaAlterenateReferenceMaker to generate new reference genome using SNPs in .vcf file. I used following command: java -jar GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R genome.fa -o newgenome.fa -V snps.vcf

I got this error: ##### ERROR ##### ERROR MESSAGE: Input files variant and reference have incompatible contigs: The contig order in variant and referenceis not the same; to fix this please see (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files. ##### ERROR variant contigs = [10, 1, 2, 3, 4, 5, 6, 7, 8, 9, MT, Pt] ##### ERROR reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, MT, Pt]

vfc file contains SNPs for only one genome (ch6) and does not have any contig description line, and using the script in the link didn't work. I tried changing the order of the genome.fa to be same as variant contigs but this time, the chromosomes of the newgenome.fa were wrong (e.g. ch7 had the sequence of ch6, ch1 had the sequence of ch10, etc.)

How can I fix this? Thank you in advance.

Created 2015-10-28 19:32:07 | Updated | Tags: fastaalternatereferencemaker

I tried to use FastaAlternateReferenceMaker to create a fasta file using a variant file (with SNPs and INDELs) and GRCh37 reference genome for Chromosome 22 earlier and now Chromosome 19. I noticed that some variants are not incorporated in the output sequence though the QUAL value is 100 and FILTER value is PASS for the variants. There are no overlapping SNPs or INDELs in these locations. For example, when trying to create the output sequence for CHR 19, it did not include the variant in position 540805 in the attached VCF file. Am I missing something? Appreciate your input. I am not able to upload the VCF file. if it would help, here are a few lines from the VCF file:

19 539279 rs2288956 C T 100 PASS AA=C|||;AC=1;AF=0.500;AFR_AF=0.0915;AMR_AF=0.2493;AN=2;DP=11750;EAS_AF=0.1776;EUR_AF=0.2883;NS=2504;SAS_AF=0.1534;VT=SNP GT 1|0 19 539871 rs17672563 C T 100 PASS AA=C|||;AC=1;AF=0.500;AFR_AF=0.0915;AMR_AF=0.2493;AN=2;DP=15431;EAS_AF=0.1746;EUR_AF=0.2883;NS=2504;SAS_AF=0.1554;VT=SNP GT 1|0 19 540648 rs376850117 GC G 100 PASS AA=ccccc|CCCCCC|CCCCC|insertion;AC=1;AF=0.500;AFR_AF=0.7375;AMR_AF=0.9452;AN=2;DP=10125;EAS_AF=0.9018;EUR_AF=0.9384;NS=2504;SAS_AF=0.8814;VT=INDEL GT 1|0 19 540805 rs578255597 G AC 100 PASS AA=?|-|C|unsure;AC=1;AF=0.500;AFR_AF=0.6657;AMR_AF=0.9438;AN=2;DP=9576;EAS_AF=0.999;EUR_AF=0.9294;NS=2504;SAS_AF=0.9162;VT=INDEL GT 1|0

This is the command that I used:

java -Xmx2g -jar ../GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -R human_g1k_v37.fasta -T FastaAlternateReferenceMaker --variant outputChr19NA07056Edited.vcf -o 1000Genomes_chr19outputNA07056.fasta -L 19

Created 2015-10-28 16:46:20 | Updated | Tags: snp fastaalternatereferencemaker

Hi everyone,

I have a question about using the GATK AlternateReferenceMaker and how it may affect down stream analysis. I apologize for the long post but I tried to include only relevant info. Any thoughts on analysis are helpful!

Here is basically what I did: - Illumina sequencing of 300 libraries with methods similar to genotype-by-sequencing (like RAD-tag libraries without shearing) - Created contigs of sequences within individuals (cap3) and compared across 300 individuals - Selected contigs found in at least 5 individuals and mapped to a draft reference genome of sister species (bwa-mem) - Used GATK AlternateReferenceMaker to insert differences and create a "pseudo"-reference for my species

• Then I have did SNP calling from raw sequences using GATK best practices (alignment of each individual to pseudo-reference, realignment, g.vcf, Haplotype Caller)
• I did hard filtering for quality, coverage, and missing data across SNPs
• I have sub-selected SNPs with minor allele frequencies > 0.25 to use in designing a SNP array for parentage (~800 SNPs)
• When I try to select the flanking regions around these SNPs using bedtools, it pulls the flanking regions from the pseudo-reference and many of them have a lot of N's... so my questions are:
1. How can I tell if these were produced from the process I used to create a "pseudo"-reference for my species? Could they be misalignments from sequence contigs to reference or between species? How would this affect the SNP calling proceess?

2. What is the best way to ground truth the sequencing region? Should I try to pull the alignments from the region around each SNP for each individual and look at them manually? How can I do this?

3. Finally, the sister species and my pseudo-reference are not indexed in the same way. I'm guessing I should've sorted them before indexing or specified the indexing to use when using the GATK AlternateReferenceMaker. Can I convert the indexing?

Created 2015-09-09 09:47:25 | Updated 2015-09-09 09:54:22 | Tags: fastaalternatereferencemaker

Hi, I am using FastaAlternateReferenceMaker from GenomeAnalysisTK-3.4-46

my vcf-file looks like this:

#header...
##source=SelectVariants
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  sample11


this is how I start the program:

java -jar GenomeAnalysisTK.jar \ -T FastaAlternateReferenceMaker \ -R C17_ref.fna \ -o test.fasta \ -L scaffold-50:1055905-1055905 \ -V results/tmp/J65_clean.vcf

and the resulting fasta file will always have a C instead of an A which would be correct here (if i use A,T,C in the ALT column I will get an A). so it looks like the first ALT nucleotide is taken and it totally ignores the 3/3 in the last column (if I change it to 1/1 or 0/0 nothing changes)

What am I doing wrong?

Thanks

Created 2015-08-09 21:20:47 | Updated | Tags: fastaalternatereference vcf fastaalternatereferencemaker

I am trying to use a VCF containing snps variants to change the mouse reference (GRCm38- c57BL/6J) with BALB/cJ snps.

After running this command:

java -jar ~/programs/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R ~/genome/mouse_GRCm38.p4/GRCm38.primary_assembly/GRCm38.primary_assembly.fa -o ~/BALBcJ.snp.primary.fa -V ~/BALB_cJ.snps.vcf

The following ERROR shows up:

##### ERROR reference contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 3, 4, 5, 6, 7, 8, 9, MT, X, Y, JH584299.1, GL456233.1, JH584301.1, GL456211.1, GL456350.1, JH584293.1, GL456221.1, JH584297.1, JH584296.1, GL456354.1, JH584294.1, JH584298.1, JH584300.1, GL456219.1, GL456210.1, JH584303.1, JH584302.1, GL456212.1, JH584304.1, GL456379.1, GL456216.1, GL456393.1, GL456366.1, GL456367.1, GL456239.1, GL456213.1, GL456383.1, GL456385.1, GL456360.1, GL456378.1, GL456389.1, GL456372.1, GL456370.1, GL456381.1, GL456387.1, GL456390.1, GL456394.1, GL456392.1, GL456382.1, GL456359.1, GL456396.1, GL456368.1, JH584292.1, JH584295.1]

So Trying to fix, I used the perl script in the link to sort properly within the reference.

I did this:

./sortByRef.pl ~/BALB_cJ.snps.vcf /home/tiagocastro/genome/mouse_GRCm38.p4/GRCm38.primary_assembly/GRCm38.primary_assembly.fa.fai > ~/BALB_cJ.snps_sorted.vcf

using the new vcf file, a new error is shown:

##### ERROR VCF3 VariantContext (this is an external codec and is not documented within GATK)

looking the head of each, sorted and basic vcf, I can see that is different.

Can someone help me?

Created 2015-05-01 05:07:41 | Updated | Tags: selectvariants snp gatk fastaalternatereferencemaker

I have some VCF files, each of which I have merged to contain >300 genotypes. Furthermore, to make them more manageable I have subsetted them to just contain the chromosome regions I am interested in.

Now I wish to generate some genotype specific FASTA sequences using these files and a reference sequence; i.e. a sequence for each genotype which is the same as the reference sequence but with the SNPs specific to each genotype in place of their counterparts in the reference sequence.

Now I know that there is variation in the genotypes. Here is a picture visualizing three exemplar genotypes that I generated by loading the VCF file into Geneious.

http://imgur.com/LRZHIcw

I then try to create individual VCF files for each genotype using this:

java -jar GenomeAnalysisTK.jar -R ~/Path/to/reference\sequence/ref.fasta -T SelectVariants --variant ~/Path/to/complete\vcf/example.vcf -o ~/Path/to/individual/genotype.vcf -sn genotype

While I can't be sure this had the desired effect as it is difficult to assess a whole VCF file I can say that the header now only contains the relevant genotype so I assume this is the case.

I then try and use this individual VCF file for each genotype like this:

java -jar GenomeAnalysisTK.jar -R ~/Path/to/reference\sequence/ref.fasta -T FastaAlternateReferenceMaker --variant~/Path/to/individual/genotype.vcf -L chrX:XX,XXX,XXX-XX,XXX,XXX -o ~/Path/to/individual/genotype.fasta

Here the Xs represent the location on the reference sequence of the regions of interest.

I did this in a loop and got identical sequences for every genotypes. I then implemented it individually for the 3 exemplar genotypes in the picture above and in both cases I get identical sequences for every genotype. Interestingly they are not the reference sequence.

What am I doing wrong?

I have also posted this on the Biostars forum..

Created 2015-03-25 23:06:11 | Updated | Tags: fastaalternatereferencemaker

Hi,

I'm trying to use the FastaAlternateReferenceMaker tool to make a new reference genome for read-mapping.

However, it merges all the chromosomes into one, simply called "1". Is there a command that allows chromosome/scaffold names to be maintained.

Also I guess that any frameshifts caused by insertions or deletions affect the base co-ordinates relative to the original assembly, making comparison with other data problematic.

Cheers,

Blue

Created 2015-03-02 10:13:32 | Updated 2015-03-02 10:30:32 | Tags: fastaalternatereference fastaalternatereferencemaker

Hello, I really like yor FastaAlternateReferenceMaker to build a new consensus from FASTA+VCF. However, it deletes all headers from the FASTA file and replaces them by numbers. Is there a way to change this behavior, e.g. simply take the old headers?

Best, Boyke

Created 2014-12-03 16:43:19 | Updated | Tags: fastaalternatereferencemaker

I think supplying the error/out file from the bsub job I ran for GATK FastaAlternateReferenceMaker will be more helpful than any explanation I could write. The reference fasta is ~85Mb and the output Alternate.consensus.fna ended up ~24Mb when error occurred. Let me know if you need anymore details to diagnose the problem. Thx

Created 2014-04-07 14:25:26 | Updated | Tags: bug java7 fastaalternatereferencemaker negativearraysizeexception

I was working with FastaAlternateReferenceMaker today. I wanted to use it on two samples. The first sample worked fine. The second sample came up with an error about a few REF values not conforming (e.g. :A) in the vcf. I wrote a script to correct the errors and remove the : from the vcf file. I tried to run it again, but received memory allocation errors. So, I dealt with that and am now allocating 6g for a 15G VCF and a 188M fasta reference. I got it running (I think) and then started getting the NegativeArraySizeException java error. I saw this had come up as a bug in earlier versions of GATK, but it seemed like it had been resolved...perhaps not? Anyhow, this 'quick little run' has taken up the better part of my day and I would be super greatful for some assistance troubleshooting this problem.

##### ERROR ------------------------------------------------------------------------------------------

Created 2014-04-02 07:06:05 | Updated | Tags: indel fastaalternatereferencemaker indexdictionaryutils

Hey,

first of all thanks for the support offered here. But (sadly) I have an other two questions.

I used the FastaAlternateReferenceMaker to create a consesus sequence and it worked great. Here I have a question regarding a warning: "IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation". I do not get this warning. Which dictionary is needed here. Is this warning comming up, because I used .gz and .tbi compressed vcf-files?

And a furhter question regarding the creation of a consesus sequence. Im unsure what to do in cases of heterozygous INDELs. What would you suggest to do? Skip them as they are only represented by one allele or include them in the consensus. Both seems to be incorrect. Do you know the general handling here?

Volavii

Created 2014-03-27 10:11:21 | Updated | Tags: fastaalternatereferencemaker useiupac

Hey,

Im trying to use the option --useIUPAC with the FastaAlternateReferenceMaker. I tried several ways to set this option but all lead to ERRORs (see below). Without the option no ERROR message occurrs.

Here my code and...

java -Xms1g -jar "/home/GenomeAnalysisTK.jar" \
-T "FastaAlternateReferenceMaker" \
-R "/home/refGen.fa" \
-L "chromosome3" \
-lw "100" \
--variant "/home/chromosome3_raw_snp_filter3.vcf.gz" \
-o "/home/chromosome3.fasta" \
--useIUPAC


the ways I used useIUPAC:

--useIUPAC
##### ERROR MESSAGE: Argument with name 'useIUPAC' isn't defined.

--useIUPAC "true"
##### ERROR MESSAGE: Argument with name 'useIUPAC' isn't defined.

-useIUPAC "true"
##### ERROR MESSAGE: Argument with name 'useIUPAC' isn't defined.

-useIUPAC
##### ERROR MESSAGE: Argument with name 'useIUPAC' isn't defined.


Created 2014-03-24 13:32:40 | Updated | Tags: fastaalternatereferencemaker

Hi,

I would like to change the reference genome with the sequences of the variants I've found in my deep sequencing data. So here is my command line: java -jar GenomeAnalysisTK.jar -R ../human_g1k_v37.fasta -T FastaAlternateReferenceMaker -o ../output_new_genome.fasta -L ../cancer.intervals -- ../output.snp.indel.LOH.vcf

I got this error:

##### ERROR Invalid argument value '../output.snp.indel.LOH.vcf' at position 9.

Don't really understand...

Thank you for your time and help

Created 2014-02-20 08:40:28 | Updated | Tags: vcf fastaalternatereferencemaker

Dear all I am writing to you to understand why FastaAlternateReferenceMaker is missing a deletion in the final consensus. I am using gatk v. 2.8-1 and this commandline:

output from gatk was:

INFO 08:57:46,123 HelpFormatter - -------------------------------------------------------------------------------- INFO 08:57:46,126 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.8-1-g932cd3a, Compiled 2013/12/06 16:47:15 INFO 08:57:46,126 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 08:57:46,126 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 08:57:46,130 HelpFormatter - Program Args: -R ref.fa -T FastaAlternateReferenceMaker -o reads_vs_ref_gatk_consensus.fa --variant reads_vs_ref_var.vcf INFO 08:57:46,130 HelpFormatter - Date/Time: 2014/02/20 08:57:46
INFO 08:57:46,131 HelpFormatter - --------------------------------------------------------------------------------
INFO 08:57:46,131 HelpFormatter - --------------------------------------------------------------------------------
INFO 08:57:46,137 ArgumentTypeDescriptor - Dynamically determined type of reads_vs_ref_var.vcf to be VCF
INFO 08:57:46,856 GenomeAnalysisEngine - Strictness is SILENT
INFO 08:57:46,978 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 08:57:47,010 RMDTrackBuilder - Creating Tribble index in memory for file reads_vs_ref_var.vcf
INFO 08:57:47,079 RMDTrackBuilder - Writing Tribble index to disk for file /home/aprea/test/consensus/reads_vs_ref_var.vcf.idx
INFO 08:57:47,223 GenomeAnalysisEngine - Preparing for traversal
INFO 08:57:47,224 GenomeAnalysisEngine - Done preparing for traversal
INFO 08:57:47,225 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 08:57:47,226 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
INFO 08:57:47,546 ProgressMeter - done 1.05e+04 0.0 s 30.0 s 99.0% 0.0 s 0.0 s
INFO 08:57:47,547 ProgressMeter - Total runtime 0.32 secs, 0.01 min, 0.00 hours
INFO 08:57:51,452 GATKRunReport - Uploaded run statistics report to AWS S3

There are 3 variants in the vcf file: the first is a SNP and is reported, the second is an insertion and is reported but the third, a deletion, is missing. Could you please tell me where's the problem? I attached both reference and vcf file.

Many thanks.

Created 2013-11-12 14:31:50 | Updated 2013-11-12 14:35:51 | Tags: fastaalternatereferencemaker consensus name header

Hi,

I am using FastaAlternateReferenceMaker to create consensus sequences as follows:

java -Xmx12g -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R /Reference/chromosome.1.fa -o /Output/consensus.1.fa --variant /VCF/chromosome1.vcf -L /Interval/chromosome.1.list

This command line produces a single fasta file including consensus sequences for the intervals provided in the list file.

Two questions:

(1) Presumably, sequence header for each sequence in the consensus fasta file corresponds to the contig name in the master reference file. Is there any way to modify the command line to print, say, the interval as the sequence header instead of the contig name?

Note: I can feed the program one interval at a time and name the output file accordingly, however, I'd like to stick to submitting my query per one chromosome at a time for the sake of saving up time.

(2) Number of sequences in the consensus fasta do not match the number of non-overlapping intervals in the input list. I would think that this is because some intervals are variant-free and therefore no alternate reference is reported for them. Do you confirm this is the case? I cannot easily check because the issue mentioned in (1).

Thank you.

Created 2013-09-18 17:08:31 | Updated 2013-09-18 17:12:41 | Tags: fastareference fastaalternatereferencemaker

Hello

I tried to run FastaAlternateReferenceMaker and I get the following error:

WARNING 2013-09-18 16:28:28 IntervalList    Ignoring interval for unknown reference: Chr1:3580210-3580286


For all the intervals I submitted. I already looked around on the web, and I did not find any answer, knowing that my chromosome names are all with the 'Chr' format in all the files and that my interval files are tab delimited.

My interval file look like:

@HD     VN:1.4  SO:unsorted
@SQ     SN:Chr1 LN:158337067    UR:file:chromosome_3.1.fasta    M5:0631b350aa263a0f714de8ba9d609eb0
@SQ     SN:Chr2 LN:137060424    UR:file:Chromosome_3.1.fasta    M5:15898469d6142f8bb74f769bfe9b155f
@SQ     SN:Chr3 LN:121430405    UR:file:Chromosome_3.1.fasta    M5:c515c4da7c2cd2d24c9487db8f733cfd
...
Chr1    3580210 3580286 +       ID=MI0011294_1;accession_number=MI0011294
Chr1    3580220 3580240 +       ID=MIMAT0011792_1;accession_number=MIMAT0011792
Chr1    3607747 3607842 -       ID=MI0014499_1;accession_number=MI0014499
Chr1    3607802 3607822 -       ID=MIMAT0017395_1;accession_number=MIMAT0017395
Chr1    10227277        10227339        -       ID=MI0009752_1;accession_number=MI0009752
Chr1    10227315        10227337        -       ID=MIMAT0009241_1;accession_number=MIMAT0009241
Chr1    19881347        19881431        -       ID=MI0005457_1;accession_number=MI0005457
Chr1    19881398        19881419        -       ID=MIMAT0003539_1;accession_number=MIMAT0003539
Chr1    19930459        19930542        -       ID=MI0005454_1;accession_number=MI0005454
Chr1    19930511        19930532        -       ID=MIMAT0004332_1;accession_number=MIMAT0004332
...


The header of my interval file is a copy of the Chromosome_3.1.dict I do not know what is misformated and why I get this error

Thanks

Martin

Created 2013-09-05 14:54:32 | Updated 2013-09-05 16:09:16 | Tags: fastaalternatereferencemaker

Command:

java -jar GenomeAnalysisTK.jar -R S288C_refseq.fasta -T FastaAlternateReferenceMaker -o WT_refseq.fasta --variant WT_common.vcf


Output (last line):

##### ERROR MESSAGE: Line 54: there aren't enough columns for line
chr1   1   .   CCACACCACACCCACAC   CCACCCACACCACACCCACAC,CCACACCACACCCACACCACACCCACAC   5.79   .   INDEL;IS=1,1.000000;DP=3;VDB=2.063840e-02;AF1=1;AC1=8;DP4=0,0,3,0;MQ=28;FQ=-33.4   GT:PL:DP:SP:GQ
(we expected 9 tokens, and saw 1 )


As far as I can tell, the vcf file conforms to spec and contains 9 tab separated columns. Therefore I don't understand the error. I've tried re-parsing the vcf file to ensure that there aren't missing or hidden characters, without success.

Created 2013-07-05 15:39:36 | Updated | Tags: fastaalternatereferencemaker

Hi, I want to create out of my bam file a consensus file by using FastaAlternatereferneceMaker and my .vcf file. How should I care about positions where no reads mapped to. By default FastaAlternateReferenceMaker uses the REF base, but isnt it better to use here N insted? THANKS :=)

Created 2013-05-08 08:23:44 | Updated | Tags: intervals fastaalternatereferencemaker

Hi, I am using FastaAlternateReferenceMaker and have a set of intervals ordered first by chromosome and then by their start positions. I have tried ordering chromosomes alphabetically(chr1, chr10, chr11,..) as well as numerically (chr1, chr2, chr3...) but the output fasta sequence returned is not in the same order as listed in interval file. I find that even the names target_1, target_2 etc are also not used as fasta headers in the output file. I am stuck with mapping the input intervals with the output fasta sequences. Thanks in advance for all the help, Ramya