Tagged with #hg19
0 documentation articles | 0 announcements | 5 forum discussions

No posts found with the requested search criteria.
No posts found with the requested search criteria.

Created 2014-06-27 12:30:27 | Updated | Tags: bam hg19 1000g grch37
Comments (1)

I have a human exome experiment on which I am using hg19 resources (reference, targets, dbSNP, ... the whole shebang). I want to add some 1000Genomes exomes to this experiment, but the available BAMs are from GRCh37.

Is there a tool to port the BAMs from GRCh37 to hg19, and to continue with that? Maybe LiftOver?

Do you rather recommend re-processing the 1000Genomes BAMs on hg19? Would that mean regenerate FASTQs and re-do the whole map/MarkDup/IndelReal/BQSR steps?

For now, I have worked on the original BAMs but have renamed all the classical chromosomes from "1" to "chr1" and I got rid of the mitochondrial chromosome and all other contigs (got rid of these contigs also in the resources to avoid GATKs complaints on missing contigs). How bad would you think that is based on the differences you know between GRCh37 and hg19?

Thanks a lot for your help!

Created 2014-06-06 14:54:02 | Updated | Tags: haplotypecaller bam hg19 grch37
Comments (5)

I am preparing BAM files from the 1000 genomes project to use in my GATK pipeline (along with other already processed BAMs) and I have the following issues:

  • chromosome notation on my BAMs is from GRCh37 but my pipeline uses hg19, so I would like to replace chromosome notation (1 -> chr1)
  • the mitochondrial chromosome is slightly different in hg19 and GRCh37 (see here), so I want to leave it out
  • and actually leave out all alternate contigs

This sounds quite trivial, but I haven't found a clean way to do this yet. I have tried the following: i=INPUT.bam j=OUTPUT.bam samtools view -h $i | awk 'BEGIN{FS=OFS="\t"} (/^@/ && !/@SQ/){print $0} $2~/^SN:[1-9]|^SN:X|^SN:Y/{print $0} $3~/^[1-9]|X|Y/{$3="chr"$3; print $0} ' | sed 's/SN:/SN:chr/g' | samtools view -bS - > $j However, when I try running the HaplotypeCaller, I get the following error:

ERROR MESSAGE: BAM file(s) do not have the contig: chrM. You are probably using a different reference than the one this file was aligned with

Could you help me prepare these BAM files for processing? Thanks a lot in advance

Created 2013-07-19 06:44:49 | Updated | Tags: haplotypecaller hg19
Comments (3)

Hi, I’m trying to use GATK for variant calling. I have had some problems preparing the hg19 reference genome. I haven’t done the alignment myself, but gotten the .bam files already done. I how ever know that this should be the same reference used for the alignment. I have downloaded the hg19 from ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz and downloaded the new version of mitochondrial genome (NC_012920.1). and then tried: cat chr*.fa > hg19.fa

After this I have followed the instructions on (how to) Prepare a reference for use with BWA and GATK.

When I try to call the variants using HaplotypeCaller (as instructed on: (howto) Call variants on a diploid genome with the HaplotypeCaller):

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa -I 8526RU.rmdup.bam -L 20 --genotyping_mode DISCOVERY --output_mode EMIT_VARIANTS_ONLY -stand_emit_conf 10 -stand_call_conf 30 -o raw_variants8526RU.vcf

I get the error message: “Badly formed genome loc: Contig '20' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?” Can you tell me what the problem is? And how to fix this? I know there have been some similar questions considering the contigs, but I haven’t been able to solve the problem based on them.

Thank you, Sini

Created 2012-12-16 22:27:37 | Updated 2012-12-16 22:28:40 | Tags: haplotypecaller validatevariants variantannotator fastareference vcf dbsnp hg19
Comments (7)

I ran the HaplotypeCaller, VariantAnnotator, and Variant Validatoor on chr3 locations from a human tumor sample.

The HaplotypeCaller command line is:

#Fasta from the gz in the resource bundle

java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T HaplotypeCaller \
 -I chrom_bams/286T.chr3.bam \
 -o hapc_vcfs/286T.chr3.raw.vcf 

The VariantAnnotator command line is:

java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T VariantAnnotator \
     --dbsnp $dbsnp  --alwaysAppendDbsnpId \
    -A BaseQualityRankSumTest -A DepthOfCoverage \
    -A FisherStrand -A HaplotypeScore -A InbreedingCoeff \
    -A MappingQualityRankSumTest -A MappingQualityZero -A QualByDepth \
    -A RMSMappingQuality -A ReadPosRankSumTest -A SpanningDeletions \
    -A TandemRepeatAnnotator \
    --variant:vcf hapc_vcfs/286T.chr3.raw.vcf \
    --out varanno_vcfs/286T.chr3.va.vcf

This all works nicely, but I go back and use ValidateVariants just to be sure:

java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T ValidateVariants \
   --dbsnp ${dbsnp} \
   --variant:vcf varanno_vcfs/286T.chr3.va.vcf \
    1> report/ValidateVariants/286T.chr3.va.valid.out \
    2> report/ValidateVariants/286T.chr3.va.valid.err &

An issue arises with a rsID that is flagged as not being present in dbSNP.

...fails strict validation: the rsID rs67850374 for the record at position chr3:123022685 is not in dbSNP

I realize this is an error message that generally would not generally qualify as an issue to post to these forums, however it is an error that seems to be generated by the Haplotype caller, illuminated by VariantAnnotator, and caught by the ValidateVariants.

The first 7 fields of the offending line in the 286T.chr3.va.vcf can be found using: cat 286T.chr3.va.vcf | grep rs67850374

chr3    123022685       rs67850374;rs72184829   AAAGAGAAGAGAAGAG        A       1865.98 .

There is a corresponding entry in the dbsnp_135.hg19.vcf file: cat $dbsnp | grep rs67850374

chr3    123022685       rs67850374;rs72184829   AA      A,AAAGAGAAGAG,AAAGAGAAGAGAAGAGAAGAG     .  PASS

My initial guess is that this is caused by a disagreement in the reference and variant fields between the two annotations. From what I can gather the call to the variantcontext function validateRSIDs() has a call to validateAlternateAlleles(). I assume this is what throws the error that is then caught and reported as "...fails strict validation..."

The UCSC genome browser for hg19 does show the specified position to be AA. It seems as thought the HaplotypeCaller simply used a different reference than dbsnp in this case.

The reference file supplied to HaplotypeCaller was the same as to VariantAnnotator and ValidateVariants. I did not supply the dbsnp argument to the HaplotypeCaller as I planned on doing all annotations after the initial variant calling, and the documentation states that the information is not utilized in the calculations. It seems as though this is a difference in between the reference assembly for dbSNP and the the reference supplied by the resource bundle.

My questions are:

  1. Is this really a problem that arises from slightly different reference assemblies?
  2. Is the hg19-1.5 reference fasta different from any other hg19 reference fasta?
  3. Is there at tool that I have missed that would have prevented this error and allowed the pipeline to continue without error?"
  4. Will this strict validation failure cause problems for the VariantRecalibrator?

As it stands, I am simply going to discard the offending lines manually. There are less than twenty in the entire exome sequencing of this particular tumor-normal sequencing. However, it seems like this issue will likely arise again. I will check the dbSNP VCF for places where the reference differs from the sequence in hg19. At least that should give me an estimate of the number of times this will arise and the locations to exclude from the variant calls.

-- Colin

Created 2012-08-28 20:22:44 | Updated 2013-01-07 20:00:01 | Tags: hg19 b37
Comments (4)

Hi, We have some annotation files, for example a GTF file of UCSC's "Known Genes" in hg19 coordinates. We'd like to convert this to b37 coordinates. What's the best way to go about doing this? Assistance would be appreciated! Thanks in advance, Lao