Tagged with #fastareference
3 documentation articles | 0 events or announcements | 1 forum discussion


The GATK uses two files to access and safety check access to the reference files: a .dict dictionary of the contig names and sizes and a .fai fasta index file to allow efficient random access to the reference bases. You have to generate these files in order to be able to use a Fasta file as reference.

NOTE: Picard and samtools treat spaces in contig names differently. We recommend that you avoid using spaces in contig names.

Creating the fasta sequence dictionary file

We use CreateSequenceDictionary.jar from Picard to create a .dict file from a fasta file.

> java -jar CreateSequenceDictionary.jar R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict
[Fri Jun 19 14:09:11 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict
[Fri Jun 19 14:09:58 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary done.
Runtime.totalMemory()=2112487424
44.922u 2.308s 0:47.09 100.2%   0+0k 0+0io 2pf+0w

This produces a SAM-style header file describing the contents of our fasta file.

> cat Homo_sapiens_assembly18.dict 
@HD     VN:1.0  SO:unsorted
@SQ     SN:chrM LN:16571        UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:d2ed829b8a1628d16cbeee88e88e39eb
@SQ     SN:chr1 LN:247249719    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:9ebc6df9496613f373e73396d5b3b6b6
@SQ     SN:chr2 LN:242951149    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:b12c7373e3882120332983be99aeb18d
@SQ     SN:chr3 LN:199501827    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:0e48ed7f305877f66e6fd4addbae2b9a
@SQ     SN:chr4 LN:191273063    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:cf37020337904229dca8401907b626c2
@SQ     SN:chr5 LN:180857866    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:031c851664e31b2c17337fd6f9004858
@SQ     SN:chr6 LN:170899992    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:bfe8005c536131276d448ead33f1b583
@SQ     SN:chr7 LN:158821424    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:74239c5ceee3b28f0038123d958114cb
@SQ     SN:chr8 LN:146274826    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:1eb00fe1ce26ce6701d2cd75c35b5ccb
@SQ     SN:chr9 LN:140273252    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:ea244473e525dde0393d353ef94f974b
@SQ     SN:chr10        LN:135374737    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:4ca41bf2d7d33578d2cd7ee9411e1533
@SQ     SN:chr11        LN:134452384    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:425ba5eb6c95b60bafbf2874493a56c3
@SQ     SN:chr12        LN:132349534    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:d17d70060c56b4578fa570117bf19716
@SQ     SN:chr13        LN:114142980    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:c4f3084a20380a373bbbdb9ae30da587
@SQ     SN:chr14        LN:106368585    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:c1ff5d44683831e9c7c1db23f93fbb45
@SQ     SN:chr15        LN:100338915    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:5cd9622c459fe0a276b27f6ac06116d8
@SQ     SN:chr16        LN:88827254     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:3e81884229e8dc6b7f258169ec8da246
@SQ     SN:chr17        LN:78774742     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:2a5c95ed99c5298bb107f313c7044588
@SQ     SN:chr18        LN:76117153     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:3d11df432bcdc1407835d5ef2ce62634
@SQ     SN:chr19        LN:63811651     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:2f1a59077cfad51df907ac25723bff28
@SQ     SN:chr20        LN:62435964     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f126cdf8a6e0c7f379d618ff66beb2da
@SQ     SN:chr21        LN:46944323     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f1b74b7f9f4cdbaeb6832ee86cb426c6
@SQ     SN:chr22        LN:49691432     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:2041e6a0c914b48dd537922cca63acb8
@SQ     SN:chrX LN:154913754    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:d7e626c80ad172a4d7c95aadb94d9040
@SQ     SN:chrY LN:57772954     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:62f69d0e82a12af74bad85e2e4a8bd91
@SQ     SN:chr1_random  LN:1663265      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:cc05cb1554258add2eb62e88c0746394
@SQ     SN:chr2_random  LN:185571       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:18ceab9e4667a25c8a1f67869a4356ea
@SQ     SN:chr3_random  LN:749256       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:9cc571e918ac18afa0b2053262cadab6
@SQ     SN:chr4_random  LN:842648       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:9cab2949ccf26ee0f69a875412c93740
@SQ     SN:chr5_random  LN:143687       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:05926bdbff978d4a0906862eb3f773d0
@SQ     SN:chr6_random  LN:1875562      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:d62eb2919ba7b9c1d382c011c5218094
@SQ     SN:chr7_random  LN:549659       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:28ebfb89c858edbc4d71ff3f83d52231
@SQ     SN:chr8_random  LN:943810       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:0ed5b088d843d6f6e6b181465b9e82ed
@SQ     SN:chr9_random  LN:1146434      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:1e3d2d2f141f0550fa28a8d0ed3fd1cf
@SQ     SN:chr10_random LN:113275       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:50be2d2c6720dabeff497ffb53189daa
@SQ     SN:chr11_random LN:215294       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:bfc93adc30c621d5c83eee3f0d841624
@SQ     SN:chr13_random LN:186858       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:563531689f3dbd691331fd6c5730a88b
@SQ     SN:chr15_random LN:784346       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:bf885e99940d2d439d83eba791804a48
@SQ     SN:chr16_random LN:105485       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:dd06ea813a80b59d9c626b31faf6ae7f
@SQ     SN:chr17_random LN:2617613      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:34d5e2005dffdfaaced1d34f60ed8fc2
@SQ     SN:chr18_random LN:4262 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f3814841f1939d3ca19072d9e89f3fd7
@SQ     SN:chr19_random LN:301858       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:420ce95da035386cc8c63094288c49e2
@SQ     SN:chr21_random LN:1679693      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:a7252115bfe5bb5525f34d039eecd096
@SQ     SN:chr22_random LN:257318       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:4f2d259b82f7647d3b668063cf18378b
@SQ     SN:chrX_random  LN:1719168      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f4d71e0758986c15e5455bf3e14e5d6f

Creating the fasta index file

We use the faidx command in samtools to prepare the fasta index file. This file describes byte offsets in the fasta file for each contig, allowing us to compute exactly where a particular reference base at contig:pos is in the fasta file.

> samtools faidx Homo_sapiens_assembly18.fasta 
108.446u 3.384s 2:44.61 67.9%   0+0k 0+0io 0pf+0w

This produces a text file with one record per line for each of the fasta contigs. Each record is of the: contig, size, location, basesPerLine, bytesPerLine. The index file produced above looks like:

> cat Homo_sapiens_assembly18.fasta.fai 
chrM    16571   6       50      51
chr1    247249719       16915   50      51
chr2    242951149       252211635       50      51
chr3    199501827       500021813       50      51
chr4    191273063       703513683       50      51
chr5    180857866       898612214       50      51
chr6    170899992       1083087244      50      51
chr7    158821424       1257405242      50      51
chr8    146274826       1419403101      50      51
chr9    140273252       1568603430      50      51
chr10   135374737       1711682155      50      51
chr11   134452384       1849764394      50      51
chr12   132349534       1986905833      50      51
chr13   114142980       2121902365      50      51
chr14   106368585       2238328212      50      51
chr15   100338915       2346824176      50      51
chr16   88827254        2449169877      50      51
chr17   78774742        2539773684      50      51
chr18   76117153        2620123928      50      51
chr19   63811651        2697763432      50      51
chr20   62435964        2762851324      50      51
chr21   46944323        2826536015      50      51
chr22   49691432        2874419232      50      51
chrX    154913754       2925104499      50      51
chrY    57772954        3083116535      50      51
chr1_random     1663265 3142044962      50      51
chr2_random     185571  3143741506      50      51
chr3_random     749256  3143930802      50      51
chr4_random     842648  3144695057      50      51
chr5_random     143687  3145554571      50      51
chr6_random     1875562 3145701145      50      51
chr7_random     549659  3147614232      50      51
chr8_random     943810  3148174898      50      51
chr9_random     1146434 3149137598      50      51
chr10_random    113275  3150306975      50      51
chr11_random    215294  3150422530      50      51
chr13_random    186858  3150642144      50      51
chr15_random    784346  3150832754      50      51
chr16_random    105485  3151632801      50      51
chr17_random    2617613 3151740410      50      51
chr18_random    4262    3154410390      50      51
chr19_random    301858  3154414752      50      51
chr21_random    1679693 3154722662      50      51
chr22_random    257318  3156435963      50      51
chrX_random     1719168 3156698441      50      51

liftOverVCF.pl

Contents

Introduction

This script converts a VCF file from one reference build to another. It runs 3 modules within our toolkit that are necessary for lifting over a VCF.
1. LiftoverVariants walker
2. sortByRef.pl to sort the lifted-over file
3. Filter out records whose ref field no longer matches the new reference

Obtaining the Script

The liftOverVCF.pl script is available in our public source repository under the 'perl' directory. Instructions for pulling down our source are available here.

Example

./liftOverVCF.pl -vcf calls.b36.vcf \
  -chain b36ToHg19.broad.over.chain \
  -out calls.hg19.vcf \
  -gatk /humgen/gsa-scr1/ebanks/Sting_dev
  -newRef /seq/references/Homo_sapiens_assembly19/v0/Homo_sapiens_assembly19
  -oldRef /humgen/1kg/reference/human_b36_both
  -tmp /broad/shptmp [defaults to /tmp]

Usage

Running the script with no arguments will show the usage:

Usage: liftOverVCF.pl
    -vcf        <input vcf>
    -gatk       <path to gatk trunk>
    -chain      <chain file>
    -newRef     <path to new reference prefix; we will need newRef.dict, .fasta, and .fasta.fai>
    -oldRef     <path to old reference prefix; we will need oldRef.fasta>
    -out        <output vcf>
    -tmp            <temp file location; defaults to /tmp>
  • The 'tmp' argument is optional. It specifies the location to write the temporary file from step 1 of the process.


Chain files

Chain files from b36/hg18 to hg19 are located here within the Broad:

   /humgen/gsa-hpprojects/GATK/data/Liftover_Chain_Files/

External users can get them off our ftp site:

   location: ftp.broadinstitute.org
   username: gsapubftp-anonymous
   path:     Liftover_Chain_Files

1. Reference Sequence

The GATK requires the reference sequence in a single reference sequence in FASTA format, with all contigs in the same file. The GATK requires strict adherence to the FASTA standard. Only the standard ACGT bases are accepted; no non-standard bases (W, for example) are tolerated. Gzipped fasta files will not work with the GATK, so please make sure to unzip them first. Please see this article for more information on preparing FASTA reference sequences for use with the GATK.

Human sequence

If you are using human data, your reads must be aligned to one of the official b3x (e.g. b36, b37) or hg1x (e.g. hg18, hg19) references. The contig ordering in the reference you used must exactly match that of one of the official references canonical orderings. These are defined by historical karotyping of largest to smallest chromosomes, followed by the X, Y, and MT for the b3x references; the order is thus 1, 2, 3, ..., 10, 11, 12, ... 20, 21, 22, X, Y, MT. The hg1x references differ in that the chromosome names are prefixed with "chr" and chrM appears first instead of last. The GATK will detect misordered contigs (for example, lexicographically sorted) and throw an error. This draconian approach, though unnecessary technically, ensures that all supplementary data provided with the GATK works correctly. You can use ReorderSam to fix a BAM file aligned to a missorted reference sequence.

Our Best Practice recommendation is that you use a standard GATK reference from the GATK resource bundle.

2. Sequencing Reads

The only input format for NGS reads that the GATK supports is the [Sequence Alignment/Map (SAM)] format. See [SAM/BAM] for more details on the SAM/BAM format as well as Samtools and Picard, two complementary sets of utilities for working with SAM/BAM files.

In addition to being in SAM format, we require the following additional constraints in order to use your file with the GATK:

  • The file must be binary (with .bam file extension).
  • The file must be indexed.
  • The file must be sorted in coordinate order with respect to the reference (i.e. the contig ordering in your bam must exactly match that of the reference you are using).
  • The file must have a proper bam header with read groups. Each read group must contain the platform (PL) and sample (SM) tags. For the platform value, we currently support 454, LS454, Illumina, Solid, ABI_Solid, and CG (all case-insensitive).
  • Each read in the file must be associated with exactly one read group.

Below is an example well-formed SAM field header and fields from the 1000 Genomes Project:

@HD     VN:1.0  GO:none SO:coordinate
@SQ     SN:1    LN:249250621    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ     SN:2    LN:243199373    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:a0d9851da00400dec1098a9255ac712e
@SQ     SN:3    LN:198022430    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:fdfd811849cc2fadebc929bb925902e5
@SQ     SN:4    LN:191154276    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:23dccd106897542ad87d2765d28a19a1
@SQ     SN:5    LN:180915260    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:0740173db9ffd264d728f32784845cd7
@SQ     SN:6    LN:171115067    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:1d3a93a248d92a729ee764823acbbc6b
@SQ     SN:7    LN:159138663    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:618366e953d6aaad97dbe4777c29375e
@SQ     SN:8    LN:146364022    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:96f514a9929e410c6651697bded59aec
@SQ     SN:9    LN:141213431    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:3e273117f15e0a400f01055d9f393768
@SQ     SN:10   LN:135534747    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:988c28e000e84c26d552359af1ea2e1d
@SQ     SN:11   LN:135006516    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:98c59049a2df285c76ffb1c6db8f8b96
@SQ     SN:12   LN:133851895    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:51851ac0e1a115847ad36449b0015864
@SQ     SN:13   LN:115169878    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:283f8d7892baa81b510a015719ca7b0b
@SQ     SN:14   LN:107349540    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:98f3cae32b2a2e9524bc19813927542e
@SQ     SN:15   LN:102531392    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:e5645a794a8238215b2cd77acb95a078
@SQ     SN:16   LN:90354753     AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:fc9b1a7b42b97a864f56b348b06095e6
@SQ     SN:17   LN:81195210     AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:351f64d4f4f9ddd45b35336ad97aa6de
@SQ     SN:18   LN:78077248     AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:b15d4b2d29dde9d3e4f93d1d0f2cbc9c
@SQ     SN:19   LN:59128983     AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:1aacd71f30db8e561810913e0b72636d
@SQ     SN:20   LN:63025520     AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:0dec9660ec1efaaf33281c0d5ea2560f
@SQ     SN:21   LN:48129895     AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:2979a6085bfe28e3ad6f552f361ed74d
@SQ     SN:22   LN:51304566     AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:a718acaa6135fdca8357d5bfe94211dd
@SQ     SN:X    LN:155270560    AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:7e0e2e580297b7764e31dbc80c2540dd
@SQ     SN:Y    LN:59373566     AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:1fa3474750af0948bdf97d5a0ee52e51
@SQ     SN:MT   LN:16569        AS:NCBI37       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:c68f52674c9fb33aef52dcf399755519
@RG     ID:ERR000162    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR000252    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR001684    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR001685    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR001686    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR001687    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR001688    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR001689    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR001690    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR002307    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR002308    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR002309    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR002310    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR002311    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR002312    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR002313    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@RG     ID:ERR002434    PL:ILLUMINA     LB:g1k-sc-NA12776-CEU-1 PI:200  DS:SRP000031    SM:NA12776      CN:SC
@PG     ID:GATK TableRecalibration      VN:v2.2.16      CL:Covariates=[ReadGroupCovariate, QualityScoreCovariate, DinucCovariate, CycleCovariate], use_original_quals=true, defau 
t_read_group=DefaultReadGroup, default_platform=Illumina, force_read_group=null, force_platform=null, solid_recal_mode=SET_Q_ZERO, window_size_nqs=5, homopolymer_nback=7, except on_if_no_tile=false, pQ=5, maxQ=40, smoothing=137       UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta    M5:b4eb71ee878d3706246b7c1dbef69299
@PG     ID:bwa  VN:0.5.5
ERR001685.4315085       16      1       9997    25      35M     *       0       0       CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT     ?8:C7ACAABBCBAAB?CCAABBEBA@ACEBBB@?     XT:A:U  XN:i:4    X0:i:1  X1:i:0  XM:i:2  XO:i:0  XG:i:0  RG:Z:ERR001685  NM:i:6  MD:Z:0N0N0N0N1A0A28     OQ:Z:>>:>2>>>>>>>>>>>>>>>>>>?>>>>??>???>
ERR001689.1165834       117     1       9997    0       *       =       9997    0       CCGATCTAGGGTTAGGGTTAGGGTTAGGGTTAGGG     >7AA<@@C?@?B?B??>9?B??>A?B???BAB??@     RG:Z:ERR001689    OQ:Z:>:<<8<<<><<><><<>7<>>>?>>??>???????
ERR001689.1165834       185     1       9997    25      35M     =       9997    0       CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT     758A:?>>8?=@@>>?;4<>=??@@==??@?==?8     XT:A:U  XN:i:4    SM:i:25 AM:i:0  X0:i:1  X1:i:0  XM:i:2  XO:i:0  XG:i:0  RG:Z:ERR001689  NM:i:6  MD:Z:0N0N0N0N1A0A28     OQ:Z:;74>7><><><>>>>><:<>>>>>>>>>>>>>>>>
ERR001688.2681347       117     1       9998    0       *       =       9998    0       CGATCTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG     5@BA@A6B???A?B??>B@B??>B@B??>BAB???     RG:Z:ERR001688    OQ:Z:=>>>><4><<?><??????????????????????       

Fixing BAM files with alternative sortings

The GATK requires that the BAM file be sorted in the same order as the reference. Unfortunately, many BAM files have headers that are sorted in some other order -- lexicographical order is a common alternative. To resort the BAM file please use ReorderSam.

3. Intervals

The GATK accept interval files for processing subsets of the genome in Picard-style interval lists. These files have a .interval_list extension and look like this:

@HD     VN:1.0  SO:coordinate
@SQ     SN:1    LN:249250621    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:1b22b98cdeb4a9304cb5d48026a85128     SP:Homo Sapiens
@SQ     SN:2    LN:243199373    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:a0d9851da00400dec1098a9255ac712e     SP:Homo Sapiens
@SQ     SN:3    LN:198022430    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:fdfd811849cc2fadebc929bb925902e5     SP:Homo Sapiens
@SQ     SN:4    LN:191154276    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:23dccd106897542ad87d2765d28a19a1     SP:Homo Sapiens
@SQ     SN:5    LN:180915260    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:0740173db9ffd264d728f32784845cd7     SP:Homo Sapiens
@SQ     SN:6    LN:171115067    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:1d3a93a248d92a729ee764823acbbc6b     SP:Homo Sapiens
@SQ     SN:7    LN:159138663    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:618366e953d6aaad97dbe4777c29375e     SP:Homo Sapiens
@SQ     SN:8    LN:146364022    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:96f514a9929e410c6651697bded59aec     SP:Homo Sapiens
@SQ     SN:9    LN:141213431    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:3e273117f15e0a400f01055d9f393768     SP:Homo Sapiens
@SQ     SN:10   LN:135534747    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:988c28e000e84c26d552359af1ea2e1d     SP:Homo Sapiens
@SQ     SN:11   LN:135006516    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:98c59049a2df285c76ffb1c6db8f8b96     SP:Homo Sapiens
@SQ     SN:12   LN:133851895    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:51851ac0e1a115847ad36449b0015864     SP:Homo Sapiens
@SQ     SN:13   LN:115169878    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:283f8d7892baa81b510a015719ca7b0b     SP:Homo Sapiens
@SQ     SN:14   LN:107349540    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:98f3cae32b2a2e9524bc19813927542e     SP:Homo Sapiens
@SQ     SN:15   LN:102531392    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:e5645a794a8238215b2cd77acb95a078     SP:Homo Sapiens
@SQ     SN:16   LN:90354753     AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:fc9b1a7b42b97a864f56b348b06095e6     SP:Homo Sapiens
@SQ     SN:17   LN:81195210     AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:351f64d4f4f9ddd45b35336ad97aa6de     SP:Homo Sapiens
@SQ     SN:18   LN:78077248     AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:b15d4b2d29dde9d3e4f93d1d0f2cbc9c     SP:Homo Sapiens
@SQ     SN:19   LN:59128983     AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:1aacd71f30db8e561810913e0b72636d     SP:Homo Sapiens
@SQ     SN:20   LN:63025520     AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:0dec9660ec1efaaf33281c0d5ea2560f     SP:Homo Sapiens
@SQ     SN:21   LN:48129895     AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:2979a6085bfe28e3ad6f552f361ed74d     SP:Homo Sapiens
@SQ     SN:22   LN:51304566     AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:a718acaa6135fdca8357d5bfe94211dd     SP:Homo Sapiens
@SQ     SN:X    LN:155270560    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:7e0e2e580297b7764e31dbc80c2540dd     SP:Homo Sapiens
@SQ     SN:Y    LN:59373566     AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:1fa3474750af0948bdf97d5a0ee52e51     SP:Homo Sapiens
@SQ     SN:MT   LN:16569        AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:c68f52674c9fb33aef52dcf399755519     SP:Homo Sapiens
1       30366   30503   +       target_1
1       69089   70010   +       target_2
1       367657  368599  +       target_3
1       621094  622036  +       target_4
1       861320  861395  +       target_5
1       865533  865718  +       target_6
...

consisting of a SAM-file-like sequence dictionary (the header), and targets in the form of + . These interval lists are tab-delimited. They are also 1-based (first position in the genome is position 1, not position 0). The easiest way to create such a file is to combine your reference file's sequence dictionary (the file stored alongside the reference fasta file with the .dict extension) and your intervals into one file.

You can also specify a list of intervals in a .interval_list file formatted as :- (one interval per line). No sequence dictionary is necessary. This file uses 1-based coordinates.

Finally, we also accept BED style interval lists. Warning: this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats should be offset by 1.

4. Reference Ordered Data (ROD) file formats

The GATK can associate arbitrary reference ordered data (ROD) files with named tracks for all tools. Some tools require specific ROD data files for processing, and developers are free to write tools that access arbitrary data sets using the ROD interface. The general ROD system has the following syntax:

-argumentName:name,type file

Where name is the name in the GATK tool (like "eval" in VariantEval), type is the type of the file, such as VCF or dbSNP, and file is the path to the file containing the ROD data.

The GATK supports several common file formats for reading ROD data:

  • VCF : VCF type, the recommended format for representing variant loci and genotype calls. The GATK will only process valid VCF files; VCFTools provides the official VCF validator. See [here] for a useful poster detailing the VCF specification.
  • UCSC formated dbSNP : dbSNP type, UCSC dbSNP database output
  • BED : BED type, a general purpose format for representing genomic interval data, useful for masks and other interval outputs. Please note that the bed format is 0-based while most other formats are 1-based.

Note that we no longer support the PED format. See here for converting .ped files to VCF.

Sorry, there are no publicly available documents of this type with the tag #fastareference. Try one of the other types.

I ran the HaplotypeCaller, VariantAnnotator, and Variant Validatoor on chr3 locations from a human tumor sample.

The HaplotypeCaller command line is:

gatk="/usr/local/gatk/GenomeAnalysisTK-2.2-8-gec077cd/GenomeAnalysisTK.jar"
#Fasta from the gz in the resource bundle
indx="/home/ref/ucsc.hg19.fasta" 
dbsnp="/fdb/GATK_resource_bundle/hg19-1.5/dbsnp_135.hg19.vcf"

java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T HaplotypeCaller \
 -I chrom_bams/286T.chr3.bam \
 -o hapc_vcfs/286T.chr3.raw.vcf 

The VariantAnnotator command line is:

java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T VariantAnnotator \
     --dbsnp $dbsnp  --alwaysAppendDbsnpId \
    -A BaseQualityRankSumTest -A DepthOfCoverage \
    -A FisherStrand -A HaplotypeScore -A InbreedingCoeff \
    -A MappingQualityRankSumTest -A MappingQualityZero -A QualByDepth \
    -A RMSMappingQuality -A ReadPosRankSumTest -A SpanningDeletions \
    -A TandemRepeatAnnotator \
    --variant:vcf hapc_vcfs/286T.chr3.raw.vcf \
    --out varanno_vcfs/286T.chr3.va.vcf

This all works nicely, but I go back and use ValidateVariants just to be sure:

java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T ValidateVariants \
   --dbsnp ${dbsnp} \
   --variant:vcf varanno_vcfs/286T.chr3.va.vcf \
    1> report/ValidateVariants/286T.chr3.va.valid.out \
    2> report/ValidateVariants/286T.chr3.va.valid.err &

An issue arises with a rsID that is flagged as not being present in dbSNP.

...fails strict validation: the rsID rs67850374 for the record at position chr3:123022685 is not in dbSNP

I realize this is an error message that generally would not generally qualify as an issue to post to these forums, however it is an error that seems to be generated by the Haplotype caller, illuminated by VariantAnnotator, and caught by the ValidateVariants.

The first 7 fields of the offending line in the 286T.chr3.va.vcf can be found using: cat 286T.chr3.va.vcf | grep rs67850374

chr3    123022685       rs67850374;rs72184829   AAAGAGAAGAGAAGAG        A       1865.98 .

There is a corresponding entry in the dbsnp_135.hg19.vcf file: cat $dbsnp | grep rs67850374

chr3    123022685       rs67850374;rs72184829   AA      A,AAAGAGAAGAG,AAAGAGAAGAGAAGAGAAGAG     .  PASS

My initial guess is that this is caused by a disagreement in the reference and variant fields between the two annotations. From what I can gather the call to the variantcontext function validateRSIDs() has a call to validateAlternateAlleles(). I assume this is what throws the error that is then caught and reported as "...fails strict validation..."

The UCSC genome browser for hg19 does show the specified position to be AA. It seems as thought the HaplotypeCaller simply used a different reference than dbsnp in this case.

The reference file supplied to HaplotypeCaller was the same as to VariantAnnotator and ValidateVariants. I did not supply the dbsnp argument to the HaplotypeCaller as I planned on doing all annotations after the initial variant calling, and the documentation states that the information is not utilized in the calculations. It seems as though this is a difference in between the reference assembly for dbSNP and the the reference supplied by the resource bundle.

My questions are:

  1. Is this really a problem that arises from slightly different reference assemblies?
  2. Is the hg19-1.5 reference fasta different from any other hg19 reference fasta?
  3. Is there at tool that I have missed that would have prevented this error and allowed the pipeline to continue without error?"
  4. Will this strict validation failure cause problems for the VariantRecalibrator?

As it stands, I am simply going to discard the offending lines manually. There are less than twenty in the entire exome sequencing of this particular tumor-normal sequencing. However, it seems like this issue will likely arise again. I will check the dbSNP VCF for places where the reference differs from the sequence in hg19. At least that should give me an estimate of the number of times this will arise and the locations to exclude from the variant calls.

-- Colin