Input File Formats
Input File Formats
Haploview currently accepts input data in five formats, standard linkage format, completely or partially phased haplotypes, HapMap Project data dumps, PHASE format, and PLINK outputs. The program can also automatically fetch phased HapMap data off the HapMap website. It also takes in a separate file with marker position information, as well as several auxiliary input files, described below. The four formats are explained in depth below.
Linkage Format
Linkage data should be in the Linkage Pedigree (pre MAKEPED) format, with columns of family, individual, father, mother, gender, affected status and genotypes. The file should not have a header line (i.e. the first line should be for the first individual, not the names of the columns). Please note that Haploview can only interpret biallelic markers with greater than two alleles (e.g. microsatellites) will not work correctly. A sample line from such a file might look something like:
3 12 8 9 1 2 1 2 3 3 0 0 4 2 a b c d e f -----------g------------
-
Pedigree Name
A unique alphanumeric identifier for this individual's family. Unrelated individuals should not share a pedigree name.
-
Individual ID
An alphanumeric identifier for this individual. Should be unique within his family (see above).
-
Father's ID
Identifier corresponding to father's individual ID or "0" if unknown father. Note
that if a father ID is specified, the father must also appear in the file. -
Mother's ID
Identifier corresponding to mother's individual ID or "0" if unknown mother Note that if a mother ID is specified, the mother must also appear in the file.
-
Sex
Individual's gender (1=MALE, 2=FEMALE).
-
Affection status
Affection status to be used for association tests (0=UNKNOWN, 1=UNAFFECTED,2=AFFECTED).
-
Marker genotypes
Each marker is represented by two columns (one for each allele, separated by a space) and coded either ACGT or 1-4 where: 1=A, 2=C, 3=G, T=4. A 0 in any of the marker genotype position (as in the the genotypes for the third marker above) indicates missing data.
It is also worth noting that this format can be used with non-family based data. Simply use a dummy value for the pedigree name (1, 2, 3...) and fill in zeroes for father and mother ID. It is important that the "dummy" value for the ped name be unique for each individual. Affection status can be used to designate cases vs. controls (2 and 1, respectively).
Files should also follow the following guidelines:
- Families should be listed consecutively within the file (i.e. all the lines with the same pedigree ID should be adjacent)
- If an individual has a nonzero parent, the parent should be included in the file on his own line.
Phased Haplotypes
Haplotype data for Haploview's input must be formatted in columns of Family, Individual and Genotypes. There should be two lines (chromosomes) for each individual. This is the standard format of Genehunter's TDT output. See the sample below:
FAM1 FAM1M01 0 4 2 2 FAM1 FAM1M01 0 4 2 2 FAM1 FAM1F02 3 h 1 2 FAM1 FAM1F02 3 h 1 2
The data format uses the numerals 1-4 to represent genotypes, the number zero to represent missing data, and the letter "h" to represent a heterozygous allele. That is, if an individual is heterozygous at a locus, both alleles should be "h" if the phasing (which allele falls on which chromosome) is uncertain.
HapMap Project Data Dumps
Data from the HapMap Project can be dumped by region using the GBrowse interface. The saved data file is in a marker-per-line format which can be loaded in Haploview.
GBrowse dumps only one file, which has one marker per line and which includes familial relationships among the HapMap samples as well as marker position information. The file format has several header lines (beginning with "#") which Haploview parses. Open the file by selecting "Browse HapMap Data" option and selecting the downloaded file.
If you wish to load data from another source in HapMap style format, you will need to specify pedigree information in the header of the file you've created. This can be done by creating lines of the following format at the top of your file:
#@ FAM01 NA0001 0 0 1 1
This data is the same as the pedfile format discussed above. The fields are family, individual, father, mother, gender, affected status. You would then replace the NAXXXX identifiers in the header row of the HapMap file with your identifiers, subject to two important constraints: they must be unique across the entire dataset, not just within a family and they must begin with the characters NA.
HapMap PHASE Format
Data in the HapMap PHASE format can be loaded into Haploview using three separate files. The first is the data file containing binary allele information. The second is a sample file containing a single column of the individual IDs used in the dataset. The third is a legend file containing four columns: marker, position, 0, and 1. Only the legend file requires a header and is used to decode the information in the data file. These files can be loaded in as GZIP compressed files using the "Files are GZIP compressed" checkbox on the initial loading screen. For more information on the HapMap PHASE format, please see the HapMap PHASE readme.
HapMap Download
Data in the HapMap PHASE format can also be automatically downloaded into Haploview using the "HapMap Download" tab in the load screen by specifying the HapMap Release, chromosome, analysis panel, and start and end positions (in kb). These options can also be automatically filled in by querying the GeneCruiser database with a gene or SNP ID. More information about the GeneCruiser database can be found at the GeneCruiser website.
Marker Information File
The marker info file is two columns, marker name and position. The positions can be either absolute chromosomal coordinates or relative positions. It might look something like this:
marker01 190299 marker02 190950 marker03 191287
An optional third column can be included in the info file to make additional notes for specific SNPs. SNPs with additional information are highlighted in green on the LD display. For instance, you could make note that the first SNP is a coding variant as follows:
marker01 190299 CODING_SNP marker02 190950 marker03 191287
PLINK Format
Output files from PLINK can be loaded into Haploview using the PLINK tab on the initial loading screen. PLINK files must contain a header and at least one column header must be titled "SNP" and contain the marker IDs for the results in the file. PLINK loading also requires a standard PLINK map or binary map file corresponding to the markers in the output file. The map file can be either three or four headerless columns (the Morgan distance column is optional). The map file can also be embedded in the results file as the first few columns of the file using the "Integrated Map Info" checkbox. You can load in non-SNP based files as well by checking the "Non-SNP" box. These files do not require a map file. You can choose to only load in one chromosome from your results file using the "Only load results from Chromosome" checkbox and selecting a chromosome from the dropdown list. You can also select which columns to load from your results file by checking the "Select Columns" checkbox. For a great deal more information on PLINK outputs, please see Shaun Purcell's PLINK website.
Batch Load File
The "-batch" flag on the command line allows you to run Haploview automatically (in nogui mode) on several files. Batch input files should have one genotype file per line, along with an info file (if desired) separated by a space. Filenames must conform to the following rules:
- Pedfile names must end in ".ped"
- Phased haplotype file names must end in ".haps"
- HapMap file names must end in ".hmp"
- Info file names must end in ".info"
The following example shows 2 pedfiles (with info files) and a hapmap file:
sample1.ped sample1.info sample2.ped sample2.info sample3.hmp