Frequently Asked Questions

What is Tagger?

Tagger is a computer program and web-based service for the selection and evaluation of tag SNPs based on the empirical patterns of linkage disequilibrium from a genetic variation resource such as the International HapMap Project. Tagger can help you with the design and analysis of genetic association studies, namely to test variants that are common in the population for a role in complex traits.

Tagger combines the simplicity of pairwise r2 methods with the potential efficiency of multimarker haplotype approaches. This is achieved by aggressively searching for multimarker predictors as effective surrogates for single tag SNPs. These predictors are explicitly recorded to be performed in the downstream association analysis. Particularly attractive features for the user include the ability to specify sets of SNPs to be included and/or excluded as tag SNPs. You can also provide SNP design scores for your genotyping platform, allowing for preferential picking of SNPs that are likely to give better conversion rates.

How can I run Tagger locally on my own computer?

You can download Haploview which has an implementation of Tagger (done by Jeffrey Barrett, Julian Maller and David Bender). Most of the functionality is there. Note that it is possible to get different results between Haploview's Tagger and this web server due to the randomness of picking from equivalent tag SNPs, although the numbers should generally agree. (Let us know if they don't!)

What are chromosomal landmarks?

For your convenience, you can specify chromosomal regions of interest without getting the data from HapMap first and uploading it yourself. More importantly, this functionality allows you to specify multiple regions simultaneously. Tagger uses a local copy of the HapMap (currently, the phased data set of Release 21). (We no longer host the 1.5M SNP Perlegen data set as described by Hinds et al.) You can specify such chromosomal regions in a format used by the genome browsers (including HapMap gbrowse): for example, chr5:58,302,467-58,918,031 for gene PDE4D. You can specify multiple regions; one region per line. Note that chromosomal positions MUST be in NCBI build 35 coordinates.

What is HapMap format?

This refers to the genotype data dump (NOT the frequency or LD data dump!) that you can obtain from the HapMap website. In this file format, the columns correspond to the HapMap samples (depending on the population sample selected), and every line corresponds to a SNP. If you upload a HapMap file, it will be automatically converted and phased, recognizing the familial (trio) relationships in YRI and CEU samples. It is preferable to use the chromosomal landmarks method (which instead of uploading HapMap data directly, especially if you have a large genomic region or multiple regions. We aim to host the most current version of the HapMap installed locally.

What is "ped" file format?

The "ped" file format refers to the widely-used format for linkage pedigree data. Each line describes a single (diploid) individual in the following format:

family_ID individual_ID father_ID mother_ID gender phenotype genotype_1 genotype_2 ...

If your data lacks pedigree information (for example, unrelated case/control individuals), set the father_ID and mother_ID to 0. sex denotes the individual's gender with 1=male and 2=female. phenotype refers to the affected status (for association studies) where 0=unknown, 1=unaffected, 2=affected. Finally, each genotype is written as two (=diploid) integer numbers (separated by whitespace), where 1=A, 2=C, 3=G, 4=T. No header lines are allowed and all columns must be separated by whitespace. Check out the information at the PLINK website on the "ped" file format.

What is Nick Patterson format?

If you submit a "ped" or HapMap file, Tagger will use Nick Patterson's emphase program for phasing the genotype data. The method is based on the expectation-maximization (E-M) algorithm of Excoffier and Slatkin (1995), modified to process larger data files using the partition-ligation approach. Uploading pre-phased data (or just using the local HapMap by specifying landmarks) can save computing time considerably.

Below is an example of the Patterson file format. Basically, every line is a haplotype, and every (human) individual has two lines. First column contains haplotypes of individuals (NO spaces between genotypes). Remaining columns are optional.

3413421133414212342421123 0 0 1334F01 a
4222234211234212242421211 1 0 1334F01 b
4222234211214212242421111 2 0 1334M02 a
3413421133414212222421211 3 0 1334M02 b

What should be in the marker information file?

If you're uploading a "ped" file or Nick Patterson formatted file, you MUST upload a file with SNP identifiers and their chromosomal position (in bp). These positions will be used to limit marker comparisons to within a reasonable window size (to avoid overfitting). Every line in this file corresponds to a marker in the same order as they appear in the "ped" file (but not necessarily ordered by chromosomal location), listing the SNP identifier (mandatory), chromosome (optional), physical location (mandatory), and strand (optional). SNP identifiers are case-sensitive (be aware of this when you include or exclude SNPs).

Note that if you're submitting a HapMap file, you don't need to submit an .info file (as the HapMap file already contains this information).

rs1450878	chr7    128770786  +
rs1562833       chr7    128803549  -
rs753947	chr7    128805866  -

What should be in the design scores file?

You may submit a file with design scores for preferential picking of SNPs that are more likely to work on your genotyping platform. Every line corresponds to a SNP, listing its SNP identifier and its given design score.

rs1450878	0.76
rs1562833	1.00
rs753947	0.47

What is a "haplotype blocks" file?

Haplotype blocks are segments of strong LD within which there is little evidence for ancestral recombination events (see Gabriel et al.). You can manually specify boundaries of such blocks. Please note that this is optional: you do not need to tell Tagger which markers are in LD, it will figure out this by itself. This option has two potential uses: (1) You can force Tagger to allow all SNPs within a block to form multi-marker tests. This may increase the risk of overfitting. (2) You can tell Tagger to pick tags to capture the (common) haplotypes within the blocks by clicking the "capture haplotypes" box. For every block, use a separate line to list the start and stop chromosomal positions (like the positions of markers in the .info file).

chr2 128770786  128803549
chr2 128805866 128856789
chr2 129305844 130024023

What are include tag SNPs?

You can force in sets of SNPs by submitting a file containing SNP identifiers, or pasting this in the textbox. Tagger will then take this into account when picking new tag SNPs. Useful if you already have genotype data for some SNPs but want to pick extra tag SNPs. Or if you want to include some SNPs that have been previously shown to be associated. Make sure that the SNP identifiers are exactly the same (case-sensitive!) as those in the .info file.

I already have a list of tag SNPs. How can I see how well they capture the untyped SNPs?

If you want to evaluate the tagging performance of a set of tag SNPs, submit an include tags file (with the SNP names) and click the "evaluate" checkbox (to tell Tagger to not pick extra tags). This may be useful if more SNPs have been added to the reference panel (e.g. HapMap), or if you just want to compare different tag sets. Generally, you should check how well the working SNPs in your disease study (once you have that data) capture the untyped SNPs of the reference panel as that may likely have changed or improved by then.

I suspect some SNPs will not work on my genotyping platform. How can I prevent them from being picked as tag SNPs?

You can submit a file with SNPs to be excluded from the final list of tags. Tagger will automatically try to capture these SNPs the best it can. This is useful if some of the SNPs have poor design scores (or are known to have high failure rates). If you have design scores for the SNPs you may decide to exclude the low-scoring ones by setting a minimal design score (default is 0=exclude no SNPs). Make sure the SNP identifiers are exactly the same as those in the .info file (case-sensitive!).

I use Excel for all my files. Can I upload these?

No. You cannot submit Excel files. Instead, convert your files to plain (ASCII) text files. You can do this in Excel selecting Save As, and the Text (Windows) file format.

What is the allele frequency threshold?

Picking tag SNPs from your genetic map will be intimately dependent on the specific hypotheses that you want to test in the downstream association study. You can specify which alleles you want to capture by their frequency. These settings here act as a filter for the data by the set frequency threshold (default is 5%). Example: If you want to test all >5% alleles but want to include a less common coding SNP, set the allele frequency here to 5% (for SNPs) and force in that particular SNP separately (as an include tag SNP; see above).

What are specific alleles to capture?

In Tagger, you can specify exactly which alleles you want to capture. By default, the specified alleles will be added to the list of SNPs (and haplotypes, if any) that pass the set frequency threshold. Click on the check box if you are exclusively interested in capturing these specific alleles. The format is two columns: SNP_ID(s), allelic genotype(s). Specify single SNPs by their identifiers (SNP_ID) (see first example). Specify haplotypes by their SNP_IDs (separated by commas, no spaces!!) and their genotypes (separated by commas, no space!!) that make up the specific haplotype you want to tag (second example). If you want to capture all haplotypes of that marker combination, you should omit the second column (third example). Note that, in the latter case, haplotypes of frequency less than the specified frequency threshold (see above) will be ignored.

rs2051773
rs2051773,rs757081 4,4
rs2051773,rs757081

What is the r2 threshold?

This is the minimal coefficient of determination r2 at which all alleles are to be captured. Setting the threshold to 1.0 will result in a non-redundant set of tag SNPs where all untyped SNPs will have a perfect proxy.

What is pairwise tagging?

Pairwise tagging means that all tag SNPs will act as direct proxies to all other untyped SNPs because they are highly correlated with one another. In pairwise tagging mode, Tagger should behave similarly to ldSelect developed by Carlson et al.

What is aggressive tagging?

The major advance in Tagger is to improve tagging efficiency by aggressively searching for multi-marker predictors to capture all alleles of interest (SNPs and/or haplotypes). Since all marker combinations (even for a modest number of SNPs in the data set) are impossible to evaluate, the search for an effective predictor for an allele is limited by a heuristic based on the underlying LD structure. You can limit the search by setting the LOD score between markers (higher is stricter but may give less efficiency gain); number of iterations (lower is less computational burden); the maximal number of markers to be included in a predictor (lower is less computational burden); and the maximum number of tests allowed to find a good predictor for each allele (lower is less computational burden).

Why would I want to ignore marker pairs separated at some distance or greater?

By setting a maximum separation distance for marker pairs you can prevent SNPs to "see" each other or form a haplotype predictor, and thus minimize the risk of relying on spurious correlations between SNPs.

Why do some SNPs have no tests listed as predictors?

While Tagger tries very hard to find an effective predictor on the basis of the picked tags to capture all alleles of interest (SNPs and, if specified, haplotypes), it may be possible that some alleles cannot be captured well (with adequate r2). If a test falls below a certain cutoff (default is r2 < 0.1), Tagger will just ignore that test and put -1's to indicate that allele is not captured.

I get a warning that says "Lots of missing data or hets". What is this?

This is a warning generated by emphase (written by Nick Patterson), and you can safely ignore it if it's printed only a few times.