Plotting Genome-Wide Association Results

The interpretation of genome-wide association results can be greatly facilitated by visualization.

As part of the type 2 diabetes whole-genome scan, we developed scripts (written in R) to generate quantile-quantile (Q-Q) plots as well plots of the association results within their genomic context (gene annotations and local linkage disequilibrium patterns). These scripts are provided here "as is"; users should have a working knowledge of the R package.

The output of these scripts are PDF files that can be edited manually using programs like Canvas and Photoshop etc.

Q-Q Plot

Genome-wide analysis involves hundreds of thousands of statistical tests, and even modest levels of bias can distort the null distribution and overwhelm a small number of true associations. To search for evidence of systematic bias (from unrecognized population structure, analytical approach, genotyping artifacts, etc.), a quantile-quantile (Q-Q) plot can be used to characterize the extent to which the observed distribution of the test statistic follows the expected (null) distribution.

Download here the R script to generate a Q-Q plot
Download here an example file (46604 P-values for SNPs on chromosome 3 from the DGI scan)

Regional Association Plot

We have developed a plot that highlights the statistical strength of an association in the context of the association results for surrounding markers, gene annotations, estimated recombination rates and pairwise correlations between the surrounding markers and the putative associated variant.

You have to provide a file that contains the following data for every SNP across the region of interest: position (as Build 35 coordinates), P-value, a label to indicate whether a SNP is quot;typed" or "imputed", and the r-squared between that SNP and the putative associated variant. All SNPs in this file will be plotted with their corresponding P-values
(as -log10 values) as a function of chromosomal position. SNPs that are "typed" are plotted as diamonds; "imputed" SNPs are plotted as circles. Estimated recombination rates are plotted to reflect the local LD structure around the associated SNP and their correlated proxies (bright red indicating highly correlated, faint red indicating weakly correlated).

Download here the R script to generate a regional association plot
Download here an example file (the CDKN2A/CDKN2B gene region that we identified as a novel T2D locus)
Download here the estimated recombination rates from HapMap (using Build 35 coordinates)
Download here the gene annotations from the UCSC genome browser (using Build 35 coordinates). Note: this file is only provided for convenience. It is most certainly NOT a complete list of all human genes.

The R script needs to be modified as follows:

Specify which SNP is the key association ("rs7574865"). Of course, this SNP needs to be present in the data file.
Provide a header title ("CDKN2A/CDKN2B gene region").
Specify which chromosome it is ("9").
Provide the replication or joint P-value (5.4e-8). The blue diamond will represent this data point with the red diamond representing the initial P-value (present in the data file).

If you're a local Broad user...

I have set up an example directory in my home directory ag:~debakker/figures/ which contains the two R scripts and the two example files. Run the R scripts as follows:

R --slave < assocplot.R
R --slave < ppplot.R

(These R scripts will automatically point to a local copy of the gene annotations and the recombination rates, so there's no need to download these yourself if you run these scripts locally on the Broad server.)

I have also put here a potentially useful script to grep out the HapMap r-squared values (pre-computed with Haploview). Run as follows:

./get_rsquared.csh SNP CHR FILE

where SNP refers to the central associated SNP (give the rsID), CHR refers to the chromosome it's on, and FILE is the name of the file containing the association data (like DGI_chr9_hit.txt). Make sure the FILE contains a proper header line so that R will recognize the columns (for example, look at the DGI_chr9_hit.txt file). The output will be a file with the same filename appended with ".new", with the rsquared values added for every SNP in a separate column ("RSQR").

Questions?

You can contact the author here.

Reference

Diabetes Genetics Initiative of Broad Institute of Harvard and MIT, Lund University, and Novartis Institutes of BioMedical Research (2007) Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science. 316: 1331-1336.