Frequently Asked Questions


1. What is DAPPLE testing?
2. What type of input does DAPPLE take?
3. What goes in the "Genes to Specify" box?
4. If I input a SNP, how is the region defined around that SNP?
5. For a region, how are overlapping genes defined?
6. Where does the protein-protein interaction data come from?
7. What are all these output files??


1. What is DAPPLE testing?


The hypothesis behind DAPPLE is that causal genetic variants affect common mechanisms and that these mechanisms can be inferred by looking for physical connections between proteins encoded in disease-associated regions. DAPPLE is therefore testing whether the networks built from seed regions - both direct networks and indirect networks - are more connected than chance expectation. Chance expectation is defined by the connectivity expected if connectivity were purely a function of the binding degree of participating proteins.



2. What type of input does DAPPLE take?


DAPPLE takes 4 types of input:

(1) A list of SNPs. These are entered as one SNP per line, either in a specified file or directly via the webpage interface. Thse SNPs must be in HapMap, because this is how DAPPLE defines the 'wingspan' region around a gene which is a function of linkage disequilibrium. If a seed SNP is not in DAPPLE's database, a warning will be returned that it could not find the seed SNP. We suggest that users find a proxy for that SNP and only use results that derive from a DAPPLE run with no such warnings.

(2) A list of regions. These are entered as one region per line, either in a specified file or directly via the webpage interface. Each region should be entered as 'ID chr left right' where ID is a region identifier, chr is a number from 1-23, left is the left boundary in genomic coordinates and right is the right boundary. The entries can be space or tab delimited.

(3) A list of genes with region identifiers, or "gene-regions". These are entered as one entry per line, either in a specified file or directly via the webpage interface. Each entry should be defined as 'gene ID' where gene is a gene name in gene symbol (ie Hugo) ID, such as "ATXN1", and ID refers to a region to assign the gene to. Since DAPPLE is specifically looking for connectivity between regions - and not within regions - the user can group genes based on how they want to define groups.

(4) A list of genes. These are entered as one entry per line, either in a specified file or directly via the webpage interface. Each gene should be identified with its gene symbol (ie Hugo) ID, such as "ATXN1". This mode should be used if the user does not want to group genes into regions, but rather wants each gene to stand as its own region.



3. What goes in the "Genes to Specify" box?


DAPPLE will treat each protein in input regions equally, as long as the protein is in the InWeb database. If the user feels that for any of the regions, only a subset of genes should be considered, then the user should enter them here as gene symbol (ie Hugo) IDs, such as "ATXN1". A common example is the HLA locus, which is large, contains many genes and can often introduce a lot of noise into the analysis. If the user specifies one or more HLA genes likely to be the candidate(s), the rest of the genes in the region will not be considered.



4. If I input a SNP, how is the region defined around that SNP?


The region is defined using LD according to the HapMap. For a given SNP, we extend out to the region defined by SNPs in r^2>=0.5 and then extend out to the nearest hotspots.



5. For a region, how are overlapping genes defined?


The hg18 gene list was downloaded from UCSC using Ensemble transcripts. Splice isoforms were then collapsed to define the largest gene footprint from transcription start to transcription stop. Gene footprints were then extended on either end to include 50kb of regulatory sequence by default, though the user can specify a different regulatory region. Any gene footprint that overlaps a region is included in that region. If a gene overlaps 2 regions, those regions are merged. If the user would like to keep the regions seperate, they should input genes and explicitly assign them to regions (option #3 on "What type of input does DAPPLE take?").



6. Where does the protein-protein interaction data come from?


We use the InWeb databased, published by Kasper Lage in 2007. This database contains 428,430 reported interactions, 169,810 of which are deemed high-confidence, non-self interactions across 12,793 proteins. High-confidence is defined by a rigorously tested signal to noise threshold as determined by comparison to well-established interactions. Briefly, InWeb combines reported protein interactions from MINT, BIND, IntAct, KEGG annotated protein-protein interactions (PPrel), KEGG Enzymes involved in neighboring steps (ECrel), Reactome and others as described elsewhere in detail. All human interactions were pooled and interactions in orthologous protein pairs passing a strict threshold for orthology were included. Each interaction was assigned a probabilistic score based on the neighborhood of the interaction, the scale of the experiment in which the interaction was reported and the number of different publications in which the interaction had been sited.



7. What are all these output files??


DAPPLE outputs a number of files, all of which are described here.

FILE_summary: This file contains the parameter values for the 4 network statistics measured: (1) The number of direct connections between seen proteins from different loci, (2) the average seed protein direct connectivity (a.k.a. direct binding degree), (3) the average seed protein indirect connectivity (a.k.a. indirect binding degree) and (4) the average common indiractor binding degree (the average number of seed proteins that common interactors bind to).

FILE_NetStats: This file contains the permutation p-values for the 4 network statistics described in FILE_summary (i.e., what is the probability that I see a parameter value >= the observed value by chance?)

FILE_SeedScores: This file contains the individual p-values for seed proteins - generally, the probability that by chance the seed protein would be as connected to other seed proteins (directly or indirectly) as is observed. Please refer to the publication's supplementary materials for exact details of p-value calculation. The file contains 4 columns: gene ID, region ID, uncorrected p-value, corrected p-value.

FILE_GenesToPrioritize: This file contains genes that achieved a corrected p-value less than 0.05.

FILE_CIscores: This file contains the p-values for common interactors that describe the probability that by chance individual common interactors would be as connected to seed proteins as was observed.

FILE_directConnections: This file contains a list of the direct connections in the network.

FILE_plot: If the user chose plot=true, this is the visualization of the network. Page 1 shows the direct network and pages 2-3 show the indirect network. Colors of seeds correspond to region.

FILE_Iterate_*: If the user chose to iterate, there will be a mirror set of files that describe the same statistics after iteration. Iteration refers to the process by which DAPPLE is run and genes that reach a p-value of less than 0.05 after correction are prioritized for the next iteration, while the rest of the genes in the region that did not pass the same threshold are removed. DAPPLE will iterate until it converages; that is, until it cannot propose any new candidate genes.

FILE_MissingGenes: This file is important to pay attention to. If the input is SNPs or regions, this describes the genes in those input regions that are in the InWeb database in contrast to those that aren't. If too many input proteins are not in the InWeb database (less than 60% average inclusion), one should be careful about interpreting DAPPLE results.



Click here to return to the DAPPLE homepage.