Gencrypt: One-way cryptographic hashes to identify overlapping individuals

Gencrypt is a lightweight tool that allows researchers to securely compare individual-level data from between genomic datasets by utilizing a class of security algorithms known as one-way cryptographic hashes. By converting individual-level genotypes into one-way hashes, researchers can indirectly compare SNP data and correctly identify identical individuals between datasets without infringing on IRB guidelines or patient privacy concerns.

Input is individual-level data in the form of PLINK .ped and .bim files and a seed value. During the running of Gencrypt, SNPs are included per hash in a random order. Doing so helps avoid cryptic relations between hash outputs due to linkage disequilibrium between SNPs. The random order used is determined based on the input seed value, so that independent runs of Gencrypt can recapitulate the same ordering every time.

Output is a list of pair-wise individual comparisons based on the datasets being compared, displaying the percent of one-way hash outputs that are identical between individuals being compared. Pairs of individuals with large percentages of identical hash outputs are suggested to be the same individual.

Gencrypt was developed by Michael Turchin in the lab of Joel Hirschhorn at Children's Hospital Boston and the Broad Institute. Gencrypt is described in reference, and example data is provided, below.

Links

See https://github.com/mturchin20/gencrypt for software and example dataset information

Reference

Turchin, M.C. and Hirschhorn, J.N. 2012. Gencrypt: one-way cryptographic hashes to detect overlapping individuals across samples. Bioinformatics. 28(6): 868-8

Questions?

Have any questions? Contact j o e l h at broadinstitute.org

Broad Logo CHB Logo