Anda di halaman 1dari 2


HyeChungKum,PhD,Darshana Pathak,Gautam Sanka,StanleyAhalt,PhD UniversityofNorthCarolina,ChapelHill,NC ABSTRACT

Ambiguous links must be manually reviewed during approximate record linkage to enable accurate data integration. This requirement would seem to make it impossible for researchers to protect patients privacy when integrating health informatics data. To address this problem, we propose a novel decoupled data system that blocks sensitive attribute disclosure via encryption and chaffing. We also evaluate three methodsChaffing, Display control for clerical review and Manipulation of universe around the datathat can minimize identity disclosure. Let


IPPIRL = the category of information I in the Minimal Sharing model; h = a person tuning the false matches manually; , = respective error terms; such that, InteractiveRL(h, ) <= IPPIRL < Disclosure(h, ) InteractiveRL(h, ) is the minimum amount of information the person, h, needs to make decisions on linkage with high confidence Disclosure(h, ) is the level of information disclosed to the honest-but-curious user, h, then, Privacy Preserving Interactive Record Linkage (PPIRL) is defined as the query operation PPIRL(DR, DS, IPPIRL, h) in the minimal sharing model* where DR and DS are the two tables to be linked, h is a honest-but-curious human in the loop making a final judgment on linkage, and IPPIRL is the minimal information to be shared with the human h. Approximate Record Linkage Human in the loop to resolve ambiguous links Threat of sensitive attribute disclosure

Why is record linkage (RL) important? There is a constant need for record linkage to create a coherent Big Data system for the data originating from heterogeneous uncoordinated systems. Why is record linkage challenging? Redundant and fragmented datasets are split over multiple systems. Missing and erroneous attribute values with no unique, errorfree identifiers require approximate record linkage, which result in error from false matches or uncertain matches3,4. What is Privacy Preserving Record linkage? To identify the records in one or more datasets that represent the same real world entity, without compromising the privacy of subjects involved5,8. What is Interactive Record linkage? Record linkage with people tuning and managing the false matches from the approximate record linkage algorithms. We define the properly tuned output from a hybrid human machine data integration system as high quality record linkage7.

DecoupledInformationSystemfor PrivacyPreservingInteractiveRecordLinkage METHOD:SECUREDECOUPLEDLINKAGE(SDLink)

A tractable computational model for privacy preserving interactive record linkage (PPIRL) focusing on protection against attribute disclosure. Three techniques SDLink utilizes for privacy protection: 1. Strict decoupling via TPM Trusted Platform Module based encryption (pseudonym method) 2. Minimum information sharing during human interaction via information suppression. 3. Chaffing adding fake data to block attribute inference from group membership
*Minimal Sharing Model [Agrawal 2003] Let there be two parties R (receiver) and S (sender) with databases DR and DS respectively. Given a database query Q spanning the tables in DR and DS, and some categories of information I, compute the answer to Q and return it to R without revealing any additional information to either party except for information contained in I.

First generation: Hash based exact match Second generation: Improve the quality of linkages by allowing approximate match utilizing privacy preserving approximate string comparison operations such as bloomfilters (2009)6. Third generation [our model]: High quality RL using a hybrid human-machine data integration system for privacy preserving interactive record linkage (2012)5. (2003)1.

Hye-Chung Kum, PhD (
See poster at for references.
We thank Mike Reiter and Ashwin Machanavajjhala for their insightful comments, Fabian Monrose for supporting the research, and Ian SangJun Kim and Ren Bauer for their assistance with the experiment. This research was supported in part by funding from the NC Department of Health and Human Services, NIH CTSA UL1TR000083, and NSF award no. CNS0915364.

A simple but powerful data system for Privacy Preserving Interactive Record Linkage. Decouples (i.e. isolates) sensitive data (SD) from the personally identifying information PII. Provides both error management in the data integration and the privacy protection by blocking attribute disclosure and minimizing identity disclosure. KEYINSIGHT The innovation in decoupling data is the focus on revealing information rather than hiding it. The key is to understand the minimum information required for quality linkage. Then to design protocols to reveal, in a secure manner, only that information.

InformationSuppressionDuringClerical Review

It is important to note that the current norms for data integration in the US are full disclosure of all information to a fully trusted human entity. For example, full disclosure of both attribute and identity to certain trusted parties is HIPAA compliant.

Identitydisclosurewithoutsensitiveattribute disclosurehasalittlepotentialforharm

We evaluate three methods for information disclosure: 1. Chaffing 2. Manipulation of universe 2.1 Fabrication 2.2 Non-disclosure

FOURSTEPSINDECOUPLINGDATA 1. Split data set into two tables: One for the identifying information PII and the other for remaining mostly sensitive information. T = TPII + TSD 2. Shuffling: Randomly shuffle rows in PII table, TPII. 3. Chaffing: Add fake rows of PII to TPII. 4. Encryption: Apply asymmetric encryption to lock the row association between the TPII and TSD.

The survey results confirmed that chaffing and either falsifying or not defining the universe around the data were effective in introducing uncertainty to the information disclosed. When the universe around the data was not defined, 56% of the participants were uncertain about the identity given a common name. Even for rare names, if the list is chaffed and the universe is not defined, 66% of the participants were uncertain on the identity.

Today, nearly all of our activities from birth until death leave digital traces in large databases. Together, these digital traces collectively capture our social genome, the footprints of our society. Like the human genome, the social genome data has much buried in the massive almost chaotic data. If properly analyzed and interpreted, this social genome could offer crucial insights into many of the most challenging problems facing our society (i.e. affordable and accessible quality healthcare). The burgeoning field of population informatics is the systematic study of populations via secondary analysis of massive data collections (termed big data) about people. In particular, health informatics analyzes electronic health records to improve health outcomes for a population.

Information suppression is essential during clerical review to avoid sensitive attribute disclosure. Furthermore, when chaffing is used in combination with nondisclosure of the universe, even rare names can be displayed with minimum risk of attribute disclosure during clerical review. Our proposed methods are effective in the presence of missing and erroneous data.