Local Warehouse
Local Local
Data Data Data
Mining Mining Mining
Warehouse
Classification
Model
Algorithm
Reconstruction Problem
10 V 90
Age
Original distribution for Age
Probabilistic estimate of original value of V
Intuition (Reconstruct single point)
• Use Bayes' rule for density functions
10 V 90
Age
Original Distribution for Age
Probabilistic estimate of original value of V
Reconstructing the Distribution
• Combine estimates of where point came from for all the points:
– Gives estimate of original distribution.
10 90
Age
1 n fY (( xi + yi ) − a ) f Xj ( a )
fX = ∑ ∞
n i =1
∫
−∞
fY (( xi + yi ) − a ) f Xj ( a )
Reconstruction: Bootstrapping
1200
1000
Number of People
800
Original
600 Randomized
Reconstructed
400
200
0
20 60
Age
Recap: Why is privacy preserved?
• Cannot reconstruct individual values accurately.
• Can only reconstruct distributions.
Classification
• Naïve Bayes
– Assumes independence between attributes.
• Decision Tree
– Correlations are weakened by randomization, not destroyed.
Decision Tree Example
90
Original
Accuracy
80
Randomized
70 Reconstructed
60
50
Fn 1 Fn 2 Fn 3 Fn 4 Fn 5
Accuracy vs. Randomization Level
Fn 3
100
90
80 Original
Accuracy
70 Randomized
60 ByClass
50
40
10 20 40 60 80 100 150 200
Randomization Level
Privacy Metric
• Issues
• For very high privacy, discretization will lead to a poor model
• Gaussian provides more privacy at higher confidence levels
Problem: Reconstructing the
Original Distribution
• Reconstruction Problem:
– Given FY and wi’s, estimate FX
Original Distribution
Reconstruction: Method
• The estimated density function:
1 n f ( w − a) f X ( a)
f X′ ( a ) = ∑ ∞ Y i
n i =1
∫−∞ fY ( wi − z ) f X ( z ) dz
• Given a sufficiently large number of samples, it is
expected to be close to the real density function
– However, fX is not known
1 n f Y ( wi − a ) f Xj ( a )
– Iteratively,
f X
j +1
( a) = ∑ ∞
n i =1
∫ f Y ( wi − z ) f Xj ( z ) dz
−∞
• Stopping Criterion
χ 2 test between successive iterations
Original Distribution
Reconstruction: Empirical Evaluation
Decision-Tree Classification
Overview
• Recursively partition the data until each partition contains
mostly examples from the same class
• Data Partitioning
– Reconstruction phase gives an estimate of the number of points
in each interval.
– Partition data into S1 and S2 where S1 contains the lower valued
data points.
Decision-Tree Classification
Using the Perturbed Data (cont.)
• When and How are the distributions
reconstructed?
– Global
• Reconstruct for each attribute once at the beginning
• Build the decision tree using the reconstructed data
– ByClass
• First split the training data
• Reconstruct for each class separately
• Build the decision tree using the reconstructed data
– Local
• First split the training data
• Reconstruct for each class separately
• Reconstruct at each node while building the tree
Experiments
Methodology
• What to compare?
– Classification accuracy of decision trees induced from
• Global, ByClass, and Local algorithms vs.
• Original and Randomized Distributions
• Synthetic Data
– 100,000 training and 5,000 test records
– All attributes are uniformly distributed ?
– Equally split between two classes
– Five classification functions (five datasets)
• (age < 40 and 50K ≤ salary ≤ 100K) or (40 ≤ age < 60 and 75K ≤ salary ≥ 125K) or
(age ≥ 60 and 25K ≤ salary ≤ 75K)
• Issues
– 95% confidence level
– Privacy ranges from 25% to 200%
– k (number of intervals) is selected, heuristically, such that
• 10 < k < 100 and average number of points in each is 100
Experiments
Results
• Privacy of X, Π (X)=21=2
• One can calculate h(Z) = 9/4, which gives:
I(X;Z) = h(Z) - h(Z|X) = 9/4 - h(Y) = 9/4 – 1 = 5/4
where we substitute h(Z|X)=h(Y) because X and Y are
independent and Z = X + Y, and that h(Y)=log22=1
• Privacy loss of X revealing Z:
P(X|Z)=1-2-5\4 =0.5796
• Privacy of X after revealing Z:
Π (X|Z)=Π (X)*(1-P(X|Z))=2*(1.0-0.5796)=0.8408
Quantification of information
loss
• Given perturbed values z1,…,zn, it is (in general)
not possible to reconstruct original density
function fX(x)
• Lack of precision in estimation fX(x) – information
loss
• Let fˆX ( x) denote the estimated density function
• Proposed metric for information loss:
[ 1
]
I ( f X , f X ) = E ∫ f X ( x) − fˆX ( x) dx
ˆ
2 ΩX
( f X , fˆX reconstruction
– I perfect )=0
– I (no , fˆX ) = 1
f Xcorrespondence
Illustration
I ( f X , fˆX )
• Is equal to half the sum of A, B, C and D.
• Is equal to 1-α , where α is the area
shared by both distributions
α
Expectation Maximization
algorithm
• Problem definition:
K
2. Update Θ as follows: θ i ( k +1) ψ ( z ; Θ k
)
= i
mi N
3. k=k+1
4. If not termination-criterion goto 2
• Termination criterion is based on how
much Θ k has changed since the last
iteration
Theorems
ˆ ) during E-step is:
• 4.1: Value ofQ(Θ, Θ
K
ˆ ) = ∑ψ ( z; Θ
Q (Θ, Θ ˆ ) ln θ , where
i i
i =1
N Pr(Y ∈ z j − Ω i )
ψ i ( z ; Θ) = θ i ∑
ˆ , and v ∈ z j − Ω i if z j − v ∈ Ω i
i =1 f Z ;Θˆ ( z j )
U-G
G-G
G-U
G-U U-U
U-G
G-G
U-U
Information loss vs. privacy
loss
• Combined the two
from last slide
• OBS: Best perturbing
U-G
distribution sensitive G-G
to original distribution G-U
U-U
Information loss vs. data
points
• Gaussian distributions,
standard deviation of
2/π e for original, 0.8 for
perturbing distribution
• OBS: Privacy loss
independent of # data
points
x 104
Talk Outline
• Motivation
• Association Rules
– A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke, “Privacy
Preserving Mining of Association Rules”, KDD 2002.
• Open Problems
B.Spears,
B. Spears, Associations
Bob soccer,
soccer,
B. Spears, bbc.co.uk
bbc.co.uk
baseball,
B.Marley,
B. Marley,
cnn.com
baseball
baseball Recommendations
Chris
B. Marley,
camping
Uniform Randomization
• Given a transaction,
– keep item with 20% probability,
– replace with a new random item with 80% probability.
Is there a problem?
Example: {x, y, z}
10 M transactions of size 3 with 1000 items:
100,000 (1%) 9,900,000 (99%)
have have zero
{x, y, z} items from {x, y, z}
6 * (0.8/999)3
0.23 = .008
= 3 * 10-9
t = a, b, c, u, v, w, x, y, z
t’ = b, v, x, z œ, å, ß, ξ, ψ, €, א, ъ, ђ, …
j=4
Partial Supports
To recover original support of an itemset, we need randomized
supports of its subsets.
• Given an itemset A of size k and transaction size m,
• A vector of partial supports of A is
s = ( s0 , s1 ,..., sk ) , where
1
sl = ⋅ # { t ∈T | # ( t ∩ A) = l }
T
where D[l ]i , j = Pi , l ⋅ δ i = j − Pi , l ⋅ Pj , l
k k
Pr[ z ∈ t | A ⊆ t ′] = ∑ sl+ ⋅ Pk , l ∑s ⋅P l k ,l
l =0 l =0
E1(E
E2(E3(ABC)))
(ABC)) 1 E1(E
E12(E
E231(ABD))
(ABC)))
(ABC)
E1(E
E2(E3(ABD)))
(ABD)) ABC
ABC
E3(ABC)
ABD
E3(ABD)
E2(E
E32(EE132(ABC)))
(ABC))
(ABD) 2 3 E3(E
E13(E
E213(ABD)))
(ABC))
(ABC)
ABD ABC
Computing Candidate Sets
E2(E3(ABC)) 1 E1(ABC)
E2(E3(ABD)) ABC
ABC
E3(ABC)
ABD
E3(ABD)
E2(E3(E1(ABC))) 2 3 E3(E1(ABC))
ABD ABC
Compute Which Candidates
Are Globally Supported?
• Goal: To check whether
n
X.sup (1)
≥ s * ∑DB i
i =1
n n
(2)
∑ X . sup ≥ ∑ s* | DB
i =1
i
i =1
i |
n
(3)
∑ ( X . sup
i =1
i − s* | DBi |) ≥ 0
Note that checking inequality (1) is equivalent to
checking inequality (3)
Which Candidates Are Globally
Supported? (Continued)
• Now securely compute Sum ≥ 0:
• Site0 generates random R
Sends R+count0 – frequency*dbsize0 to site1
• Sitek adds countk – frequency*dbsizek, sends to
sitek+1
• Final result: Is sum at siten - R ≥ 0?
• Use Secure Two-Party Computation
• This protocol is secure in the semi-honest model
Computing Frequent:
Is ABC ≥ 5%?
1
ABC: 16+18-.05*300
19 ABC=18 ABC: 19 ≥ R?
DBSize=300
ABC: YES!
2 3 R=17
ABC=9 ABC=5
DBSize=200 DBSize=100
ABC: 17+9-.05*200
16 ABC: R+count-freq.*DBSize
17+5-.05*100
17
Computing Frequent:
Is ABC ≥ 5%?
1
ABC: 16+18-.05*300 ABC=18
DBSize=300
ABC: 19 ≥ R?
ABC: YES!
2 3 R=17
ABC=9 ABC=5
DBSize=200 DBSize=100
{ X ∪ Y }.sup ∑ XY.sup i
≥c⇒ i =1
n
≥c
X . sup
∑ X .sup
i =1
i
n
⇒ ∑ ( XY. supi −c ∗ X . supi ) ≥ 0
i =1
Association Rules in
Vertically Partitioned Data
• Two parties – Alice (A) and Bob (B)
• Same set of entities (data cleansing, join
assumed done)
• A has p attributes, A1 … Ap
• B has q attributes, B1 … Bq
• Total number of transactions, n
• Support Threshold, k
JSV Brain Tumor Diabetic JSV 5210 Li/Ion Piezo
Vertically Partitioned Data
(Vaidya and Clifton ’02)
• Learn globally valid association rules
• Prevent disclosure of individual
relationships
– Join key revealed
– Universe of attribute values revealed
• Many real-world examples
– Ford / Firestone
– FBI / IRS
– Medical records
Basic idea
• Find out if itemset {A1, B1} is frequent (i.e., If support of {A1, B1} ≥
k)
A B
Key A1 Key B1
k1 1 k1 0
k2 0 k2 1
k3 0 k3 0
k4 1 k4 1
k5 1 k5 1
Support = ∑ A×B
i =1
i i
x +a *R +a *R
1 1,1 1 1 ,2 2 ++ a
1,n
2 * R n
2
x + a * R + a * R + + a
2 2 ,1 1 2 ,2 2 2 ,n
2 * R n
2
x +a *R +a *R
n n,1 1 n,2 2 ++ a n,n
2 * R n
2
S=
∑x *y
i =1
i i
x * y + x * y ++ x * y
1 1 2 2 n n
+ a * y * R + a * y * R + + a * y * R
1,1 1 1 1, 2 1 2 1,n
2 1
n
2 Grouping
components
+ a * y * R + a * y * R + + a * y * R
2 ,1 2 1 2, 2 2 2 2 ,n
2 2
n
2 vertically
and
factoring out
+ a * y * R + a * y * R ++ a * y *R
n ,1 n 1 n, 2 n 2 n ,n
2 n
n
2 Ri
Protocol (complete)
S=
n
∑x*y i i
R * (a * y + a * y + + a * y )
i =1
+ 1 1,1 2 ,1 n ,1
+ R * (a * y + a * y + + a * y )
1 2 n
2 1, 2 1 2, 2 2 n,2 n
+ R n *
2
a 1,n
2
* y +a1 2 ,n
2
* y ++ a
2 n ,n
2
* y
n
Ln ∫ ( c − L( a) ) 2 P( a,
Estimation Lc L*
c) da dc
Error
• L* : “best possible” classifier
• Ln : classifier learned from the sample
• Lc : best classifier from those that can be described by the given
classification model (e.g decision tree, neural network)
Goal: determine sample size so expected error is sufficiently large
regardless of technique used to learn classifier.
Sample size needed to classify
when zero approximation error
• Let C be a class of discrimination functions with
VC dimension V ≥ 2. Let X be the set of all
random variables (X,Y) for which LC = 0.
For δ ≤ 1/10 and ε < 1/2
V −1
N (ε , δ ) ≥
12ε
Intuition: This is difficult, because we must have a
lot of possible classifiers for one to be perfect.
Sample size needed to classify
when no perfect classifier exists
• Let C be a class of discrimination functions with VC dimension V ≥
2. Let X be the set of all random variables (X,Y) for which for fixed L
∈ (0, ½)
L = inf P{g ( X ) ≠ Y }.
g∈C
Then for every discrimination rule gn based on X1 , Y1 , ... , Xn , Yn,
L(V − 1) e −10 1 1
N (ε ,δ ) ≥ × min( 2 , 2 )
32 δ ε
and also, for ε ≤ L ε ¼,
L 1
N (ε ,δ ) ≥ 2 log
4ε 4δ
Intuition: If the space of classifiers is small, getting the best one is easy
(but it isn’t likely to be very good).
Summary
• Privacy and Security Constraints can be
impediments to data mining
– Problems with access to data
– Restrictions on sharing
– Limitations on use of results
• Technical solutions possible
– Randomizing / swapping data doesn’t prevent
learning good models
– We don’t need to share data to learn global results
– References to more solutions in the tutorial notes
• Still lots of work to do!
Distributed Data Mining:
What next?
• Research/leverage key related background/supporting
technologies
– Parallel data mining
– Distributed query processing
• Identify specific challenge problem/goals
– What type of data mining to address first
– Likely parameters for a meaningful solution
– Success measures
• Initial focuses
– Clustering: Appears likely to be tractable
– Association rules: Mathematically “clean” problem
– Anomaly detection: Obvious cases where global results ≠ sum
of local results
Research Approach
• Define a challenge problem:
– Central data mining task (e.g., clustering)
– Constraints on data (e.g., must not release any
specific entities from individual sites)
• Develop algorithms that solve problem
– Within bounded approximation of central approach
– For the specific problem, or a class of problems with
varying constraints
• Develop techniques that enable easy solution of
new tasks/problems as they are defined
Current Challenge Problems
• Association Rules in Vertically Partitioned Data
– Data about each entity split between sites
– Early result: algebraic method for non-binary data (with
Jaideep Vaidya, CS)
• Association Rules in Horizontally Partitioned Data
– Each site contains complete information about some entities.
– Early result: Can not only avoid sharing data, but also which
rules are supported by which site (with Murat Kantarcioglu, CS)
• Distributed Clustering
– Share cluster centers, not data?
– Distributed EM Clustering (with Xiadong Lin, Statistics)
What We’re Doing Now
• Define a challenge problem:
– Central data mining task (e.g., clustering)
– Constraints on data (e.g., must not release entities from individual sites)
• Develop algorithms that solve problem
– Within bounded approximation of central approach
– For the specific problem, or a class of problems with varying constraints
See Posters of work by Murat Kantarcioglu and Jaideep Vaidya
www.almaden.ibm.com/u/srikant/talks.html