Abstract It has become a common practice to provide public access to primary nonaggregated statistical data. Necessary precautions need to be taken to guarantee that
sensitive data features are masked, and data privacy cannot be violated.
In the case of providing group anonymity in a given dataset, i.e. protecting information about a group of people, it is important to protect intrinsic data features and
distributions. To solve this task, it is expedient to modify the dataset by introducing
a certain level of distortion, which can be done using different approaches. Once a
particular modification approach is chosen, the goal is to introduce minimal amount
of distortion under that approach.
In this work, we show that the task of providing group anonymity can be reduced to
the minimum cost network flow problem, where the network represents a model of
data modification, subject to fuzzy restrictions that encompass the chosen modification approach. The problem of determining such fuzzy restrictions is an ill-defined
one, and often can be solved only by expert evaluation.
The flow with a minimum cost can be found by well-known algorithms of polynomial complexity. However, the general task of choosing the modification approach
to provide group anonymity at the minimum cost possible is deemed to be of high
computational complexity. We propose to solve this task using appropriately tailored
memetic algorithm.
Oleg Chertov
National Technical University of Ukraine Kyiv Polytechnic Institute, 37 Peremohy Prospekt,
03056 Kyiv, Ukraine e-mail: chertov@i.ua
Dan Tavrov
National Technical University of Ukraine Kyiv Polytechnic Institute, 37 Peremohy Prospekt,
03056 Kyiv, Ukraine e-mail: dan.tavrov@i.ua
1 Introduction
A man is a social animal [1]. Most of our everyday actions depend on or are based
on how people treat them, especially the ones closest to us (or whose opinion is the
most relevant to us). Such people constitute what may be called a close circle.
At the same time, a person may not always be eager to disclose information about
members of her close circle. This reluctance may originate either from subjective
views, or from the nature of the circle (religious identity, professional community,
income level, LGBT community, radical group etc.).
What we face here is a problem of concealing membership of a given respondent in a certain group. This problem can be formulated as the task of masking
certain characteristics of a given respondent [2, 3, 4]. This task is usually called the
task of providing individual anonymity, where anonymity means [5] the property of
a subject to be unidentifiable within a set of other subjects. It is possible to set a
complementary task of providing group anonymity, where we need to conceal information not about a single respondent, but about a group of respondents (e.g., we
need to mask regional, age, or other kinds of distributions of a certain group).
In Sect. 2, we set the stage by discussing the recent advantages in the field of
providing group anonymity. We also discuss different ways to apply evolutionary
computing methods to solving the task of providing group anonymity with minimal
data distortion.
In Sect. 3, we provide the formal definition of the task of providing group
anonymity, introduce appropriate notation, and discuss how this task can be reduced
to the minimum cost flow network optimization problem with fuzzy restrictions on
the network architecture.
In Sect. 4, we describe the memetic algorithm for solving the task of providing
group anonymity. In Sect. 5, we introduce a novel idea of applying this memetic
algorithm in two phases.
In Sect. 6, we illustrate the application of the proposed algorithm to the real data
based task of protecting the regional distribution of military personnel. In Sect. 7,
we draw conclusions and outline the directions for the research to follow.
2 Related Work
The procedure for providing data group anonymity should meet the following conditions [6, p. 399]:
1. Disclosure risk is low or at least adequate to protected information importance.
2. Both original and protected data, when analyzed, yield close, or even equal results.
3. The cost of transforming the data is acceptable.
Approaches for solving the task of providing group anonymity (TPGA) can be
divided into two groups [7], the ones that obtain the needed solution in two stages,
and the ones that achieve the same goal in a single stage.
All two-stage methods imply carrying out the following two steps:
1. Define the modified distribution that masks important aggregate characteristics
of a certain group of respondents.
2. Define the procedure for modifying the primary data so that they correspond to
the new distribution.
In many cases, the results of the first stage have to comply with the requirement
of preserving certain properties of the distribution being modified. To this end, we
need to apply specific methods. For instance:
if we need to preserve core statistical characteristics of the distribution, such as
its mean value and standard deviation, we can use the normalization method [8];
if we need to preserve high-frequency features of the distribution, we can use the
wavelet transforms method [9];
if we need to preserve the cyclic and periodic components of the distribution, we
can use the singular spectrum analysis method [10].
The assessment of the quality of the particular modified distribution from the
masking data features point of view typically can be done only subjectively by an
expert, and is usually imprecise.
After having obtained the modified distribution, we need to modify the data to
adapt them to this distribution. Typically, we want to introduce as little distortion
as possible. In this work, we show that modifying the data with minimal distortion
can be viewed as a well-known minimum cost flow problem in network flow optimization [11, p. 296]. Moreover, there exists a unique correspondence between the
modified distribution and the architecture of the network. In the last decades, many
algorithms of polynomial and pseudopolynomial complexity have been proposed
for solving this task.
In the absence of external restrictions, we can obtain modified distributions that
are equivalent from the masking data features point of view. Consider the case of
masking the regional distribution of military personnel. Extreme values in this distribution might correspond to the cites of classified military facilities. To provide
group anonymity in this case, we need to mask these extreme values, which can be
done using at least two completely different approaches [12, p. 20]:
the extremum transition approach tells us to modify the distribution in such way
that completely new extreme values emerge, which are different from the initial
ones;
the Ali Babas wife approach implies that we do not eliminate some or all of the
extreme values but rather add new alleged ones, so that it becomes impossible to
discriminate between the real and made-up extremums.
For instance, under Ali Babas wife approach, we can create new extreme values
for any region without any specific considerations, as long as the old ones become
features like high frequency [9] or periodic [10] components, as mentioned above.
It is therefore possible to limit search space by using a special form of individual
representation in the memetic algorithm [22]. In general, however, there are no specific requirements for the solution to meet, and using penalty functions seems to be
the most appropriate alternative.
In general, it is not clear what fuzzy restrictions should be included in the problem definition, as different restrictions guide evolution in different directions, not
necessarily leading to obtaining optimal solutions. If the restrictions are too severe
(we know the modified distribution to a high precision), the level of distortion introduced into the dataset might become too high. On the other hand, if the restrictions
are too mild, the modified distribution may not mask sensitive data features at all.
Since it is not always possible to define appropriate fuzzy restrictions at the outset
of solving the TPGA, we will discuss the two-phase memetic algorithm for solving
the TPGA, which possesses the following characteristics:
1. In the first phase, we define preliminary fuzzy restrictions, and obtain the approximate solutions.
2. In the second phase, we analyze the results obtained after the first phase, adapt
the fuzzy restrictions accordingly, and obtain the refined solutions.
3 Theoretic Background
3.1 The Generic Scheme of Providing Group Anonymity
Let the data for anonymizing be gathered in a depersonalized microfile M. Each
record ri , i = 1, , of this microfile contains values of attributes w j , j = 1, . The set
of all the values of w j is denoted by w j .
Let wv j , j = 1,t, denote vital microfile attributes. Then, a vital value combination
V can be defined as an element of the Cartesian product wv1 wv2 . . . wvl . We
will denote a set of vital value combinations by V = {V1 , . . . ,Vlv }. We call microfile
records whose attribute values belong to V vital records.
Let w p denote a parameter microfile attribute. Then, a parameter value P can be
defined as
a value of this attribute, P w p . We will denote a set of parameter values
by P = P1 , . . . , Pl p . Parameter values can be used to divide the microfile M into
submicrofiles M1 , . . . , Ml p . Each submicrofile Mk has exactly k records, k = 1, l p ,
k k = .
Let us denote by G (V, P) a group of respondents whose distribution needs to
be protected when providing group anonymity. The group is thereby determined by
appropriately defined values of parameter and vital microfile attributes.
The task of providing group anonymity [7] lies in performing modification of
M in order to mask sensitive data features. The generic scheme of providing group
anonymity according to single-stage approach to solving the TPGA goes as follows:
InfM (r, r ) =
nord
p=1
r (I p ) r (I p )
r (I p ) + r (I p )
2
ncat
+ k 2 (r (Jk ) , r (Jk )) ,
(1)
k=1
where I p (Jk ) stands for the pth ordinal (kth categorical) influential attribute (attribute whose distribution over parameter values is of interest for researchers), r ()
stands for the operator returning the attribute value of the record r, (v1 , v2 ) stands
for the operator, which is equal to a certain number 1 if its arguments belong to
the same category, and 2 otherwise, p and k are nonnegative weights (the more
important is the attribute, the greater is the weight).
A properly defined modifying algorithm has to modify the quantity signal to
mask sensitive data features, and at the same time minimize the distortion introduced
into the microfile in terms of (1).
(2)
(5)
(6)
R (X1 ) : X1 is A1 ,
R (X2 |u1 ) : X2 is A2 ,
...
R (Xn |u1 , . . . , un1 ) : Xn is An ,
(7)
where R (Xi |u1 , . . . , ui1 ) is a fuzzy restriction conditioned on the values u1 , . . . , ui1 ,
and Ai is the fuzzy subset of Ui with the membership function Ai |u1 ,...,ui1 (ui ),
i = 1, n, which depends on the values u1 , . . . , ui1 . The compatibility degree of the
value u = (u1 , . . . , un ) of the variable X with the fuzzy restriction R (X) in this case
equals
(u1 , . . . , un ) = A1 (u1 ) . . . An |u1 ,...,un1 (un ) ,
(8)
where is the fuzzy intersection.
(9)
(10)
R Xl p |q1 , . . . , ql p 1 : Xl p is Al p ,
which also guarantee that (9) is satisfied.
In practice, to mask sensitive features of the signal q, and also to minimally
constrain the search space for the algorithm A, it is sufficient to impose restrictions
only on a certain proper subset of atomic variables. In this work, we will additionally
assume without the loss of generality that the variables in this proper subset are
mutually noninteractive.
Let us denote by J = (i1 , . . . , ik ) the index sequence, which is a proper subsequence of I = (1, . . . , l p ). Let us denote by J the complement
of J to I. Also, let us
denote by XJ the k-ary variable XJ = Xi1 , . . . , Xik . We can then rewrite (10) as
(11)
R (XJ |qJ ) : qi = Q0 qj .
jJ
iJ
ci j xi j ,
(i, j)A
subject to:
xi j
j|(i, j)A
x ji = b (i)
i N ,
j|( j,i)A
0 xi j ui j
(12)
i (i, j) A ,
b (i) = 0 .
i=1
The network for the TPGA has the architecture with the following characteristics:
the set N consists of four disjoint subsets N1 , N2 , N3 , N4 ;
nodes N1k N1 and N4k N4 represent the submicrofiles Mk , k = 1, l p , i.e. |N1 | =
|N4 | = l p ;
(1)
(l )
(k,i)
10
(l )
(k)
N3 represent
non-vital records of the submicrofile Mk , k = 1, l p ,
(k)
i = 1, k qk , i.e., N3 = k qk , k = 1, l p ;
the set A consists of the three disjoint subsets AN1 N2 , AN2 N3 , AN3 N4 ;
(k,i)
(k)
the arcs in AN1 N2 connect the nodes N1k N1 with the nodes N2 N2 , k = 1, l p ,
i = 1, qk . There are no other arcs in AN1 N2 ;
(k,i)
(k)
the arcs in AN3 N4 connect the nodes N3 N3 with the nodes N4k N1 , k = 1, l p ,
i = 1, k qk . There are no other arcs in AN3 N4 ;
(k,i)
(k)
the arcs in AN2 N3 connect each node in N2 N2 , k = 1, l p , i = 1, qk , with each
(l, j)
(l)
node N3 N3 , k = 1, l p , l = 1, l p , k 6= l, i = 1, qk , j = 1, k qk ;
k , i > 0
the supplies of the nodes N1k N1 , k = 1, l p , equal b N1k =
;
0
, i 0
k , i < 0
the demands of the nodes N4k N4 , k = 1, l p , equal b N4k =
;
0 , i 0
the supplies/demands of the nodes in N2 and N3 equal 0;
the capacities of all the arcs in A equal 1;
the costs of the arcs in
AN1 N2 and A
N3 N4 equal 0;
Nodes N3
(k,i)
(l, j)
, N3
AN2 N3 , k = 1, l p , l = 1, l p , k 6= l, i = 1, qk ,
ci j xi j ,
(i, j)A
subject to:
xi j
j|(i, j)A
x ji = b (i)
i N ,
j|( j,i)A
0 xi j ui j
i (i, j) A ,
(13)
b (i) = 0 ,
i=1
q1 , . . . , ql p ,
where q1 , . . . , ql p is the degree of compatibility of the modifiled quantity signal
q (or, equivalently, of the supplies/demands of the network) with the fuzzy restriction R (X) in the form (11), and is the user-defined constant, which expresses the
lowest acceptable degree of compatibility of the modified quantity signal with the
fuzzy restriction.
If there are additional constraints that need to be imposed on the modified quantity signal in order to preserve its intrinsic features (such as statistical characteris-
11
Fig. 1 The architecture of a network for the task of providing group anonymity
tics, high-frequency features, or cyclic and periodic components), we can add them
to (13). In this work, we will assume that the problem to solve can be fully described
by (13).
To solve this complex problem, we propose to use the appropriately tailored
memetic modifying algorithm (MMA) introduced in [7], which is described in the
following sections.
12
13
where (U) represents estimation of a TPGA solution quality from the minimizing
microfile distortion point of view, (U) is a penalty function representing estimation of TPGA solution quality from the compatibility with the imposed fuzzy
restrictions (11) point of view, and (U) is a penalty term against obtaining individuals with too many rows. As all three terms are of equal importance, their values
lie in the interval [0, 1].
We propose to use the following expression for the first term of (14):
Q
Cmax InfM Mui1 (ui2 ) , Mui3 (ui4 )
(U) =
i=1
Cmax
(15)
where Cmax is the greatest possible value of the cumulative influential metric (1),
Mi ( j) is the operator yielding the jth record of the submicrofile Mi , i = 1, l p .
We also propose to use the following expression for the penalty function:
ik
(U) =
A j
|U|( j) ,
(16)
j=i1
where indexes i1 , . . . , ik define the proper subset of the submicrofiles subject to the
fuzzy restrictions (11). It is worth noting that we dont need to explicitly incorporate
the last restriction from (11) into the penalty function, as by the nature of pairwise
swapping, the sum of all the elements of the modified quantity signal q will remain
equal to the sum of all the elements of the initial quantity signal q.
14
15
1. Based on analyzing the quantity signal q representing microfile M, define suitable decreasing fuzzy restrictions for those signal elements that violate the requirement of masking maximum signal values.
2. Apply the memetic algorithm as described in Sect. 4.
3. Classify obtained individuals into feasible solutions, which are compatible with
decreasing restrictions to a high degree, and mask maximum signal values; subfeasible solutions, which are compatible with decreasing restrictions to a high
degree, but dont mask maximum signal values; and infeasible solutions, which
are not compatible with decreasing restrictions to a high degree.
4. Group all subfeasible solutions, for which it is possible to define the same increasing fuzzy restrictions, in clusters. One solution may belong to several clusters.
5. Choose the cluster with the smallest mean value of cumulative metric (1). If
the cluster contains less than solutions, where is the population size in the
algorithm, increase its size to by duplicating solutions at random. If the cluster
contains more than solutions, decrease its size to by removing solutions at
random.
6. Apply memetic algorithm of step 2 to the set of solutions obtained on step 5 as
to the initial population.
The first two steps of this procedure constitute the first phase of the MMA, the
other four ones constitute the second phase of the MMA.
It is worth noting that the quality of the solution heavily depends on the quality
of the decreasing restriction functions. If the restrictions are too severe (too many
swaps need to be performed), the cumulative metric (1) might become too great.
On the other hand, if they are too mild, the solution may not be feasible, since
maximums may remain greater than other signal values, even though their absolute
values have decreased.
6 Practical Results
6.1 General Description of the Task
To illustrate the application of the memetic modifying algorithm to the real data
based task of providing group anonymity, we decided to consider the problem of
masking the regional distribution of military active personnel in the state of Massachusetts (the U.S.) according to the 5-Percent Public Use Microdata Sample File
of the 2000 U.S. census [28]. The total of 141, 838 records was taken for analysis.
To define the group of military active personnel distributed by place of work, we
took Military Service as the vital attribute, its value Active Duty as the only
vital value, Place of Work PUMA, where PUMA stands for the Public Use
Microdata Area, as the parameter attribute, and its values 25010 25120 with the
16
step 10 as parameter values. The latter values stand for the codes of Massachusetts
statistical areas.
Quantity signal q corresponding to the group is presented in Fig. 8 (solid line).
Signal elements 1, 2, . . . , 12 correspond to statistical areas 25010, 25020, . . . , 20120,
respectively.
jJ
1,
x 20 2
47
2
2 (x) =
67
2
,
47
0,
1,
2
1 2 x 25
5
7 (x) =
x 30 2
2
,
0,
x 20
, 20 x 43.5
43.5 x 67
x 67
x 25
, 25 x 27.5
27.5 x 30
x 30
17
Fig. 2 Membership function associated with the decreasing fuzzy restriction for the second signal
element for the example
1,
x 25
2
25
, 25 x 26.5
12
3
2
9 (x) =
x 28
2
,
26.5 x 28
0,
x 28
1,
x 25
2
x 25
, 25 x 31.5
12
13
2
12 (x) =
x 38
2
,
31.5 x 38
13
0,
x 38
Indexes from J were chosen to appear in the first column of individuals in the
MMA population, indexes from J were chosen to appear in the third column.
To minimize the distortion introduced into the microfile, we took Sex, Age,
Hispanic or Latino Origin, Marital Status, Educational Attainment, Citizenship Status, and Persons Total Income in 1999 as the influential attributes. We
considered all these attributes to be categorical ones. To simplify the matter, we
chose the following parameters of (1): k = 1 k = 1, 7, 1 = 1, 2 = 0. In this case,
the metric (1) shows the number of attribute values to be altered during one swap of
the records between the submicrofiles.
To prevent individuals in the MMA from growing indefinitely, we used the following penalty term (Fig. 6):
1
ex (U) =
1+e
1
2 (QU 90)
18
The fitness function (14) for the first phase of the example is as follows:
Q
1099
f ex1 (U) =
i=1 k=1
1099
+ jJ j |U|( j) +
1
1
1+e 2 (QU 90)
+
,
Fig. 3 Membership function associated with the decreasing fuzzy restriction for the seventh signal
element for the example
19
Fig. 4 Membership function associated with the decreasing fuzzy restriction for the ninth signal
element for the example
Fig. 5 Membership function associated with the decreasing fuzzy restriction for the twelfth signal
element for the example
20
Fig. 6 Penalty term heavily discriminating from obtaining individuals with more than 100 rows
Table 1 Clusters obtained after the first phase of the memetic modifying algorithm
Quantity Signal Elements to Increase Cluster Size Mean Metric
1 and 6
78
45.436
3 and 6
84
46.048
26
46.269
3 and 10
4 and 6
43
48.488
6 and 8
183
46.519
8 and 10
101
44.238
1,
x 15 2
,
2
12
8 (x) = 10 (x) =
x 27 2
12
0,
x 15
15 x 21
, 21 x 27
x 27
The fitness function (14) for the second phase of the example is as follows:
Q
1099
f ex2 (U) =
i=1 k=1
1099
+ jJ{10,12} j |U|( j) +
+
1
1
1+e 2 (QU 90)
Among 3000 solutions obtained as the result of the second phase of applying
MMA, 2693 ones (or 89.767%) were feasible ones. Two solutions with the lowest
cumulative metrics (1) are given in Fig. 8(b) (dashed and dotted lines). The mean
21
Fig. 7 Membership functions associated with the increasing fuzzy restriction function for the
eighth and tenth signal elements from the example
22
(a)
(b)
Fig. 8 Initial (solid line) and modified quantity signals: (a) feasible one with the metric 40 (dasheddotted line), feasible one with the metric 43 (dotted line), subfeasible one (dashed line) (b) the one
with the metric 37 (dashed-dotted line), the one with the metric 38 (dotted line)
References
1. Brooks, D.: The Social Animal: The Hidden Sources of Love, Character, and Achievement.
N.Y. Random House Trade Paperbacks (2011)
2. Aggarwal, C. C., Yu, P. S.: A general survey of privacy-preserving data mining: models and
algorithms. In: Aggarwal, C. C., Yu, P. S. (eds.) Privacy-Preserving Data Mining: Models and
Algorithms. Advances in Database Systems, vol. 34, pp. 1152. Springer Science+ Business
Media, LLC, New York (2008)
3. Fung, B., Wang, K., Chen, R., Yu, P.: Privacy-preserving data publishing: a survey of recent
developments. ACM Comp. Surv. 42(4), 153 (2010)
23
24
25. Zadeh, L. A.: Fuzzy sets. Inf. and Cont. 8, 338353 (1965)
26. Zadeh, L. A.: The concept of a linguistic variable and its application to approximate
reasoningII. Inf. Sci. 8, 301357 (1975)
27. Goldberg, D. E., Korb, B., Deb, K.: Messy genetic algorithms: motivation, analysis, and first
results. Comp. Sys. 3, 493530 (1989)
28. U. S. Census 2000. 5-Percent Public Use Microdata Sample Files,
http://www.census.gov/census2000/PUMS5.html
29. Syswerda, G.: Schedule optimization using genetic algorithms. In: Davis, L. (ed.) Handbook
of Genetic Algorithms, pp. 332349. Van Nostrand Reinhold, New York (1991)
30. Brindle, A.: Genetic algorithms for function optimization. Doctoral Dissertation, Department
of Computer Science, Tech. Rep. TR81-2, University of Alberta (1981)
31. Goldberg, D. E.: Genetic Algorithms in Search, Optimization, and Machine Learning.
Addison-Wesley (1989)